Project Name : Cyber Security News Harvesting and building a database of Cyber news and classify these news articles into news types e.g. Informational, Advise, Data Breach, company hacking etc.
Project detail: There will be 3 key deliverables –
List of useful websites and sources who publish credible and good news and articles around Cyber Security and store it in following format
SourceID
SourceName
StartURL
PullMethod = {Web Scapper, API, other}
CrawlDepth
News Harvester Service – a java or Python service which takes list of news sources generated from 1st activity as input and pulls the news articles from these sources using Web scrappers or APIs, clean the non-relevant content and store these news inside a Database (Mongo DB) using following schema –
NewsID
NewsSourceID
OriginalPageURL
NewsContent {HTML}
NewsArticleType
AssociatedTags <TagName, TagValue>
HarvestedAt
PublishedAt
New Center Service – a RESTful service that exposes set of APIs so that consumer can pass parameters like time frame, news type or Source as criteria and this service returns the matching news from the database which was populated by the service in step 2