Dear Client:
I can do the jobs using open-source Python/Scrapy framework.
I have very python + web data scraping experiences in following tech/libraries/languages:
• Parsing XML, HTML, JSON, JS code, text etc.
• Hadoop/MR, nltk
• Proxying, Delay/throttling, cookies
• Scrapy
• Python, lxml, XPath, beautifulsoup, urrllib,
• mySQLdb, xlrd, xlwt, csv, minidom, Image,
• Smarty, PHP, C/C++, Java
• Ruby, mechanize, nokogiri, scraping
• Regex, JS/Ajax/JSON, html/xml, PyV8
• Csv, excel, mySQL
• Selenium Webdriver/FF/Chrome, Xvbf, etc.
• Linux/CentOS/Ubuntu, Windows
I have scraped over 30s of websites containing XML/JS/Ajax/Dynamic data contents – some websites with multiple regions, countries, currencies.
I have installed and configured Scrapy on several platforms: CentOS, Ubuntu, Windows.
I am currently maintaining a Scrapy based web data capturing/harvesting platform on Ubuntu 12.x for a private US client. It is used to source products attributes and images, classify products, and determine prices of over 30,000 products of different categories (toys, books, medical devices, footwear, apparels etc.) from 15s of different websites (in multiple formats/feeds: HTML/XML/JSON, csv, Excel, PDF etc.) for feeding to an e-commerce site. The scrapers store the data directly in a mySQL database comprising 5 tables.
Thanks,
Malik.