I'm looking for a creative developer with relevant experience with webcrawling technology required.
The project goal is to design and develop a standalone intelligent webcrawler that is capable of processing extremely high volumes of data on a monthly basis.
Advanced Search Engine in Java
• Crawl and index websites of a range of web site defined in admin area in background and in Real time.
• Extract links from the page based on some key words
• Access will be in HTTP and HTTPS (auto login on website with authentication)
• Save links, https sessions and various informations from the site in DB.
• Report of the different error (parsing, login, etc..).
Website:
• Simple web/html user interface linked to the search engine
• User registration
• Multilingual support (English, German, Russian, Mandarin, Japanese, etc…)
• Customized search based on user roles
• Sponsors Links
Backend GUI:
• To Define URL's for crawling/searching/indexing
• Manage the SE configuration
The whole module should be lightweight, the installation and administration should be friendly and easily customizable.
The developer should be able to provide a conceptual view of the system before starting the development.
The code must be clear and commented.
The site will be optimized for the major search engines, Google in first place.
The technology offered should be based on Open source solutions, scaleable. e.g. able to withstand a significant number of hits a day, by database clustering, server load balancing or whatever appropriate means.
A more detailed scope will be provided to successful bidders, subject to a Non-disclosure agreement.
There will be further functions added later, so I'm looking for long term relationship with a stable company.
It will be necessary for an ongoing period of support after delivery in order, and it is likely that further enhancement work will be required.