On Demand Web Scraper for Product Details/Prices (with Caching Old Results)

On Demand Web Scraper for Product Details/Prices (with Caching Old Results)

Cancelled

Job Description

I need a web scraper built as a web-script that can be called as a GET request and will get the product details and prices from upto 3 predefined websites. This data needs to be cached by the script to prevent useless multiple calls.

EXAMPLE
=========

http://example.com/web-scraper-script?product=samsung+55+led

should scrape 3 sites (names will be provided later, assume amazon.com is one of them for now. The results below are from amazon.) and return the following:
OPTION 1:
Samsung UN55EH6000 55-Inch 1080p 120Hz LED HDTV (Black)
PRICE
$1,049.99
PRODUCT FEATURES:
Size: 55-Inch
Full HD 1080p
Clear Motion Rate 240
ConnectShare Movie
Wide Color Enhancer Plus
TV without stand (Width x Height x Depth): 49.2-Inch x 31.1-Inch x 3.7-Inch, TV with stand (Width x Height x Depth): 49.2-Inch x 29-Inch x 9-Inch
Connect Share Movie
TV with stand (Width x Height x Depth): 49.2-Inch x 31.1-Inch x 9-Inch, TV without stand (Width x Height x Depth): 49.2-Inch x 29-Inch x 3.7-Inch
LINK:
http://www.amazon.com/Samsung-UN55EH6000-55-Inch-1080p-120Hz/dp/B0074FGSBY/ref=sr_1_1?ie=UTF8&qid=1357393626&sr=8-1&keywords=samsung+55+led

OPTION 2:
Samsung UN55ES6100 55-Inch 1080p 120Hz Slim LED HDTV (Black)
PRICE
$1,497.99
PRODUCT FEATURES:
Size: 55-Inch
Smart Content with Signature Services
Built-in Wi-Fi
Smart Hub
Web Browser
TV with stand (Width x Height x Depth): 49.3-Inch x 31.8-Inch x 10.9-Inch, TV without stand (Width x Height x Depth): 49.3-Inch x 29-Inch x 1.8-Inch
LINK:
http://www.amazon.com/Samsung-UN55ES6100-55-Inch-1080p-120Hz/dp/B007B9PP1C/ref=sr_1_2?ie=UTF8&qid=1357393626&sr=8-2&keywords=samsung+55+led

[SIMILAR TOP 5 OPTIONS MUST BE RETURNED]

OTHER NOTES:
============
- All inbound requests must be logged
- All scraped data should be cached for 1-week
- Script to check cache before fetching fresh data (to improve response times)
- Contractor is responsible for ensuring that the crawler masquerades as a regular visitor and doesn't get banned at reasonable volume levels (<10k queries per day)

PLATFORM:
==========
- Any Linux compatible platform is fine.
- Perl / PHP / Node.js / Java preferred - will consider alternate suggestion
- MySQL database for caching and logging - this needs to be a SQL DB so that it is easy for the administrator to view and analyse
- Note: Server to be hosted on amazon ec2

FUTURE WORK:
=============
This is just the MVP for a large complex product - there will be significant on-going work for the successful bidder