Web scraping tool

Web scraping tool

Closed - This job posting has been filled.

Job Description

I need a script to scrape some data from defined URLs on the web and place it in a comma separated or other text file (data will continue to be added to the same file). The script will be run every 8 hours on a Ubuntu Desktop using a cron job. Each time the script runs, the data as well as the time it was fetched should be placed in a new row in the file. The script can be prepared in anything that runs on Ubunto Desktop (I used to run it using 'R', the math packagem but since the page structure changed haven't been able to update it).

If there is an error in fetching the data for any reason for any of the URLs, the value placed should be "NA".

The data to scrape:
For the URLs listed below, there is a piece of text that says, for example, "Jobs 1 to 10 of 34,509". I need to get the value 34509 and stored under the column name for the URL.

URLs:
The column headings and the URLs to scrape the data from are as follows:
1. OpRisk - "http://www.indeed.com/jobs?q=%22operational+risk%22&l="
2. RiskTech - "http://www.indeed.com/jobs?q=%22risk+technology%22&l="
3. ITRisk - "http://www.indeed.com/jobs?q=%22it+risk%22+or+%22technology+risk%22&l="
4. CreditRisk - "http://www.indeed.com/jobs?q=%22credit+risk%22&l="
5. MktRisk - "http://www.indeed.com/jobs?q=%22market+risk%22&l="
6. RiskMgmt - "http://www.indeed.com/jobs?q=%22risk+management%22&l="

Example output: (this is tab delimited but could be comma separated or other format)
datetime riskmgmt mktrisk creditrisk oprisk risktech itrisk
4/14/2012 15:42 38169 966 3243 1709 249 1653
4/14/2012 15:43 38169 966 3243 1709 249 1659
4/14/2012 15:46 38169 966 3243 1709 249 1653
4/14/2012 16:00 38151 963 3239 1707 249 1654
4/14/2012 17:00 38185 964 3244 1705 251 1657
4/14/2012 18:00 38210 965 3245 1705 252 1657

Any other suggestions to accomplish the above are welcome