I have a very large list of about 500 US government websites that I want to collect public information from and the list constantly expands. I want you to use the Scrapy Python framework to collect the information requested.
Example site that will be crawled: http://www.dppps.sc.gov/sc_most_wa
The basic idea:
1) Write crawlers
2) Collect/download (crawl) text data about most wanted people and transform it into a predefined item (set of fields).
3) Collect/download (crawl) images and be able to associate the text information from 2) with this image.
4) Output the crawled data in JSON format and the collected image to folders on file on the system.
The basic skill requirements for this project are:
1) Know how to write Python code and be able to use the Scrapy Python framework.
2) Understand how to translate fields from the information on the website to a predefined set of fields.
example: Sometimes a site will say "Name" and that field will need to be split into "first_name","middle_name","last_na
3) Have attention to detail about what you are doing
4) Be able to learn from QA responses and apply that to your method of creating future crawlers
5) JIRA experience
6) Git experience
7) Ability to follow a premade skeleton such that the output of your code matches the output of previously created code. (Previously created code will be provided for reference.)
Things that will be provided for you:
1) A very large code base has already been created and you will have access to see the best ones in it to use as jumping off points
2) You will receive access to our JIRA project for it and you will be able to see everything that is going on inside the project.
3) You will receive a private Git repository for your team from github
The budget is $20/crawler and it will be paid out to you on oDesk via bonuses. I would like to get at least 20 done a week but there is no limit to the number of them that I will pay you for, each week, as long as they are written properly and pass QA.
Note: $20 will only be paid per crawler if the crawler passes QA