Crawler & Scraping

Crawler & Scraping

Cancelled

Job Description

This is a complex task with a view to create a near "generic" crawler - that intelligently crawls the web to find events and then populates our site.

The first step to this is the following (we are looking for a long term contracting hire if all goes well - who is able to step into leadership role, and have independent and creative input to a difficult problem).

--
Write a Crawler in Java that scrapes the web periodically to get the latest events. We'll provide the initial seed set. Each event should have the following fields with proper values:

1. Title
2. Description
3. Image URL (Find the image URL and upload the image)
4. Location (With Latitude and Longitude)
5. Time (Start Time and End Time if it is there)
6. Source URL

The data should be stored in MongoDB with the following format:

> db.Nudges.find({"_id" : "n-5248b5f0e4b0356781882f86"}).pretty();
{
"_id" : "n-5248b5f0e4b0356781882f86",
"address" : "1 Front Street, Brooklyn, NY 11201, USA",
"created" : NumberLong("1380496880482"),
"desc" : "New Yorkers love to argue about the best pizza, with Di Fara's, John's and Lombardi's being among the primary contenders. We won't settle that score here, but if you have only 24 hours you can't go wrong with Grimaldi's, a coal-fired pizzeria under the shadow of the Brooklyn Bridge. Not only will you get a memorable pie, you'll also get a memorable view of Manhattan from one of the oldest — and most picturesque — parts of Brooklyn. Not to mention a jukebox filled with classics by Frank Sinatra, who, legend has it, had Grimaldi's pies flown to him in Vegas.\n",
"loc" : {
"lng" : -73.99325650000003,
"lat" : 40.7025514
},
"photoUrl" : "http://www.google.com/helloworl.jpg",
"source" : "nudgecrawler",
"startTime" : NumberLong("1382374857000"),
"title" : "Grimaldi's Pizzeria",
"uid" : "u-51ac3bb5e4b065b447a31994",
"updated" : NumberLong("1380496880482")
}
>

Here is the sample seed set:

1. http://www.timeout.com/newyork
2. http://beta.flavorpill.com/newyork
3. http://www.nycgo.com/events/
4. http://events.nydailynews.com/
5. http://sf.funcheap.com/
6. http://www.sfstation.com/
7. http://www.7x7.com/
8. http://eventbrite.com
9. http://lonelyplanet.com
10. http://www.thebolditalic.com/events
11. http://travel.usnews.com/Washington_DC/Things_To_Do/
12. http://www.visitsaltlake.com/events/?e_ViewBy=search&e_submit=1&e_sortBy=eventName&e_pagesize=25&e_catID=0&e_location=-1&e_sDate=10-02-2013&e_eDate=12-30-2013&e_keyword=Keyword+Search&e_submitBtn=GO
13. http://events.cityweekly.net/