We need a skilled Python coder to write a script that is going through a set of HTML documents and parses the DOM tree and gives us a JSON stream of the objects like headings, links, images etc.
It is important that we extract as much information as possible in a structured way. For an example, have a look at the attachment.
I am a Python guy myself but am caught up in plenty of work. So please let me know how you would tackle this so I can make sure you know what you're talking about. Knowledge of mechanize, beautifulsoup and NoSQL databeses are a plus. Write me for more infos.
Skills: json, dom