Python Data Capture and Sanitation

Python Data Capture and Sanitation


Job Description

Deal of the Day – Request for Development


LUE has commissioned the creation of a database to warehouse lead data scraped from our existing order entry platform. This is made possible by python based screenscraping technology developed for LUE by a third party. This is accomplished via the library Mechanize, however due to revisions in the platforms reporting tool, it is currently non-functional. SQL is managed via the library SQLAlchemy. This project is currently hosted on Linode infrastructure. The current interface for this software is powered by the Twisted engine – a python based event driven network application engine similar to the python web framework Django or perhaps node.js. Additional development work was done to interact with the email service provider MailChimp, but this API interaction script was never utilized.

Operational Constraints:

Initial project completion target has been placed at 2/15/2013, if possible. Due to the short window of time to complete this project, major revisions, rewrites or expansions of the current code base is likely not feasible barring considerable expense. It is therefore recommended that the existing Python codebase be retained, corrected and expanded.

Proposed Revisions to Existing Code Base:

Existing code base must be revised account for the changes in our OEP’s reporting tool to re-enable data capture. These changes are unlikely to be significant, as the break in functionality is likely due to small changes in the way that the reporting tools HTML is structured.

Proposed Extension: Sanitation

LUE has selected the provider Tower Data to sanitize our lead data. Tower Data provides real time email validation via SOAP service and also via batch upload. LUE would like to have leads sanitized via the soap service after the scrape has been completed. It is also necessary to have this sanitized status recorded in the database. This is to ensure that only actionable leads are retained in our active database.

Proposed Extension: Interspire Upload

LUE has selected the company Interspire to act as our email service provider. Interspire provides a robust XML based API service, using a self hosted API client written in PHP. Since their documentation provides code examples in PHP that leverage cURL to post data to this client, the pycURL library could be leveraged to interact with this client. This API provides functionality for email lead upload, list management, and checks against existing lists.

Towerdata API docs:

Interspire API docs:

Twisted Web App docs:

Mechanize Library:

Other open jobs by this client