Automated Data Mining/Extraction from Online PDFs

已取消 已发布的 Nov 21, 2011 货到付款
已取消 货到付款

**Full description is attached.**

DO NOT APPLY FOR THIS JOB IF YOU HAVE NOT READ THE ENTIRE DESCRIPTION.

THIS IS AN AUTOMATED DATA MINING/EXTRACTION JOB -- NOT A MANUAL ONE.

We are looking for a contractor having solid experience with software and development for data extraction from online PDFs. The PDFs are scanned copies of IRS forms that have been filed by charities, and are available through a single public online source. There are six different types of forms. The IRS scanning process can result in different positioning of data among scanned forms and scans of different quality.

We want someone who has demonstrated a history of substantial, successful data mining using PDFs and OCR. If you are looking to learn or expand your profile, this is not for you. Fluent English is a must.

The contractor must develop a program that will do the following:

1. Download scanned PDFs of mixed quality from the online source using a list of URLs in a text file provided by the buyer (approximately 300,000 PDFs and URLs).

2. Extract up to ten numeric and text data fields from each PDF using a combination of automated graphical manipulation and OCR. The location of the data on the pages will be different for each of the 6 types of forms.

3. Incorporate error-checking based on related data fields selected by buyer.

4. Format the data output as a CSV to be uploaded to buyer's SQL database.

5. Provide well-commented source code and an executable. The program will be run on an ongoing basis by the buyer.

6. Deliver written step-by-step operating instructions that a novice user can readily understand and follow.

7. Pass the following accuracy tests when operated by the buyer: Based on 10,000 URLs chosen by the buyer, the program will (a) download 100% of the PDFs and (b) correctly extract from the downloaded PDFs 90% or greater of the designated data fields, with the error-checking identifying all data fields where extraction failed.

**Full description is attached.**

工程 微软 项目管理 脚本安装 shell脚本 软件构架 软件测试 视窗桌面

项目ID: #3710190

关于项目

1个方案 远程项目 活跃的Dec 5, 2011

1 威客就此工作平均出价 $5001

matfizvw

See private message.

$5000.55 USD 在21天内
(54条评论)
6.0