Find Jobs
Hire Freelancers

Data Extraction from Word documents using Python or similar tool

$1500-3000 USD

进行中
已发布大约 9 年前

$1500-3000 USD

货到付款
I need someone expert with Python or a similar tool to scrape data from text documents that I will provide, to extract numerical and text data according to the rules I provide. The text documents are in Russian, so ability to read the Cyrillic alphabet is helpful, but as long as you are able to include Cyrillic text in the code, it’s not essential that you understand the words. Here is more detail: I have about 40,000 individual Microsoft word documents (1-4 pages in length each), which are court decisions in criminal cases (in Russian). I want to create a dataset from the information contained in these documents: for example, the name of the judge, the crimes charged, and the sentence imposed. I have written “identification rules” (in English, but using Russian “trigger words”) that indicate how to extract or code all the variables I want. For example, to extract the name of the defendant (accused criminal) in each case, the rule is “Eхtract the first Capitalized word following any one of these trigger words [УСТАНОВИЛ ОR Установил ОR У С Т А Н О В И Л OR установил].” Another example: to figure out whether a court reached a verdict, I want to create a variable “verdict,” which can be done by following the rule “Enter 1 or True if the title/header of the document is [П Р И Г О В О Р OR ПРИГОВОР].” Some rules require the extraction of text in the documents, and some rules just require a 1/0 or True/False depending on whether certain trigger words or phrases occur in each document. You would have to code these rules that I provide, so as to extract the information into a dataset that is readable in STATA (e.g., Excel, delimited such as CSV, dta, xml). As you see, so long as you are able to replicate the Russian words in your code, you need not understand them, although it would help. There are a lot of variables I want in the final dataset, and therefore, a lot of identification rules – about 150, which take up about 30 pages in a Word document. The rules are of various complexity, and some have multiple parts, unlike the examples above. That means a programmer will need to communicate with me if some rules are confusing or difficult to program, and I will try to revise them. I also anticipate that the first attempt at compiling the dataset will reveal problems that have to be addressed, and the programmer should be willing to help me address them. I am uploading the following to help you figure out if you can do this project: a document with just a few sample rules (“Sample Rules”), a document with just 2 samples of the cases from which the data is to be extracted (“Sample Cases”), and an Excel document that gives a sense for what I need the final product to look like (“Sample Output”) – although it need not be in Excel, of course.
项目 ID: 7289126

关于此项目

27提案
远程项目
活跃9 年前

想赚点钱吗?

在Freelancer上竞价的好处

设定您的预算和时间范围
为您的工作获得报酬
简要概述您的提案
免费注册和竞标工作
颁发给:
用户头像
I'm a native Russian speaker and a computer science professional with a PhD degree and excellent Python skills. Natural Language Processing is my favorite field in which I have extensive knowledge. Please make sure to check my COMPLETE profile (ALL Skills) and see the reviews I received from other employers. It would be my pleasure to do your project and to discuss it with you. Please have a look at another large programming project I completed recently: https://www.freelancer.com/jobs/Excel-Mathematics/Convert-math-proofs-Excel-formulas/ As you can see I am perfectly capable of handling a $1,500 project to the complete satisfaction of my employer.
$2,000 USD 在21天之内
5.0 (24条评论)
5.6
5.6
27威客以平均价$2,181 USD来参与此工作竞价
用户头像
Hi, I am a top 7th full-stack freelancer. Please get in touch and discuss in detail. I can get started right now. Best!
$3,092 USD 在30天之内
5.0 (9条评论)
6.7
6.7
用户头像
Hello I'm Python developer and I'm very interested in your project. The Cyrillic alphabet is native for me 'course I'm from Ukraine. Please kindly provide more details related to project requirements. Thanks.
$2,500 USD 在25天之内
5.0 (24条评论)
5.8
5.8
用户头像
Hello, I am an experienced python programmer and I'd like to do this job for you. Also I'm Russian native speaker so it will simplify the work for me. I'm going to use python-docx library for doc file parsing (tell me if I cannot use it) and use python-3 (again, if you need python-2 script, tell me about it). I suppose I understand the rules however some implying some of them will require significant efforts. Thank you in advance. PS Price, time and milestones splitting are approximate, can be discussed. PPS Don't be afraid, I'm not working on Federal Security Servies :)
$2,275 USD 在30天之内
5.0 (75条评论)
5.8
5.8
用户头像
A proposal has not yet been provided
$1,500 USD 在30天之内
5.0 (98条评论)
5.8
5.8
用户头像
Good day, I'm a computer scientist with 6+ years of experience in Natural Language Processing and as a consequence converting word documents to all possible formats. Just a few months ago I was working on a project where I had to convert a word document to plain text with some specific tags. Based on previous experience I would say that python is not the best option for scrapping word documents, there is python-docx which is a package for processing word documents but it is still not good enough. I would suggest a combination of Java (docx4j) and python, which I think is the best possible combination. If you are interested contact me here, please note that the price and time are figurative and depend on further specifications.
$2,500 USD 在30天之内
5.0 (8条评论)
4.1
4.1
用户头像
hello, you can place your confidence in my Python/Russian-Cyrillic knowledge and experience. please feel free to ask for any information. greets, srdjan
$1,500 USD 在30天之内
5.0 (15条评论)
4.1
4.1
用户头像
Hello, I've been working with Python since last 3 years and I've great experience over data processing with all type of characters (including Cyrillic alphabets). I'll complete your task within time frame at minimal cost and will provide you complete support till the end. Hope to listen from you soon!
$1,666 USD 在4天之内
5.0 (10条评论)
4.3
4.3
用户头像
Hi, I am a professional web data scraper specialized using Python program, PHP script, .Net program, Crawler and Bot. My tool can search data and get information from Aa to Zz with an existing lists of english words. Below is the link for your reference as a sample related to my tool being developed. This demo will capture doctor's name, address, zip, phone, ratings and reviews in 4 different sites. The final output will be save in *.XLSX format or as your quirement.I can start as early possible depending on your approval and acceptance. In relation to this application, I can rest assured I will impart a high quality and reliable, efficient and accurate with the output. Give me a try and I will try to get the best results and finish the project far before the deadline. Thanks,Ferdous
$2,500 USD 在30天之内
4.8 (6条评论)
3.8
3.8
用户头像
I studied Russian for two years, so I can read cyrillic letters - school level, tho' so I don't understand much of the sample document. I have ten years experience with Python, and although what you ask is complex I believe I can solve the task, and expect to collaborate with you from time to time using effective written communication and also with regular web-based reports (I can set this up quickly, don't worry.) I recommend use of a fast-insert noSQL database (MongoDB) for engine output, and a web report to show rule engine results / & perhaps original document text for comparison. From this kind of architectural arrangement I believe we can make the best & most accurate progress.
$3,000 USD 在10天之内
5.0 (2条评论)
3.2
3.2
用户头像
Greetings for the Day!!! I have 6 years of experience in .NET, VBA Macros, VB script,PS Script and VB creation with application like SAP, Internet explorer, Microsoft Outlook,PDF& Text files, MS Access and SQL Server databases. And also I have worked on extraction with websites like Amazon, Cellpex,Costco,etc., hope if awarded with this project I can make it best and better with maximum 100% accuracy and satisfaction. please award me this project and contact me for further details Thanks Prabakar M
$2,500 USD 在30天之内
5.0 (3条评论)
2.5
2.5
用户头像
Hi there! I know Russian, and have 2 years experience in Python. So, I'm able to do this parsing tool for you.
$1,500 USD 在12天之内
4.6 (3条评论)
2.3
2.3
用户头像
A proposal has not yet been provided
$1,500 USD 在10天之内
4.1 (2条评论)
2.3
2.3
用户头像
Hello, As far as I can tell, you want a file containing a summary for every verdict (defendant, judge, sentence etc), and you are providing rules for extraction of every item (variable) that is required in that summary. Do you mind giving me an example of a more complex rule? I am a student and have experience in data extraction using scripting languages, though on a much smaller scale. I also know Serbian Cyrillic (very similar to Russian), which might be helpful.
$1,800 USD 在30天之内
5.0 (1条评论)
1.4
1.4
用户头像
Good afternoon, my name is Alec I'm from Ukraine. I perfectly know Russian language and I think it will be easier to solve your problem. I have experience programming from Python to scrape sites and online shops. Essentially scrap Word documents is no different. If you can do something interestno example free. Thank you await your response.
$2,222 USD 在10天之内
0.0 (0条评论)
0.0
0.0
用户头像
I m Serb and my native letters are ciric, I can finished this task beacuse I scrapy data from many text documents.
$1,500 USD 在15天之内
0.0 (0条评论)
0.0
0.0
用户头像
I have worked extensively in scraping and regular expression type projects. I've looked over the documents you provided and it doesn't seem hard at all. My approach would be to use Python for the whole lot. I will have a standard container that rules can be fit into in order to make it easy to append new rules to the code as needed. You did not mention how you would like the results of the script to be stored. I would suggest an sql database so that the results are easily queried.
$2,222 USD 在30天之内
0.0 (0条评论)
0.0
0.0
用户头像
Здравствуйте, мы из Новороссии, так что думаю с русским языком проблем не будет. Для того чтобы сделать качественно все что вы попросили нам нужно времени около месяца - это включает разработку, тестирование и создание готового продукта по итогу. У нас есть похожие продукты по реализации, так что я думаю проблем не будет. Сделку будет проводить с оплатой частями. Напишите мне и мы более подробно пообщаемся.
$2,000 USD 在30天之内
0.0 (0条评论)
0.0
0.0
用户头像
hello, my name is alexandru i have 3+ years experience in building custom complete applications using python. your project has very clear specifications thank you for that and it is very interesting. my solution for this project is based on python2.7 and pyqt4 for a nice GUI with which you can easely manipulate any document at any time, also the application will allow to add or remove any rule without the help of a programer. please let me know if you are interested, after which i can provide more details. looking forward to yout reply, alexandru
$3,333 USD 在15天之内
0.0 (0条评论)
0.0
0.0
用户头像
Hello, I am an experienced software developer and a native Russian speaker. I am interesting in your project. I can help you and write a Python code to scrape data from text documents and as a result to form a CSV file according to the rules you provide. Best regards, Volodymyr
$2,000 USD 在15天之内
0.0 (0条评论)
1.0
1.0
用户头像
Hi, I have more than 14 years of exp and I am expert in this kind of work. I have completed more than 225 projects. Please look at the feedback left by my employers to know more about my work. Waiting for your positive response. Thanks.
$2,850 USD 在60天之内
0.0 (0条评论)
1.3
1.3

关于客户

UNITED STATES的国旗
Cambridge, United States
5.0
1
付款方式已验证
会员自11月 19, 2014起

客户认证

谢谢!我们已通过电子邮件向您发送了索取免费积分的链接。
发送电子邮件时出现问题。请再试一次。
已注册用户 发布工作总数
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
加载预览
授予地理位置权限。
您的登录会话已过期而且您已经登出,请再次登录。