发布项目

Company Name Matching Engine

已取消已发布的 Mar 31, 2012 货到付款

$30-5000 USD

货到付款

已取消货到付款

I often have the need to match company names between two separate large csv files. Matching company names well is not a trivial task. Various algorithms and processes should be considered to do this including: Levenshtein Edit Distance, Smith-Waterman distances, Jaccard token distance, weighing common company name tokens differently than uncommon ones and so on.

For example, provided company names such as:

DSZ Investments, LLC

D.S.Z Investment Company

DSZ Investments, L.L.C

DSG Investments, LLC

The first 3 should be considered the same company, but the fourth should be considered a separate company even though the edit distance is very narrow. The common token "Company" has to have very low weight when doing the match. Whereas the uncommon token DSG must have a much heavier factor on the match due to it's rarity.

A highly relevant document that I read and that the principles within should be codified and integrated into the project is attached to this post.

Experience doing this type of matching or designing these types of algorithms would be very helpful. I work in a unix environment and I am looking for a command line tool that can run from the bash shell.

Please review the attached document and let's get the conversation going. Canned replies will be ignored.

Thanks for your interest in this project.

脚本安装 shell脚本

项目ID： #2727519

关于项目

1个方案远程项目活跃的Apr 22, 2012

想要赚钱吗？

在Freelancer上竞标的好处

设置你的预算和时间表

通过工作获取报酬

大致描述您的提案

免费注册并竞标工作

1 威客就此工作平均出价 $636

AnkSoftware

See private message.

$635.8 USD 在20天内

(4条评论)

5.0

发布一个这样的项目

Company Name Matching Engine

关于项目

想要赚钱吗？

在Freelancer上竞标的好处

1 威客就此工作平均出价 $636

Freelancer

关于

条款

应用