Company Name Matching Engine

已取消 已发布的 Mar 31, 2012 货到付款
已取消 货到付款

I often have the need to match company names between two separate large csv files. Matching company names well is not a trivial task. Various algorithms and processes should be considered to do this including: Levenshtein Edit Distance, Smith-Waterman distances, Jaccard token distance, weighing common company name tokens differently than uncommon ones and so on.

For example, provided company names such as:

DSZ Investments, LLC

D.S.Z Investment Company

DSZ Investments, L.L.C

DSG Investments, LLC

The first 3 should be considered the same company, but the fourth should be considered a separate company even though the edit distance is very narrow. The common token "Company" has to have very low weight when doing the match. Whereas the uncommon token DSG must have a much heavier factor on the match due to it's rarity.

A highly relevant document that I read and that the principles within should be codified and integrated into the project is attached to this post.

Experience doing this type of matching or designing these types of algorithms would be very helpful. I work in a unix environment and I am looking for a command line tool that can run from the bash shell.

Please review the attached document and let's get the conversation going. Canned replies will be ignored.

Thanks for your interest in this project.

脚本安装 shell脚本

项目ID: #2727519

关于项目

1个方案 远程项目 活跃的Apr 22, 2012

1 威客就此工作平均出价 $636

AnkSoftware

See private message.

$635.8 USD 在20天内
(4条评论)
5.0