关闭

Develop algorithm to remove repeating text

该项目收到27 来自天才威客的竞标,平均竞标价格为€140 EUR

为像这样的项目获取免费报价
雇主工作
项目预算
€30 - €250 EUR
全部竞标
27
项目描述

Every web site has repeating text on each page. For example, the header and footer, and perhaps a sidebar.

Usually the important text on the page is the unique text on the page.

For example, if you look at these two websites, you can see there is duplicate text on both pages (mostly at the top and bottom of the pages) which is not important:

[url removed, login to view]

[url removed, login to view]

The important text is mostly the unique job description text.

I need you to develop an algorithm in Python (you can use a library; it doesn't need to be original code) which is able to detect duplicate text. So if you could imagine we merged the HTML from the two links above into a single document, your code would remove the header and footer (and perhaps some other text) due to it being duplicate text in the document.

Any questions, just ask.

I am not interested in a Wordpress website. Thanks.

在寻找赚取金钱的机会?

  • 设定您的预算和时间框架
  • 大致描述您的建议方案
  • 为您的工作领取工资

雇用同样在该项目上竞标的威客

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online