已关闭

Develop algorithm to remove repeating text

Every web site has repeating text on each page. For example, the header and footer, and perhaps a sidebar.

Usually the important text on the page is the unique text on the page.

For example, if you look at these two websites, you can see there is duplicate text on both pages (mostly at the top and bottom of the pages) which is not important:

[url removed, login to view]

[url removed, login to view]

The important text is mostly the unique job description text.

I need you to develop an algorithm in Python (you can use a library; it doesn't need to be original code) which is able to detect duplicate text. So if you could imagine we merged the HTML from the two links above into a single document, your code would remove the header and footer (and perhaps some other text) due to it being duplicate text in the document.

Any questions, just ask.

I am not interested in a Wordpress website. Thanks.

技能: Python, 网页搜罗

查看更多: text enhance remove, text clustering java algorithm, text file search algorithm, pdf text background remove, job description text conversion, anchor text link domain algorithm, job description realtime text, job description text, php random text repeating, text file remove double lines, text background remove, develop sms text screen freelance, remove empty line opening text file, php remove links text, can software develop text based mmorpg, remove color text, text remove duplicates words, job description general transcription, copywriter job description wiki, video game designer job description, html job description joomla, social network marketing job description, call center agent timesharing outbound job description, flash banner designer job description, paintball worker job description

About the Employer:
( 0 reviews ) Netherlands

项目ID: #12155198

24名威客为此工作的平均竞标价是€143

flashsaiful

Hi, I can do this for you. Please send a massage in the PMB for details.......Best Regards flashsaiful

€155 EUR 在3天内
(112条评论)
6.4
DanielVizcaya91

Hello there, my name is Daniel and I would love to help you out with this project. I have a lot of experience parsing texts in order to obtain useful information so I think that can be apply here to identify duplicate 更多

€198 EUR 在5天内
(58条评论)
6.4
lkhelladi

hello, I'd be glad to implement the desired Python tool for you. Looking forward to chat with you soon for more details. Best regards,

€94 EUR 在2天内
(39条评论)
5.4
€155 EUR 在3天内
(29条评论)
5.0
cracken

Hi, I am competitive to this kind of task, can take good care of this project. In fact, I already done related to this job before. We can use regex and import difflib to compare both data. Let me know the best of you 更多

€249 EUR 在5天内
(12条评论)
4.4
adilhussain0411

Hello! My name is Mehnaz Bashir. I am writing in response to your Project. After carefully reviewing the experience requirements and skills required for the job, I feel that I am a suitable match for the job. I have 更多

€30 EUR 在3天内
(9条评论)
4.2
some235one

Hi, I can do this using python. I have done something similar to wikipedia. The exact solution will depend on how many pages you need

€277 EUR 在3天内
(9条评论)
4.0
Gnus

Hey, I can write such code by scraping links, structuring into some tokens and then comparing them. But are you interested in HTML DOM browsing. In your example link that means to scrape everything in this tag: <articl 更多

€70 EUR 在3天内
(4条评论)
3.6
MacJeremy

To whom it may concern, if I understood you well, I take both pages, compare them and everything that is the same would be deleted, and the rest would be merged to one page? I am at your disposal for further ques 更多

€250 EUR 在10天内
(2条评论)
3.4
€155 EUR 在3天内
(1条评论)
2.9
€155 EUR 在3天内
(2条评论)
3.0
€88 EUR 在3天内
(4条评论)
2.5
Orpiv

Hello, i would like to introduce our company orpiv tech we have done projects like yours earlier as well we can show you our past work or you can check out our portfolio, we can perfectly develop an algorithm in Py 更多

€40 EUR 在3天内
(1条评论)
2.9
drishinfotech

Hello, Thank you for the posting. I checked the sites and would like to collaborate with you over this task. Regarding texts, we can use native Python libraries like beautifulsoup or urlib and for the desired 更多

€111 EUR 在3天内
(1条评论)
0.8
phourxx

Greetings, You're looking for a python programmer to develop a Web scraping tool to scrape the details of a job from the website mentioned in the project details. Talking about a perfect match, I am a core python pro 更多

€90 EUR 在2天内
(1条评论)
0.6
dichotamous

A proposal has not yet been provided

€155 EUR 在4天内
(0条评论)
0.0
jsbot

Can we go with Selenium Java. (You'll get better robot with selenium if language is not concern) We've scraped many websites with selenium. Some of them are rCommerce giant Amazon, Flipkart. For any query on Automati 更多

€222 EUR 在53天内
(0条评论)
0.0
yuvalkainan

A proposal has not yet been provided

€133 EUR 在3天内
(0条评论)
0.0
DevoirTechsoft

Hello, We have studied the requirements and found it matches our skills. We are having an enthusiastic team with us having years of experience in HTML, CSS, UI design, PHP+MySQL, javascript, jquery, AJAX, e-commerce 更多

€155 EUR 在3天内
(0条评论)
0.0
ngemzinou

I think I understood what you want. Still not sure what output format do are you looking for ? do you want html output or just text files ? An algorithm for this task maybe not be perfect if the pages layout/tags a 更多

€222 EUR 在3天内
(0条评论)
0.0