Find Jobs
Hire Freelancers

Create a web spider as linux daemon

$250-750 USD

已授予
已发布超过 10 年前

$250-750 USD

货到付款
A linux deamon should spider intranet websites and extract some data. The base urls of the intranet servers are given as ([login to view URL], [login to view URL] ... [login to view URL]). A C++ application (deamon) should be built with the following interface which allows to manage/create a list of pages (urls): - add a host to be spidered (going through all pages on this site, creating a list of the pages of a site) - add a single url to be spidered (adding it to the list of pages of a site) - remove a host (not to be spidered in future, deleting all related xapian data and lists of pages) - remove a single url, all of the related xapian data and removing it from the list of pages to be spidered - allow to set a list of url parameters that should be ignored (session ids for example) - specify a time interval after wich an already spidered url has to be spidered again - specify a time interval for following calls on a site-IP, preventing to "overload" it - specify a max_depth parameter, defining how deep the site should be crawled - for each site host, an according process should do this job. e.g. 10 site-IPs to spider -> 10 processes The interface should allow to define: Spider all urls from [login to view URL], all from [login to view URL] except [login to view URL] plus spider only [login to view URL] The processes which spider through the list of pages should... - get the content of each url, splitting it into text (content without html tags) , encoding (charset), title, canonical url and description (from meta info), current date+time*. - give this data to a different application through a function call. The spider should not come into infinite loops, therefore it has to check, if the raw site content of an url is identical of an url with some different parameter. If possible, it should use the canonical tag for this. To determine, if a site has already been spidered, the according process can "ask" (function call) if the url has already been spidered (based on the data extracted with *), and if yes, if it was more than max_interval days ago. Yes: spider again and get data, no: continue with next url. Starting points: - [login to view URL] - [login to view URL] - [login to view URL]
项目 ID: 4979819

关于此项目

5提案
远程项目
活跃11 年前

想赚点钱吗?

在Freelancer上竞价的好处

设定您的预算和时间范围
为您的工作获得报酬
简要概述您的提案
免费注册和竞标工作

关于客户

SWITZERLAND的国旗
Eichberg, Switzerland
5.0
3
会员自9月 25, 2011起

客户认证

谢谢!我们已通过电子邮件向您发送了索取免费积分的链接。
发送电子邮件时出现问题。请再试一次。
已注册用户 发布工作总数
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
加载预览
授予地理位置权限。
您的登录会话已过期而且您已经登出,请再次登录。