Crawler Specifications
1. The crawler must be called through a command line with several parameters to set its behavior.
Required Parameters:
url : the url to crawl as a start page
Optional Parameters:
max-links : whether to limit the total number of links to fetch. This is not the number of
concurrent requests. Default to no limit
max-depth : whether to limit the depth of request from the start url. Default to no limit
wait : seconds between link request. Default to 0
include-external : If a page has a link to external page then must send only a head request to get
its status. Default to yes
robots: Whether the crawler follow the [login to view URL] rules. Default to yes
link-rel: Whether to follow the rel attribute of a link. Default to yes
2. The crawler must crawl only internal links that match the host name part of the url parameter.
Subdomains are excluded. If Include-external is set to yes, send only a head request. Do not follow the links inside.
3. The crawler must request for links, js, css, objects, videos and other website prerequisites
4. The crawler should not download the body of links other than text/html. It should only get the headers of the files.
5. The crawler must send output to the console for every links crawled. It should output the headers and link. (For logging purpose)
6. If a runtime error occurred in the crawler it should continue the crawling of other links. The error must be print into the console output. (For logging purpose)
7. Links must be stored into mysql database. Prerequisites included and external_link if include-external is on.
8. Do not request links that are already requested.
Database Specifications
DBMS: MySql
Tables
websites: InnoDB
• id – auto increment id
• url – url of the website passed as url parameter from the crawler
• created – date and time when the record is inserted
links: InnoDB
• id – auto increment id
• website_id – foreign key column. Id of website from websites table
• url – url of the requested link.
• name – Depends on the file type. If html get the title tag. If file get the filename from the response header
• headers – response headers of the requested link
• mimetype – mime type of the requested link
• md5_hash – hash of the request body. Not applicable for files or external_link.
• sha1_hash – hash of the request body. Not applicable for files or external_link.
• created – date and time when the record is inserted
link_relations: InnoDB
• link_id – foreign key column. Id of the link from links table. This is the main entity.
• parent_id – foreign key column. Id of the link from links table. This is the link (referrer) where link_id entity is found.
• depth – Depth of the link_id entity.
Note: links can have multiple parents and different depth. Ex. Contact link may appear on homepage, about page. So just populate this table only for link relations.
First Depth must be the links found at the website url or the start page of the crawl. They must have a parent_id of 0
Ex. [login to view URL] is a website from websites table. The depth of links inside this page must be set to 1 and parent_id to 0.
Hi. Experienced web crawling developer. Have experience with scrapy and python itself. I did similar project and I think we can work on this too.
Let me know what are the deadline for the project.
Hi.
We are a group of experienced python/javascript developers.
We have done many scraping projects using Scrapy and BeautifulSoup frameworks.
Most of features are already available in the scrapy framework. Just need to integrate and develop glue functions to build the final scraper.
We can use sqlalchemy in pipeline to cleanly dump the scraped items into MySQL database with requested features as you have specified in the description.
Let us talk more. We would be glad to help.
Thanks
Hi,
I used to crawler some information from other websites. I think I can do it well. Let me do it and you wil love my quaility result.
Many thanks,
Liem
Hello , my name is Seifert and i made a crawler 2 years ago to college, of course these time i has less specifications. The important is that, I have experience and knowledge about that topic and i am sure that i'm going to do a great work. Any doubt or question contact me.
Regards Seifert