Development of web scraping modules in Scrapy

已取消 已发布的 Mar 23, 2010 货到付款
已取消 货到付款

You should develop spider modules to extract the ads from the following five italian sites, using the Python scraping framework Scrapy:

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

Please see the attached file for a Scrapy example project with two scraping modules for [url removed, login to view] and [url removed, login to view]

Specifically, your task will be:

* Find good starting urls for the five specified sites, to ensure that the sites can be widely scraped for new ads

* Develop the spider modules for the five sites. The scraping modules MUST be robust, i.e. you MUST NEVER use full XPath paths to extract the requested elements, but instead you should use relative and clever ones based on attributes (such as id, class, width, etc) or any identifying features like contains(@href, 'image').

* Ensure that all available fields (described later), where present in the ads, are extracted. Not all sites have all fields in their ads: you should check which fields are present in the ads of each site, and extract them

**A basic knowledge of Italian is required to work on this project.**

## Deliverables

Here is a list of the fields which you should extract from the ads, where present. Please note that some sites have only a few of these fields:

* source = (string, fixed) the name of the site, e.g. [url removed, login to view] (without the http://)

* title = (string) the title of the ad, e.g. "Appartamento" or "Villa" (like in the examples, you should remove the city/province/region from the title)

* city = (string) the city where the building of the ad is located

* province = (string) the province where the building of the ad is located

* region = (string) the region where the building of the ad is located

* area = (string) for larger cities, the area of the city where the building of the ad is located

* address = (string) the address of the building of the ad

* description = (string) the description of the building of the ad

* sale_rent = (integer) 0 if the building if for sale, 1 if the building is for rent (suggestion: you need to check for words "vendita" and "affitto" in the ad, like in the examples)

* publish_date = (date) the date when the ad has been published

* price = (integer) the price of the building of the ad (default -1 if not specified)

* building_type = (string) the type of the building, e.g. "Residenziale"

* building_surface = (integer) the building surface, in square meters (default -1 if not specified)

* rooms = (integer) the number of rooms of the buidling (default -1 if not specified)

* bathrooms = (integer) the number of bathrooms of the buidling (default -1 if not specified)

* box_type = (string) the type/description of the car's box (if the building has a car's box)

* box_surface = (integer) the surface of the car's box in square meters (if the building has a car's box. Default -1 if not specified)

* has_balcony = (integer) 0 if the ad says that the building doesn't have a balcony, 1 if the ad says it has a balcony, -1 if unspecified

* has_terrace = (integer) 0 if the ad says that the building doesn't have a terrace, 1 if the ad says it has a terrace, -1 if unspecified

* has_elevator = (integer) 0 if the ad says that the building doesn't have an elevator, 1 if the ad says it has an elevator, -1 if unspecified

* garden_type = (string) the type of the garden

* garden_surface = (integer) the garden surface, in square meters (default -1 if not specified)

* floor = (integer) the floor of the building (default -1 if not specified)

* heating_type = (string) the type of the heating, e.g. "Autonomo" or "Centralizzato"

* building_condition = (string) the condition of the building, e.g. "Ottimo" or "Buono" or "Ristrutturato"

You will need to install Scrapy, along with these Python modules:

* libxml2

* lxml

* pywin32 (if you work in win32)

* Twisted

* [url removed, login to view]

PHP

项目ID: #3286407

关于项目

1个方案 远程项目 活跃的Apr 6, 2010

1 威客就此工作平均出价 $30

addshells

See private message.

$29.75 USD 在10天内
(1条评论)
0.0