Basic extraction from Wikipedia (from a few specific lists to DB)

已关闭 已发布的 Mar 22, 2011 货到付款
已关闭 货到付款

===================

BACKGROUND

===================

I will provide you with a few lists from Wikipedia website (list of ballet companies, list of operas, list of musicals, etc.) and your job would be to write a script to extract details into two basic mySQL tables (I will provide the structure of the two tables below).

As part of the deliverables of this project, I'm looking for (a) populated tables with data and (b) the scripts themselves which were used to extract the data.

**This is the first trial project of any such extraction undertakings. There is more extraction work ahead.**

===================

DATA STRUCTURE

===================

There will be two tables: "entities" table and "entity_names" table:

**entities** table:

- ID

- Wikipedia_Page

- Type

- Primary name ID (which will point to "ID" from "entity_names" table)

**entity_names** table:

- ID

- entity_ID (which will point to "ID" from "entity" table)

- Name

- Type (primary or secondary)

The reason we're using two tables, is that a given entity could later have more than one name/alias (for example "San Francisco Symphony" could be called "SF Symphony"). For all the stuff you will be extracting, you can set the value of "type" field of "entities_table" to "primary".

## Deliverables

===================

WHAT TO EXTRACT

===================

1) List of all ballet companies

Source: <[login to view URL]>

Fields to grab:

Name = "Company Name" from the table

Type = ballet_company

Wikipedia page = page for each ballet company (example: [login to view URL])

2) List of Operas

Source: <[login to view URL]>

Name = opera name from the list

Type: opera

Wikipedia page = page for each opera (example: [login to view URL])

*(below, I will only provide the type as the other fields are self-explanatory based on the above two examples)

*3) List of Opera Companies

Source: [[login to view URL]

][1] Type: opera_company

4) List of Musicals:

Sources: <[login to view URL]:_A_to_L>

<[login to view URL]:_M_to_Z>

Type: musical

5) List of Orchestras:

Source: <[login to view URL]>

Type: orchestra

6) List of Improv Theater Companies

Source: <[login to view URL]>

Type: improv_theater_company

7) List of Comedians

Source: <[login to view URL]>

Type: comedian

Note: Please only extract those who are still alive (i.e. do not take someone like "Bud Abbott (1895-1974)")

8) List of Stand-up Comedians

Source: [[login to view URL]

][2] Type: stand_up_comedian

Note: Please only extract those who are still alive

9) List of dance companies:

Source: <[login to view URL]>

Type: dance_company

10) List of pop punk bands

Source: [[login to view URL]

][3] Type: pop_punk_band

Java JavaScript MySQL PHP 脚本安装 shell脚本 软件构架 软件测试 网络主机 网站管理 网站测试 XML XSLT

项目ID: #3191040

关于项目

28个方案 远程项目 活跃的Apr 13, 2011

有28名威客正在参与此工作的竞标,均价$177/小时

repmovsd

See private message.

$382.5 USD 在5天内
(144条评论)
7.0
samirkumardas

See private message.

$297.5 USD 在5天内
(241条评论)
7.0
sktn

See private message.

$143.65 USD 在5天内
(262条评论)
7.1
pbradaric

See private message.

$85 USD 在5天内
(28条评论)
6.1
mastirlaa

See private message.

$85 USD 在5天内
(76条评论)
6.1
novepi

See private message.

$212.5 USD 在5天内
(42条评论)
5.9
Bitquark

See private message.

$170 USD 在5天内
(44条评论)
5.9
tomkusvw

See private message.

$85 USD 在5天内
(62条评论)
5.7
webspiderinc

See private message.

$85 USD 在5天内
(53条评论)
5.5
topleaseu

See private message.

$212.5 USD 在5天内
(24条评论)
5.3
oasis21

See private message.

$127.5 USD 在5天内
(35条评论)
4.9
szaszalexmcpd

See private message.

$85 USD 在5天内
(55条评论)
4.4
lenzai

See private message.

$340 USD 在5天内
(16条评论)
4.2
ragastens

See private message.

$110.5 USD 在5天内
(37条评论)
4.4
cwaldbieser

See private message.

$297.5 USD 在5天内
(10条评论)
4.3
powzak

See private message.

$85 USD 在5天内
(25条评论)
4.1
MrRain

See private message.

$85 USD 在5天内
(13条评论)
3.8
rased108

See private message.

$85 USD 在5天内
(29条评论)
4.6
Archit88

See private message.

$136 USD 在5天内
(14条评论)
3.3
ifailed

See private message.

$85 USD 在5天内
(8条评论)
2.4