Title: Multithreaded External Website Source Parser (Regular Expressions)
Project Type: Programmed Software, Windows (x64) Application, (Including Source Code)
Language: C#.NET, C, OR VB.NET (Visual Basic 2010) Or Other Programming Language for Executable Application
Interface: Screenshots of the interface design are attached to the Project Resources, and should be followed similarly
Budget: $150 total, with (1) proposed 'milestone' payment of $75 for a completed single threaded version that is limited to saving only the first 100 results of each type of information collected, and only use a single regular expression input file. The first milestone payment will be made after the demonstration application is reviewed. The speed however, even in the first milestone demonstration application, must meet certain 'speed' standards and programmer must have the knowledge of how the speed/accuracy will inncrease/decrease in comparison to the full multithreaded version, given 25~ mbps internet connection and a 5000~ passmark score PC running Win x64 and 8GB RAM.
Project Summary: This project description is for a program with the purpose of parsing external website source codes. The user will import a list of regular expressions, one per line named [login to view URL] in the following format:
beginning text##!##ending text
Whatever text in the source code is between the 'beginning text' and 'ending text' (where '##!##' is in the input file) will be appended to the file [login to view URL]
The application will also read a second file with a list of URLs (one per line) named [login to view URL]
If multiple matches of the regular expression are found within the same page's source code, each one will be appended to the output file ([login to view URL])
Getting Started: This project is best suited for someone who has already developed this application in part or wholly, though it is quite straight forward to anyone familiar with the scraping, data mining, crawling, etc of websites.
Speed, accuracy, and scalability: This software will be run on an approximate 25~ mbps internet connection (mega BIT), and 5000~ passmark score cpu running Winx64 with 8GB of RAM. The acceptable accuracy requirement is 95%, meaning that for a list of 100 URLs where 100 regular expressions are available in the corresponding source, at least 95 (or better) should be found and appended to 95 lines in the [login to view URL] file. The software will make use of large flat files with several million entries in [login to view URL], so should not have any issues either reading large [login to view URL] and appending to largely growing [login to view URL] files.
The desired speed of the software, taking into account the 95% accuracy requirement as well as the internet and hardware specifications of its machine is approximately 1800 URLs/minute under typical web server speed conditions. The only difficulty in developing this software should be the treatment of slowly responding websites, unfound urls, and your discretion with how they are handled.
Please take a moment to review the attached project resources that contain screen shots with the recommended GUI (user interface) of the software. For any questions regarding the project, feel free to PM me any time (will check them often) and I can provide additional contact information or simply answer any inquiries you have there. Thank you and good luck.