/crawtext

Python Crawler for collecting domain specific web corpora

Primary LanguagePython

python crawl_trial.py will launch the crawl accroding to parameters declared in crawl_parameters.yml file or any yaml file declared as an argument of crawl_trial.py. The crawler uses the seed sites found in the list of files of a given repertory (path) as well as a query that will be used to validate new webpages (query) found during the crawling process. inlinks_min define the minimum number of citation a page should accumulate before being considered as a candidate to enter the corpus. depth parameter defines the number of steps of corpus extension performed from the initial corpus made of seeds webpages. 

required modules: urllib2,BeautifulSoup, urlparse, sqlite3, pyparsing,urllib, random, multiprocessing,lxml,socket, decruft, feedparser, pattern,warnings, chardet, yaml


TODO list:

  * better scrapping of webpages, for the moment decruft is doing great but can be enhanced (http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/)
  * automatically extract dates from extracted texts (good python solutions in english, still lacking in french)
  * feed the db with cleaner and richer information (like domain name, number of views, etc)
  * take into account the charset when crawling webpages
  * crawl updating process 
  * automatic grab google links to initiate a crawl
  * TBD grid-compliant code...
  * clean the code (debug mode, documentation, etc.)
  * write a comprehensive post_processing.py script to keep compatibility with other developments.
  * monitoring and reporting (page retrieval problems, success, distributions, etc)
  * modular architecture : include a better information extraction process.
  * avoid downloading n times the same content. (md5 comparison)
  * retry to download pages that could not be opened.
  * targeted and careful crawl of each domain (only follow hypertext links with the ~query in the url or in the linkText)