You will need to use pip install to install several python modules first.
pip install -r requirements.txt
(demo) To parse items out of hackernews
scrapy crawl HackerNews -o results.csv -t csv
To crawl the web from seeds (using scrapy_redis).
~~scrapy runspider~~
scrapy runspider hn_scraper/spiders/SeedsSpider.py
Reads seeds from seeds.txt
, which has one url per line:
https://en.wikipedia.org/wiki/Rhombicuboctahedron
https://martinfowler.com/articles/viticulture-gallerist.html
Using your own seeds generator, perhaps do something like:
java -jar impl/task/lambda/target/vericite-task.jar seedslist /opt/seeds.txt
Writes output scraped items (intermediate look at the data) to export.csv
detected_language,url,response_code,num_links,response_type,declared_language
en,https://en.wikipedia.org/wiki/Anne_Bradstreet,200,472,HTML,en
en,https://quizlet.com/37641628/american-literature-flash-cards/,200,29,HTML,en-gb
en,https://en.wikipedia.org/wiki/Donald_Woods,200,312,HTML,en
es,http://www.taringa.net/posts/info/18045919/Kaprosuchus-el-cocodrilo-jabali.html,200,137,HTML,es
And also to redis:
redis-cli
127.0.0.1:6379> lindex SeedsSpider:items 2
{
"detected_language": "en",
"url": "https://quizlet.com/37641628/american-literature-flash-cards/",
"response_code": 200,
"num_links": 29,
"response_type": "HTML",
"declared_language": "en-gb"
}
To do some ML classifying on the output:
python language_train.py
The language classifier should output predictions:
Classification Report:
precision recall f1-score support
de 0.90 0.45 0.60 40
en 0.92 0.42 0.58 57
es 0.86 0.29 0.43 86
fr 1.00 0.43 0.60 35
no 0.92 0.39 0.55 56
ru 1.00 0.26 0.42 182
sv 0.96 0.31 0.47 153
zh-cn 0.86 1.00 0.93 2578
avg / total 0.88 0.87 0.84 3187
Confusion Matrix:
[[ 18 0 0 0 0 0 0 22]
[ 1 24 0 0 0 0 0 32]
[ 0 1 25 0 0 0 0 60]
[ 1 0 0 15 1 0 0 18]
[ 0 0 0 0 22 0 0 34]
[ 0 0 0 0 0 48 0 134]
[ 0 1 0 0 0 0 47 105]
[ 0 0 4 0 1 0 2 2571]]
Predictions of a sample of urls:
#0 predicted:[en], expected:[en], url:[https://en.wikipedia.org/wiki/Rhombicuboctahedron], result: correct
#1 predicted:[en], expected:[en], url:[https://en.wikipedia.org/wiki/Uniform_polyhedron], result: correct
#2 predicted:[en], expected:[sv], url:[https://sv.wikipedia.org/wiki/Geometri], result: incorrect
#3 predicted:[en], expected:[es], url:[https://es.wikipedia.org/wiki/Geometr%C3%ADa], result: incorrect
#4 predicted:[en], expected:[de], url:[https://de.wikipedia.org/wiki/Geometrie], result: incorrect
#5 predicted:[en], expected:[ru], url:[https://ru.wikipedia.org/wiki/%D0%93%D0%B5%D0%BE%D0%BC%D0%B5%D1%82%D1%80%D0%B8%D1%8F], result: incorrect
#6 predicted:[en], expected:[en], url:[https://twitter.com/elonmusk], result: correct
#7 predicted:[sv], expected:[sv], url:[https://twitter.com/elonmusk?lang=sv], result: correct
#8 predicted:[zh-cn], expected:[zh-cn], url:[https://twitter.com/something?lang=zh-cn], result: correct
Not perfect, but with more data, configuration tweaks, and other fixes it can improve.