/stimson-web-scraper

Scrapes and crawls websites for textual data and urls in any ISO language

Primary LanguagePythonMIT LicenseMIT

stimson-web-scraper

Scrapes and crawls websites for textual data and urls in any ISO language

Table of Contents

Getting Started on Mac OS

In a terminal window:

    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    xcode-select --install
    brew update
    brew upgrade

    git --version
    git version 2.24.1 (Apple Git-126)

    brew install python3
    python3 --version
        Python 3.7.7

    pip3 install -U pytest
    py.test --version
	This is pytest version 5.4.1, imported from /usr/local/lib/python3.7/site-packages/pytest/__init__.py

Install Desktop tools

Download GitHub desktop

    open https://desktop.github.com

Optionally Download PyCharm Professional

    open https://www.jetbrains.com/pycharm/download

Git on the Server Generating Your SSH Public Key

Reference

open https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/
check to make sure your github key has been added to the ssh-agent list.  Here's my ~/.ssh/config file

 Host github.com github
     IdentityFile ~/.ssh/id_rsa
     IdentitiesOnly yes
     UseKeyChain yes
     AddKeysToAgent yes
    cd ~/.ssh
    ssh-keygen -o
    ssh-add -K ~/.ssh/id_rsa
    ssh-add -L

get project source code

    cd ~
    git clone https://github.com/Stimson-Center/stimson-web-scraper.git

Getting started with Web Scraping

Execute test suite to ensure environmental integrity

    cd ~/stimson-web-scraper
    ./run_tests.sh

Execute as an Python3 executable

    cd ~/stimson-web-scraper/scraper
    ./start.sh
    ./cli.py -u https://www.yahoo.com -l en

Execute as an Python3 package

Get an article from a Website Page

import datetime
from scraper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.build()

# Access Data scraped from this web site page

article.authors
['Leigh Ann Caldwell', 'John Honway']

article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)

article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

article.top_image
'http://someCDN.com/blah/blah/blah/file.png'

article.movies
['http://youtube.com/path/to/link.com', ...]

article.keywords
['New Years', 'resolution', ...]

article.summary
'The study shows that 93% of people ...'

article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'

Foreign Language Websites

scraper can extract and detect languages seamlessly. If no language is specified, Newspaper will attempt to auto detect a language. If you are certain that an from scraper then you can specify it by two letter ISO code

To see list of supported ISO languages

import scraper
scraper.get_languages()
Your available languages are:
input code         full name
af			  Afrikaans
ar			  Arabic
be			  Belarusian
bg			  Bulgarian
bn			  Bengali
br			  Portuguese, Brazil
ca			  Catalan
cs			  Czech
da			  Danish
de			  German
el			  Greek
en			  English
eo			  Esperanto
es			  Spanish
et			  Estonian
eu			  Basque
fa			  Persian
fi			  Finnish
fr			  French
ga			  Irish
gl			  Galician
gu			  Gujarati
ha			  Hausa
he			  Hebrew
hi			  Hindi
hr			  Croatian
hu			  Hungarian
hy			  Armenian
id			  Indonesian
it			  Italian
ja			  Japanese
ka			  Georgian
ko			  Korean
ku			  Kurdish
la			  Latin
lt			  Lithuanian
lv			  Latvian
mk			  Macedonian
mr			  Marathi
ms			  Malay
nb			  Norwegian (Bokmål)
nl			  Dutch
no			  Norwegian
np			  Nepali
pl			  Polish
pt			  Portuguese
ro			  Romanian
ru			  Russian
sk			  Slovak
sl			  Slovenian
so			  Somali
sr			  Serbian
st			  Sotho, Southern
sv			  Swedish
sw			  Swahili
ta			  Tamil
th			  Thai
tl			  Tagalog
tr			  Turkish
uk			  Ukrainian
ur			  Urdu
vi			  Vietnamese
yo			  Yoruba
zh			  Chinese
zu			  Zulu
import scraper
scraper.get_languages()

{'ar': 'Arabic', 'af': 'Afrikaans', 'be': 'Belarusian', 'bg': 'Bulgarian', 'bn': 'Bengali', 'br': 'Portuguese, Brazil', 'ca': 'Catalan', 'cs': 'Czech', 'da': 'Danish', 'de': 'German', 'el': 'Greek', 'en': 'English', 'eo': 'Esperanto', 'es': 'Spanish', 'et': 'Estonian', 'eu': 'Basque', 'fa': 'Persian', 'fi': 'Finnish', 'fr': 'French', 'ga': 'Irish', 'gl': 'Galician', 'gu': 'Gujarati', 'ha': 'Hausa', 'he': 'Hebrew', 'hi': 'Hindi', 'hr': 'Croatian', 'hu': 'Hungarian', 'hy': 'Armenian', 'id': 'Indonesian', 'it': 'Italian', 'ja': 'Japanese', 'ka': 'Georgian', 'ko': 'Korean', 'ku': 'Kurdish', 'la': 'Latin', 'lt': 'Lithuanian', 'lv': 'Latvian', 'mk': 'Macedonian', 'mr': 'Marathi', 'ms': 'Malay', 'nb': 'Norwegian (Bokmål)', 'nl': 'Dutch', 'no': 'Norwegian', 'np': 'Nepali', 'pl': 'Polish', 'pt': 'Portuguese', 'ro': 'Romanian', 'ru': 'Russian', 'sk': 'Slovak', 'sl': 'Slovenian', 'so': 'Somali', 'sr': 'Serbian', 'st': 'Sotho, Southern', 'sv': 'Swedish', 'sw': 'Swahili', 'ta': 'Tamil', 'th': 'Thai', 'tl': 'Tagalog', 'tr': 'Turkish', 'uk': 'Ukrainian', 'ur': 'Urdu', 'vi': 'Vietnamese', 'yo': 'Yoruba', 'zh': 'Chinese', 'zu': 'Zulu'}

To import an article in a supported ISO language

from scraper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'

article = Article(url, language='zh') # Chinese
article.build()

print(article.text[:150])

香港行政长官梁振英在各方压力下就其大宅的违章建
僭建问题到立法会接受质询并向香港民众道歉梁振英在星期二12月10日的答问大会开始之际
在其演说中道歉但强调他在违章建筑问题上没有隐瞒的
意图和动机一些亲北京阵营议员欢迎梁振英道歉且认为应能获得香港民众接受但这些议员也质问梁振英有

print(article.title)
港特首梁振英就住宅违建事件道歉


# If you are certain that an from scraper import Article

url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'

article = Article(url, language='zh') # Chinese
article.build()

print(article.text[:150])
香港行政长官梁振英在各方压力下就其大宅的违章建
僭建问题到立法会接受质询并向香港民众道歉梁振英在星期二12月10日的答问大会开始之际
在其演说中道歉但强调他在违章建筑问题上没有隐瞒的
意图和动机一些亲北京阵营议员欢迎梁振英道歉且认为应能获得香港民众接受但这些议员也质问梁振英有

print(article.title)
港特首梁振英就住宅违建事件道歉

Extract text from Adobe PDF files in any ISO language

from scraper import Article
url = "http://tpch-th.listedcompany.com/misc/ShareholderMTG/egm201701/20170914-tpch-egm201701-enc02-th.pdf"
article = Article(url=url, language='th')
article.build()
print(article.text)

Get a Wikipedia Article including embedded tables

from scraper import Article
url = "https://en.wikipedia.org/wiki/International_Phonetic_Alphabet_chart_for_English_dialects"
article = Article(url=url, language='en')
article.build()

print(article.text)
print(article.tables)

Optionally Setting up a Docker environment

    brew install docker
    docker --version
    cd ~/stimson-web-scraper
    ./run_docker.sh

You will be put into the virtual machine:

(venv) tf-docker /app >

    ./run_tests.sh

For more details see:

Docker Tutorial

Contributing

  • Fork it
  • Create your feature branch (git checkout -b your_github_name-feature)
  • Commit your changes (git commit -am 'Added some feature')
  • Make sure to add tests for it. This is important so we don't break it in a future version unintentionally.
  • File an Issue
  • Push to the branch (git push origin your_github_name-feature)
  • Create new Pull Request