/Web-Scraper-Public-

Web Scraper for extracting numerals of all languages for later analysis in a readible .csv format.

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Python package Vulnerabilities Bugs Security Rating Maintainability Rating Code Smells Lines of Code Duplicated Lines (%) Coverage Codacy Badge Codacy Badge

Numeral-Web-Scraper

Web Scraper for extracting numerals of all languages from languagesandnumbers.com for later analysis. Saves them in a readible .csv format.

Requirements

  • See requirements.txt
  • Python 3.9+

Function

Scrapes all numerals listed at languagesandnumbers.com from all 251 languages. Furthermore, the scraped numerals get saved in a CSV-File in the desired script-path which can be viewed in any editor for later analysis. A progress bar indicates how many websites are left.

Execution

Binary

[Note: Releases are Outdated, I will update them soon when I finished most of the aspects listed in the TODOs. For now, please build the project manually.]

  • Download the .exe-file from the releases tab. Double-click to execute.

Use/build from Source

  • Download and unzip source code or clone the repository with git clone https://github.com/mrtnbm/Web-Scraper-Public-.git
  • Install Python 3.9+ sudo apt install python3.9
  • Optionally update pip, setuptools, wheel: python3 -m pip install --upgrade pip setuptools wheel
  • Install requirements pip install -r requirements.txt
  • Start script with python3 web-scraper-all.py resp. python web-scraper-all.py on Windows.

Build binary yourself

  • Execute pyinstaller -wF web-scraper-all.py.

Run tests

  • python test-web-scraper-all.py

GUI

  • Main Window for changing settings and selecting a folder to save the csv file

    image

  • Secondary Window for viewing the progression of the script

    image

TODO

  • Test-Cases for all functions (achieve coverage >= 75%)
  • refactor main (more seperate functions, less code in main)
  • refactor to meet OOP standards
  • fix all code smells
  • redirect uploading artifacts to deploy outside of repository