/tagalog-dictionary-scraper

Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Tagalog Dictionary Scraper 📒 Tweet

Ating pag-ibayuhin ang ating talahuluganan!

Collects Tagalog words from tagalog.pinoydictionary.com, a database of Tagalog words powered by Cyberspace.ph Web Hosting using web scraping and web crawling techniques.

24,868 words (as of Oct 20, 2016)

License: GPL v3 Build Status Code Health codecov

contributions welcome

How is it done? 💪

Each webpage is loaded and parsed, extracting the words enclosed in <dt> tag.

Included is tagalog.pinoydictionary.com html snippet containing the source of http://tagalog.pinoydictionary.com/list/a/ to serve as guide and overview on how dictionary words from the page are extracted.

Disclaimer: I do not own the html code cited above, it is owned by tagalog.pinoydictionary.com.

How did the project started? 💭

Originally it is intended for a Scrabble ® Tagalog dictionary database, but other uses may vary.

Tools ✏️

  python -m pip install -U pip beautifulsoup4

Notes 📌

  • tagalog_dict.txt is where the scraper collect_tagalog.py puts the collected words.
  • The output file tagalog_dict.txt will be updated from time to time to ensure up-to-date collection. 📅

License License: GPL v3

GNU General Public License 3.0