/cognates

Scrapes Wiktionary to find cognates

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

This project scrapes Wiktionary to find cognates between different languages. cognate_finder.py is the main program, which users can run to find cognates between any two languages that have Wiktionary category pages for terms in those languages derived from a common ancestor.

I tested 15 language pairs. For each language pair, cognate_finder_results.csv shows how many cognate pairs were found and how long the program took to run. Program duration ranged from 1–46 minutes, depending largely on how well-documented the chosen languages were on English Wiktionary.

cognate_finder.py was built from my earlier program, persian_english_cognates.py, which does the same thing but just for Persian and English.

The raw, uncleaned output from both programs is in the example_results folder.

wiktionary_derived_terms_categories.csv lists Wiktionary category pages for terms in one language derived from another. The Python script I wrote to scrape that data is not currently in this repository.