Identify Māori words in text, and score text by the proportion of words that are in Te Reo.
All scripts are run with python3. To install the package, run python3 setup.py install
from the working directory. To run all scripts you need to first install yelp_uri
and beautifulsoup4
.
kupu_tūtira
takes any textfile and returns a textfile containing a list of the words that it considers to be Māori, as well as hyphenated words. It is passed two filenames in the terminal, the first file containing the corpus, the second being the output file. It also determines how much of the text is Māori, and prints this percentage to the terminal. Use the following format to execute this script.
python3 kupu_tūtira -i input_file.txt -o output_file.txt
hihira_raupapa
takes any textfile and returns two textfiles. One contains a list of words that are considered to be Māori, and another contains a list of words that are of Māori form (consonant-vowel format, no doubling of consonants, always ends in a vowel, Māori alphabet), but are not considered to be Māori words. It checks them all against http://maoridictionary.co.nz . Use the following format to execute this script.
python3 hihira_raupapa -i input_file.txt -g output_file1.txt -b output_file2.txt
auaha_tūtira_tū
creates the stoplists used in thekōmiri_kupu
function (which is called in thekupu_tūtira
script) from a user-provided English corpus. However I have provided such lists in a subfolder referenced throughout the code calledtaumahi_tūtira
, created from Google's provided list of 20,000 most commonly used English words. You may recreate these lists from the same corpus or make your own by executing the script thusly.
python3 auaha_tūtira_tū -i input_file
The directory examples/gazette
uses the kōmiri_kupu
function in the ngā-kupu
library to identify Māori placenames in the New Zealand Gazetteer of placenames. In the map below, red and blue markers indicate Māori and non-Māori placenames, respectively.