Analysis of Japanese Loanwords

General Info

Author: Lindsey Rojtas
Project Title: Analysis of Japanese Loanwords
My Background: I am an undergraduate computer science major and linguistics minor at the University of Pittsburgh. This is my repository for my term project in LING1340 (Data Science for Linguists), which evaluates the use of Japanese loan words borrowed from English. I can read a bit of Japanese, but I'm very limited in this reading. However, what I can understand is katakana characters, so I chose to work with loan words!

About the Project

Katakana is a syllabic Japanese script that is used for writing words that are onomatopoeia or loan words, usually of the Western variety. A way to compare this to English would be to think about how borrowed words are written in English, which is usually in italics (e.g. "getting a sense of deja vu"). Gairaigo is the word for "loan word" or "borrowed word" in Japanese. Click here to learn more about the differences between Katakana and the other two Japanese scripts, Hiragana and Kanji.

My project looks at two aspects of katakana use: the length of the katakana word versus its conversational frequency, and the ratio of katakana words to total length of a string of utterances by a certain speaker versus the age of that speaker. I wanted to see if length and usage were correlated in any way, as well as age and amount of loan words used.

My Found Data

Two corpora were used in this project:

The Nagoya Conversation Corpus is a corpus of 129 unstructured conversations with several different participants of varying age groups. Information about the age groups can be found here. This corpus is licensed under Creative Commons BY-NC-ND 4.0. A sample of this dataset, a .txt file that has the first conversation in it, is available in samples.
The Balanced Corpus of Contemporary Written Japanese's Word List is a list of words and their frequencies, as well as some other arbitrary data that is not relevant for use in this project. The specific list of words that I'm using that only contain Katakana is a derivation of the list filtered by Reddit user Alphyn, who has given me permission to use the files that they filtered Katakana words in. I am able to use this corpus for academic purposes. There is no specification for republishing on BCCWJ's website for this data, so a sample set will be put up in samples under Fair Use that contains the first 100 entries of the .csv file that Alphyn provided.

rojtas/Analysis-of-Japanese-Loanwords

Analysis of Japanese Loanwords

General Info

About the Project

Directory

General

Notebooks/Code (progress-notebooks folder)

Other

My Found Data