/TORCHLITE_PoS

HTRC TORCHLITE Hackathon group working on Part of Speech

Primary LanguageHTML

TORCHLITE_PoS

HTRC TORCHLITE Hackathon group working on Part of Speech

In this project, we extract POS counts using the HTRC Extracted Features API and visualize the proportion of these tags on the page level across a volume. To gain a better understanding, we group the POS tags into larger categories and end up with 5 categories in total for each language: Verbs, Nouns, AdjectivesAdverbs, Pronouns, and Other. In addition, we accounted for the differences in POS tags for each different language, including Chinese, Spanish, German, French, and Arabic. Our hypothesis is that, for fiction, the beginning of the volume would have more nouns to introduce different agencies and objects. However, by comparing several volumes, we found that there is often no significant variation in the percentage of each category throughout a single volume. We also found that, unsurprisingly, non-fictions tend to have a larger percentage of nouns. Our project allows the user to input the HTID of their choice and compare different volumes. Overall, this project can be applied to gain insights into the variation of POS distribution across volumes and across languages.

alt text

Team Members: Matthew Butler, Gyuri Kang, Glen Layne-Worthey, Savannah Scott, Peizhen Wu, Xuhan Zhang, Haiqi Zhou