/TextAnalysisPython

A Python repository featuring a text analysis tool that generates and counts tokens and n-grams, and a CoNLL file reader for linguistic data processing.

Primary LanguagePython

Python Text Analysis and CoNLL File Reader

github x linkedin website

This repository contains two separate Python scripts. The first script is used for text analysis, including token generation, n-gram generation, token counting, and frequency analysis. The second script is a class for reading CoNLL (Conference on Natural Language Learning) files, which is a common format for storing linguistic data and annotations.

Ngram Counter

This script reads a text file, generates tokens (words) and n-grams (contiguous sequence of n items from a given sample of text), counts the frequency of each token or n-gram, and prints the top k most frequent tokens or n-grams. It also handles ties in frequency when printing the top k tokens or n-grams.

Usage

You can use the print_top_k_ex_aequo_most_frequent_tokens(filename, k) function to print the top k most frequent tokens from a file, including ties. Similarly, you can use the print_top_k_ex_aequo_most_frequent_n_grams(filename, n, k) function to print the top k most frequent n-grams from a file, including ties.

CoNLL File Reader

The open_conll class opens a CoNLL file and allows you to iterate over the tokens in the file. It is designed to be used with Python's with statement, which allows for clean resource management.

Usage

You can use the open_conll class to read a CoNLL file as follows:

with open_conll("filename.conll") as infile:
    for token in infile:
        print(token)

This will print all tokens in the CoNLL file.