This repository contains two separate Python scripts. The first script is used for text analysis, including token generation, n-gram generation, token counting, and frequency analysis. The second script is a class for reading CoNLL (Conference on Natural Language Learning) files, which is a common format for storing linguistic data and annotations.
This script reads a text file, generates tokens (words) and n-grams (contiguous sequence of n items from a given sample of text), counts the frequency of each token or n-gram, and prints the top k most frequent tokens or n-grams. It also handles ties in frequency when printing the top k tokens or n-grams.
You can use the print_top_k_ex_aequo_most_frequent_tokens(filename, k)
function to print the top k most frequent tokens from a file, including ties. Similarly, you can use the print_top_k_ex_aequo_most_frequent_n_grams(filename, n, k)
function to print the top k most frequent n-grams from a file, including ties.
The open_conll
class opens a CoNLL file and allows you to iterate over the tokens in the file. It is designed to be used with Python's with
statement, which allows for clean resource management.
You can use the open_conll
class to read a CoNLL file as follows:
with open_conll("filename.conll") as infile:
for token in infile:
print(token)
This will print all tokens in the CoNLL file.