This Project is for course: INF 1340 - Programming for Data Science at University of Toronto (Faculty of Information), supervised by Dr. Maher Elshakankiri

Project: Text Analyzer

Name: Saad Umar

Course: INF 1340 - Programming for Data Science

Lec0101 - Monday - 3pm to 6pm

Instructor: Dr. Maher Elshakankiri

Program Explanation

This project involves creating a text analyzer using python programming to analyze text files and gain insights to the file. The program consists of 13 functions plus the main, two of them are to read and write the input and output files, while the remaining 11 functions perform different types of analysis on the input text file. This program will prompt the user to enter the name of the input file which they wish to analyze (ensure that the input file is in the same directory as the Midterm.py file). The user will then be asked to write the name of the output file to which the results will be displayed. The results of this program will be displayed on the screen, as well as be saved in an output file. Additionally, the output of the count_char functions will be saved in a csv file as well as in the output file and displayed on the screen.

Some of the features of this program are:

Count the number of words in the file
Count the number of sentences in the file (delimiters set as period and question mark)
The number of times each word occurs (frequency) in a document
The frequency of a specific word, which the user will input
Top 10 most common words in the file
TF = Term frequency of each word in the document. Term frequency is the ratio of the number of times the word appears in a document compared to the total number of words in that document. It is an indicator of the term's importance in a given text.
The average number of sentences in a file
The longest word in the input file
The length of the longest word in the input file
The shortest word in the input file
The frequency of each character in the input file

Installation

pip3 install collections

pip3 install scikit-learn

pip3 install pandas

pip3 install nltk

nltk.download('punkt')

nltk.download('stopwords')

Any additional library that is used in this program and needs to be installed on a local machine, can be installed using the format:

pip3 install <libraryName>

Note: This program will run best on python3. If this program is being run on python2, you might run into syntax errors, and might get some potential fomatting issues.

Usage

Ensure you have python3 installed on your local machine
Open any IDE of your choice
Ensure all necessary libraries are installed (refer to the Installation section above)
Place your input text file in the same directory as the program
To run the program, enter the following command:

python3 Midterm.py

After running the program, do the following when prompted:

Enter the input text file name
Enter the output file name
Enter the word whose frequency you want to know (number of times the specified word occurs in the document)

Note: In case of any unexpected errors/or having trouble running the program, please feel free to reach out to me at: s.umar@mail.utoronto.ca