The following Java program will process and analyze the contents of a literary segment contained in a text file.
We first take the contents of the alice29.txt file and store it in a String variable. Here I used BufferedReader and StringBuilder to read through each line of the text file, and append each line to my String variable:
From there, we make a series of pre-processing adjustments to prep it for our analysis. First the program will scan for punctuation (commas, quotations, etc.) and remove them entirely.
Next, we will tokenize the words to be added to an ArrayList, but for future steps, I then converted this ArrayList to an Array:
Before we remove the list of stop words, we will of course need to obtain those stop words from a stopwords.txt file and ultimately tokenize them into a String array as well. This process, with the exception of the method of reading the file into a String (because the Buffered Reader did not work), is mostly the same for Alice:
Next, we will take these stopwords and remove them from Alice. We do this by storing the array of stop words into a HashSet of strings. We check to see (one by one) if a word in Alice is NOT also in the StopWords set. If this is the case, we can add them to a new data structure in the form of a string ArrayList (which will then be output in its own text file).
Now that we have an ArrayList of each and every individual word (including duplicates; that’s important) from the Alice story processed, we are ready to analyze the frequency of each word string and create a HashMap that states how many instances there are of a single word (which will be outputted in its own text file).
However, it is important that we sort this HashMap by order of frequency (I at least chose to do so by descending order (this will be output in its own text file).
The following is an output in a text file of the sorted HashMap, representing each unique word and its frequency.
From here, we would finally like to display some statistical results regarding the data and its changes; for there is a great value of data and even data structures themselves. For this we are interested in 4 questions: 1). How large was the alice29.txt file? 2). How large was the alice dataset following pre-processing (punctuations, spaces, stopwords)? 3). How many stopwords were there? 4). How many punctuation symbols were present in the original alice29.txt file? To obtain these values, 1-3 were very simple as we are merely seeking the length() Strings and Arrays, as well as the size() of an ArrayList. For punctuation, however, this a trickier measurement to obtain. This was done through a ‘brute force’ method of checking looping through alice String (before pre-processing) and then checking for every single mentioned punctuation character (the time complexity of such is likely O(n)).