README

Stars

Known bugs: None

Design details: -KdTree: My KdTree takes in just the dimensions of the points it will contain and is initialized empty. The tree maintains nodes as TreeNodes which contain NodeObjects (an interface). To add nodes to the tree, one calls the function addNodes (or addNode for a single node), passing in a list of NodeObjects to be stored in TreeNodes and added to the tree. The interface NodeObjects requires the function getPoint() which necessitates any class that implements it to maintain its "location" as a list of coordinates used to compare against other NodeObjects in the construction of the tree. In particular, I created a Star class as a representation of a single star that implements the NodeObject interface and stores all its information including name, iD, and location. -Interpreter: I created an Interpreter class which generates the REPL and parses/handles command line input. The Interpreter class is relatively extensible in that new commands can be added with little effort and without modifying in any capacity the ability of the Interpreter to handle previous commands. Since the Interpreter uses an if/else if/else statement to read commands from the command line (in the eval function), to add a new command is simply of adding another case to the statement (as well as another function to parse for the arguments related to the new command). Further, I factored out argument parsing and tokenizing into their own methods - the former so that I could use those static methods to parse input from the GUI and the latter because it provides broad functionality I'm sure I can use elsewhere. -StarsObject: This class, a representation of a stars database that contains many stars and can be queried, is the essence of this particular project. It maintains all the stars that are uploaded in a KdTree, using these stars to execute functions that correspond with the stars, neighbors, and radius commands to the command line. -GUI: For the GUI, instead of redirecting on the completion of a command, I instead just print on the same query page a new message reporting on the execution of the command. I believe this is a better user interface as the user can then immediately submit another command while still being able to view the output from the last command. Generally, my design consists of text input boxes with instructions above to allow users to input the search parameters for both the nearest neighbors and range searches.

Any runtime/space optimizations: -In maintaining the KdTree, I chose to reconstruct the kdTree each time new nodes are added because while this would greatly lower KdTree insertion efficiency, it would keep the tree balanced, greatly increasing the efficiency of search algorithms like k nearest neighbors and range search. -In implementing the k nearest neighbors search (as well as the range search) algorithm, I choose to store the "current bests" in a hashmap, with the nodes mapped to distances from the point of interest. In this way the distances don't have to be recalculated on each comparison.

Running tests: To run junit tests (located in src/test/java/edu/brown/cs/gk16/), simply run "mvn test" from the command line. To run student written system tests, run the cs32 test harness - "./cs32-test ./tests/student/stars/* " - all extra written csv files are located in data/stars/.

Building/running program: To build the program, run "mvn package" from the command line or alternatively "mvn package -D maven.test.skip=true" to omit the junit tests. To run the program with command line input and output, run "./run" to start. To run the program with the gui, run "./run --gui".

Answer to design questions:

In the Interpreter class' eval function, I would add another case of the if/else if/else statement to check for the new command. Then I would create a new function in the Interpreter class (to be called in eval) to parse for and error check the arguments to the new command. This function would also handle executing the command — if it were stars related, it would presumably call another function contained in the StarsObject class.
If I were to use my KdTree to find points close together on the Earth's surface, I wouldn't be able to use Euclidean distance to calculate the distances between points - since movement occurs only on the spherical surface. For example, the Earth has a diameter of 7,917.5 km, so if I use my KdTree to find points within 7,918 km of a point, the search would return all points, which is incorrect since the distance I would have to travel over the Earth's surface from the given point to another point could exceed that value (traveling from Seville, Spain to Auckland, New Zealand covers 19,913 km for instance).
In order to allow my KdTree to implement the Collection interface, I would have to add "implements Collection" to the class declaration, then include all of the Collection interface's required methods. These include add (essentially the same as my addNodes), addAll, clear, contains, isEmpty, etc.

Explanations for any Checkstyle errors: None known

Autocorrect

Known bugs: None

Design details: -Trie: My trie consists of a root node containing no characters and then each new combination of letters is a new node. For example, if I add the word "a" to the empty trie, it creates a new node whose parent is the root node (a parent stores its children in a hashmap that maps the new character to the child node). But then, if I add "at" to the trie, one more new node is created whose parent is the node containing "a". Each node stores the entire letter combination up until that point as a string, so the new node would store "at". -AcObject: This is basically the representation of an autocorrect object that stores all the corpuses given to it by the user and contains the text prediction algorithms. In this class, I handle the processing of corpuses — storing the frequency of unigrams, bigrams, and ngrams as each word is processed. The reason why I separated the storing of ngrams from bigrams/unigrams (as well as the ranking functions) is because if the user fails to upload a corpus with the n set to some integer greater than 2, the smart ranking should default to the regular autocorrect instead of just the unigram model. -GUI: Similar to the stars interface, the autocorrect interface consists of buttons and input text fields that allow the user to input settings selections and data. To represent the actual text prediction, I use an unordered list and clear/append to the list as words are being suggested by the backend algorithm.

Any runtime/space optimizations: -In finding all the children with the correct levenshtein edit distances (below the threshold), I use an algorithm that traverses the trie to collect the words instead of calculating individually the distance between the current word and each word in the trie. This decreases the runtime because it reduces the number of words that need to be visited (only a segment of the trie). -In ranking and sorting the suggestions, I maintained information about the ngram/bigram/unigram frequencies in hashmaps since the lookup time is constant (instead of having to iterate through some other data structure).

Running tests: To run junit tests (located in src/test/java/edu/brown/cs/gk16/), simply run "mvn test" from the command line. To run student written system tests, run the cs32 test harness - "./cs32-test ./tests/student/autocorrect/* " - all extra written csv files are located in data/autocorrect/.

Smart suggestion ranking: To rank the suggestions in smart mode, I use an n-gram ranking algorithm (like the unigram/bigram model but extended to n previous words before the final word). For example, if n = 4, the algorithm would take the 3 words before the last word + each suggestion and rank the quadrigrams based on the frequencies stored. And then if there are ties between the quadigrams, those are ranked by creating trigrams out of the 2 words before the last word + each suggestion. And it continues on until n = 0 and the unigrams are ranked (plus alphabetical rankings for ties). The user can pick the value of n (default value is 0) and change it by typing "n " or query the value by typing "n". The value for n must be set before loading in the corpus so that information about the frequencies can be stored. The ngram frequencies are stored in a hashmap that maps the value of n (size of the ngram) to a hashmap storing the frequencies for ngrams of that size (each of those hashmaps maps an ngram to a integer that details its frequency in the loaded corpus). In creating the ngrams out of the user input, I also make sure to take into account punctuation (i.e. an ngram shouldn't straddle a period, comma, semicolon, and any other break punctuation).

I thought about ranking based on led distances — which would be relatively easy to implement given that I currently have a method that returns the led-qualifying suggestions as a hashmap mapping led distances to lists of suggestions that have those led distances — but I thought the n-gram algorithm would be cooler and more challenging to implement, as well as a better predictive text model because of the patterned nature of our language (also worthy of swag points???)

This smart ranking method is better than the given bigram/unigram model because it essentially allows the system to keep track of longer patterns. For the phrase "have a good day" might occur as frequently as "good doggy" but if the user inputs "have a good d", it's far more likely that they mean to type "have a good day" rather than "have a good doggy".

Answer to design questions:

If I wanted to handle autocorrecting multiple input fields on the same page with the same corpora, I would just add another input field on the front end (and would have to create new handlers to handle the other input). I wouldn't have to change anything on the backend since the same AcObject's autocorrect function can just be called again to query the same trie. However, if I wanted to handle autocorrecting multiple input on different pages that used different corpora, I would have to instantiate new AcObject for each different corpora, as well as create new handlers for each input field on each of the different pages. In essence, each instance of AcObject is a representation of a corpora.
My trie would need as many nodes as there are new words to store the new words. The way my trie is implemented is that each new combination of a letter and the preceding letters of the word is a separate node — for example to store just the word bell, I would need four new nodes (one for each letter). However, if I then wanted to store the word belly, I would only need one more new node. In addition, each node stores the characters of the parent node plus the new character, so the nodes would store b, be, bel, bell, and belly. Thus, if a new letter was introduced into the alphabet that negated any English word, since all the all the words to be negated are already stored in the trie, each new negated word would only need one more new node. For instance, for the new word badθ, I would just need a new node that's the child of the node containing "bad" that stores the string "badθ".

Explanations for any Checkstyle errors: None known

Bacon

Known bugs: None

Design Details: -EdgeSetGraph: I use an edge set graph because of the fast (constant time) insert method and ease of implementation. However, my EdgeSetGraph maintains the vertices in a hashmap that maps the vertex number to a vertex which allows it to return a vertex in the graph given the unique vertex number. This is convenient for bacon, since I can just keep track of the vertex numbers (in the BaconObject class) and thus can get any given vertex in constant time. -Actor: The actor class is a subclass of the vertex super class which represents a node in the edge set graph. The actor class is specific to bacon as it contains a create edges method that creates the incident edges based on the project specs (co-actors with first names that begin with the same letter as the donor's last name). In this manner, I'm able to keep my code extensible — since this also allows me to factor out the dijkstra's algorithm into its own class for reuse — while also maintaining full bacon functionality. In addition, the actor class is initialized with static caches to make the process of creating edges more efficient. -Dijkstra: This class exists only to find the shortest path between two vertices in a graph and so it only has one method that does exactly that. During the dijkstra's traversal, it calls the method create edges on the vertices. This is advantageous because it allows the edges to be created during the process which saves time since not necessarily all the edges will be visited (and if if there are different specs to creating the edges, I could just create a new subclass that extends the vertex class with a different form of the create edges method) — or if I want the graph to already have the edges, I can just use the vertex super class as my graph node class and the create edges method does nothing. -BaconObject: I factor the querying of the database into a single method called query table for simplification. I also have a method that given two names checks that the first letter of the last name of the first name is the same as the first letter of the first name of the second name, which I use to create the edges. -GUI: I have an input page where the user can type in a path for the database and then two actors to find the shortest path between. The actor input boxes have drop down menus below that show autocorrect suggestions based on the data loaded in from the database. Then to implement the dynamic URLs for the actor and film pages, I add an href tag to the list items (each list item is an actor or film) to make them link to their respective page — there's one ftl file for the actors and one for the films and I just change the listed items on the page depending on the actor or film. I have an onload element on each that calls the javascript method to update those items.

NOTE: For the connect command, since the actor name is not necessarily unique, I use the first actor that the database returns with the given name.

Any runtime/space optimizations: -I minimize the number of sql queries I make by 1) using the guava loading cache structure to cache film names by id, actor names by id, and the number of actors who've acted in a film by film id and by 2) using a single join command when querying the database for co-actors who've acted in the same films as the given actor. Thus, I never make the same query to the database twice since if I've made the query before, the data would be stored locally in one of the cache structures. -Instead of creating the entire graph upon loading the database (i.e. putting in all the actors and connecting edges) I create the edges of the graph during the dijkstra traversal. This saves time since a call to the connect command won't necessarily require each actor be visited and thus won't cover its incident edges. -I implemented the vertex class so that it keeps track of its incident (outgoing) edges so that when I want the incident edges of a vertex, I won't have to search through all the edges in the graph in the edge set graph class and instead can just call the constant time getter method of the vertex.

Answer to design questions:

Right now, in order to add a new graph search algorithm, I would have to create a new class that performs that search and then create a new vertex sub class that creates edges based on new specifications. In order to improve this, I could make a more extensible create edges method that takes in parameters (maybe like the index of the initials to compare or an object to compare instead of strings or the name of the table in the database to query so I don't have to worry about the structure) so that I wouldn't have to create a whole new vertex sub class with a whole new create edges method.
In the function that handles reading in the database, I could check the file extension of the inputted file to determine what file it is (and if the file extension is nonexistent I would attempt to read it as any of the prescribed file types), then depending on what file type it is, I could have different methods that read the data (i.e. one for sqlite3 file and one for csv files) and add the actors to the graph - for csv files I would change my program to allow multiple files instead of having one file that is reset like for the database files. Then, I would factor out any code I have right now to query the database (i.e. finding co-actors etc.) and depending on what file type was given, those method(s) would query the files different. Once I've obtained the data, the data is stored the same way internally (in the edge set graph) so the rest of the program could be the same.
First, I would need to be querying the database for the film release years. Then while performing dijkstra's I would have to store the release year of the film that served as the connecting film between an actor and the "previous" actor (in the path). And when I create the edges for a given actor, I have to check that the release year of the film connecting the given actor to the next actor is after the stored release year. Alternatively, I can just add edges the same way, but also store the film release years in the edges, and then while traversing the graph just check that the release year is after the release year stored for an actor before moving along the particular edge.

Explanations for any Checkstyle errors: None known

griffink3/Database_Search

README

Stars

Autocorrect

Bacon