/Pied-Piper

A basic compresion algorithm for txt files written in c++

Primary LanguageC++

Pied-Piper

Problem Description: I have created a very simplistic compression program specifically targeted at .txt papers and paragraphs. The user provides the location of a .txt file they would like to compress, and the program parses through the file pulling out the top three most frequent words. These words are than replaced with single byte representation values (1,2,3), reducing the overall size of the .txt file and informing the user of how compressed the new file is relative to the original source. This compressed file is given the extension .pra. The program then generates a separate lookup table in the same folder as the compressed file, which is user defined. Once a user wants to decompress the file the program is ran again, this time the user selects decompress from the cmd line interface and the program replaces the single byte representations with the full word. I was unfortunately not able to build a gui for the program using sfml as I had hoped however, I would like to try this with QT in the future. I also could further optimize the program by selection the longest words in addition to most frequent.

Program Documentation: Upon opening the parent folder of the program, the user will notice multiple extra .txt files and a compressed folder. These are purely for example purposes and any folder / file on the system can be selected. The user will need to build the.cpp file using standard methods, no externally downloaded libraries are needed. Upon running a command line interface with appear, if this I the first time running the program the user should select 1 for compress. After selection the program will ask the user to pass the full file path of the folder they would like to use to store the .pra compressed files. If they want to use the default, they my use the compressed folder. Simply copy the path to the folder (folder name included) to the console. Note this is tested on a Windows machine and while I believe it will run on unix based systems I cannot guarantee success. After passing the compressed folder the user will be asked to pass the path to the source .txt file they would like to compress (include the file name .txt). The file will now be compressed displaying words in the file as it goes to show progress. After compression the initial file size is compared to the size of the two .pra files and a compression percentage is listed to the user. You will then see the two .pra files appear in the specified folder and the source file is unmodified by the process. The main menu will then ask the user to make another selection, while decompression can be done immediately after it is recommended to select 3 and close the program and restart it first. To decompress the .pra file the user selects 2 and points to the same folder containing the .pra files as before. This time a new Decompressed.txt file is created in the parent directory of the main.cpp file. At this point CONGRATS you have defeated hooli. To any who would like to improve upon this code some potential areas to focus on may include a better GUI, I would like to have a file manager option to select folders and files from the systems native manager. I would also like to make the mapper.pra file hidden in the long run. I would also like to incorporate a progress bar to indicate compression remaining on linger files, I did discover some neat ways of doing this but would like to have a more custom option. The last major change I would like to see moving forward is the saving of the decompressed file, I would like this to save in the same user specified folder as the .pra files and be saved with the same name as the original source file. This would most likely involve saving the source file name in the mapper.pra file which is being used as a metadata file of sorts. There is also an oddity I would like to address in future version, if the test.txt is editied outside of clion it seems to fail to close, I believe this has to do with fstream encoding handling.

Test Cases: As the program takes in .txt files and produces three others (compressed.pra, mapper.pra, and decompressed.txt) it is easier to show the results of the five tests. These can be found in the Test_Case folder the test%.txt is the source, it is also named with the compression % it gave. From a quick look it can be seen that the compression is far better on .txt files with high repeat words which are frankly highly unlikely. You can also see that it has potential to fail if given files with 1,2, or 3 in them and newline spacing is sometimes one off. However overall, I feel it handles what it is supposed to very well.