Analysis of JFK documents made public by the government.
Data is available here:
There are 6684 pdf files and 17 wav files. This is an unsupervised project to try and understand what is in these files. The first step is getting the pdfs into text. A first pass at converting the documents was made. It seems like more image processing is needed on these documents before I can get good results from Tesseract OCR.
After I get (releatively) clean text, I will apply some NLP and unsupervised algorithms to gain insight into these documents. First I will try to see which documents are similar, and decipher the common themese throughout the text.