Sr. Design Project – Medtronic
LISPAT (Lost in Space and Time) is built to convert documents from their common pdf/docx format to txt, allowing for the use of Natural Language Processing on the provided data, with the goal of finding semantic similarities between two documents. Using methods such as text cleaning, top keyword counts, top ngram counts, and leveraging python libraries to get the most out of the data. It extracts the most valuable information from each document all in order to capture the original information, in a manner that preserves and increases its value, accessibility and usefulness.
IMPORTANT: Please scroll down for some important notes.
Home - LISPAT's Home Page
API Reference - Detailed reference for LISPAT's API
Usage Guides - How to use LISPAT and its features
- Converts 2 uploaded pdf/docx/doc documents for comparison/analysis
- Finds top keywords in both documents
- Highlights selected top keywords
- Search for a keyword in specific document
- Visualize the document's term frequency on a graph
- Visualize the way that 2 documents use a particular word
Follow the home page link above to access LISPAT.
Notes about LISPAT.
- To clear a search, click "enter", once you have removed your search words.
- Search words are case sensitive.
- Selected keywords will contain any word that has the same character occurrences of the keyword.
For example: the word
words
is contained withinkeywords
. So keywords will show up in the term search. If you want to contain the whole word, please enter it in the search bar. - The documents will contain headers and footers from the pdf or docx document, due to the processing when converting the documents the data from each document is not perfectly formatted and some attributes may be missing.
The graph shows you the differences in the way the words are used in each document.
Top Right Corner: These are common words found in both documents, ideally you would want to find most of your top terms in this corner.
Bottom Left Corner: These are the words least common words used by each document. They are rarely used.
Top Left Corner: These are the most frequent terms found in the 1st Document.
Bottom Right Corner: These are the most frequent terms found in the 2nd Document.
Selecting a term on the graph or column will show you where they were found in each document, with side by side comparison.
Docker
Note: To install docker please reference https://docs.docker.com/install/
> docker pull jbrummet/lispat:2.0.0
> docker tag jbrummet/lispat:2.0.0 lispat
> docker run -it -p 5000:5000 lispat
open 0.0.0.0:5000 in the browser
> cd path/to/{lispat}
> docker build . -t lispat
> docker run -it -p 5000:5000 lispat
open 0.0.0.0:5000 in the browser
LISPAT does have a fairly large docker file. The installation process does take a minute since there are so many dependencies between both python and javascript.
Cheers, Team LISPAT