BoxSelectCanvas: Self-reading library
Did you ever went to the library and asked yourself, why they don't read all those books and tell you about the important stuff - instead of allowing you read?
This contains some NLP-Pipeline-Framework for Layout recognition, semantic text analysis and text to speech tool for audio-books putting together various tools.
-
It brings a framework to connect a multitude of different tools with less spaghetti-code or sorted spaghettis.
-
It has a deep learning approach to learn layout structures (arbitrarily configurable by learning from tags, that are automaticallyh inserted into LateX files, that are compiled and used for learning layout analysis)
-
It employs Elmo to predict semantic tags within the texts (A transformer model like Bert or GPT3 from AllenNLP)
-
It creates an interactive 3D-universe of Documentens representing the content of the "library". The documents are automatically clustered and the clusters are titled.
-
It brings a dynamic frontend, that hosts pages for backend components. The frontend services and page-components are dynamically built by the backend!
Based on a philosophical we try to mine "utterances expressing differences", when different theories, parts of things, meanings, values are compared to each other, as they are the main factor to convey knowledge to the reader, building the structure of decisions the reader may come to after reading.
We have had three trials on this:
- An algorithm that compares and matches based on antonyms (coming from wordnet) and negations. The comparison is done on all sentences and phrases on a text, that are matched, if the comparisons and antonyms come felicitous together.
- An algorithm, that uses a fine-tuned transformer-model (based on AllenAI transformer model ELMO), to match phrases that are similar to our handmade corpus
- An algorithm processing the text peaces with Open-AIs GPT3, that pushes the scientific text into GPT3 with the question "What differences are expressed in this text?"
You can paste url in the frontend and the document will be taken into account by the self-reading library! You can teach with those documents in three ways:
-
Annotate the layout
-
Annotate some difference-utterances
-
The more documents, the more structured and overview-like will the topics become
coming soon
... will come...
Embeddings, vector representations of the "meaning" of the word, are clusters with Gaussian Clustering to thematically more or less consitent topics. And those clusters are given titles by some TF-IDF-method.
Create an audiobook from your favorite scientific paper and share it with the rest of the science world to have some great podcast from the machine! ... will come ...
Clone and cd into.
python -m venv venv
pip install pip==20.3
pip install torch
pip install -r requirements.txt --use-deprecated=legacy-resolver
cd self-reading-library/python
python backend.py
and in other shell:
cd self-reading-library/react/layouut-viewer-made/
[yarn install]
yarn run dev
- vasturiano
- arxive.org
- AllenNLP
- pdf2htmlEX
- parasail
- layoutlmv2
- differencebetween.net
- To authors of over 60000 npm packages!
- To authors of 2000 Python packages!
- To nature, god and friends!