Personal Papers-Explorer is analogous to a private pdf papers search engine for a Phd researcher student at Texas State University.
This project is aimed at developing an easy to use web application for the Personal Papers-Explorer engine. This engine is built using Solr which is a standalone enterprise text search server with a REST-like API. The web application facilitates a user i.e. a researcher to upload research papers of pdf format into various cores or projects and search through them. A machine learning library called GROBID (GeneRatio Of Bibliographic Data) API is used for extracting and parsing the title, authors, abstract and body text of pdf papers. These extracted data are stored in Solr via JSON over HTTP. PhP and Python scripts are developed to invoke Personal PaperExplorer APIs for accessing stored papers. The web interface also provides a text area for the researcher to annotate each research paper and this annotation can be exported in multiple formats such as JSON, XML for integration with other systems.
Below softwares should be installed to run the Project:
Setup the project by changing: In settings.php under includes folder:
$target_dir - Enter the full path to the folder where you want to store the research papers.
In settings.py under pyCalculation folder:
- importQuery - Solr UI url.
- fullPathToSolrBinFolder - C:\solr-7.1.0\bin
- fullPathToSolrServerFolder - C:\solr-7.1.0\server\solr
- pathToResearchPapersFolder - Enter the full path to the folder where you want to store the research papers.
- Automatic metadata extraction - Personal Papers-Explorer automatically extracts author, title and other related metadata for analysis and document search.
- Full-text indexing - Personal Papers-Explorer indexes the full-text of the entire pdf papers. Full boolean, phrase search is supported.
- Individual core capability search - Personal Papers-Explorer is flexible with multiple core search where cores can be created by user according to the categories of the papers.
- Annotating the research papers - The web interface also provides a text area for the researcher to annotate each research paper.
- Exporting data to JSON format - The extracted File name, Title, Authors, Abstract and Annotation can be exported into a downloadable JSON format file.
Core creation can be done by Solr API using CREATE:
curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=ComputerScience"
Once the core is created, new fields such as timestamp, title, author, abstract, annotation, summary, references and text are added to the core schema:
curl -X POST -H 'Content-type:application/json' --data-binary
"{"add-field":[
{"name":"timestamp","type":"pdate","indexed":true,"stored":true,"default":"NOW","multiValued":false},
{"name":"abstract","type":"string","indexed":true,"stored":true,"docValues":true},
{"name":"author","type":"string","indexed":true,"stored":true,"docValues":true},
{"name":"annotation","type":"string","indexed":true,"stored":true,"docValues":true},
{"name":"summary","type":"string","indexed":true,"stored":true,"docValues":true},
{"name":"references","type":"string","indexed":true,"stored":true,"docValues":true},
{"name":"title","type":"string","indexed":true,"stored":true,"docValues":true}
{"name":"text","type":"string","indexed":true,"stored":true,"docValues":true,"multiValued":true}
]}" http://localhost:8983/solr/ComputerScience/schema
To reflect these new changes in solr core, the core should be reloaded using RELOAD option:
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=ComputerScience"
Screenshot for core creation:
Solr core can be deleted using UNLOAD operation:
curl "http://localhost:8983/solr/admin/cores?action=UNLOAD&core=ComputerScience&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true"
Screenshot for core deletion:
A pdf document can be uploaded to the Solr core by API call:
http://localhost:8983/solr/ComputerScience/update/extract?commit=true&literal._id=Band.pdf
The Research Papers PDF documents can be extracted by GROBID API call:
curl -v --form input=@./ResearchPapers/ComputerScience/Band.pdf http://cloud.science-miner.com/grobid/processFulltextDocument
A uploaded document can be deleted using the uniqie index of the document:
curl http://localhost:8983/solr/ComputerScience/update?commit=true -H "Content-Type: text/xml" --data-binary "<delete><id>4496b4f7-42ad-4dee-92ad-daa78b5c32b9</id></delete>"
The three different Boolean query parsers used are:
To retieve all the documents:
http://localhost:8983/solr/ComputerScience/select?q=*:*&fq=*:*&wt=python&rows=2147483647&sort=timestamp%20desc&df=text
The "fq" field is changed base on the queries such as:
-
Query using AND:
title:Detection AND author:Tech This will retrieve the documents whose title contains “Detection” in title and “Tech” in author.
-
Query using OR:
title:Detection OR author:Tech This will retrieve the documents whose title contains “Detection” or the author name contains “Tech”.
-
Query using NOT (-):
title:Detection AND -author:Tech This will retrieve the documents whose title contains “Detection” and documents whose author doesn’t have “Tech”.
The annotation field for each pdf paper can be updated using the Solr API call:
curl localhost:8983/solr/ComputerScience/update?commit=true -H 'Content-type:application/json' --data-binary "[{'id':'148796b5-89c3-4e5b-bad9-ee8b5b2d449b','annotation':{'set':'This is test annotation'}}]"
Once the documents are annotate, the data can be exported to JSON file:
CS5395-004 Independent study with Dr.Anne Hee Hiong Ngu at Texas State University.