- Prerequisite: Install
docker
anddocker-compose
at your local machine in order to be able to execute the commands below. https://docs.docker.com/get-docker/
git clone https://github.com/HaigeWang1/Paper-Semantification.git
- paper_semantification includes a parser that relies on OpenAI public endpoints. To make it work a key is required.
- Create an .env file in the same folder as docker-compose.yaml
- Set the env variable
OPENAI_API_KEY="sk-..."
- paper_semantification includes a parser that relies on OpenAI public endpoints. To make it work a key is required.
docker build -t paper_semantification .
Build the docker image for the python service paper_sementificationdocker-compose up -d
Run the whole application
Docker-compose contains two services:
- Database Neo4J can be access locally through http://localhost:7474, connect URL bolt://localhost:7687.
- Authentication is disabled, thus ignore the fields related to authentication
- Our python service exposes its APIs through a FastAPI server http://localhost:8000/docs
- You can call the different endpoints that our service exposes
The purpose of this task is to comprehensively process scholarly papers by leveraging metadata extraction services such as CERMINE and GROBID APIs.
Extract metadata for each paper using CERMINE and GROBID (provided through API), including title, authors, affiliations, publication year, etc.
- Available APIs
- [OPTIONAL] The ceur-ws template introduced a structure into the PDFs and recommended to at least provide a e-mail address or other identifier
- optimization to extract author information based on the template
- Search paper title in DBLP for the DBLP ID.
- Match author names with potential ORCID identifiers.
- Cross-reference paper with Wikidata entries.
Compare results from CERMINE and GROBID; conduct manual checks for discrepancies. If DBLP data is present, match against CERMINE and GROBID results.
Create nodes for Proceedings, Event, Author, Paper, Affiliations. Ensure papers are connected to proceedings and event. Connect authors to affiliations and papers.
Store KG privately, ensuring security of personal data such as email addresses. If pushing to a public KG, sanitize private data. Synchronize entities linked to Wikidata.
- Access API
- Utilize
- Validation
- 2024-01-19: Midterm coordination
- 2024-03-22: Project result delivery
- 2024-03-28: Final presentation