This repository contains the code to extract meaningful financial and non-financial indicators from company reports (pdf files)
Go to the root of the directory, create a python virtual environment and activate it
python3 -m venv env
source env/bin/activate
Copy the file sample_config/.env
to the root of the repository and fill the missing values.
Build and run the docker-compose file. The container hosts the pgvector database containing the embeddings extracted from the pdf files
sudo docker compose build
sudo docker compose up
To store the semantic embeddings in the database, run
PYTHONHASHSEED=0 python3 main.py --pdf [PDF_PATH] --embed --use_dense --model_name [MODEL_NAME]
where
PYTHONHASHSEED=0
is an environment variable making thehash
function deterministic.hash
is used to parse document chunks, producing an unique id which is used as primary key inside the embedding database. In this way, the system avoids loading multiple times the same documents if the above command is run repeatedly;PDF_PATH
can be either a single pdf file or a directory storing pdf files;--embed
and--use_dense
indicate that the system should embed the documents using the modelMODEL_NAME
(taken from Huggingface). By default,MODEL_NAME="sentence-transformers/all-mpnet-base-v2"
.
To store the data for the sparse embedding, run
PYTHONHASHSEED=0 python3 main.py --pdf [PDF_PATH] --embed --use_sparse --syn_model_name [SYN_MODEL_NAME]
where
PYTHONHASHSEED=0
same as above;PDF_PATH
can be either a single pdf file or a directory storing pdf files;--embed
and--use_sparse
indicate that the system should embed the documents using the modelSYN_MODEL_NAME
. By default,SYN_MODEL_NAME="tf_idf"
.
To query the embeddings, run
python3 main.py --pdf [PDF_PATH] --query [QUERY_STRING] --use_dense --model_name [MODEL_NAME] --k [TOP_K_RESULTS]
This command will return the top-k results obtained from the dense (semantic) query.
To do the same for the sparse (syntactic) embeddings, run
python3 main.py --pdf [PDF_PATH] --query [QUERY_STRING] --use_sparse --syn_model_name [MODEL_NAME] --k [TOP_K_RESULTS]
The ensemble method leverages both semantic and syntactic retrieval modes to further improve the system. To use the ensemble, run
python3 main.py --pdf [PDF_PATH] --query [QUERY_STRING] --use_ensemble --model_name [MODEL_NAME] --syn_model_name [SYN_MODEL_NAME] --k [TOP_K_RESULTS] --lambda [LAMBDA_VALUE]
The additional parameter --lambda
is a scalar value that controls the importance of syntactic features over semantic ones. The higher the value, the more we give importance to the SYN_MODEL_NAME
(e.g. tf_idf)
Run the file test.py
with
python3 test.py --pdf [PDF_PATH] --use_[dense|sparse|ensemble] --model_name [MODEL_NAME] --syn_model_name [SYN_MODEL_NAME] --checkpoint_rate [CHECKPOINT_RATE]
with --checkpoint_rate
is the saving frequency. The files will be stored in the tests/
directory