In this work we compare different approaches to Common Procurement Vocabulary (CPV) codes classification, using data extracted from the Spanish Treasury.
All training, testing datasets and code are available in Zenodo. Check them out!
In order to preprocess the data, follow the steps below:
- Download the atom datasets into a folder (in our case
licitacionesPerfilesContratanteCompleto3_2019
) - Run
atom2csvBS.py
, which will save the results indocs.csv
- Run
data2tt.py
usingdocs.csv
. As a result, you will obtaintrain.csv
andtest.csv
, which are the datasets used in the notebooks
The notebooks below contain the different approaches compared:
Test for classic approaches: https://www.kaggle.com/marianavasloro/cpvtwodigitsml
Test for the MKaan approach: https://www.kaggle.com/code/marianavasloro/mkaan
Training and evaluation of fine-tuned RoBERTa: https://www.kaggle.com/code/marianavasloro/fine-tuned-roberta-for-spanish-cpv-codes
Full: https://www.kaggle.com/datasets/marianavasloro/dataset
10%: https://www.kaggle.com/datasets/marianavasloro/dataset10
If you use these datasets or our notebooks, please cite this repository and our 2022 SEPLN paper (to appear):
@software{maria_navas_loro_2022_6554604,
author = {María Navas Loro and
Daniel Garijo and
Oscar Corcho},
title = {Multi-label Text Classification for Public Procurement in Spanish},
month = may,
year = 2022,
publisher = {Zenodo},
doi = {10.5281/zenodo.6554604},
url = {https://doi.org/10.5281/zenodo.6554604}
}
This work has been supported by NextProcurement European Action and the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with Universidad Politécnica de Madrid in the line Support for R&D projects for Beatriz Galindo researchers, in the context of the V PRICIT (Regional Programme of Research and Technological Innovation).
We would like to thank Jennifer Tabita for her contributions to the initial set of notebooks, and the AI4Gov master students for their validation of the approach.
Source of the data: Ministerio de Hacienda.