/kb-shrink

Shrinking knowledge bases for knowledge intensive tasks, one key-value pair at a time

Primary LanguagePython

Knowledge Base Shrink

Read the ACL paper, the master thesis or watch the presentation.

The 768-dimensional embedding of 2019 Wikipedia dump (split to 100 token segment) takes almost 150GB. This poses practical issues for both research and applications. We aim to reduce the size through two methods:

Dimensionality reduction of the embedding:

  • PCA, autoencoder, random projections
  • Effect on IP vs L2
  • Pre-processing
  • Training/evaluation data size dependency

Document splitting & filtering:

  • Split on segments respecting semantic boundaries
  • Get retrievability annotations and train a filtering system
  • Decrease knowledge base size by clustering (join neighbours pointing to the same doc)
    • Observe performance vs. cluster count
    • Cluster aggregation
    • Pre-train vs. post-train reduction effects

Recommendations

  • Always use pre- and post-processing (centering & normalization).
  • PCA is a good enough solution that requires very little data (1k vectors) to fit and is stable. The autoencoder provides a slight improvement but is less stable.
  • 8-bit floats are supported and offer very little performance drop. Combine PCA and this precision reduction for the best trade-off.

Citation

@inproceedings{zouhar2022knowledge,
  title={Knowledge Base Index Compression via Dimensionality and Precision Reduction},
  author={Zouhar, Vil{\'e}m and Mosbach, Marius and Zhang, Miaoran and Klakow, Dietrich},
  booktitle={Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge},
  pages={41--53},
  year={2022},
  url={https://aclanthology.org/2022.spanlp-1.5/},
}

Paper video presentation

Furthermore, this project is also a Master thesis.

Acknowledgement

  • Based on KILT research & dataset.
  • This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 232722074 – SFB 1102.