Knowledge Base Shrink

Read the ACL paper, the master thesis or watch the presentation.

The 768-dimensional embedding of 2019 Wikipedia dump (split to 100 token segment) takes almost 150GB. This poses practical issues for both research and applications. We aim to reduce the size through two methods:

Dimensionality reduction of the embedding:

PCA, autoencoder, random projections
Effect on IP vs L2
Pre-processing
Training/evaluation data size dependency

Document splitting & filtering:

Split on segments respecting semantic boundaries
Get retrievability annotations and train a filtering system
Decrease knowledge base size by clustering (join neighbours pointing to the same doc)
- Observe performance vs. cluster count
- Cluster aggregation
- Pre-train vs. post-train reduction effects

Recommendations

Always use pre- and post-processing (centering & normalization).
PCA is a good enough solution that requires very little data (1k vectors) to fit and is stable. The autoencoder provides a slight improvement but is less stable.
8-bit floats are supported and offer very little performance drop. Combine PCA and this precision reduction for the best trade-off.

Citation

@inproceedings{zouhar2022knowledge,
  title={Knowledge Base Index Compression via Dimensionality and Precision Reduction},
  author={Zouhar, Vil{\'e}m and Mosbach, Marius and Zhang, Miaoran and Klakow, Dietrich},
  booktitle={Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge},
  pages={41--53},
  year={2022},
  url={https://aclanthology.org/2022.spanlp-1.5/},
}

Furthermore, this project is also a Master thesis.

Acknowledgement

Based on KILT research & dataset.
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 232722074 – SFB 1102.

zouharvi/kb-shrink

Knowledge Base Shrink

Recommendations

Citation

Acknowledgement