/embed-encode

Base64-encoded embeddings from the OpenAI API

Primary LanguageJupyter NotebookBSD Zero Clause License0BSD

Base64-encoded embeddings from the OpenAI API

Displaying coordinates of text embeddings retrieved using the OpenAI Python library shows more digits than when the embeddings are retrieved explicitly from the API endpoint or using most other libraries. This repository explores why that is, how to get this behavior (and by the same mechanism) when working in other languages, and why one should not usually bother to do so.

More specifically, this repository is a collection of code examples and documentation for the encoding_format argument to the OpenAI embeddings API, which, when set to base64, will send raw floats encoded in Base64. The OpenAI Python library uses that under the hood.

Beware that encoding_format is not officially documented. It could be removed or changed in the future!

License

This project is licensed under 0BSD, which is a "public-domain equivalent" license. See LICENSE for details.

Acknowledgements

These materials arose out of conversations with RonaldGRuckus on the OpenAI Discord server. If not for Ronald's observations about embeddings from the Python library, and the conversations that followed, this repository and its contents would not exist.

Contents

See Why embeddings via the Python library show more digits for a fully detailed explanation of this.

The example code in this repository is in three directories:

Note that the reason to use encoding_format, if there is one, would not ordinarily be increased precision, but instead the optimization in speed and network usage, which appears to be why the OpenAI Python library uses it.

Furthermore, to reiterate the above warning, encoding_format is not officially documented, and it could potentially be removed, or changed, at any point in the future. The OpenAI Python library's source code shows how one might approach using it in a way that partially avoids depending on its future existence.