txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification.
NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:
- paperai - AI-powered literature discovery and review engine for medical/scientific papers
- tldrstory - AI-powered understanding of headlines and story text
- neuspo - Fact-driven, real-time sports event and news site
- codequestion - Ask coding questions directly from the terminal
txtai is built on the following stack:
- sentence-transformers
- transformers
- faiss
- Python 3.6+
The easiest way to install is via pip and PyPI
pip install txtai
You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.
pip install git+https://github.com/neuml/txtai
Python 3.6+ is supported
Windows and macOS systems have the following prerequisites. No additional steps for Linux.
Install C++ Build Tools
Run brew install libomp
see this link
The examples directory has a series of examples and notebooks giving an overview of txtai. See the list of notebooks below.
Notebook | Description | |
---|---|---|
Introducing txtai | Overview of the functionality provided by txtai | |
Build an Embeddings index with Hugging Face Datasets | Index and search Hugging Face Datasets | |
Build an Embeddings index from a data source | Index and search a data source with word embeddings | |
Add semantic search to Elasticsearch | Add semantic search to existing search systems | |
Extractive QA with txtai | Introduction to extractive question-answering with txtai | |
Extractive QA with Elasticsearch | Run extractive question-answering queries with Elasticsearch | |
Apply labels with zero shot classification | Use zero shot learning for labeling, classification and topic modeling | |
API Gallery | Using txtai in JavaScript, Java, Rust and Go |
The following sections cover available settings for each txtai component. See the example notebooks for detailed examples on how to use each txtai component.
An Embeddings instance is the engine that provides similarity search. Embeddings can be used to run ad-hoc similarity comparisions or build/search large indices.
Embeddings parameters are set through the constructor. Examples below.
# Transformers embeddings model
Embeddings({"method": "transformers",
"path": "sentence-transformers/bert-base-nli-mean-tokens"})
# Word embeddings model
Embeddings({"path": vectors,
"storevectors": True,
"scoring": "bm25",
"pca": 3,
"quantize": True})
method: transformers|words
Sets the sentence embeddings method to use. When set to transformers, the embeddings object builds sentence embeddings using the sentence transformers. Otherwise a word embeddings model is used. Defaults to words.
path: string
Required field that sets the path for a vectors model. When method set to transformers, this must be a path to a Hugging Face transformers model. Otherwise, it must be a path to a local word embeddings model.
storevectors: boolean
Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.
scoring: bm25|tfidf|sif
For word embedding models, a scoring model allows building weighted averages of word vectors for a given sentence. Supports BM25, tf-idf and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.
pca: int
Removes n principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.
backend: annoy|faiss|hnsw
Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to Faiss for Linux/macOS and Annoy for Windows. Faiss currently is not supported on Windows.
Backend-specific settings are set with a corresponding configuration object having the same name as the backend (i.e. annoy, faiss, or hnsw). None of these are required and are set to defaults if omitted.
annoy:
ntrees: number of trees (int) - defaults to 10
searchk: search_k search setting (int) - defaults to -1
See Annoy documentation for more information on these parameters.
faiss:
components: Comma separated list of components - defaults to None
nprobe: search probe setting (int) - defaults to 6
See Faiss documentation on the index factory and search for more information on these parameters.
hnsw:
efconstruction: ef_construction param for init_index (int) - defaults to 200
m: M param for init_index (int) - defaults to 16
randomseed: random-seed param for init_index (init) - defaults to 100
efsearch: ef search param (int) - defaults to None and not set
See Hnswlib documentation for more information on these parameters.
quantize: boolean
Enables quanitization of generated sentence embeddings. If the index backend supports it, sentence embeddings will be stored with 8-bit precision vs 32-bit. Only Faiss currently supports quantization.
txtai provides a light wrapper around a couple of the Hugging Face pipelines. All pipelines have the following common parameters.
path: string
Required path to a Hugging Face model
quantize: boolean
Enables dynamic quantization of the Hugging Face model. This is a runtime setting and doesn't save space. It is used to improve the inference time performance of models.
gpu: boolean
Enables GPU inference.
model: Hugging Face pipeline or txtai pipeline
Shares the underlying model of the passed in pipeline with this pipeline. This allows having variations of a pipeline without having to store multiple copies of the full model in memory.
An Extractor pipeline is a combination of an embeddings query and an Extractive QA model. Filtering the context for a QA model helps maximize performance of the model.
Extractor parameters are set as constructor arguments. Examples below.
Extractor(embeddings, path, quantize, gpu, model, tokenizer)
embeddings: Embeddings object instance
Embeddings object instance. Used to query and find candidate text snippets to run the question-answer model against.
tokenizer: Tokenizer function
Optional custom tokenizer function to parse input queries
A Labels pipeline uses a zero shot classification model to apply labels to input text.
Labels parameters are set as constructor arguments. Examples below.
Labels()
Labels("roberta-large-mnli")
A Similarity pipeline is also a zero shot classifier model where the labels are the queries. The results are transposed to get scores per query/label vs scores per input text.
Similarity parameters are set as constructor arguments. Examples below.
Similarity()
Similarity("roberta-large-mnli")
txtai has a full-featured API that can optionally be enabled for any txtai process. All functionality found in txtai can be accessed via the API. The following is an example configuration and startup script for the API.
Note that this configuration file enables all functionality (embeddings, extractor, labels, similarity). It is suggested that separate processes are used for each instance of a txtai component.
# Index file path
path: /tmp/index
# Allow indexing of documents
writable: True
# Embeddings settings
embeddings:
method: transformers
path: sentence-transformers/bert-base-nli-mean-tokens
# Extractor settings
extractor:
path: distilbert-base-cased-distilled-squad
# Labels settings
labels:
# Similarity settings
similarity:
Assuming this YAML content is stored in a file named index.yml, the following command starts the API process.
CONFIG=index.yml uvicorn "txtai.api:app"
uvicorn is a full-featured production ready server with support for SSL and more. See the uvicorn deployment guide for details.
A Dockerfile with commands to install txtai, all dependencies and default configuration is available in this repository.
The Dockerfile can be copied from the docker directory on GitHub locally. The following commands show how to run the API process.
docker build -t txtai.api -f docker/api.Dockerfile .
docker run --name txtai.api -p 8000:8000 --rm -it txtai.api
# Alternatively, if nvidia-docker is installed, the build will support GPU runtimes
docker run --name txtai.api --runtime=nvidia -p 8000:8000 --rm -it txtai.api
This will bring up an API instance without having to install Python, txtai or any dependencies on your machine!
The txtai API provides all the major functionality found in this project. But there are differences due to the nature of JSON and differences across the supported programming languages.
Difference | Python | API | Reason |
---|---|---|---|
Return Types | tuples | objects | Consistency across languages. For example, (id, score) in Python is {"id": value, "score": value} via API |
Extractor | extract() | extractor.extract() | Extractor pipeline is a callable object in Python |
Labels | labels() | labels.label() | Labels pipeline is a callable object in Python that supports both string and list input |
Similarity | similarity() | similarity.similarity() | Similarity pipeline a callable object in Python that supports both string and list input |
The following programming languages have txtai bindings:
See each of the projects above for details on how to install and use. Please add an issue to request additional language bindings!
For those who would like to contribute to txtai, please see this guide.