This repo aims to index Lex Fridman's podcasts transcriptions for Question-Answering using Andrej Karathy's transcriptions produced with OpenAI's whisper
At the moment this code relies on a couple of private packages from MeliorAI, namely:
-
gateway-API: For parallel semantic & lexical search and aggregate results.
-
semantic-search: To define lexical and semantic searching pipelines both for indexing and serving.
-
OPTIONALLY: distributed-faiss: for distributed indexing using FAISS as the underlying index. (Although this is optional if using vanila FAISS)
Every other package is openly available.
There a few, perhaps not well explained, details when configuring the above packages:
-
Semantic Search: Each document fed from semsearch/feeder.py must contain the following keys as Doc.extra_fields:
-
"type": "content"
:An arbitrary name of what constitues a document in this context (e.g.:
page
,document
). Needed for thegateway-api
to know how to aggregate and route results.The
type
must match withtypes
configure in the gateway-api schema for a given category:Pipeline: Categories: content: result_type: ContentResultType types: "content" # so the gateway-api knows what type of results to include under this category
-
"semantic": True
So the
semantic-inferece
service knows these documents are to be loaded for inference.
-
-
gateway API configuration:
-
The
categories
name in gatewayapi/config.yml:Pipeline: Categories: content: ... # this category
must match the fields of the
SearchResponseType
in the gatewayapi/schema.json:"SearchResponseType": { "parent_class": "BaseSearchResponseType", "fields": { "content_hits": "graphene.List(...)" // name before _hits } }
-