Lexicap QA

This repo aims to index Lex Fridman's podcasts transcriptions for Question-Answering using Andrej Karathy's transcriptions produced with OpenAI's whisper

👉️ Lexicap transcriptions.

At the moment this code relies on a couple of private packages from MeliorAI, namely:

gateway-API: For parallel semantic & lexical search and aggregate results.
semantic-search: To define lexical and semantic searching pipelines both for indexing and serving.
OPTIONALLY: distributed-faiss: for distributed indexing using FAISS as the underlying index. (Although this is optional if using vanila FAISS)

Every other package is openly available.

Misc Notes

There a few, perhaps not well explained, details when configuring the above packages:

Semantic Search: Each document fed from semsearch/feeder.py must contain the following keys as Doc.extra_fields:
- "type": "content":
  
  An arbitrary name of what constitues a document in this context (e.g.: page, document). Needed for the gateway-api to know how to aggregate and route results.
  
  The type must match with types configure in the gateway-api schema for a given category:
```
Pipeline:
Categories:
   content:
      result_type: ContentResultType
      types: "content"  # so the gateway-api knows what type of results to include under this category
```
- "semantic": True
  
  So the semantic-inferece service knows these documents are to be loaded for inference.

gateway API configuration:

The categories name in gatewayapi/config.yml:

Pipeline:
 Categories:
    content: ... # this category

must match the fields of the SearchResponseType in the gatewayapi/schema.json:

   "SearchResponseType": {
   "parent_class": "BaseSearchResponseType",
   "fields": {
         "content_hits": "graphene.List(...)"  // name before _hits
   }
}

josemarcosrf/Lexicap-QA

Lexicap QA

Misc Notes