/Lexicap-QA

QA retrieval for Lex Fridman's podcast transcriptions

Primary LanguageJupyter Notebook

Lexicap QA

This repo aims to index Lex Fridman's podcasts transcriptions for Question-Answering using Andrej Karathy's transcriptions produced with OpenAI's whisper

👉️ Lexicap transcriptions.

At the moment this code relies on a couple of private packages from MeliorAI, namely:

  • gateway-API: For parallel semantic & lexical search and aggregate results.

  • semantic-search: To define lexical and semantic searching pipelines both for indexing and serving.

  • OPTIONALLY: distributed-faiss: for distributed indexing using FAISS as the underlying index. (Although this is optional if using vanila FAISS)

Every other package is openly available.

Misc Notes

There a few, perhaps not well explained, details when configuring the above packages:

  1. Semantic Search: Each document fed from semsearch/feeder.py must contain the following keys as Doc.extra_fields:

    • "type": "content":

      An arbitrary name of what constitues a document in this context (e.g.: page, document). Needed for the gateway-api to know how to aggregate and route results.

      The type must match with types configure in the gateway-api schema for a given category:

      Pipeline:
      Categories:
         content:
            result_type: ContentResultType
            types: "content"  # so the gateway-api knows what type of results to include under this category
    • "semantic": True

      So the semantic-inferece service knows these documents are to be loaded for inference.

  2. gateway API configuration:

    • The categories name in gatewayapi/config.yml:

      Pipeline:
       Categories:
          content: ... # this category

      must match the fields of the SearchResponseType in the gatewayapi/schema.json:

         "SearchResponseType": {
         "parent_class": "BaseSearchResponseType",
         "fields": {
               "content_hits": "graphene.List(...)"  // name before _hits
         }
      }