/emailai

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

emailai

Purpose

These Python tools are being developed for use in AI-aided investigative reporting although some may have more general appicability. They are freely availbale for general use but are provided with no warranty of any kind nor a commitment to maintain them. They will be contributed to langchain, llama-hub, and other repositories as time allows and depending on interest.

Searching information for investigative reproting purposes often differs from a simple search for answers. The question being asked may be who knew what when rather than what are the facts. It may be crucial to know what various actors said at various times and what they heard from others. For practical reasons it is usually necessary to set up a database of documents, for example the documents returned in reponse to a public records or FOIA request. AI is used to create vector indicies and perhaps metadata for the documents. Vector and metadata retrieval are used to find documents relevant to a particular request. Relevant documents are then fed back to an LLM for abstraction, summarization, and or/analysis. Typically the document collection will be large. After resisting public records requests, governments and other organizations which are forced to respond often release a deluge of documents either to avoid liability for withholding and/or to bury the relevant documents in duplicative or extraneous data.

Problems addressed.

  1. Email is often provided as PDFs but the metadata like sender, recipients, subject, date, and attachments is crucial to establish responsibility. emailfrompdf.py contains code to parse text from PDFs and return Documents with both content and metadata.
  2. In a large collection of email, the email addresses for the same person are often formatted in different ways depending on how the names are stored in other peoples address book. Tools wil be provided to allow retrieval by sender or recipient metadata even where there is variance in email address formats.
  3. Current langchain and llamaindex APIs do not give access to the full capabilities of vector search engines (namely Pinecone) which support both dense and sparse annd metadata searches. A more capabale tool will be provided.
  4. If you want to know what x wrote to y about z during a certain time period, you can't just retrieve the top few documents according to vector similarity to present to the LLM for summary or analysis; you need almost all dcouments which might be relevant. However, you can't send that much data to an LLM like OpenAI in one query without breaking the token limit. Even using multiple queries with every possibly relevant document is liable to be expensive. Code will be provided to present documents in relevance order for summarization of relevant information with a cutoff determined dynamically by the LLM.