Richer Interface for Multimodal Data and BaseChat Interaction

Question

Richer Interface for Multimodal Data and BaseChat Interaction

Opened this issue 2 months ago · 0 comments

Checked other resources

This is a feature request, not a bug report or usage question.
I added a clear and descriptive title that summarizes the feature request.
I used the GitHub search to find a similar feature request and didn't find it.
I checked the LangChain documentation and API reference to see if this feature already exists.
This is not related to the langchain-community package.

Feature Description

Based on my experience with Langchain, I find the langchain.Document interface to be too generic—perhaps intentionally so, to support a wide range of use cases. However, as LLMs continue to evolve, especially with growing multimodal capabilities, the need for a richer, more expressive interface for data parsing, storage, retrieval, LLM interaction, and citation becomes increasingly critical.
Currently, Langchain.Document is not expressive enough to represent complex or structured information beyond plain text. I also explored Langchain.BaseMedia, which partially addresses multimodality (e.g., image/audio), but it still lacks the granularity required for representing tables, formulas, and other structured formats. I’m expanding the concept of "modality" here to include more sophisticated constructs for representing information.

Use Case

Existing Gaps and Observations

There are interfaces in other systems that attempt to support richer representations, but many are tightly coupled with specific document parsers. For example:

Docling provides its own schema for structured outputs from parsed documents (text, tables, images, etc.).
Other parser APIs like LandingAI or LlamaParse also return structured outputs (e.g., in JSON, HTML, Markdown).

This reveals a broader opportunity: to define a common interface in Langchain that can represent different modalities - TextElement, ImageElement, TableElement, AudioElement, etc. - as part of a unified MultimodalDocument, which could be implemented as something like List[Element].
Such a structure could assume vertically stacked elements by default, but it could also support layout-aware metadata (e.g., page number, bounding boxes) if the parser is capable of providing it.
Currently, most Langchain chains/runnables (e.g., for Q&A) are text-based. They expect retrievers to return a List[Document] where the content is textual. These documents are typically concatenated into a prompt string passed to the LLM.
However, in the multimodal setting, this approach doesn’t scale well. You need to interact with the LLM using HumanMessage objects that might represent different modalities. The current workaround - dumping everything into Document.metadata and reconstructing HumanMessage types (text, image, etc.) from there - is hacky, hard to reuse, and error-prone.

There’s a clear design gap here:

Langchain expects retrievers to return List[Document]
LLMs (via BaseChat) expect a List[HumanMessage]
But there’s no clear or clean path for transforming rich, structured documents into appropriate multimodal messages
This missing piece is a kind of (Retriever | Lambda | LLM) pipeline where the lambda function is not straightforward to implement due to the limitations of Document

Proposed Solution

Suggested Improvements

Define a richer interface for representing structured, multimodal information. This could be implemented via Pydantic models:
- TextElement, TableElement, ImageElement, FormulaElement, etc.
- Tables can have nested granularity: headers, rows, and cell-level metadata
- This would allow for more granular RAG operations and citations downstream
Multimodal Loaders and Retrievers concepts:
- Parser authors could fill in detailed, structured elements
- These could later support fine-grained citations in LLM response
- Enable clean identification and handling of different modalities in retrieval
- This would also help in routing different types of Element to appropriate LLM input handlers (text, image, table, etc.)

[GOOD TO HAVE]

Support for over-the-wire serialization, e.g., via Pydantic-compatible formats (JSON, etc.)
Support for serialization/deserialization to/from file systems or object storage:
- Should support selective loading, since RAG workflows typically need only a subset of the full document

Richer Interface for Multimodal Data and BaseChat Interaction

Checked other resources

Feature Description

Use Case

Existing Gaps and Observations

Proposed Solution

Suggested Improvements

Alternatives Considered

Additional Context

Question