microsoft/kernel-memory

Are `.doc` files supported?

gmantri opened this issue · 4 comments

Context / Scenario

We have some Microsoft Word documents that are in old format (.doc). What we are seeing is that when try to use those documents, Kernel Memory fails to answer the questions from those documents. When we convert those documents to .docx format, everything works great.

microsoft.docx

What happened?

Our expectation was that both .doc and .docx files should work but that is not happening. .doc files do not work but .docx file work.

Importance

a fix would make my life easier

Platform, Language, Versions

Microsoft.KernelMemory.Core - 0.61.240524.1
Microsoft.SemanticKernel - 1.15.0

Relevant log output

No response

dluc commented

Hi @gmantri the old .doc format is not supported sorry. Aside from converting files manually, you could:

  • option 1: add a custom handler in the ingestion pipeline to convert .doc files to PDF/DOCX, before extract text step. See steps parameter in the Import API.
  • option 2: add a custom decoder for .doc files, doing the same, e.g. converting to PDF/DOCX and returning the text. There should be some examples in the examples folder about adding custom decoders.

@dluc - Thanks. Is there a list of file types supported by Kernel Memory. All I could find was this: https://github.com/microsoft/kernel-memory?tab=readme-ov-file#kernel-memory-km-and-sk-semantic-memory-sm and it only talks about the file types at a high level (e.g. Word instead of .docx and not .doc). Having this list will be really helpful.

dluc commented

The default list can be extrapolated from here

public static IServiceCollection AddDefaultContentDecoders(

        services.AddSingleton<IContentDecoder, TextDecoder>();
        services.AddSingleton<IContentDecoder, MarkDownDecoder>();
        services.AddSingleton<IContentDecoder, HtmlDecoder>();
        services.AddSingleton<IContentDecoder, PdfDecoder>();
        services.AddSingleton<IContentDecoder, ImageDecoder>();
        services.AddSingleton<IContentDecoder, MsExcelDecoder>();
        services.AddSingleton<IContentDecoder, MsPowerPointDecoder>();
        services.AddSingleton<IContentDecoder, MsWordDecoder>();

using DI one can inject more decoders, that are automatically picked up by TextExtractionHandler (

IEnumerable<IContentDecoder> decoders,
)

For each file, the handler loops through the list of decoders, asking each one if they support the current file format:

var decoder = this._decoders.LastOrDefault(d => d.SupportsMimeType(uploadedFile.MimeType));

if (decoder is not null) ...

Thank you!