Are `.doc` files supported?
gmantri opened this issue · 4 comments
Context / Scenario
We have some Microsoft Word documents that are in old format (.doc
). What we are seeing is that when try to use those documents, Kernel Memory fails to answer the questions from those documents. When we convert those documents to .docx
format, everything works great.
What happened?
Our expectation was that both .doc
and .docx
files should work but that is not happening. .doc
files do not work but .docx
file work.
Importance
a fix would make my life easier
Platform, Language, Versions
Microsoft.KernelMemory.Core - 0.61.240524.1
Microsoft.SemanticKernel - 1.15.0
Relevant log output
No response
Hi @gmantri the old .doc format is not supported sorry. Aside from converting files manually, you could:
- option 1: add a custom handler in the ingestion pipeline to convert .doc files to PDF/DOCX, before extract text step. See
steps
parameter in the Import API. - option 2: add a custom decoder for .doc files, doing the same, e.g. converting to PDF/DOCX and returning the text. There should be some examples in the
examples
folder about adding custom decoders.
@dluc - Thanks. Is there a list of file types supported by Kernel Memory. All I could find was this: https://github.com/microsoft/kernel-memory?tab=readme-ov-file#kernel-memory-km-and-sk-semantic-memory-sm and it only talks about the file types at a high level (e.g. Word instead of .docx and not .doc). Having this list will be really helpful.
The default list can be extrapolated from here
services.AddSingleton<IContentDecoder, TextDecoder>();
services.AddSingleton<IContentDecoder, MarkDownDecoder>();
services.AddSingleton<IContentDecoder, HtmlDecoder>();
services.AddSingleton<IContentDecoder, PdfDecoder>();
services.AddSingleton<IContentDecoder, ImageDecoder>();
services.AddSingleton<IContentDecoder, MsExcelDecoder>();
services.AddSingleton<IContentDecoder, MsPowerPointDecoder>();
services.AddSingleton<IContentDecoder, MsWordDecoder>();
using DI one can inject more decoders, that are automatically picked up by TextExtractionHandler
(
For each file, the handler loops through the list of decoders, asking each one if they support the current file format:
var decoder = this._decoders.LastOrDefault(d => d.SupportsMimeType(uploadedFile.MimeType));
if (decoder is not null) ...
Thank you!