[Bug] Error uploading pdf file. Exception: 'Could not find the version header comment at the start of the document.'
bancroftway opened this issue · 6 comments
Context / Scenario
Trying to upload a pdf file through KM. Getting error: UglyToad.PdfPig.Core.PdfDocumentFormatException: 'Could not find the version header comment at the start of the document.' Please note that the issue is occuring on a wide variety of pdfs, not just a single file. In fact, I reproduced this error on a publicly available pdf files from here https://rrc.texas.gov/resource-center/research/data-sets-available-for-download/
Please see screenshot below
What happened?
Importance
edge case
Platform, Language, Versions
.Net 9
Relevant log output
UglyToad.PdfPig.Core.PdfDocumentFormatException: 'Could not find the version header comment at the start of the document.'
If the file is not a valid PDF, I don't think there's anything we can do about it.
Could you provide some details about what happens and what you would expect to happen?
If the file is not a valid PDF, I don't think there's anything we can do about it.
Could you provide some details about what happens and what you would expect to happen?
This issue appears to be happening on a bunch of pdfs. The pdfs are valid in that I can open them and view them by double-clicking them.
Could you share some of these files?
@dluc sample files are attached below. It is a public pdf from the Texas Railroad Commission (https://rrc.texas.gov/resource-center/research/data-sets-available-for-download/)
I am using KM version: 0.75.240924.1
Here is my code:
public async Task<string> UploadDocumentToMemory(string docId, string fileName, Dictionary<string, string> tags,
byte[] docContent)
{
var kmKernel = GetKernelMemoryClient();
var tc = new TagCollection();
if (tags?.Any() == true)
{
foreach (var kvp in tags.Where(x => !string.IsNullOrEmpty(x.Value)))
{
tc.Add(kvp.Key.ToString(), kvp.Value.ToString());
}
}
//delete the document if it already exists
if (await kmKernel.IsDocumentReadyAsync(docId))
{
await DeleteDocumentsFromMemory(new List<string> { docId });
}
var id = await kmKernel.ImportDocumentAsync(new MemoryStream(docContent),
fileName: fileName,
documentId: docId,
tags: tc.Any() ? tc : null
);
return id;
}
private MemoryServerless GetKernelMemoryClient()
{
var azureOpenAITextConfig = new AzureOpenAIConfig
{
APIType = AzureOpenAIConfig.APITypes.ChatCompletion,
MaxRetries = 10,
MaxTokenTotal = configuration.GetValue<int>("AzureServices:OpenAI:LlmMaxTokens"),
Endpoint = configuration["AzureServices:OpenAI:Url"]!,
APIKey = configuration["AzureServices:OpenAI:ApiKey"]!,
Auth = AzureOpenAIConfig.AuthTypes.APIKey,
Deployment = configuration["AzureServices:OpenAI:ModelName"]!
};
var azureOpenAIEmbeddingConfig = new AzureOpenAIConfig()
{
APIType = AzureOpenAIConfig.APITypes.EmbeddingGeneration,
MaxRetries = 10,
MaxTokenTotal = configuration.GetValue<int>("AzureServices:OpenAI:EmbeddingMaxTokens"),
Endpoint = configuration["AzureServices:OpenAI:Url"]!,
APIKey = configuration["AzureServices:OpenAI:ApiKey"]!,
Auth = AzureOpenAIConfig.AuthTypes.APIKey,
Deployment = configuration["AzureServices:OpenAI:EmbeddingModel"]!,
EmbeddingDimensions = null,
MaxEmbeddingBatchSize = configuration.GetValue<int>("AzureServices:OpenAI:MaxEmbeddingBatchSize"),
};
var azureBlobsConfig = new AzureBlobsConfig()
{
Auth = AzureBlobsConfig.AuthTypes.ConnectionString,
ConnectionString = configuration["ConnectionStrings:StorageConnection"]!,
Container = configuration["AzureServices:OpenAI:EmbeddingsStorageContainer"]!
};
var azureAISearchConfig = new AzureAISearchConfig()
{
Auth = AzureAISearchConfig.AuthTypes.APIKey,
APIKey = configuration["AzureServices:Search:AdminApiKey"]!,
Endpoint = configuration["AzureServices:Search:EndPoint"]!,
UseHybridSearch = true,
};
var memory = new KernelMemoryBuilder()
.WithAzureOpenAITextGeneration(azureOpenAITextConfig)
.WithAzureOpenAITextEmbeddingGeneration(azureOpenAIEmbeddingConfig)
.WithAzureBlobsDocumentStorage(azureBlobsConfig)
.WithAzureAISearchMemoryDb(azureAISearchConfig)
.WithSearchClientConfig(new SearchClientConfig { MaxMatchesCount = 4, Temperature = 0, TopP = 0 })
.Build<MemoryServerless>();
return memory;
}
chapter1-all-effective-aug21-2017.pdf
digital-map-information-user-guide.pdf
Hi @bancroftway!
I have successfully managed to import the PDF documents you provided (as a reference, I'm using the application I have published here: https://github.com/marcominerva/KernelMemoryService).
Looking at your code:
public async Task<string> UploadDocumentToMemory(string docId, string fileName, Dictionary<string, string> tags,
byte[] docContent)
{
// ...
var id = await kmKernel.ImportDocumentAsync(new MemoryStream(docContent),
fileName: fileName,
documentId: docId,
tags: tc.Any() ? tc : null
);
// ...
}
Have you checked that the docContent
byte array contains the actual data?
This issue was most likely due to some quirk of Blazor Server, which was corrupting the byte array. I had to do the following to get around this issue. Note: I am using Blazor InputFile to upload files:
<InputFile OnChange="UploadFile" class="form-control"/>
private async Task UploadFile(InputFileChangeEventArgs e)
{
.....
using (var stream = new MemoryStream())
{
await e.File.OpenReadStream(CommonConstants.Max_Upload_Size_In_Bytes).CopyToAsync(stream);
stream.Position = 0;
var docContent = stream.ToArray(); //now the docContent byte array is good.
}