[Bug] Error uploading pdf file. Exception: 'Could not find the version header comment at the start of the document.'

Question

[Bug] Error uploading pdf file. Exception: 'Could not find the version header comment at the start of the document.'

bancroftway opened this issue 3 months ago · 6 comments

Context / Scenario

Trying to upload a pdf file through KM. Getting error: UglyToad.PdfPig.Core.PdfDocumentFormatException: 'Could not find the version header comment at the start of the document.' Please note that the issue is occuring on a wide variety of pdfs, not just a single file. In fact, I reproduced this error on a publicly available pdf files from here https://rrc.texas.gov/resource-center/research/data-sets-available-for-download/

Please see screenshot below

What happened?

Importance

edge case

Platform, Language, Versions

.Net 9

Relevant log output

UglyToad.PdfPig.Core.PdfDocumentFormatException: 'Could not find the version header comment at the start of the document.'

Answer 1 · 2024-09-25T18:12:42.000Z

If the file is not a valid PDF, I don't think there's anything we can do about it.

Could you provide some details about what happens and what you would expect to happen?

Answer 2 · 2024-09-25T18:15:38.000Z

If the file is not a valid PDF, I don't think there's anything we can do about it.

Could you provide some details about what happens and what you would expect to happen?

This issue appears to be happening on a bunch of pdfs. The pdfs are valid in that I can open them and view them by double-clicking them.

Answer 3 · 2024-09-25T19:08:57.000Z

Could you share some of these files?

Answer 4 · 2024-09-25T20:05:42.000Z

@dluc sample files are attached below. It is a public pdf from the Texas Railroad Commission (https://rrc.texas.gov/resource-center/research/data-sets-available-for-download/)

I am using KM version: 0.75.240924.1

Here is my code:

public async Task<string> UploadDocumentToMemory(string docId, string fileName, Dictionary<string, string> tags,
    byte[] docContent)
{
    var kmKernel = GetKernelMemoryClient();

    var tc = new TagCollection();

    if (tags?.Any() == true)
    {
        foreach (var kvp in tags.Where(x => !string.IsNullOrEmpty(x.Value)))
        {
            tc.Add(kvp.Key.ToString(), kvp.Value.ToString());
        }
    }

    //delete the document if it already exists
    if (await kmKernel.IsDocumentReadyAsync(docId))
    {
        await DeleteDocumentsFromMemory(new List<string> { docId });
    }

    var id = await kmKernel.ImportDocumentAsync(new MemoryStream(docContent),
                fileName: fileName,
                documentId: docId,
                tags: tc.Any() ? tc : null
                );

    return id;
}

private MemoryServerless GetKernelMemoryClient()
{

    var azureOpenAITextConfig = new AzureOpenAIConfig
    {
        APIType = AzureOpenAIConfig.APITypes.ChatCompletion,
        MaxRetries = 10,
        MaxTokenTotal = configuration.GetValue<int>("AzureServices:OpenAI:LlmMaxTokens"),
        Endpoint = configuration["AzureServices:OpenAI:Url"]!,
        APIKey = configuration["AzureServices:OpenAI:ApiKey"]!,
        Auth = AzureOpenAIConfig.AuthTypes.APIKey,
        Deployment = configuration["AzureServices:OpenAI:ModelName"]!
    };

    var azureOpenAIEmbeddingConfig = new AzureOpenAIConfig()
    {
        APIType = AzureOpenAIConfig.APITypes.EmbeddingGeneration,
        MaxRetries = 10,
        MaxTokenTotal = configuration.GetValue<int>("AzureServices:OpenAI:EmbeddingMaxTokens"),
        Endpoint = configuration["AzureServices:OpenAI:Url"]!,
        APIKey = configuration["AzureServices:OpenAI:ApiKey"]!,
        Auth = AzureOpenAIConfig.AuthTypes.APIKey,
        Deployment = configuration["AzureServices:OpenAI:EmbeddingModel"]!,
        EmbeddingDimensions = null,
        MaxEmbeddingBatchSize = configuration.GetValue<int>("AzureServices:OpenAI:MaxEmbeddingBatchSize"),
    };

    var azureBlobsConfig = new AzureBlobsConfig()
    {
        Auth = AzureBlobsConfig.AuthTypes.ConnectionString,
        ConnectionString = configuration["ConnectionStrings:StorageConnection"]!,
        Container = configuration["AzureServices:OpenAI:EmbeddingsStorageContainer"]!
    };

    var azureAISearchConfig = new AzureAISearchConfig()
    {
        Auth = AzureAISearchConfig.AuthTypes.APIKey,
        APIKey = configuration["AzureServices:Search:AdminApiKey"]!,
        Endpoint = configuration["AzureServices:Search:EndPoint"]!,
        UseHybridSearch = true,
    };

    var memory = new KernelMemoryBuilder()
        .WithAzureOpenAITextGeneration(azureOpenAITextConfig)
        .WithAzureOpenAITextEmbeddingGeneration(azureOpenAIEmbeddingConfig)
        .WithAzureBlobsDocumentStorage(azureBlobsConfig)
        .WithAzureAISearchMemoryDb(azureAISearchConfig)
        .WithSearchClientConfig(new SearchClientConfig { MaxMatchesCount = 4, Temperature = 0, TopP = 0 })
        .Build<MemoryServerless>();

    return memory;
}

chapter1-all-effective-aug21-2017.pdf
digital-map-information-user-guide.pdf

Answer 5 · 2024-09-26T07:40:23.000Z

Hi @bancroftway!

I have successfully managed to import the PDF documents you provided (as a reference, I'm using the application I have published here: https://github.com/marcominerva/KernelMemoryService).

Looking at your code:

public async Task<string> UploadDocumentToMemory(string docId, string fileName, Dictionary<string, string> tags,
    byte[] docContent)
{
    // ...

    var id = await kmKernel.ImportDocumentAsync(new MemoryStream(docContent),
                fileName: fileName,
                documentId: docId,
                tags: tc.Any() ? tc : null
                );

    // ...
}

Have you checked that the docContent byte array contains the actual data?

Answer 6 · 2024-09-26T20:07:35.000Z

This issue was most likely due to some quirk of Blazor Server, which was corrupting the byte array. I had to do the following to get around this issue. Note: I am using Blazor InputFile to upload files:

<InputFile OnChange="UploadFile" class="form-control"/>

private async Task UploadFile(InputFileChangeEventArgs e)
{
        .....
        using (var stream = new MemoryStream())
        {
            await e.File.OpenReadStream(CommonConstants.Max_Upload_Size_In_Bytes).CopyToAsync(stream);
            stream.Position = 0;
            var docContent = stream.ToArray();        //now the docContent byte array is good.        
        }