Unstructured-IO/unstructured

bug/Titles not included in chunks by-title

Opened this issue · 0 comments

Describe the bug
I am using the API with chunking strategy by title. When I compare the PDF with parsed data, I find that chunked excerpts don't see to include their title. If I parse with no chunking, I see title is identified correctly, as is the document hierarchy. I would have expected the title to be part of the chunk as it has a lot of semantic weight.

To Reproduce

Here is my pipeline. I attach a PDF it processes, search for "What we found" in the PDF to see title for a section, it is this title which is this title which occurs in its own CompositeElement.


MAX_CHARACTERS=1500
CHUNK_OVERLAP=200
COMBINE_TEXT_UNDER_N_CHARS=50

Pipeline.from_configs(
            context=ProcessorConfig(),
            indexer_config=LocalIndexerConfig(input_path=input_dir),
            downloader_config=LocalDownloaderConfig(),
            source_connection_config=LocalConnectionConfig(),
            partitioner_config=PartitionerConfig(
                partition_by_api=True,
                api_key=os.getenv("UNSTRUCTURED_API_KEY"),
                partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
                strategy="hi_res",
                additional_partition_args={
                    "split_pdf_page": True,
                    "split_pdf_allow_failed": True,
                    "split_pdf_concurrency_level": 15,
                    "reprocess": True,
                    "extract_image_block_types": ["Image"]
                },
                reprocess=True
            ),
            #https://docs.unstructured.io/api-reference/ingest/ingest-configuration/chunking-configuration
            chunker_config=ChunkerConfig(
                chunking_strategy="by_title",
                max_characters = MAX_CHARACTERS,
                chunk_overlap = CHUNK_OVERLAP,
                combine_text_under_n_chars= COMBINE_TEXT_UNDER_N_CHARS
            ),
            #embedder_config=EmbedderConfig(embedding_provider="langchain-huggingface"),
            uploader_config=LocalUploaderConfig(output_dir=f"{OUTPUT_DIR}/{METHOD}")
        ).run()

Here is the file I am testing with ...

oversightgov__faa_quickly_awarded_cares_act_funds_but_can_enhanc__2c7d00c9-def8-4409-9359-1a626bbf69b5.pdf

Expected behavior
I wouldn't expect titles to just be their own chunks, instead that they would be part of the main text for that section.

Screenshots

Environment Info
I am using the unstructured docker image as found here:

https://github.com/Unstructured-IO/unstructured/tree/main?tab=readme-ov-file#run-the-library-in-a-container

I exec in. Had to also install ...

pip install unstructured-ingest
pip install unstructured

Here are my versions ...

Python 3.11.10

unstructured 0.15.13
unstructured-client 0.25.9
unstructured-inference 0.7.36
unstructured-ingest 0.0.21
unstructured.paddleocr 2.8.1.0
unstructured.pytesseract 0.3.13

Additional context