bug/Titles not included in chunks by-title
Opened this issue · 0 comments
Describe the bug
I am using the API with chunking strategy by title. When I compare the PDF with parsed data, I find that chunked excerpts don't see to include their title. If I parse with no chunking, I see title is identified correctly, as is the document hierarchy. I would have expected the title to be part of the chunk as it has a lot of semantic weight.
To Reproduce
Here is my pipeline. I attach a PDF it processes, search for "What we found" in the PDF to see title for a section, it is this title which is this title which occurs in its own CompositeElement.
MAX_CHARACTERS=1500
CHUNK_OVERLAP=200
COMBINE_TEXT_UNDER_N_CHARS=50
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=input_dir),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res",
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15,
"reprocess": True,
"extract_image_block_types": ["Image"]
},
reprocess=True
),
#https://docs.unstructured.io/api-reference/ingest/ingest-configuration/chunking-configuration
chunker_config=ChunkerConfig(
chunking_strategy="by_title",
max_characters = MAX_CHARACTERS,
chunk_overlap = CHUNK_OVERLAP,
combine_text_under_n_chars= COMBINE_TEXT_UNDER_N_CHARS
),
#embedder_config=EmbedderConfig(embedding_provider="langchain-huggingface"),
uploader_config=LocalUploaderConfig(output_dir=f"{OUTPUT_DIR}/{METHOD}")
).run()
Here is the file I am testing with ...
Expected behavior
I wouldn't expect titles to just be their own chunks, instead that they would be part of the main text for that section.
Screenshots
Environment Info
I am using the unstructured docker image as found here:
I exec in. Had to also install ...
pip install unstructured-ingest
pip install unstructured
Here are my versions ...
Python 3.11.10
unstructured 0.15.13
unstructured-client 0.25.9
unstructured-inference 0.7.36
unstructured-ingest 0.0.21
unstructured.paddleocr 2.8.1.0
unstructured.pytesseract 0.3.13
Additional context