Incorrect result for MarkdownHeaderTextSplitter
Closed this issue · 2 comments
Describe the bug
I just tried using the MarkdownHeaderTextSplitter, but I believe the end result is incorrect.
Steps to reproduce the bug
Try splitting the text # Foo\n\n ## Bar\n\nHi this is Jim \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly
.
The result is:
["Foo\n Bar\nHi this is Jim\nHi this is Joe\n Baz\nHi this is Molly"]
Which is incorrect according to LangChain's implementation:
https://python.langchain.com/docs/modules/data_connection/document_transformers/markdown_header_metadata
Expected behavior
It should be:
[
"Foo\nBar\nHi this is Jim \nHi this is Joe",
"Foo\nBaz\nHi this is Molly"
]
Screenshots
No response
NuGet package version
0.12.3-dev.110
Additional context
No response
This is quite possible, I don’t remember who exactly implemented this, but many things have so far been implemented and used only by specific people.
But we'd love any help - from simply creating Unit Tests to fixing the Markdown Splitter itself
My bad. I remember that original implementation was kind of confusing, so i decided to simplify it. Probably, misunderstood something in the code.