Incorrect result for MarkdownHeaderTextSplitter

Question

Incorrect result for MarkdownHeaderTextSplitter

Closed this issue 10 months ago · 2 comments

nmklong commented a year ago

Describe the bug

I just tried using the MarkdownHeaderTextSplitter, but I believe the end result is incorrect.

Steps to reproduce the bug

Try splitting the text # Foo\n\n ## Bar\n\nHi this is Jim \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly.

The result is:

["Foo\n Bar\nHi this is Jim\nHi this is Joe\n Baz\nHi this is Molly"]

Which is incorrect according to LangChain's implementation:
https://python.langchain.com/docs/modules/data_connection/document_transformers/markdown_header_metadata

Expected behavior

It should be:

[
    "Foo\nBar\nHi this is Jim  \nHi this is Joe",
    "Foo\nBaz\nHi this is Molly"
]

Screenshots

No response

NuGet package version

0.12.3-dev.110

Additional context

No response

Answer 1 · 2024-02-09T23:07:48.000Z

This is quite possible, I don’t remember who exactly implemented this, but many things have so far been implemented and used only by specific people.
But we'd love any help - from simply creating Unit Tests to fixing the Markdown Splitter itself

Answer 2 · 2024-02-26T07:59:15.000Z

My bad. I remember that original implementation was kind of confusing, so i decided to simplify it. Probably, misunderstood something in the code.