How to split document in a smarter way

Question

How to split document in a smarter way

RobinHerzog opened this issue 3 years ago · 6 comments

RobinHerzog commented 3 years ago

Hello,

I understand that we need to split document in smaller piece because OpenAI can not get the whole texts as input.

However, my challenge is to cut texts in a smart way so it does not cut the text in the middle of a sentence.

Any luck about that?

Answer 1 · 2023-02-20T19:33:46.000Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Answer 2 · 2023-03-22T07:17:38.000Z

second this

Answer 3 · 2023-06-13T13:23:22.000Z

I'm using: RecursiveCharacterTextSplitter for generic text splitting tasks:
https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html

Answer 4 · 2023-11-20T11:04:04.000Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi,
I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.

Original Working Code
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer')
My Expectation
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ')

Answer 5 · 2023-11-28T10:09:26.000Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.

Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '

My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') '

You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class.
If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".

Some code like the following:

def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any:  # avoid importing spacy
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Spacy is not installed, please install it with `pip install spacy`."
        )
    if pipeline == "sentencizer":
        from spacy.lang.en import English

        sentencizer = English()
        sentencizer.add_pipe("sentencizer")
    else:
        sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
    return sentencizer

class SpacyTextSplitter(RecursiveCharacterTextSplitter):
    """Splitting text using Spacy package.

    Per default, Spacy's `en_core_web_sm` model is used. For a faster, but
    potentially less accurate splitting, you can use `pipeline='sentencizer'`.
    """

    def __init__(
        self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any
    ) -> None:
        """Initialize the spacy text splitter."""
        super().__init__(**kwargs)
        self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline)
        self._separators = separators

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # !!! your code !!!

Answer 6 · 2023-11-28T10:17:17.000Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.
Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '
My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') '

You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class. If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".

Some code like the following:
def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any:  # avoid importing spacy
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Spacy is not installed, please install it with `pip install spacy`."
        )
    if pipeline == "sentencizer":
        from spacy.lang.en import English

        sentencizer = English()
        sentencizer.add_pipe("sentencizer")
    else:
        sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
    return sentencizer

class SpacyTextSplitter(RecursiveCharacterTextSplitter):
    """Splitting text using Spacy package.

    Per default, Spacy's `en_core_web_sm` model is used. For a faster, but
    potentially less accurate splitting, you can use `pipeline='sentencizer'`.
    """

    def __init__(
        self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any
    ) -> None:
        """Initialize the spacy text splitter."""
        super().__init__(**kwargs)
        self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline)
        self._separators = separators

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # !!! your code !!!

Thanks a lot!!