How to split document in a smarter way
RobinHerzog opened this issue · 6 comments
Hello,
I understand that we need to split document in smaller piece because OpenAI can not get the whole texts as input.
However, my challenge is to cut texts in a smart way so it does not cut the text in the middle of a sentence.
Any luck about that?
Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.
second this
I'm using: RecursiveCharacterTextSplitter for generic text splitting tasks:
https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html
Simplest looks to use the built-in
SpacyTextSplitterthat uses Spacy or maybe theNLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch tolangchainwould be to usesentence_splitterwhich is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using theMarkdownTextSplittermight be the best bet.
Hi,
I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.
Original Working Code
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer')
My Expectation
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ')
Simplest looks to use the built-in
SpacyTextSplitterthat uses Spacy or maybe theNLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch tolangchainwould be to usesentence_splitterwhich is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using theMarkdownTextSplittermight be the best bet.Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.
Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '
My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') '
You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class.
If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".
Some code like the following:
def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any: # avoid importing spacy
try:
import spacy
except ImportError:
raise ImportError(
"Spacy is not installed, please install it with `pip install spacy`."
)
if pipeline == "sentencizer":
from spacy.lang.en import English
sentencizer = English()
sentencizer.add_pipe("sentencizer")
else:
sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
return sentencizer
class SpacyTextSplitter(RecursiveCharacterTextSplitter):
"""Splitting text using Spacy package.
Per default, Spacy's `en_core_web_sm` model is used. For a faster, but
potentially less accurate splitting, you can use `pipeline='sentencizer'`.
"""
def __init__(
self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any
) -> None:
"""Initialize the spacy text splitter."""
super().__init__(**kwargs)
self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline)
self._separators = separators
def split_text(self, text: str) -> List[str]:
"""Split incoming text and return chunks."""
# !!! your code !!!
Simplest looks to use the built-in
SpacyTextSplitterthat uses Spacy or maybe theNLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch tolangchainwould be to usesentence_splitterwhich is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using theMarkdownTextSplittermight be the best bet.Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.
Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '
My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') 'You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class. If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".
Some code like the following:
def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any: # avoid importing spacy try: import spacy except ImportError: raise ImportError( "Spacy is not installed, please install it with `pip install spacy`." ) if pipeline == "sentencizer": from spacy.lang.en import English sentencizer = English() sentencizer.add_pipe("sentencizer") else: sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"]) return sentencizer class SpacyTextSplitter(RecursiveCharacterTextSplitter): """Splitting text using Spacy package. Per default, Spacy's `en_core_web_sm` model is used. For a faster, but potentially less accurate splitting, you can use `pipeline='sentencizer'`. """ def __init__( self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any ) -> None: """Initialize the spacy text splitter.""" super().__init__(**kwargs) self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline) self._separators = separators def split_text(self, text: str) -> List[str]: """Split incoming text and return chunks.""" # !!! your code !!!
Thanks a lot!!