DFKI-MLT/JTok

Sentence splitting without whitespaces

Closed this issue · 6 comments

There seems to be no option to split sentences if the text is missing the space after the period.

Example:

Some simple sentence.Wherein, a space is missing.

If I do not allow PERIOD within tokens, I still cannot configure the sentence splitting in order to create two sentences. Is there a way I missed? If not, it would be nice if the splitting is extensible/configurable.

You could remove PERIOD from the list of punctuation that is allowed to be embedded within a token, as defined here. A PERIOD would then show up as a separate token.
But still JTok would not recognize a sentence ending. This is because after a potential end-of-sentence token, like PERIOD, JTok checks if the next token is a whitespace or a so-called "sentence continuing" token, i.e. another end-of-sentence token, a closing punctuation or a closing bracket. If that is not the case, JTok dismisses the possibility of a sentence end. This behavior is implemented in the code here, so there is nothing to be configured about it. If you want to change that behavior, you have to adapt the code.

This exactly what I did and where I stopped. I do not want to modify the code because of the weak copy-left. Is there interest in an extension of jtok to support this use case by configuration? I could provide the modifications if I find a clean solution. Right now, I apply some postprocessing in order to solve my use case.

I don't think an extension of JTok would be the correct way to handle this problem. The input in that case is clearly malformed, and I don't think, in general, a linguistic tool should be responsible for fixing such errors in its input. So pre/postprocessing would be the way to go.

I have to agree and disagree. Of course, the tool should not be responsible to fix malformed input. but it should be robust enough to handle common cases of real-world input. I do not know if a missing space can be called "malformed" with all of its consequences. Preprocessing is not an option and I think postprocessing is a suboptimal solution from a software development perspective. JTok can be configured so nicely, but not its sentence splitting heuristics, which is a real pity IMHO. Well, I'll stick with the postprocessing for now then... let me know if you are interested in further discussion.

Funny thing: I wanted to check how much effort it would be to get the behavior you want in a post-processing step. Thereby I noticed that the conditional setting of the eos-flag here is actually redundant. If I set the eos-flag unconditionally, JTok behaves like before according to the unit tests. There are only two cases where the output changes: JTok now recognizes an end-of-sentence after a period that is immediately followed by an opening bracket, and that's actually an improvement. And even better, if you remove PERIOD from the list of punctuation that is allowed to be embedded within a token, as mentioned above, you get exactly the behavior you're looking for. Looks like a win-win situation :-)

So I'll release a new version of JTok soon.

Ah great, thank you :-)