Default regex for words treats numbers as individual words
IMN-MichaelL opened this issue · 2 comments
I am unsure if this is behavior is intended or not but if there is a number that is part of word or acronym it is broken out as a separate word.
Expected behavior:
v.words('What Is The Best MACD Indicator for MT4?')
(8) ["What", "Is", "The", "Best", "MACD", "Indicator", "for", "MT4"]
v.countWords('What Is The Best MACD Indicator for MT4?')
8
Current behavior:
v.words('What Is The Best MACD Indicator for MT4?')
(9) ["What", "Is", "The", "Best", "MACD", "Indicator", "for", "MT", "4"]
v.countWords('What Is The Best MACD Indicator for MT4?')
9
I realize I could come up with my own regex to solve this issue but the default regex works really well besides this one scenario which doesn't seem intended.
I am using Voca 1.4.0
For your particular case separating the number is inconvenient, however this is an expected behaviour.
Separating the number usually is a good thing, for example v.words('IHave4Years')
results in [ 'I', 'Have', '4', 'Years' ]
.
I suggest you to use v.split() with your own separator or v.words() with your own words RegExp.
Is there any reason that this functionality isn't something that can be toggled with a flag?
I can see it being useful for dealing with code or perhaps dealing with text from poorly parsed html. However when you consider the kind of text someone might write for an article, book, or web page, this functionality is going to cause more issues than not and being able to decide whether you want that functionality or not would be ideal.
As more examples, hyphenated text and proper nouns like LibreOffice is split apart. Word processors like LibreOffice produces word counts I expect because they do not split words like this library does by default. I realize in a technical sense it is words but in a real world practical sense not so much.