indices not accounting for leading hashtag
ptwobrussell opened this issue · 4 comments
In comparing twitter-text-rb and twitter-text-py there seems to be a very minor offset issue in the start index for hashtags.
= Input =
from twitter_text.extractor import Extractor
extractor = Extractor("Spring has arrived in Guilford @dailyshoot #ds158 http://bit.ly/9guArw")
for ht in extractor.extract_hashtags_with_indices(): print ht
= Actual Output =
{'indices': (44, 49), 'hashtag': u'ds158'}
= Expected Output (as compared to twitter-text-rb and Twitter's production API) =
{'indices': (43, 49), 'hashtag': u'ds158'}
text = u"Spring has arrived in Guilford @dailyshoot #ds158 http://bit.ly/9guArw"
text[44:49] == 'ds158'
Seems to me like this is right. The indices should match the text returned. The ruby library is either giving the indices for the hashtag including the leading #, or is off by 1.
I would have thought the same thing, but considering that Twitter is using twitter-text-rb in production, my guess is that are stripping the leading # and @ off of the entities because they're implied? I agree that it isn't intuitive, but seems like it must be intentional on their part and their example docs show it this way as well: http://dev.twitter.com/pages/tweet_entities
Granted, we may be the first to notice it, but that seems unlikely to me.
If you'd like I could file a bug about this at http://code.google.com/p/twitter-api/issues/list (don't see one addressing this) to clarify that this is the expected behavior if that would be helpful to you? All I want is to be able to depend on twitter-text-py one way or the other and know that it'll be consistent with the production twitter api.
I agree that the library should match the output of the ruby library. In doing so, though, I don't want to introduce any bugs from the ruby project. Having checkout the documentation you linked to, I've changed the indices to include the preceding characters.