indices not accounting for leading hashtag

Question

indices not accounting for leading hashtag

ptwobrussell opened this issue 15 years ago · 4 comments

In comparing twitter-text-rb and twitter-text-py there seems to be a very minor offset issue in the start index for hashtags.

= Input =

from twitter_text.extractor import Extractor
extractor = Extractor("Spring has arrived in Guilford @dailyshoot #ds158 http://bit.ly/9guArw")
for ht in extractor.extract_hashtags_with_indices(): print ht

= Actual Output =

{'indices': (44, 49), 'hashtag': u'ds158'}

= Expected Output (as compared to twitter-text-rb and Twitter's production API) =

{'indices': (43, 49), 'hashtag': u'ds158'}

Answer 1 · 2010-06-29T16:40:15.000Z

text = u"Spring has arrived in Guilford @dailyshoot #ds158 http://bit.ly/9guArw"
text[44:49] == 'ds158'

Seems to me like this is right. The indices should match the text returned. The ruby library is either giving the indices for the hashtag including the leading #, or is off by 1.

Answer 2 · 2010-06-29T17:02:33.000Z

I would have thought the same thing, but considering that Twitter is using twitter-text-rb in production, my guess is that are stripping the leading # and @ off of the entities because they're implied? I agree that it isn't intuitive, but seems like it must be intentional on their part and their example docs show it this way as well: http://dev.twitter.com/pages/tweet_entities

Granted, we may be the first to notice it, but that seems unlikely to me.

If you'd like I could file a bug about this at http://code.google.com/p/twitter-api/issues/list (don't see one addressing this) to clarify that this is the expected behavior if that would be helpful to you? All I want is to be able to depend on twitter-text-py one way or the other and know that it'll be consistent with the production twitter api.

Answer 3 · 2010-06-29T20:04:40.000Z

Fixing indices start offset to including preceding @ and #. Closed by 3b76c75

Answer 4 · 2010-06-29T20:06:11.000Z

I agree that the library should match the output of the ruby library. In doing so, though, I don't want to introduce any bugs from the ruby project. Having checkout the documentation you linked to, I've changed the indices to include the preceding characters.