Rounding errors found in IDs

Question

Rounding errors found in IDs

kmcelwee opened this issue 4 years ago · 10 comments

Oembed process revealed IDs that had been rounded:

1265004698148495400
1265812305817862100
1266080125805834200

This exists within the dataset in fortune-100-blm-dataset. I believe I manually entered values for Lowe's because of API limits, so that might be what's happening.

Request oembed for all tweets ending in 00 to double check that it's limited to these tweets.
Dig into fortune-100-blm-dataset repo and double check scripts.
Pandas automatically reads the ID column as an integer. Research that this doesn't cause issues.

Answer 1 · 2020-10-10T20:02:59.000Z

df['ID'].astype(str).apply(lambda x: len(x)).value_counts()

outputs:

19    83280
18    54160
11      117
17       87
10       38
16       11

Meaning a majority of tweets have 19 digits.

Answer 2 · 2020-10-10T20:09:01.000Z

The maximum ID value as an integer is 1287171305138204672, which is greater than all the Lowe's values that were rounded, supporting the argument that it's just the Lowe's tweets.

Answer 3 · 2020-10-10T20:23:33.000Z

Looking only at IDs that end in 00, leaves us with 3158 IDs

all_ids = [x for x in df['ID'].astype(str).tolist() if x[-2:] == '00']

Using get_oembed(tweet_id) we get the following tweets that raised errors:

1139202878801715200
1063239357237248000
1192497953505792000
1266080125805834200
1265812305817862100
1265004698148495400
1176691432100249600

Answer 4 · 2020-10-11T01:32:36.000Z

1139202878801715200
Nike
Thu Jun 13 16:08:26 +0000 2019

It doesn’t matter what you play. Nobody wins alone. #BeTrue #UntilWeAllWin

@caster800m @TheChrisMosier @ScoutBassett @KerronClement @MarkMcKenzie4_ @ EricKoston @S10bird @brittneyGriner @jordin_canada @jewellloyd https://t.co/veA9PtqwbW

✅ confirmed. This was deleted.

Answer 5 · 2020-10-11T01:43:18.000Z

1063239357237248000,Exelon,Fri Nov 16 01:16:31 +0000 2018,,

RT @ Amartines: “HR does not solely own the responsibility for ensuring diversity. Leaders need to be accountable for the make up of their t…

Cannot scroll back far enough. Feed for Exelon stops in 2019. The original tweet exists though. Not exactly sure what happened here.

Answer 6 · 2020-10-11T01:48:18.000Z

1192497953505792000,IBM,Thu Nov 07 17:44:02 +0000 2019,

What does a day without IBM look like?

Watch Techless, where people must complete seemingly simple tasks without using anything that was invented by IBM or could use our technology: https://t.co/GRvjE2fz6s https://t.co/tL5UnsfCbW

✅ confirmed. This was deleted.

Answer 7 · 2020-10-11T01:49:29.000Z

1266080125805834200
1265812305817862100
1265004698148495400

✅ Are all the Lowe's tweets we know about

Answer 8 · 2020-10-11T01:53:12.000Z

1176691432100249600,Facebook,Wed Sep 25 02:54:33 +0000 2019

RT @ boztank: See you tomorrow at #OC6

https://t.co/oFTviQaIyr

✅ Looks like the original tweet was deleted

Answer 9 · 2020-10-11T02:05:38.000Z

Seems like in the raw data pull (fortune-100-blm-dataset/data/fortune-100-json/Lowes.json), the id did not match id_str. Test is added to test.py to check for this.

Answer 10 · 2020-10-11T02:09:33.000Z

Pandas supports 64 bit integers by default, and Twitter suggests that's what it's using. Still can't figure out how that error creeped in, but it should be all set now.