kmcelwee/fortune-100-blm-report

Rounding errors found in IDs

kmcelwee opened this issue · 10 comments

Oembed process revealed IDs that had been rounded:

1265004698148495400
1265812305817862100
1266080125805834200

This exists within the dataset in fortune-100-blm-dataset. I believe I manually entered values for Lowe's because of API limits, so that might be what's happening.

  • Request oembed for all tweets ending in 00 to double check that it's limited to these tweets.
  • Dig into fortune-100-blm-dataset repo and double check scripts.
  • Pandas automatically reads the ID column as an integer. Research that this doesn't cause issues.
df['ID'].astype(str).apply(lambda x: len(x)).value_counts()

outputs:

19    83280
18    54160
11      117
17       87
10       38
16       11

Meaning a majority of tweets have 19 digits.

The maximum ID value as an integer is 1287171305138204672, which is greater than all the Lowe's values that were rounded, supporting the argument that it's just the Lowe's tweets.

Looking only at IDs that end in 00, leaves us with 3158 IDs

all_ids = [x for x in df['ID'].astype(str).tolist() if x[-2:] == '00']

Using get_oembed(tweet_id) we get the following tweets that raised errors:

1139202878801715200
1063239357237248000
1192497953505792000
1266080125805834200
1265812305817862100
1265004698148495400
1176691432100249600

1139202878801715200
Nike
Thu Jun 13 16:08:26 +0000 2019

It doesn’t matter what you play. Nobody wins alone. #BeTrue #UntilWeAllWin

@caster800m @TheChrisMosier @ScoutBassett @KerronClement @MarkMcKenzie4_ @ EricKoston @S10bird @brittneyGriner @jordin_canada @jewellloyd https://t.co/veA9PtqwbW

✅ confirmed. This was deleted.

1063239357237248000,Exelon,Fri Nov 16 01:16:31 +0000 2018,,

RT @ Amartines: “HR does not solely own the responsibility for ensuring diversity. Leaders need to be accountable for the make up of their t…

Cannot scroll back far enough. Feed for Exelon stops in 2019. The original tweet exists though. Not exactly sure what happened here.

1192497953505792000,IBM,Thu Nov 07 17:44:02 +0000 2019,

What does a day without IBM look like?

Watch Techless, where people must complete seemingly simple tasks without using anything that was invented by IBM or could use our technology: https://t.co/GRvjE2fz6s https://t.co/tL5UnsfCbW

✅ confirmed. This was deleted.

1266080125805834200
1265812305817862100
1265004698148495400

✅ Are all the Lowe's tweets we know about

1176691432100249600,Facebook,Wed Sep 25 02:54:33 +0000 2019

RT @ boztank: See you tomorrow at #OC6

https://t.co/oFTviQaIyr

✅ Looks like the original tweet was deleted

Seems like in the raw data pull (fortune-100-blm-dataset/data/fortune-100-json/Lowes.json), the id did not match id_str. Test is added to test.py to check for this.

Pandas supports 64 bit integers by default, and Twitter suggests that's what it's using. Still can't figure out how that error creeped in, but it should be all set now.