geduldig/TwitterAPI

Hydrating expansions: Replace Vs Append, Mentions

igorbrigadir opened this issue · 6 comments

I liked how there's a new option to "hydrate" the includes data, but one thing i noticed was that it did not hydrate mentions (they are based on username only, not on id), and replaced objects as opposed to appending them, so other downstream code that expects to see a value in the json like an id, would get an object instead.

I've an alternative take on implementing the same idea here: https://github.com/DocNow/twarc/blob/v2/twarc/expansions.py this appends data only, and deals with mentions too.

I could probably adapt this into a PR for this library too, if there's any interest.

Can you print out some example output from your code where data in includes is appended?

Sure, here's a few examples that should cover all the types of expansions:

https://gist.github.com/igorbrigadir/8dd596bac1213851b72f9ecc0d562b15

(that should have an example of each - poll, geo, mention, retweet etc.)

Or just a preview:

tweet before:

{
  "conversation_id": "1365557828438683649",
  "text": "RT @VisitUganda: Arrival, @gad_rogers and his friends are now checking into Riverside Woods Resort for meals and day 1 accommodation. River…",
  "reply_settings": "everyone",
  "context_annotations": [
    {
      "domain": {
        "id": "65",
        "name": "Interests and Hobbies Vertical",
        "description": "Top level interests and hobbies groupings, like Food or Travel"
      },
      "entity": {
        "id": "825047692124442624",
        "name": "Food",
        "description": "Food"
      }
    },
    {
      "domain": {
        "id": "66",
        "name": "Interests and Hobbies Category",
        "description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"
      },
      "entity": {
        "id": "1034860760093024257",
        "name": "Meals",
        "description": "Meals"
      }
    }
  ],
  "possibly_sensitive": false,
  "source": "Twitter for Android",
  "public_metrics": {
    "retweet_count": 26,
    "reply_count": 0,
    "like_count": 0,
    "quote_count": 0
  },
  "lang": "en",
  "referenced_tweets": [
    {
      "type": "retweeted",
      "id": "1365282326721175553"
    }
  ],
  "id": "1365557828438683649",
  "created_at": "2021-02-27T07:02:11.000Z",
  "author_id": "230485869",
  "entities": {
    "annotations": [
      {
        "start": 76,
        "end": 97,
        "probability": 0.9051,
        "type": "Place",
        "normalized_text": "Riverside Woods Resort"
      }
    ],
    "mentions": [
      {
        "start": 3,
        "end": 15,
        "username": "VisitUganda"
      },
      {
        "start": 26,
        "end": 37,
        "username": "gad_rogers"
      }
    ]
  }
}

tweet after:

{
  "conversation_id": "1365557828438683649",
  "text": "RT @VisitUganda: Arrival, @gad_rogers and his friends are now checking into Riverside Woods Resort for meals and day 1 accommodation. River…",
  "reply_settings": "everyone",
  "context_annotations": [
    {
      "domain": {
        "id": "65",
        "name": "Interests and Hobbies Vertical",
        "description": "Top level interests and hobbies groupings, like Food or Travel"
      },
      "entity": {
        "id": "825047692124442624",
        "name": "Food",
        "description": "Food"
      }
    },
    {
      "domain": {
        "id": "66",
        "name": "Interests and Hobbies Category",
        "description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"
      },
      "entity": {
        "id": "1034860760093024257",
        "name": "Meals",
        "description": "Meals"
      }
    }
  ],
  "possibly_sensitive": false,
  "source": "Twitter for Android",
  "public_metrics": {
    "retweet_count": 26,
    "reply_count": 0,
    "like_count": 0,
    "quote_count": 0
  },
  "lang": "en",
  "referenced_tweets": [
    {
      "type": "retweeted",
      "id": "1365282326721175553",
      "entities": {
        "urls": [
          {
            "start": 206,
            "end": 229,
            "url": "https://t.co/HqQoG50YEp",
            "expanded_url": "https://twitter.com/VisitUganda/status/1365282326721175553/photo/1",
            "display_url": "pic.twitter.com/HqQoG50YEp"
          },
          {
            "start": 206,
            "end": 229,
            "url": "https://t.co/HqQoG50YEp",
            "expanded_url": "https://twitter.com/VisitUganda/status/1365282326721175553/photo/1",
            "display_url": "pic.twitter.com/HqQoG50YEp"
          },
          {
            "start": 206,
            "end": 229,
            "url": "https://t.co/HqQoG50YEp",
            "expanded_url": "https://twitter.com/VisitUganda/status/1365282326721175553/photo/1",
            "display_url": "pic.twitter.com/HqQoG50YEp"
          },
          {
            "start": 206,
            "end": 229,
            "url": "https://t.co/HqQoG50YEp",
            "expanded_url": "https://twitter.com/VisitUganda/status/1365282326721175553/photo/1",
            "display_url": "pic.twitter.com/HqQoG50YEp"
          }
        ],
        "hashtags": [
          {
            "start": 169,
            "end": 181,
            "tag": "VisitUganda"
          },
          {
            "start": 182,
            "end": 205,
            "tag": "TakeOnThePearlWithLove"
          }
        ],
        "annotations": [
          {
            "start": 59,
            "end": 80,
            "probability": 0.9068,
            "type": "Place",
            "normalized_text": "Riverside Woods Resort"
          },
          {
            "start": 117,
            "end": 138,
            "probability": 0.9245,
            "type": "Place",
            "normalized_text": "Riverside Woods Resort"
          },
          {
            "start": 155,
            "end": 161,
            "probability": 0.7633,
            "type": "Place",
            "normalized_text": "Sezibwa"
          }
        ],
        "mentions": [
          {
            "start": 9,
            "end": 20,
            "username": "gad_rogers",
            "url": "https://t.co/gvg87fpUZD",
            "name": "GAD ROGERS",
            "pinned_tweet_id": "1258390322473848833",
            "public_metrics": {
              "followers_count": 54075,
              "following_count": 7679,
              "tweet_count": 134716,
              "listed_count": 32
            },
            "location": "Africa",
            "entities": {
              "url": {
                "urls": [
                  {
                    "start": 0,
                    "end": 23,
                    "url": "https://t.co/gvg87fpUZD",
                    "expanded_url": "https://www.facebook.com/FlashNewUg/",
                    "display_url": "facebook.com/FlashNewUg/"
                  }
                ]
              }
            },
            "id": "1494927698",
            "protected": false,
            "description": "Social media influencer / Hustler \n                               Email: gadrogers@gmail.com\nTel: +256784311866",
            "verified": false,
            "profile_image_url": "https://pbs.twimg.com/profile_images/1353642208008826891/0WjTi63v_normal.jpg",
            "created_at": "2013-06-09T07:36:04.000Z"
          }
        ]
      },
      "conversation_id": "1365282326721175553",
      "text": "Arrival, @gad_rogers and his friends are now checking into Riverside Woods Resort for meals and day 1 accommodation. Riverside Woods Resort is 1.5kms from Sezibwa falls.#VisitUganda #TakeOnThePearlWithLove https://t.co/HqQoG50YEp",
      "reply_settings": "everyone",
      "context_annotations": [
        {
          "domain": {
            "id": "65",
            "name": "Interests and Hobbies Vertical",
            "description": "Top level interests and hobbies groupings, like Food or Travel"
          },
          "entity": {
            "id": "825047692124442624",
            "name": "Food",
            "description": "Food"
          }
        },
        {
          "domain": {
            "id": "66",
            "name": "Interests and Hobbies Category",
            "description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"
          },
          "entity": {
            "id": "1034860760093024257",
            "name": "Meals",
            "description": "Meals"
          }
        }
      ],
      "possibly_sensitive": false,
      "source": "Twitter for Android",
      "attachments": {
        "media_keys": [
          "3_1365282159230152712",
          "3_1365282181405433857",
          "3_1365282243904745474",
          "3_1365282284685971463"
        ],
        "media": [
          {},
          {},
          {},
          {}
        ]
      },
      "public_metrics": {
        "retweet_count": 26,
        "reply_count": 2,
        "like_count": 94,
        "quote_count": 0
      },
      "lang": "en",
      "created_at": "2021-02-26T12:47:27.000Z",
      "author_id": "39031835",
      "author": {
        "url": "https://t.co/eh7f7S3SwP",
        "name": "Visit Uganda",
        "public_metrics": {
          "followers_count": 9679,
          "following_count": 4108,
          "tweet_count": 4510,
          "listed_count": 36
        },
        "location": "Uganda",
        "entities": {
          "url": {
            "urls": [
              {
                "start": 0,
                "end": 23,
                "url": "https://t.co/eh7f7S3SwP",
                "expanded_url": "https://www.visituganda.com",
                "display_url": "visituganda.com"
              }
            ]
          },
          "description": {
            "hashtags": [
              {
                "start": 40,
                "end": 47,
                "tag": "UGANDA"
              }
            ],
            "mentions": [
              {
                "start": 88,
                "end": 103,
                "username": "tourismboardug"
              }
            ]
          }
        },
        "id": "39031835",
        "protected": false,
        "username": "VisitUganda",
        "description": "Official Twitter handle for destination #UGANDA 🇺🇬 managed by the Uganda Tourism Board, @tourismboardug.",
        "verified": true,
        "profile_image_url": "https://pbs.twimg.com/profile_images/1139546047154458625/FCX3vc1j_normal.jpg",
        "created_at": "2009-05-10T11:12:08.000Z"
      }
    }
  ],
  "id": "1365557828438683649",
  "created_at": "2021-02-27T07:02:11.000Z",
  "author_id": "230485869",
  "entities": {
    "annotations": [
      {
        "start": 76,
        "end": 97,
        "probability": 0.9051,
        "type": "Place",
        "normalized_text": "Riverside Woods Resort"
      }
    ],
    "mentions": [
      {
        "start": 3,
        "end": 15,
        "username": "VisitUganda",
        "url": "https://t.co/eh7f7S3SwP",
        "name": "Visit Uganda",
        "public_metrics": {
          "followers_count": 9679,
          "following_count": 4108,
          "tweet_count": 4510,
          "listed_count": 36
        },
        "location": "Uganda",
        "entities": {
          "url": {
            "urls": [
              {
                "start": 0,
                "end": 23,
                "url": "https://t.co/eh7f7S3SwP",
                "expanded_url": "https://www.visituganda.com",
                "display_url": "visituganda.com"
              }
            ]
          },
          "description": {
            "hashtags": [
              {
                "start": 40,
                "end": 47,
                "tag": "UGANDA"
              }
            ],
            "mentions": [
              {
                "start": 88,
                "end": 103,
                "username": "tourismboardug"
              }
            ]
          }
        },
        "id": "39031835",
        "protected": false,
        "description": "Official Twitter handle for destination #UGANDA 🇺🇬 managed by the Uganda Tourism Board, @tourismboardug.",
        "verified": true,
        "profile_image_url": "https://pbs.twimg.com/profile_images/1139546047154458625/FCX3vc1j_normal.jpg",
        "created_at": "2009-05-10T11:12:08.000Z"
      },
      {
        "start": 26,
        "end": 37,
        "username": "gad_rogers",
        "url": "https://t.co/gvg87fpUZD",
        "name": "GAD ROGERS",
        "pinned_tweet_id": "1258390322473848833",
        "public_metrics": {
          "followers_count": 54075,
          "following_count": 7679,
          "tweet_count": 134716,
          "listed_count": 32
        },
        "location": "Africa",
        "entities": {
          "url": {
            "urls": [
              {
                "start": 0,
                "end": 23,
                "url": "https://t.co/gvg87fpUZD",
                "expanded_url": "https://www.facebook.com/FlashNewUg/",
                "display_url": "facebook.com/FlashNewUg/"
              }
            ]
          }
        },
        "id": "1494927698",
        "protected": false,
        "description": "Social media influencer / Hustler \n                               Email: gadrogers@gmail.com\nTel: +256784311866",
        "verified": false,
        "profile_image_url": "https://pbs.twimg.com/profile_images/1353642208008826891/0WjTi63v_normal.jpg",
        "created_at": "2013-06-09T07:36:04.000Z"
      }
    ]
  },
  "author": {
    "url": "",
    "name": "Oguta",
    "public_metrics": {
      "followers_count": 36,
      "following_count": 481,
      "tweet_count": 872,
      "listed_count": 0
    },
    "id": "230485869",
    "protected": false,
    "username": "ogutah_2010",
    "description": "Ugandan...",
    "verified": false,
    "profile_image_url": "https://pbs.twimg.com/profile_images/1308857069244682241/yW3enfEy_normal.jpg",
    "created_at": "2010-12-25T17:05:43.000Z"
  }
}

Geo object in tweet before:

  "geo": {
    "place_id": "28edff73b28c1a74"
  }

Geo object in tweet after:

  "geo": {
    "place_id": "28edff73b28c1a74",
    "name": "South Miami",
    "id": "28edff73b28c1a74",
    "full_name": "South Miami, FL",
    "country_code": "US",
    "place_type": "city",
    "country": "United States",
    "geo": {
      "type": "Feature",
      "bbox": [
        -80.306913,
        25.689337,
        -80.284859,
        25.734075
      ],
      "properties": {}
    }
  }

etc. This way any code that expects to see the json as defined by the API will still work, because there are only extra fields, not modified ones.

I haven't time to look into this yet. It's on my list.

Finally have time to look at your suggestsion...

Thank you for pointing out that I neglected to hydrate mentions. I will fix this in my current implementation.

I see why you don't want to destroy the existing value of fields that get hydrated. The reason why I chose to replace the value was because of the compactness, and the value is already in the hyrdrated values as well. However, I would consider doing it the way you suggested, but I have one change in mind, and I would like to know what you think.

It looks like you assign the hydrated values to a new field (usually). For "author_id" you put the hydrated values in "author". For a referenced tweet "id" you put the hydrated values in "entities". For "place_id" you don't put the hydrated values into their own field but add them as siblings to "place_id".

My suggestion is that for each hydrated field ("author_id", "id", "place_id", etc) the hydrated values always get placed into a new field that is a sibling to the hydrated field and named <hydrated field name>_hydrated. So, in my examples, the new fields would be called "author_id_hydrated", "id_hydrated", "place_id_hydrated". And for mentions the new field would be "username_hydrated". I think that is a little more consistent than what you have, if I understand what you have.

In v2.7.0, hydrate_tweet parameter is replaced with hydrate_type. There is a new Enum:

class HydrateType(Enum):
    NONE = 0
    APPEND = 1
    REPLACE = 2

REPLACE replaces the value of the hyrdrated field with a dictionary of "include" values.
APPEND doesn't touch the hydrated field. Instead, it appends a new field (called SOMETHING_hydrate) and sets its value to a dictionary of "include" values.

Great, thanks!

I'll test it out later - I'm adding scripts to twarc and I'd like to make it so that people can run some tools like twarc-csv on data they got from elsewhere, like this library.