WordPress/openverse-catalog

Wikimedia audio bit_rate integer overflow

AetherUnbound opened this issue · 7 comments

Description

We received the following alert for a recent Wikimedia audio run:

SELECT aws_s3.table_import_from_s3(
  'provider_data_audio_wikimedia_commons_20220326T000000',
  '',
  'DELIMITER E''	''',
  'openverse-catalog',
  'audio/wikimedia_commons/wikimedia_audio_audio_v001_20220327000000.tsv',
  'us-east-1'
);
, parameters: None
[2022-03-27, 00:16:05 UTC] {taskinstance.py:1718} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1334, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1460, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1511, in _execute_task
    result = execute_callable(context=context)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 174, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 188, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/loader.py", line 44, in load_from_s3
    sql.load_s3_data_to_intermediate_table(
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/sql.py", line 142, in load_s3_data_to_intermediate_table
    postgres.run(
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/hooks/dbapi.py", line 205, in run
    self._run_command(cur, sql_statement, parameters)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/hooks/dbapi.py", line 229, in _run_command
    cur.execute(sql_statement)
psycopg2.errors.NumericValueOutOfRange: value "4294967294" is out of range for type integer
CONTEXT:  COPY provider_data_audio_wikimedia_commons_20220326t000000, line 375, column bit_rate: "4294967294"
SQL statement "copy provider_data_audio_wikimedia_commons_20220326T000000  from '/rdsdbdata/extensions/aws_s3/amazon-s3-fifo-15359-20220327T001605Z-0' with DELIMITER E'	'"

That value is the max value for a 32 bit integer. It appears that it's possible to get invalid values for the bit_rate column. We should probably have a case statement for these when upserting (e.g. CASE WHEN bit_rate >= 4294967296 THEN NULL ELSE bit_rate or something similar).

Reproduction

  1. Write the following record to a TSV
116428479	https://commons.wikimedia.org/w/index.php?curid=116428479	https://upload.wikimedia.org/wikipedia/commons/6/62/Sydney_tram_18_-_2016-02-14-2.ogg	\N	ogg	112343	by-sa	4.0	Andy Mabbett	https://www.wikidata.org/wiki/Q15136093	Sydney tram 18 - 2016-02-14-2	{"description": "Audio recording of tram number 18 at Central Station, Sydney, Australia. The vehicle is stationary, as passengers board, The driver sounds a warning bell, several times, to warn of the impending closure of the doors.", "global_usage_count": "1", "date_originally_created": "14 February 2016", "last_modified_at_source": "2022-03-26 14:08:25", "categories": ["2016-02-14", "Audio files by Andy Mabbett", "Audio files made using a WMUK Zoom H1", "Audio files of 2016", "Audio files of Sydney", "Central (Grand Concourse) light rail stop, Sydney", "Content media by years - Supported by Wikimedia UK - 2016", "Self-published work", "Sounds of trams", "Urbos 3 in Sydney"], "channels": "2", "license_url": "https://creativecommons.org/licenses/by-sa/4.0/", "raw_license_url": "https://creativecommons.org/licenses/by-sa/4.0"}	\N	\N	f	wikimedia_audio	wikimedia_audio	provider_api	2144	4294967294	96000	\N	\N	\N	\N
  1. Upload this record to Minio
  2. Attempt to ingest this record
  3. See error.

Screenshots

image

Additional context

Resolution

  • 🙋 I would be interested in resolving this bug.

If the number is still a valid value, changing the column type to bigint wouldn't be preferable?

What is the meaning of this bitrate anyway?

When I look at the file in an audio player, it reports the bitrate as 404 kbps, which shouldn't be nearly as large as that number I don't think... It seems like either bad data or data that we need to transform in some way to get an actual bitrate. I know FLAC has some weird "bitrate" stuff, it doesn't work the same as MP3 or other lossy compression formats (usually FLAC files I've seen don't even report a bitrate and if they do it's a very large number that doesn't correspond to what we usually think of as a useful bitrate).

Ogg Vorbis is apparently always VBR, so maybe this bitrate value isn't even useful for ogg files at all and we can just throw it out. The best you could get is an average for the file (but even that still doesn't explain the huge number here).

I confirmed that this file is indeed a Vorbis file by using the oggz tool:

$ oggz info -a Sydney_tram_18_-_2016-02-14-2.ogg
Content-Duration: 00:00:02.144
Content-Length: 109.710 kB
Content-Bitrate-Average: 419.190 kbps

Vorbis: serialno 0000029513
        221 packets in 27 pages, 8.2 packets/page, 1.138% Ogg overhead
        Content-Length: 109.710 kB
        Content-Bitrate-Average: 419.190 kbps
        Audio-Samplerate: 96000 Hz
        Audio-Channels: 2
        Page-Length-Maximum: 4.508 kB
        Page-Length-StdDev: 1.022 kB
        Packet-Length-Maximum: 3.816 kB
        Packet-Length-StdDev: 262 bytes

As you can see, the -a flag (you can also get it with just -b) also exposes the average bitrate of 419.190 kbps, which looks suspiciously close to 4294967294 but doesn't seem to be related in any way I can tell.

I wonder if it's more reliable to compute the bitrate (or average bitrate in the case of VBR files) ourselves for various files. For the file extension work we'll already need to do something with the actual file itself to get the extension, I wonder if in doing that we'll make it easier to get the bitrate ourselves?

If none of that is possible then I think throwing out values that look like nonsense is a good idea. For example, maybe just not reporting bit rate for ogg files at all considering they're probably either a lossless file in the ogg container or a VBR encoded file, neither one has super useful bitrate information necessarily (for FLAC you'd want bit depth and for VBR a couple seconds of true silence can make a high quality file look like it has a bad average bit rate anyway, it's not a useful metric necessarily, at least as far as I can tell).

These are great questions! Honestly I'm not sure what field the 419kbps bitrate is coming from on the entry's page. Here's the response we get back from the API using the same parameters as the ingestion script:

cURL: curl --request GET --url 'https://commons.wikimedia.org/w/api.php?action=query&pageids=116428479&format=json&iiprop=url%7Cuser%7Cdimensions%7Cextmetadata%7Cmediatype%7Csize%7Cmetadata&prop=globalusage%7Cimageinfo' --header 'content-type: application/json'

Response JSON
{
  "batchcomplete": "",
  "query": {
    "pages": {
      "116428479": {
        "pageid": 116428479,
        "ns": 6,
        "title": "File:Sydney tram 18 - 2016-02-14-2.ogg",
        "globalusage": [
          {
            "title": "Walter_Hunt_(inventor)",
            "wiki": "en.wikipedia.org",
            "url": "https://en.wikipedia.org/wiki/Walter_Hunt_(inventor)"
          },
          {
            "title": "User:Douglas_Coldwell/Sandboxes/Sandbox_30",
            "wiki": "en.wikipedia.org",
            "url": "https://en.wikipedia.org/wiki/User:Douglas_Coldwell/Sandboxes/Sandbox_30"
          }
        ],
        "imagerepository": "local",
        "imageinfo": [
          {
            "user": "Moderndayicarus",
            "size": 112343,
            "width": 0,
            "height": 0,
            "duration": 2.144083333333333,
            "url": "https://upload.wikimedia.org/wikipedia/commons/6/62/Sydney_tram_18_-_2016-02-14-2.ogg",
            "descriptionurl": "https://commons.wikimedia.org/wiki/File:Sydney_tram_18_-_2016-02-14-2.ogg",
            "descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=116428479",
            "metadata": [
              {
                "name": "version",
                "value": 2
              },
              {
                "name": "streams",
                "value": [
                  {
                    "name": 29513,
                    "value": [
                      {
                        "name": "serial",
                        "value": 29513
                      },
                      {
                        "name": "group",
                        "value": 0
                      },
                      {
                        "name": "type",
                        "value": "Vorbis"
                      },
                      {
                        "name": "vendor",
                        "value": "Xiph.Org libVorbis I 20120203 (Omnipresent)"
                      },
                      {
                        "name": "length",
                        "value": 2.144083333333333
                      },
                      {
                        "name": "size",
                        "value": 112343
                      },
                      {
                        "name": "header",
                        "value": [
                          {
                            "name": "vorbis_version",
                            "value": 0
                          },
                          {
                            "name": "audio_channels",
                            "value": 2
                          },
                          {
                            "name": "audio_sample_rate",
                            "value": 96000
                          },
                          {
                            "name": "bitrate_maximum",
                            "value": 0
                          },
                          {
                            "name": "bitrate_nominal",
                            "value": 4294967294
                          },
                          {
                            "name": "bitrate_minimum",
                            "value": 0
                          },
                          {
                            "name": "blocksize_0",
                            "value": 8
                          },
                          {
                            "name": "blocksize_1",
                            "value": 11
                          },
                          {
                            "name": "framing_flag",
                            "value": 0
                          }
                        ]
                      },
                      {
                        "name": "comments",
                        "value": [
                          {
                            "name": "Licence",
                            "value": "CC by-sa 4.0"
                          },
                          {
                            "name": "GENRE",
                            "value": "ambient sound"
                          },
                          {
                            "name": "TITLE",
                            "value": "Sydney Tram 18"
                          },
                          {
                            "name": "DATE",
                            "value": "2016-02-14"
                          },
                          {
                            "name": "ARTIST",
                            "value": "Andy Mabbett"
                          }
                        ]
                      }
                    ]
                  }
                ]
              },
              {
                "name": "length",
                "value": 2.144083333333333
              },
              {
                "name": "offset",
                "value": null
              }
            ],
            "extmetadata": {
              "DateTime": {
                "value": "2022-03-26 14:08:25",
                "source": "mediawiki-metadata",
                "hidden": ""
              },
              "ObjectName": {
                "value": "Sydney tram 18 - 2016-02-14-2",
                "source": "mediawiki-metadata",
                "hidden": ""
              },
              "CommonsMetadataExtension": {
                "value": 1.2,
                "source": "extension",
                "hidden": ""
              },
              "Categories": {
                "value": "2016-02-14|Audio files by Andy Mabbett|Audio files made using a WMUK Zoom H1|Audio files of 2016|Audio files of Sydney|Central (Grand Concourse) light rail stop, Sydney|Content media by years - Supported by Wikimedia UK - 2016|Self-published work|Sounds of trams|Urbos 3 in Sydney",
                "source": "commons-categories",
                "hidden": ""
              },
              "Assessments": {
                "value": "",
                "source": "commons-categories",
                "hidden": ""
              },
              "ImageDescription": {
                "value": "Audio recording of tram number 18 at Central Station, Sydney, Australia. The vehicle is stationary, as passengers board, The driver sounds a warning bell, several times, to warn of the impending closure of the doors.",
                "source": "commons-desc-page"
              },
              "DateTimeOriginal": {
                "value": "14 February 2016",
                "source": "commons-desc-page"
              },
              "Credit": {
                "value": "<span class=\"int-own-work\" lang=\"en\">Own work</span>",
                "source": "commons-desc-page",
                "hidden": ""
              },
              "Artist": {
                "value": "<bdi><a href=\"https://www.wikidata.org/wiki/Q15136093\" class=\"extiw\" title=\"d:Q15136093\"><span title=\"freelance consultant in digital and open content matters, Wikipedian, author, Fellow of the Royal Society of Arts\">Andy Mabbett</span></a>\n</bdi>",
                "source": "commons-desc-page"
              },
              "LicenseShortName": {
                "value": "CC BY-SA 4.0",
                "source": "commons-desc-page",
                "hidden": ""
              },
              "UsageTerms": {
                "value": "Creative Commons Attribution-Share Alike 4.0",
                "source": "commons-desc-page",
                "hidden": ""
              },
              "AttributionRequired": {
                "value": "true",
                "source": "commons-desc-page",
                "hidden": ""
              },
              "LicenseUrl": {
                "value": "https://creativecommons.org/licenses/by-sa/4.0",
                "source": "commons-desc-page",
                "hidden": ""
              },
              "Copyrighted": {
                "value": "True",
                "source": "commons-desc-page",
                "hidden": ""
              },
              "Restrictions": {
                "value": "",
                "source": "commons-desc-page",
                "hidden": ""
              },
              "License": {
                "value": "cc-by-sa-4.0",
                "source": "commons-templates",
                "hidden": ""
              }
            },
            "mediatype": "AUDIO"
          }
        ]
      }
    }
  }
}

In the metadata section, you can see the following properties:

{
  "name": "bitrate_maximum",
  "value": 0
},
{
  "name": "bitrate_nominal",
  "value": 4294967294
},
{
  "name": "bitrate_minimum",
  "value": 0
},

The min & max being 0, and the nominal being 232-2 all lead me to believe that this is a legitimately invalid value.

I think we could calculate the bitrate here ourselves, but based on your description of ogg files, it doesn't seem like it would be that necessary/helpful. I'm open to other thoughts here!

The min & max being 0, and the nominal being 232-2 all lead me to believe that this is a legitimately invalid value.

Yes, it does seem like this is just invalid data, you're right.

I did some sleuthing and found out how Wikimedia handles it.

https://phabricator.wikimedia.org/T3948

Here's the relevant code from OggHandler (I guess): https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/includes/handlers/OggHandler/OggHandler.php#L379

So they're just calculating it on the fly it seems (presumably cached somehow). It doesn't seem like they use the API data to render the pages. But then I'm not sure what they used to populate the API data 🤷

So they're just calculating it on the fly it seems (presumably cached somehow). It doesn't seem like they use the API data to render the pages. But then I'm not sure what they used to populate the API data shrug

That's....frustrating! 😅

I think the solution could be to move more of this stuff into ingestion. Of course then we're getting into the business of parsing output from tools like oggz and stuff. Alternatively we could use existing Python tools (or write out own) to inspect the file headers to get the information like Wikimedia is doing.

Or we can just throw away the values.

VBR bitrates are still valueable, I didn't mean to imply they were useless. It's just that in some extreme cases they can be misleading because they're just averages. But for files that are not mostly pure silence (which there are so few of) then a 96kbps VBR file is significantly different than a 300kbps VBR file. That could be useful information for Openverse user.

I agree, I think many of the values we're using could be pulled directly from the files themselves, especially since roughly half our audio files are missing bitrate information:

openledger> select count(*) from audio where bit_rate is null;
+-------+
| count |
|-------|
| 88178 |
+-------+
SELECT 1
Time: 0.191s
openledger> select count(*) from audio;
+--------+
| count  |
|--------|
| 166118 |
+--------+

I do feel that's a down-the-line concern though. I can make a separate issue for it.