Wikimedia audio bit_rate integer overflow
AetherUnbound opened this issue · 7 comments
Description
We received the following alert for a recent Wikimedia audio run:
SELECT aws_s3.table_import_from_s3(
'provider_data_audio_wikimedia_commons_20220326T000000',
'',
'DELIMITER E'' ''',
'openverse-catalog',
'audio/wikimedia_commons/wikimedia_audio_audio_v001_20220327000000.tsv',
'us-east-1'
);
, parameters: None
[2022-03-27, 00:16:05 UTC] {taskinstance.py:1718} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1334, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1460, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1511, in _execute_task
result = execute_callable(context=context)
File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 174, in execute
return_value = self.execute_callable()
File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 188, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/openverse_catalog/dags/common/loader/loader.py", line 44, in load_from_s3
sql.load_s3_data_to_intermediate_table(
File "/usr/local/airflow/openverse_catalog/dags/common/loader/sql.py", line 142, in load_s3_data_to_intermediate_table
postgres.run(
File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/hooks/dbapi.py", line 205, in run
self._run_command(cur, sql_statement, parameters)
File "/usr/local/airflow/.local/lib/python3.9/site-packages/airflow/hooks/dbapi.py", line 229, in _run_command
cur.execute(sql_statement)
psycopg2.errors.NumericValueOutOfRange: value "4294967294" is out of range for type integer
CONTEXT: COPY provider_data_audio_wikimedia_commons_20220326t000000, line 375, column bit_rate: "4294967294"
SQL statement "copy provider_data_audio_wikimedia_commons_20220326T000000 from '/rdsdbdata/extensions/aws_s3/amazon-s3-fifo-15359-20220327T001605Z-0' with DELIMITER E' '"
That value is the max value for a 32 bit integer. It appears that it's possible to get invalid values for the bit_rate
column. We should probably have a case statement for these when upserting (e.g. CASE WHEN bit_rate >= 4294967296 THEN NULL ELSE bit_rate
or something similar).
Reproduction
- Write the following record to a TSV
116428479 https://commons.wikimedia.org/w/index.php?curid=116428479 https://upload.wikimedia.org/wikipedia/commons/6/62/Sydney_tram_18_-_2016-02-14-2.ogg \N ogg 112343 by-sa 4.0 Andy Mabbett https://www.wikidata.org/wiki/Q15136093 Sydney tram 18 - 2016-02-14-2 {"description": "Audio recording of tram number 18 at Central Station, Sydney, Australia. The vehicle is stationary, as passengers board, The driver sounds a warning bell, several times, to warn of the impending closure of the doors.", "global_usage_count": "1", "date_originally_created": "14 February 2016", "last_modified_at_source": "2022-03-26 14:08:25", "categories": ["2016-02-14", "Audio files by Andy Mabbett", "Audio files made using a WMUK Zoom H1", "Audio files of 2016", "Audio files of Sydney", "Central (Grand Concourse) light rail stop, Sydney", "Content media by years - Supported by Wikimedia UK - 2016", "Self-published work", "Sounds of trams", "Urbos 3 in Sydney"], "channels": "2", "license_url": "https://creativecommons.org/licenses/by-sa/4.0/", "raw_license_url": "https://creativecommons.org/licenses/by-sa/4.0"} \N \N f wikimedia_audio wikimedia_audio provider_api 2144 4294967294 96000 \N \N \N \N
- Upload this record to Minio
- Attempt to ingest this record
- See error.
Screenshots
Additional context
Resolution
- 🙋 I would be interested in resolving this bug.
If the number is still a valid value, changing the column type to bigint wouldn't be preferable?
What is the meaning of this bitrate anyway?
When I look at the file in an audio player, it reports the bitrate as 404 kbps, which shouldn't be nearly as large as that number I don't think... It seems like either bad data or data that we need to transform in some way to get an actual bitrate. I know FLAC has some weird "bitrate" stuff, it doesn't work the same as MP3 or other lossy compression formats (usually FLAC files I've seen don't even report a bitrate and if they do it's a very large number that doesn't correspond to what we usually think of as a useful bitrate).
Ogg Vorbis is apparently always VBR, so maybe this bitrate value isn't even useful for ogg files at all and we can just throw it out. The best you could get is an average for the file (but even that still doesn't explain the huge number here).
I confirmed that this file is indeed a Vorbis file by using the oggz
tool:
$ oggz info -a Sydney_tram_18_-_2016-02-14-2.ogg
Content-Duration: 00:00:02.144
Content-Length: 109.710 kB
Content-Bitrate-Average: 419.190 kbps
Vorbis: serialno 0000029513
221 packets in 27 pages, 8.2 packets/page, 1.138% Ogg overhead
Content-Length: 109.710 kB
Content-Bitrate-Average: 419.190 kbps
Audio-Samplerate: 96000 Hz
Audio-Channels: 2
Page-Length-Maximum: 4.508 kB
Page-Length-StdDev: 1.022 kB
Packet-Length-Maximum: 3.816 kB
Packet-Length-StdDev: 262 bytes
As you can see, the -a
flag (you can also get it with just -b
) also exposes the average bitrate of 419.190 kbps, which looks suspiciously close to 4294967294
but doesn't seem to be related in any way I can tell.
I wonder if it's more reliable to compute the bitrate (or average bitrate in the case of VBR files) ourselves for various files. For the file extension work we'll already need to do something with the actual file itself to get the extension, I wonder if in doing that we'll make it easier to get the bitrate ourselves?
If none of that is possible then I think throwing out values that look like nonsense is a good idea. For example, maybe just not reporting bit rate for ogg files at all considering they're probably either a lossless file in the ogg container or a VBR encoded file, neither one has super useful bitrate information necessarily (for FLAC you'd want bit depth and for VBR a couple seconds of true silence can make a high quality file look like it has a bad average bit rate anyway, it's not a useful metric necessarily, at least as far as I can tell).
These are great questions! Honestly I'm not sure what field the 419kbps
bitrate is coming from on the entry's page. Here's the response we get back from the API using the same parameters as the ingestion script:
cURL: curl --request GET --url 'https://commons.wikimedia.org/w/api.php?action=query&pageids=116428479&format=json&iiprop=url%7Cuser%7Cdimensions%7Cextmetadata%7Cmediatype%7Csize%7Cmetadata&prop=globalusage%7Cimageinfo' --header 'content-type: application/json'
Response JSON
{
"batchcomplete": "",
"query": {
"pages": {
"116428479": {
"pageid": 116428479,
"ns": 6,
"title": "File:Sydney tram 18 - 2016-02-14-2.ogg",
"globalusage": [
{
"title": "Walter_Hunt_(inventor)",
"wiki": "en.wikipedia.org",
"url": "https://en.wikipedia.org/wiki/Walter_Hunt_(inventor)"
},
{
"title": "User:Douglas_Coldwell/Sandboxes/Sandbox_30",
"wiki": "en.wikipedia.org",
"url": "https://en.wikipedia.org/wiki/User:Douglas_Coldwell/Sandboxes/Sandbox_30"
}
],
"imagerepository": "local",
"imageinfo": [
{
"user": "Moderndayicarus",
"size": 112343,
"width": 0,
"height": 0,
"duration": 2.144083333333333,
"url": "https://upload.wikimedia.org/wikipedia/commons/6/62/Sydney_tram_18_-_2016-02-14-2.ogg",
"descriptionurl": "https://commons.wikimedia.org/wiki/File:Sydney_tram_18_-_2016-02-14-2.ogg",
"descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=116428479",
"metadata": [
{
"name": "version",
"value": 2
},
{
"name": "streams",
"value": [
{
"name": 29513,
"value": [
{
"name": "serial",
"value": 29513
},
{
"name": "group",
"value": 0
},
{
"name": "type",
"value": "Vorbis"
},
{
"name": "vendor",
"value": "Xiph.Org libVorbis I 20120203 (Omnipresent)"
},
{
"name": "length",
"value": 2.144083333333333
},
{
"name": "size",
"value": 112343
},
{
"name": "header",
"value": [
{
"name": "vorbis_version",
"value": 0
},
{
"name": "audio_channels",
"value": 2
},
{
"name": "audio_sample_rate",
"value": 96000
},
{
"name": "bitrate_maximum",
"value": 0
},
{
"name": "bitrate_nominal",
"value": 4294967294
},
{
"name": "bitrate_minimum",
"value": 0
},
{
"name": "blocksize_0",
"value": 8
},
{
"name": "blocksize_1",
"value": 11
},
{
"name": "framing_flag",
"value": 0
}
]
},
{
"name": "comments",
"value": [
{
"name": "Licence",
"value": "CC by-sa 4.0"
},
{
"name": "GENRE",
"value": "ambient sound"
},
{
"name": "TITLE",
"value": "Sydney Tram 18"
},
{
"name": "DATE",
"value": "2016-02-14"
},
{
"name": "ARTIST",
"value": "Andy Mabbett"
}
]
}
]
}
]
},
{
"name": "length",
"value": 2.144083333333333
},
{
"name": "offset",
"value": null
}
],
"extmetadata": {
"DateTime": {
"value": "2022-03-26 14:08:25",
"source": "mediawiki-metadata",
"hidden": ""
},
"ObjectName": {
"value": "Sydney tram 18 - 2016-02-14-2",
"source": "mediawiki-metadata",
"hidden": ""
},
"CommonsMetadataExtension": {
"value": 1.2,
"source": "extension",
"hidden": ""
},
"Categories": {
"value": "2016-02-14|Audio files by Andy Mabbett|Audio files made using a WMUK Zoom H1|Audio files of 2016|Audio files of Sydney|Central (Grand Concourse) light rail stop, Sydney|Content media by years - Supported by Wikimedia UK - 2016|Self-published work|Sounds of trams|Urbos 3 in Sydney",
"source": "commons-categories",
"hidden": ""
},
"Assessments": {
"value": "",
"source": "commons-categories",
"hidden": ""
},
"ImageDescription": {
"value": "Audio recording of tram number 18 at Central Station, Sydney, Australia. The vehicle is stationary, as passengers board, The driver sounds a warning bell, several times, to warn of the impending closure of the doors.",
"source": "commons-desc-page"
},
"DateTimeOriginal": {
"value": "14 February 2016",
"source": "commons-desc-page"
},
"Credit": {
"value": "<span class=\"int-own-work\" lang=\"en\">Own work</span>",
"source": "commons-desc-page",
"hidden": ""
},
"Artist": {
"value": "<bdi><a href=\"https://www.wikidata.org/wiki/Q15136093\" class=\"extiw\" title=\"d:Q15136093\"><span title=\"freelance consultant in digital and open content matters, Wikipedian, author, Fellow of the Royal Society of Arts\">Andy Mabbett</span></a>\n</bdi>",
"source": "commons-desc-page"
},
"LicenseShortName": {
"value": "CC BY-SA 4.0",
"source": "commons-desc-page",
"hidden": ""
},
"UsageTerms": {
"value": "Creative Commons Attribution-Share Alike 4.0",
"source": "commons-desc-page",
"hidden": ""
},
"AttributionRequired": {
"value": "true",
"source": "commons-desc-page",
"hidden": ""
},
"LicenseUrl": {
"value": "https://creativecommons.org/licenses/by-sa/4.0",
"source": "commons-desc-page",
"hidden": ""
},
"Copyrighted": {
"value": "True",
"source": "commons-desc-page",
"hidden": ""
},
"Restrictions": {
"value": "",
"source": "commons-desc-page",
"hidden": ""
},
"License": {
"value": "cc-by-sa-4.0",
"source": "commons-templates",
"hidden": ""
}
},
"mediatype": "AUDIO"
}
]
}
}
}
}
In the metadata
section, you can see the following properties:
{
"name": "bitrate_maximum",
"value": 0
},
{
"name": "bitrate_nominal",
"value": 4294967294
},
{
"name": "bitrate_minimum",
"value": 0
},
The min & max being 0
, and the nominal being 232-2 all lead me to believe that this is a legitimately invalid value.
I think we could calculate the bitrate here ourselves, but based on your description of ogg
files, it doesn't seem like it would be that necessary/helpful. I'm open to other thoughts here!
The min & max being 0, and the nominal being 232-2 all lead me to believe that this is a legitimately invalid value.
Yes, it does seem like this is just invalid data, you're right.
I did some sleuthing and found out how Wikimedia handles it.
https://phabricator.wikimedia.org/T3948
Here's the relevant code from OggHandler (I guess): https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/includes/handlers/OggHandler/OggHandler.php#L379
So they're just calculating it on the fly it seems (presumably cached somehow). It doesn't seem like they use the API data to render the pages. But then I'm not sure what they used to populate the API data 🤷
So they're just calculating it on the fly it seems (presumably cached somehow). It doesn't seem like they use the API data to render the pages. But then I'm not sure what they used to populate the API data shrug
That's....frustrating! 😅
I think the solution could be to move more of this stuff into ingestion. Of course then we're getting into the business of parsing output from tools like oggz
and stuff. Alternatively we could use existing Python tools (or write out own) to inspect the file headers to get the information like Wikimedia is doing.
Or we can just throw away the values.
VBR bitrates are still valueable, I didn't mean to imply they were useless. It's just that in some extreme cases they can be misleading because they're just averages. But for files that are not mostly pure silence (which there are so few of) then a 96kbps VBR file is significantly different than a 300kbps VBR file. That could be useful information for Openverse user.
I agree, I think many of the values we're using could be pulled directly from the files themselves, especially since roughly half our audio files are missing bitrate information:
openledger> select count(*) from audio where bit_rate is null;
+-------+
| count |
|-------|
| 88178 |
+-------+
SELECT 1
Time: 0.191s
openledger> select count(*) from audio;
+--------+
| count |
|--------|
| 166118 |
+--------+
I do feel that's a down-the-line concern though. I can make a separate issue for it.