Cleaner metadata (e.g. remove "Remastered" suffixes)

Question

Cleaner metadata (e.g. remove "Remastered" suffixes)

YodaEmbedding opened this issue 4 years ago · 10 comments

YodaEmbedding commented 4 years ago

EDIT: I wrote a very customizable python scrobbler, scrobblez:
https://github.com/YodaEmbedding/scrobblez

Services such as Spotify and Tidal include irrelevant suffixes within their metadata. For example:

Hallowed Be Thy Name - 2015 Remaster
Hallowed Be Thy Name (2015 Remaster)

It makes sense to remove the trailing suffixes since:

It improves the user experience of last.fm features, including comparing libraries and track counts with other users. If there are multiple names for a particular track, this becomes much more difficult or impossible!
Your own past history might contain un-remastered scrobbles. But now they contain both. Making it harder to analyze album/track listening statistics.
It cleans away fairly useless information the metadata.
The format is specific to the streaming platform (e.g. Spotify, Tidal).

A similar issue was resolved by web-scrobbler: web-scrobbler/web-scrobbler#880. They use the metadata-filter library, available on npm, which provides some useful regexes. See here:

/**
 * Filter rules to remove "Remastered..."-like strings from a text.
 */
export const REMASTERED_FILTER_RULES: FilterRule[] = [
	// Here Comes The Sun - Remastered
	{ source: /-\sRemastered$/, target: '' },
	// Hey Jude - Remastered 2015
	{ source: /-\sRemastered\s\d+$/, target: '' },
	// Let It Be (Remastered 2009)
	// Red Rain (Remaster 2012)
	{ source: /\(Remaster(ed)?\s\d+\)$/, target: '' },
	// Pigs On The Wing (Part One) [2011 - Remaster]
	{ source: /\[\d+\s-\sRemaster\]$/, target: '' },
	// Comfortably Numb (2011 - Remaster)
	// Dancing Days (2012 Remaster)
	{ source: /\(\d+(\s-)?\sRemaster\)$/, target: '' },
	// Outside The Wall - 2011 - Remaster
	// China Grove - 2006 Remaster
	{ source: /-\s\d+(\s-)?\sRemaster$/, target: '' },
	// Learning To Fly - 2001 Digital Remaster
	{ source: /-\s\d+\s.+?\sRemaster$/, target: '' },
	// Your Possible Pasts - 2011 Remastered Version
	{ source: /-\s\d+\sRemastered Version$/, target: '' },
	// Roll Over Beethoven (Live / Remastered)
	{ source: /\(Live\s\/\sRemastered\)$/i, target: '' },
	// Ticket To Ride - Live / Remastered
	{ source: /-\sLive\s\/\sRemastered$/, target: '' },
	// Mothership (Remastered)
	// How The West Was Won [Remastered]
	{ source: /[([]Remastered[)\]]$/, target: '' },
	// A Well Respected Man (2014 Remastered Version)
	// A Well Respected Man [2014 Remastered Version]
	{ source: /[([]\d{4} Re[Mm]astered Version[)\]]$/, target: '' },
	// She Was Hot (2009 Re-Mastered Digital Version)
	// She Was Hot (2009 Remastered Digital Version)
	{ source: /[([]\d{4} Re-?[Mm]astered Digital Version[)\]]$/, target: '' },
];

They also provide regex filters for other unnecessary text such as Live, Explicit, feat., and Album|Stereo|Mono|Deluxe|.... Personally, Remastered is the most critical, though.

Answer 1 · 2020-09-11T19:51:08.000Z

Hi, thanks for the suggestion. I was under the assumption that Last.fm (and maybe ListenBrainz - not sure) automatically attempted to normalize track names, but I'm guessing it doesn't do that in the case of remasters etc?

I'll look into adding filters like those regexes you've linked (thanks!) soon.

Answer 2 · 2020-09-16T01:17:15.000Z

I've implemented some basic filtering for remasters in the filter-remasters branch. If it's not too much to ask, could you please check it out and see if it does what you expect? Enable using filter-remasters = true in the config. Thanks!

Answer 3 · 2020-09-30T16:21:08.000Z

Hi @YodaEmbedding, just checking in, have you been able to look at this yet? Thanks!

Answer 4 · 2020-10-05T09:33:07.000Z

Seems to miss Aces High - 2015 Remaster. I recommend:

Taking the Remastered regexes and changing them to Remaster(ed)?
Adding a more relaxed filter to deal with In The Court Of The Crimson King (Expanded & Remastered Original Album Mix), such as $[^\(]*Remaster[^$]*\)$... though it looks like last.fm correctly fixes this particular album itself.

I could write some more tests and get back to you with a more detailed recommendation some time later this week.

I've actually ported metadata-filter to Python and have been using a personal scrobbler I wrote called scrobblez. Spotify doesn't output clean metadata for a large variety of different titles and last.fm usually doesn't fix it correctly afterwards. scrobblez allows the user to provide their own config.py file with the ability to clean via filters or provide manual overrides (example). Would it make sense to also have something similar for this repo?

My main concerns are:

Removing unnecessary information
Dealing with album artist != artist, multiple artists, "Various Artists" (classical or soundtracks)
Dealing with composer != performers (classical music) in a user-preferred way

Answer 5 · 2020-10-15T16:57:27.000Z

Strange, it does filter correctly for me (and it should be caught by the existing regexes).
Good idea, I will add that filter.

I'm currently working on porting metadata-filter to Rust for use in rescrobbled. I like the idea of a custom config to clean up metadata a lot, but I feel like allowing the user to write their own filters, manual overrides etc. would needlessly complicate the simple config of rescrobbled (and it has to be in a declarative manner). Thoughts?

As for your concerns:

Removing unnecessary information should be possible using the filters from metadata-filter, correct?
I'm not sure I have encountered this before, do you mean allowing the user to configure whether to scrobble the artist or the album artist for a given track? Last.fm doesn't make a distinction, and currently rescrobbled simply scrobbles the first track artist listed in the MPRIS metadata.
I'm not sure if this is even possible if the music player doesn't report the composer, unless the user wanted to manually list every song/composer combination of the songs they listen to. Even then, naming seems very inconsistent among services/music players and even within Spotify. Any thoughts on this?

Answer 6 · 2020-10-16T01:15:53.000Z

Yes, the unnecessary information removal is all handled by metadata-filter.

I went with the "allow the user to fully write their own configuration" option because giving too many control knobs just increases complexity. It's best if 99% of cases "just work" but the user has the option to adjust for their specific use cases. Here's my personal scrobblez/config.py as a sample. Almost all the manual overrides are defined in a "data-oriented" format so perhaps something along these lines could be adapted for rescrobbled if giving the user "too much freedom" is unwieldy in a non-dynamic language like Rust.

For dealing with artists, the way I've found would work* with most classical music on Spotify is this:

def _choose_artist(artists: List[str], album_artists: List[str]) -> Tuple[str, str]:
    """Reduce list of artists to one artist."""
    artist = artists[0]
    non_composers = [x for x in album_artists if x not in composers]
    album_artist = [non_composers + album_artists][0]
    return artist, album_artist

This gives the format:

album_artist = primary performer (e.g. "Hilary Hahn")
artist = composer (e.g. "Wolfgang Amadeus Mozart")

* Footnote: I just realized that the Spotify desktop client only reports the first album artist. ¯\_(ツ)_/¯ Theoretically, one could look up the track via the Spotify API or do a MusicBrainz lookup to get the full list of album artists... but now we're getting complicated just for the sake of the Spotify desktop client's shortcomings.

Answer 7 · 2020-12-28T02:10:48.000Z

Ok, I've given this some thought and even implemented a proof of concept (including porting metadata-filter to Rust). This would allow for a config section looking something like:

[filters]
track = [
    { builtin = ["some", "builtin", "rules"] },
    { custom = ["some (.+) regex", "replacement"] },
    { buitin = ["another-builtin-ruleset"] },
]
# album = [...]

These filter rules get squished together into big lists for the track and album and are applied in sequence. This feels really ugly to me and also doesn't allow for much custom/dynamic behavior.

The other option I've considered is simply adding a config option for a custom filtering script, which would take song metadata on its standard input and write filtered metadata to stdout. Doing that greatly simplifies the implementation and configuration, but does require the user to write all filtering logic themselves. It would allow for all the use cases you've listed. *

Thoughts?

* The latter option could probably also cleanly solve #32... although solving it by telling someone to write their own filtering script feels a bit lazy. 😛

Answer 8 · 2021-05-28T01:31:22.000Z

Closing this issue. I might reopen it in the future but the filter script is a good enough solution for now.

Answer 9 · 2021-09-09T08:03:22.000Z

@InputUsername Are there example filter scripts somewhere? I agree the filter script solution is good enough but would be nice to have some script samples :)

Answer 10 · 2021-09-09T11:19:34.000Z

@hugoroy not currently, do you have anything specific in mind that you want to do?

You could use the following Python code as a template:

#!/usr/bin/env python

import sys

artist, title, album = (l.rstrip() for l in sys.stdin.readlines())

# manipulate artist, title and album...

print(artist, title, album, sep='\n')

In principle, any kind of script/program that reads the artist/title/album and outputs them on corresponding lines should work. If there's no output, the track is ignored (not scrobbled).

Edit: in any case, I've created issue #44 because having examples seems like a good idea.