Podcastindex-org/podping.cloud

Universal podcast identifiers

Closed this issue · 12 comments

The "iTunes ID" of a podcast has become a sort of universal identifier for a podcast in the absence of a truly open standard. With the recent breakage of that API, and the ability to "opt out" of displaying that url publicly in results, it's obvious that there needs to be a truly universal podcast id.

I propose that a podcast be assigned a universal ID when it first enters the podping cloud. That ID will then become it's identifier for the life of the podcast when moving amongst hosting platforms.

I think that the sha1 hash of the protocol-scheme stripped url would give a good ID. For instance:

https://podnews.net/rss --> sha1() --> 31d69452693db24bf436a60f85689999a4c622ee

That gives a predictable varchar(40) storage for directories, apps and indexes to handle.

Is it assumed the URL is lowercase/uppercase?

Could it assume the character encoding is UTF-8 to support IRIs and not just ASCII?

Would it be worthwhile to prefix the hash with the hash type as with SRI? Like this:
sha1-31d69452693db24bf436a60f85689999a4c622ee

Thinking about a way to have this be extensible, similar to the unix/Modular Crypt Format.

Just for fun I looked at the PodcastIndex DB dump looking for Hebrew characters in feed URLs and there are none. I'm not sure if this means nobody is running a podcast with Hebrew in the URL. I created one if you need to test, I haven't submitted it to the index, its just a copy of my experimental Brianoflondon Feed. The name is just Brian of London Podcast. The Hebrew for Podcast is .... phonetically Podcast said with a Hebrew accent ;-)

https://brianoflondon.me/podcast2/%D7%91%D7%A8%D7%99%D7%90%D7%9F-%D7%9E%D7%9C%D7%95%D7%A0%D7%93%D7%95%D7%9F/%D7%A4%D7%95%D7%93%D7%A7%D7%90%D7%A1%D7%98-%D7%A9%D7%9C-%D7%91%D7%A8%D7%99%D7%90%D7%9F-%D7%9E%D7%9C%D7%95%D7%A0%D7%93%D7%95%D7%9F.xml

Most browsers show Hebrew in the URL bar but that's what I get on copy paste.

Is it assumed the URL is lowercase/uppercase?

Could it assume the character encoding is UTF-8 to support IRIs and not just ASCII?

Would it be worthwhile to prefix the hash with the hash type as with SRI? Like this:
sha1-31d69452693db24bf436a60f85689999a4c622ee

Thinking about a way to have this be extensible, similar to the unix/Modular Crypt Format.

Yes to utf-8 for sure.

If we introduce other hash formats then it messes up the predictability of being able to know what your storage requirements will be on the database side of the app/directory. A 40 char var column is easily indexed for not too much disk penalty.

If we introduce other hash formats then it messes up the predictability of being able to know what your storage requirements will be on the database side of the app/directory. A 40 char var column is easily indexed for not too much disk penalty.

I'm mostly thinking about potential for hash collisions, which are unlikely with ~3.5 million feeds but may increase substantially as number of feeds increase.

SHA-256 is a good compromise (64 hex characters vs 40)

Either way, I'd suggest storing as binary vs strings to significantly reduce storage, say sha1 as binary would be 20 bytes vs a 40 byte string

I initially was doing the same thing, but there is a major downside to one-way hashing the url: you cannot gracefully fallback if it's a feed you've never seen. i.e most apps can do a good job if passed only the feed url at runtime with realtime discovery, rendering a basic feed - even for a url they don't have in their db. This is basically the way the web works!

A truly universal link should contain the absolute feed url, one way or another. My vote is base64url(abs feed url)

you cannot gracefully fallback if it's a feed you've never seen

I should have been more clear with the goal of this. The goal is to have a universal podcast ID that is roughly equivalent in purpose to the role that an iTunes ID serves now. That purpose being to serve as a static marker that apps and services can depend on never changing, while the feed url underneath it can change.

I can’t see a way around having a central registry for these numbers. But, “registry” doesn’t have to mean “authority” if we use the podping network to do it. When a new feed url is seen by a podping server, it’s given a PCID and distributed. At that point it’s been “registered”. If the feed url for that PCID ever moves in the future podping needs a new notification type to let the network know.

Ah, then I'd suggest changing the title to avoid confusion, "Universal podcast identifiers" -> "Podping podcast identifier" or "Public podcast identifier" etc

"Universal" traditionally means across apps, whereas it sounds like you just mean public (not internal).

Twitter had an interesting blog about how they do tweet ids among multiple distributed workers:
https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html

Twitter had an interesting blog about how they do tweet ids among multiple distributed workers:
https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html

This is a pretty similar concept to Lamport Timestamps, which are very common in distributed systems. Primarily useful in places where multiple systems have to agree on the value of something. I used them for a very basic implementation of the Paxos consensus algorithm. Definitely could be useful for node-to-node communication at some point.

you cannot gracefully fallback if it's a feed you've never seen

I should have been more clear with the goal of this. The goal is to have a universal podcast ID that is roughly equivalent in purpose to the role that an iTunes ID serves now. That purpose being to serve as a static marker that apps and services can depend on never changing, while the feed url underneath it can change.

I can’t see a way around having a central registry for these numbers. But, “registry” doesn’t have to mean “authority” if we use the podping network to do it. When a new feed url is seen by a podping server, it’s given a PCID and distributed. At that point it’s been “registered”. If the feed url for that PCID ever moves in the future podping needs a new notification type to let the network know.

If the goal is a one time registration why not stick with a standard UUID?

UUIDv5 gets a sha1 of a namespace (could be the standard url namespace or there could be a podcast namespace uuid) plus the lowercase of the feed URI without the protocol as originally proposed.

Takes the best of both worlds with UUID being a known quantity.

In the small chance of a collision, you can append the feed URL with a random string, so podnews.net/rss might become podnews.net/rss#47293

But, what is the benefit of UUID v5 over just a plain sha1? The sha1 is shorter. Or, maybe it's not a technical reason you're thinking of but just a "this is the expected use case for this type of thing"? Which, btw, I'm cool with that line of reasoning.

But, what is the benefit of UUID v5 over just a plain sha1? The sha1 is shorter. Or, maybe it's not a technical reason you're thinking of but just a "this is the expected use case for this type of thing"? Which, btw, I'm cool with that line of reasoning.

UUID is actually smaller and shorter, it's 128 bits (36 hex characters with dashes) vs 160 bits (40 hex characters) with SHA1.

The expected use case is a big reason, if only because almost every database has a documented way of handling and storing them. For example both SQL Server and PostgreSQL have native storage types for it, and MySQL has built in functions to convert them to binary for efficient storage. Plus, generally speaking if you give someone a UUID string most developers know what to do with it.

I think there would be a lot less resistance in general with that route, and with v5 there's still a documented route of transforming feedURL -> UUID with a hash as opposed to choosing something random.

UUID is actually smaller and shorter

Ah, you're right! Thanks for the correction.

I think there would be a lot less resistance in general with that route, and with v5 there's still a documented route of transforming feedURL -> UUID with a hash as opposed to choosing something random.

Makes perfect sense.