Anarios/return-youtube-dislike

(Feature Request): Query a hash prefix to improve privacy

Opened this issue ยท 18 comments

Extension or Userscript?

Both

Request or suggest a new feature!

As can be seen here currently the API receives the video id. In current SponsorBlock their API gets a hash prefix of the video id which improves user privacy.

Ways to implement this!

  1. Add an endpoint in the API to receive a hash prefix and return data from all videos that match this prefix;
  2. Make the clients use this endpoint.

SponsorBlock hash is a mix of SHA-256 and hex iterated 5000 times, then it crops the first 4 characters and send to the server. While using the same idea because it was already being tested, I think using PBKDF2 is better instead of using a for loop because it allows way better security because PBKDF2 is way more used than this for loop construction, because it's more optimized (it ran up to 37 times faster in my tests, keeping the same loop count; I think the issue with SponsorBlock code is that WebCrypto is slow when iterated due to it depending on promises) and because it is the only PBKDF available in WebCrypto.

I can work on the second point only as I could not find the source code of the API. Using the userscript code as reference I think I just need to add a function that calls window.crypto.subtle.deriveBits with the video id and return the hash prefix, then make setState() call this function before calling doXHR(). I think that the changes in the extensions are similar. The only caveat I can imagine from this approach is if WebCrypto is not available in userscripts or extensions in some browser. If some browser have this issue then I think the fallback would be sending the video id as it currently works, as using a pure JavaScript implementation would be too slow, which is not desirable.

Can you work on this?

  • Yes (in the client code)
  • No (in the server code)

We currently need full videoId because some data is still being requested from google API. When we turn off google api usage completely - it can be done.

I'm wondering how it would affect db query performance, though. And what changes in DB structure can be made to improve it.

Currently videoId is a clustered index - so search by full id is very performant. Searching by substring would kill performance - so we'd need a new column - hashedPrefix, and we'd make it a clustered index instead?

from all videos that match this prefix;

Another concern - how many collisions would there be? Sponsorblock db is around 2 million videos?
I think we're going to have over 4 billion records, so a single hash by four chars could return too many matches? Hundreds, potentially?

To clarify, SponsorBlock only hashes once for searching segments. The hash isn't the security guarantee, the prefix is.

@ajayyy That's true. My idea about using PBKDF2 is because my experience with Mega, which used a similar construction (it used interacted AES instead of interacted SHA) but then they migrated to PBKDF2. Because PBKDF2 is a standard there are good implementations already (WebCrypto supports it) you get faster results with the same security level or can increase the number of interactions to get better security keeping the execution time.

By the way, I only noticed that SponsorBlock uses a interacted hash yesterday when researching existing work for this issue. I always assumed it used something way simpler such as a single SHA-1 hash prefix like HIBP's Pwned Passwords. I know that migrating hashes is hard, so is late to change SponsorBlock implementation, but is not late for this project: I hope that if it gets implemented here it uses something with better performance, like a SHA-512 prefix or even PBKDF2 if desired.

I am working on a rating system as well (ajayyy/SponsorBlock#1039), so it would be nice if the hashes are the same. Since the hash is just done once per video on device, I think any performance advantages would be minimal.

@ajayyy I have not considered that, both projects using the same hashes is better than switching hashes for better performance. Also, is always possible to replace the current WebCrypto implementation with an optimized one if desired.

I think most of the overload is caused by WebCrypto having a promise interface. I tested js-sha256 and it is 20 times faster (using hashHex = sha256(hashHex)). It is not 37 times faster like PBKDF2, but still a good improvement. Well, js-sha256 was used before, switching to WebCrypto was suggested by xPaw (I imagine he did not know about WebCrypto performance issues). Is better performance needed? No, I pretty much only commented about it because I saw the code and I quickly remembered my bad experience trying to implement WebCrypto on megajs, but thinking again that's not worth the tradeoff, as it only loads data from one video at the time and it only hashes 5000 times in special cases.

Does it depend on browser engine? I'd Firefox also slow?

@ajayyy I was testing on Chromium (96.0.4664.45) as it's shares the same engine as most popular browsers, but here are some results: https://gist.github.com/qgustavor/abf66fc4d8faf62c7e2f1a9e64cc077e

I am working on a rating system as well (ajayyy/SponsorBlock#1039), so it would be nice if the hashes are the same. Since the hash is just done once per video on device, I think any performance advantages would be minimal.

Speaking of that, would be great if the extension would integrate with SponsorBlock's rating system when it will be ready at some point maybe?

@Sominemo yep, each rating will be associated with positive or negative, so will be able to be compatible

I think I just came up with a solution.

Lets make extension never request a single video id. It is planned to display a ratio bar near every thumbnail. So why not request all videos at once, therefore making the watch history unavailable for server (because server will never know which single video out of 50 requested you actually watch).

Randomize id order before request, and voila

@ajayyy, @qgustavor - what do you think, is it good enough?

(because server will never know which single video out of 50 requested you actually watch).

Randomize id order before request, and voila

I think that's better than nothing, and accusing somebody based on that would be almost like accusing those who followed the links in issue #335 several times, although still much more effective.

@Anarios well, it sounds good on paper. but there are a few things that needs consideration:

  • somewhat important: computing resources needed to find the information (or don't cause a ddos by and to yourself)
    maybe implement it on the client side and not the server side, but that's for another issue if it exists (idk if it does)
    • what about other suggested videos that get loaded on scroll?
  • main point: the suggested videos is the output of a black box, and the black box is personalized for everyone.
    like yo, you just sent this persons entire recommendations to the server. is it really a good idea? because those recommendations are based on the watch history logged by youtube (and subscriptions and the black box itself, potentially), and you could link interest to the ip address (but probably not an identity for an average internet user).
    with that, if a bad guy with high power (power to acquire identity information from isp about ip address and maybe other shenanigans) managed to collect that type of data and found someone of interest, the "bad guy" could probably be after that person of interest.

it could be possible, but i'd say it's a low chance that it'll happen (the "bad guy" could probably be better off with youtube instead). still, i'm not sure if the solution is best to improve privacy since interests are also included now.

also, i think the feature still leaves a bit of being trackable if the requests are logged with info like ip address and requested ids. say you have a list of ids requested and you know it's that one ip address. the user has gone to the home page, it sends a request of many ids. the user clicks on a video, it sends another request of many ids. there's something common between those 2 requests (in an ideal condition where request 1 is homepage/next video and request 2 is also next video), which is both of them have this video id.

i'd say it's a trade off between requested info for 1 thing and 51 things (both technical and privacy viewpoint). but since a feature like ratio bar with every thumbnail video is in the plans, i don't think the tradeoff can be made by the developers of the extension (i sound very newbie saying those last few words) and personally i think it should be up to the user to decide if they want to see dislike ratios with suggested videos (probably not with the defaults settings).

i'm just thinking a lot about it, and how it could be abused. and how does it improve privacy. in conclusion, i'm not confident the feature would improve privacy significantly (obscurity? yes), and the solution is probably not perfect. but i also could be completely wrong about everything in this entire comment too or just a bit extreme with the cases.

edit: also, i somehow spent an entire hour just writing and thinking to make this comment of possibilities. i probably need a break.

computing resources needed to find the information

But I'll need that data anyways. It's not for privacy only - it's just to show rates near every video

i'm not sure if the solution is best to improve privacy since interests are also included now.

well, you can either have an exact watch history, or a very fuzzy youtube recommendations, second is preferable, I thought.

Hashing doesn't give you a perfect anonymity either (if I understand it correctly) - it gives you a list of possible collisions. So you can tell "well, this user watched one of these ten videos for sure". Instead of "this user watched this exact video for sure".

So, not that different from my approach? I think. I might be wrong.

@Anarios with the choice of exact, exact but hidden, and vague, yeah i think the vague could be more preferable.

well, you can either have an exact watch history, or a very fuzzy youtube recommendations, second is preferable, I thought.

Recommendations consist of videos related to this particular video and videos related to the user's interests, so if the user disables the extension before following some videos, but YouTube remembers them, recommendations from those videos will leak.

10 videos is just over 2 bits of entropy, I think, though I am not sure how that'd be used in calculations.

@Anarios

well, you can either have an exact watch history, or a very fuzzy youtube recommendations, second is preferable, I thought.

What about searching a random word from the title of the current video and using some of the results with some from the recommendations?

On paper, this would provide videos related to each other without exposing your youtube recommendations as much.
The major problen is that this could end with 2 "kinds" of videos being sent (imagine a video with the following title "Most sites have a 'hidden' link", the extensions searches 'hidden', which returns things not related to websites nor links). Something that could be done to reduce the "false search terms" would be to add a filter with some basic words that add no relevant meaning (connectors, etc), but the problem would exist anyways.

This approach shouldn't add overhead to the server (aside from the 50 queries of videos, but that's required for this kind of "anonymization"): because the search would be done on the extension side. This would require (I guess) to get the token of the user and use it against YouTube API, or perform reverse engineering on the current search page, I'll investigate this.

The major problems in the UX is that there would be 2 requests that are quite big, making the dislikes to load with more latency, I could try and do a demo of this idea to check if is somewhat noticeable. We can discuss this further on Discord ๐Ÿ˜„

@Anarios

well, you can either have an exact watch history, or a very fuzzy youtube recommendations, second is preferable, I thought.

What about searching a random word from the title of the current video and using some of the results with some from the recommendations?

On paper, this would provide videos related to each other without exposing your youtube recommendations as much. The major problen is that this could end with 2 "kinds" of videos being sent (imagine a video with the following title "Most sites have a 'hidden' link", the extensions searches 'hidden', which returns things not related to websites nor links). Something that could be done to reduce the "false search terms" would be to add a filter with some basic words that add no relevant meaning (connectors, etc), but the problem would exist anyways.

This approach shouldn't add overhead to the server (aside from the 50 queries of videos, but that's required for this kind of "anonymization"): because the search would be done on the extension side. This would require (I guess) to get the token of the user and use it against YouTube API, or perform reverse engineering on the current search page, I'll investigate this.

The major problems in the UX is that there would be 2 requests that are quite big, making the dislikes to load with more latency, I could try and do a demo of this idea to check if is somewhat noticeable. We can discuss this further on Discord ๐Ÿ˜„

Feels like too much overhead to me.

This wouldn't add performance overhead. Unless you implement this incorrecrly

The way this should be work is that RYD server should have a static storage of files.
Each file would would contain either CSV or JSON data with the like and dislike counts coresponding to each video.

All the server would need to do is serve these static assets. Even better because they are static files they make performant usage much easier. (And they can be cached for a limited period of time)

And of course when new data gets updated the coresponding file should be updated too.

And there should be an internal database seperate of the extension facing database, which would contain all of the raw video IDs and their like and dislike counts.


Also I believe you should host seperate end servers at different geographical locations. (Only for data retrieval. For example:

  • Netherlands
  • Brazil
  • US
  • Singapore
  • Poland
  • India
  • South Africa
  • Germany

And the extension should use the closest geographically located server.