minvws/nl-covid19-coronacheck-mobile-core

Concern: config size denylist

Closed this issue · 14 comments

Describe the bug, issue or concern

I'm not sure to what repo this ticket belongs. Since the denylist code is implemented in the mobilecore, and it concerns the config of the mobilecore, I'm posting it here.

At the time of writing the config size of the mobilecore is ~64 KiB. The size of the config excluding the international / national denylists is ~12 KiB. I'm estimating, with the various layers of encodings on top of the first 128 bits of the sha256 hash that are on the denylist, a single entry / blocked qr code increases the config size, as downloaded from the NGINX server, by ~43 bytes.

I don't know what a good upper limit on the config size would be, but I'm assuming around 5 MiB because the app still has to work during (semi-)large events in the middle of nowhere, where stable internet connections are not guaranteed. Assuming this size increase keeps scaling linearly this would mean that at ~120K blocked QR codes the config size would approach ~5 MiB. At the time of writing there are ~1.2K entries on the denylist, around ~1% of this limit. It is not unlikely to assume that the number of blocked QR codes will keep growing.

compression

Currently the NGINX that serves the config does not have compression enabled by default. At the time of writing the config size of the mobilecore is ~64 KiB. With GZIP compression this would reduce down to ~35 KiB and with Brotli it would reduce down to ~33 KiB. I ran some simulations, where I generated a list of random, somewhat uniformly distributed sha256 hashes and replaced the hashes in the config with those randomly generated ones. This simulates the config filesize as downloaded from the NGINX. From the generated curves it is clearly visible that compression scales quite linearly on this simulated config. The compression ratio on this data is at ~2 for GZIP and at ~2.2 for Brotli.

binary file format

Since the current denylist approach uses the first 128 bits / 16 bytes of a sha256 hash, a binary file format might be considered where the file is chunked in 16 byte parts, one for each blocked QR. Since the binary data will most likely be close to uniformly distributed additional GZIP / Brotli compression would probably perform poor on such data. If the 16-byte chunks are stored in a sorted order an efficient binary search algorithm could be utilized to search for a hash in the binary file. Simulated results show that the ratio when switching to such a file format would be significantly better than GZIP / Brotli compression on the current config file. Where the ratio for GZIP is at ~2, Brotli ~2.2 and binary ~2.8.

denylist_binary
Image1: sketchup of a binary file format that uses 16-byte chunks, one for each entry on the denylist.

simulation results

image2
Image2: simulated config filesize as the denylist grows in size.

image3
Image3: simulated config compression ratio as the denylist grows in size. Higher is better. E.g. binary scores ~2.8 which means it can store ~2.8x more items in the denylist as compared to the original config.

Python example binary file format

#!pip install coronacheck-tools pandas
"""
This is a quick-and-dirty implementation of a
conversion script that converts the denylist to
a chunked binary format and a binary search algo on this
binary format.
""""
from coronacheck_tools.verification.verifier import readconfig

import pandas as pd
import base64
import io
import random


def mobilecore_config_denylist_to_bin(denylist):
    """
    Returns a sorted and chunked bytestring, currently at 16 bytes per entry
    """
    return b''.join(sorted([base64.b64decode(x) for x in denylist.keys()]))

def binary_search_denylist(search_hash, denylist_bin, hash_len=16):
    """
    Do a binary search on the denylist_bin
    """
    listsize = len(denylist_bin)
    elements = listsize // hash_len
    
    idx_high = elements
    idx_low = 0
    steps = 1
    running = True
    while running:
        idx_mid = idx_low + (idx_high-idx_low) // 2
        consider_hash = denylist_bin[idx_mid*hash_len:idx_mid*hash_len+hash_len]
        
        if consider_hash == search_hash:
            return True, steps
        
        if idx_low == idx_mid or idx_high == idx_mid:
            running = False
        
        if consider_hash < search_hash:
            idx_high = idx_high
            idx_low = idx_mid
        elif consider_hash > search_hash:
            idx_high = idx_mid
            idx_low = idx_low
            
        
        steps += 1
    return False, steps

def test_all_hash(denylist_bin, hash_len=16):
    """
    Step through all hashes on the list and check if they can be found
    by the the binary search algo
    """
    elements = len(denylist_bin) // hash_len
    
    stepsdata = []
    
    total_steps = 0
    for idx in range(0, elements):
        h = denylist_bin[idx*hash_len:idx*hash_len+hash_len]
        result, steps = binary_search_denylist(h, denylist_bin)
        total_steps += steps
        
        assert result == True
        
        stepsdata.append({
            'hash': h.hex(),
            'steps': steps
        })
        
    return pd.DataFrame(stepsdata).describe()

def test_nonexistent(denylist_bin):
    nonexistent_search_hash = b'1234567890ABCDEF'
    result, steps = binary_search_denylist(nonexistent_search_hash, denylist_bin)
    
    assert result == False
    
    return steps

config = readconfig()

dhc_denylist = config['mobilecore']['config']['domesticVerificationRules']['proofIdentifierDenylist']
ehc_denylist = config['mobilecore']['config']['europeanVerificationRules']['proofIdentifierDenylist']

dhc_bin = mobilecore_config_denylist_to_bin(dhc_denylist)
ehc_bin = mobilecore_config_denylist_to_bin(ehc_denylist)


dhc_results = test_all_hash(dhc_bin)
ehc_results = test_all_hash(ehc_bin)
print('DHC')
print(f'Filesize {round(len(dhc_bin) / 1024, 2)} KiB with {len(dhc_bin) // 16} blocked QR codes')
print(dhc_results)
print('Non-existent hash takes', test_nonexistent(dhc_bin), 'steps')
print()

print('EHC')
print(f'Filesize {round(len(ehc_bin) / 1024, 2)} KiB with {len(ehc_bin) // 16} blocked QR codes')
print(test_all_hash(ehc_bin))
print('Non-existent hash takes', test_nonexistent(ehc_bin), 'steps')
print()
DHC
Binary filesize 15.53 KiB with 994 blocked QR codes

Lookup steps over all items in bloclist:
            steps
count  994.000000 # total steps when looking up all codes on the list one-by-one
mean     8.980885 # avg number of steps for the binary search to find the hash partial
std      1.389298 # stddev on number of steps needed for lookup
min      1.000000 # min steps needed for lookup
25%      8.000000 # quartiles
50%      9.000000 # quartiles
75%     10.000000 # quartiles
max     10.000000 # max. number of steps to find a QR code
Non-existent hash takes 12 steps # number of steps needed for a lookup of a code that is not on the list

EHC
Filesize 4.06 KiB with 260 blocked QR codes
            steps
count  260.000000
mean     7.069231
std      1.339558
min      1.000000
25%      7.000000
50%      8.000000
75%      8.000000
max      9.000000
Non-existent hash takes 10 steps

Thanks for all the considerations and simulations. We're very aware that the current implementation is not very scalable, as entries are redownloaded with every fetch of the config, and as you mention the JSON encoding is not at all space-efficient (not to mention abusing the JSON parser to get hash table lookups for free). This was a conscious choice.

We're currently considering breaking the list up in separate files that are downloaded only once, and using CBOR to get a space-efficient binary encoding. Webserver compression is disabled by default due to best-practice settings against CRIME-like attacks, although they don't apply to static CDN files. As you mention, it wouldn't help much when compared to properly binary encoding the entries.

Ah yes, splitting it up. That sounds like a bloom filter. Perhaps similar to how HaveIBeenPwnd does it? That should scale well. :)

Will the QR code itself use CBOR? E.g. replace the current ASN.1 with CBOR? Or would this be used for a more efficient representation of the denylist only? What is the added benefit of using CBOR over a pure binary file as described in my post?

Perhaps totally off-topic, but may I ask why individual QR's are denied in the first place?
I always assumed only private keys are blocked?

Also, great research @Sikerdebaard :)

Perhaps totally off-topic, but may I ask why individual QR's are denied in the first place? I always assumed only private keys are blocked?

Also, great research @Sikerdebaard :)

Because they are being massively shared.... :) for some reason people do not like governments to take away their freedom. This repo is a blackspot on the essence open source stands for.

Perhaps totally off-topic, but may I ask why individual QR's are denied in the first place?

You can read the official reasoning, in Dutch, in the memorandum on the regeling, which is attached to this kamerbrief.

This repo is a blackspot on the essence open source stands for.

Feel free to qualify that statement. This repository is unconditionally open source; you can see what's going on. All measures have a majority in parliament, which are representatives of and elected by the public. Regardless of how you or I might disagree with those measures.

Perhaps totally off-topic, but may I ask why individual QR's are denied in the first place?

You can read the official reasoning, in Dutch, in the memorandum on the regeling, which is attached to this kamerbrief.

This repo is a blackspot on the essence open source stands for.

Feel free to qualify that statement. This repository is unconditionally open source; you can see what's going on. All measures have a majority in parliament, which are representatives of and elected by the public. Regardless of how you or I might disagree with those measures.

A certain person won the German elections 31 July 1932 he had also had a majority in Parliament. In fact most dictators are voted in office. Something people tend to forget when they yell how foul the comparison is. Don't tell me its totally different and all that. I bet a year ago you would have told me this would never be used to get people fired wel guess whats going to happen. The "I just work here" or "I just carried out orders" luckily never have and wil stand up in any court. But no need to get your hands dirty You know where the PM button is.

Perhaps since were Nerds wel I am, this might be more fitting.
https://c.tenor.com/fPKoFolJK9gAAAAd/starwars-revengeofthesith.gif

Can we please stay on-topic instead of turning this into some nocturnal proof of Godwin's law? I'm trying to raise a valid concern here. Any other discussion is out-of-scope.

Ah yes, splitting it up. That sounds like a bloom filter. Perhaps similar to how HaveIBeenPwnd does it? That should scale well. :)

Not a bloom filter, as that introduces probability and false positives, but just a list that has been split up, perhaps per day.

Will the QR code itself use CBOR? E.g. replace the current ASN.1 with CBOR? Or would this be used for a more efficient representation of the denylist only? What is the added benefit of using CBOR over a pure binary file as described in my post?

The QR-code certainly could use CBOR, but because of legacy reasons it'll probably stay just ASN.1 encoded. Reasons for using CBOR for the denylist would be to be able to add more structure to the list, for example to handle deletions as well as additions. The CBOR library we used is pretty robust and well tested, so we wouldn't have to invent some ASN.1 structure ourselves.

A certain person...

Ah, an immediate Godwin, and no qualification of "blackspot on the essence open source" at all.

Thanks @confiks for the docs. Personally I'd suggest a combination your (@Sikerdebaard) solution, with a timestamp when the identifier was added: identfier:timestamp or something. Assuming a lifetime of 1 year (paper QR), everything older than one year can be very simply removed. This way, the list won't keep growing, it's size remains manageable. You could even chunk the download process, but that's for a later time.

Also, I surely hope our ministry doesn't have the time to block digital QR's, because of their 24h lifetime, so their share should be very low, right?

Furthermore, even though @spacecabbie might be very passionate, he has a valid point: We must remain critical! We — developers, nerds, computer fanatics — have the power to shape the future, as we're living in the digital age. It is up to us to decide what features we accept or deny to build. Our ministries need our guidance, let's not forget, they don't know what they're dealing with, heck they don't even know how to spell sha-256 :)

The acceptance of the deny-list feature for example: Abuse of QR's should be impossible, because the checking person verifies by the person's ID. In practice this doesn't happen, so we've added a deny list. Ok, but it was not the application failing, it's the checking person failing, classic PEBCAC. The result: The ministry has a tool to randomly (as can be read in the regeling: "indien er gegrond vermoeden bestaat…" trans: "in case of well-founded suspicion") invalidate QR's — read: deny people access. They wouldn't, because function creep = not ok, but they could, and imho, that's bad.

Furthermore, even though @spacecabbie might be very passionate, he has a valid point: We must remain critical! We — developers, nerds, computer fanatics — have the power to shape the future, as we're living in the digital age. It is up to us to decide what features we accept or deny to build. Our ministries need our guidance, let's not forget, they don't know what they're dealing with, heck they don't even know how to spell sha-256 :)

Not passionate tbh at this point it has become defense and i adopt Sun Tzu view on defense. Because in 2 weeks I cannot go to any shops anymore and I am moving houses in 1 month. Zo just because I have a principle I cannot decorate the house or make adjustments.

The acceptance of the deny-list feature for example: Abuse of QR's should be impossible, because the checking person verifies by the person's ID. In practice this doesn't happen, so we've added a deny list.

This happens because most people disagree fundamentally.

Ok, but it was not the application failing, it's the checking person failing, classic PEBCAC. The result: The ministry has a tool to randomly (as can be read in the regeling: "indien er gegrond vermoeden bestaat…" trans: "in case of well-founded suspicion") invalidate QR's — read: deny people access. They wouldn't, because function creep = not ok, but they could, and imho, that's bad.

Couple this with the average antivaxer fondness of conspiracy theories and the dutch government lack of explaining how and what they are doing to prevent it, and you have another conspiracy. Also the QR code shared of that persoon who might be innocent because his code was stolen.

A certain person...

Ah, an immediate Godwin, and no qualification of "blackspot on the essence open source" at all.

Granted I was emotional when I wrote that and after the call to stay on topic by the .. I'll skip what I think of him.
I had planned to just leave it since the discussion should not have been brought here anyway. But you had to go and taunt. So I reread and stand by my statement. Its as easy to claim Godwin as it is to reference to that period unfounded, I do not believe I did that. The similarities are scary at times. Take away emotion or pretended "shock" and just compare some of the facts.

Any ways I am a proud person as wel and do not like to admit mistakes, But yea to my horor there is no mention of morality in or about open source. I was convinced it had said something against use for discriminating or the likes. It does not. Happy ?
Still doubt people would be as happy to contribute to for example an open source project that woulp china to block internet for its people.
.

fobrs commented

All measures have a majority in parliament, which are representatives of and elected by the public.

That's not true. 'De eerste Kamer' still has to give it's vote for placing this flue-like virus on the A-status list. Until then all measures are not lawful: https://www.eerstekamer.nl/wetsvoorstel/35401_wijziging_wet_publieke.

Closing this issue since it's not serving it's intended purpose. @confiks feel free to delete it if this gets out of hand.

Looks like you've thought about solving the issue with the denylist a lot already. I'll be looking forward to seeing your solution so that I can implement it in coronacheck-tools as well.

edit - should anyone be interested, I'm keeping track on the number of blocked QR codes here and here.

Not passionate tbh at this point it has become defense and i adopt Sun Tzu view on defense. Because in 2 weeks I cannot go to any shops anymore and I am moving houses in 1 month. Zo just because I have a principle I cannot decorate the house or make adjustments.

I thought I'd use the word 'passionate' to mellow it down a bit. It's my hope to bring both parties closer together, open the dialogue instead of immediately calling each-other names. But as soon as governments make truth, all researchers go home. If I ever add technical suggestions, I'd always try to enforce better user privacy, for whatever that's worth.

Let's make one thing clear, I wouldn't even touch this code with a stick, not even if it would buy me a house. I strongly believe the resulting apps are going over our fundamental rights like they're nothing. Let's not forget, fundamental rights are absolute and should not be changed in any way or form. Now, governments all over the world change these rights like it's nothing! And the worst, people still believe this is about Covid?!

So yeah, It's my hope to make people aware. And I monitor changes closely, if there's ever a p00p, I'll be shooting it.

I feel you. Luckily I'm not moving, but no apartheids-QR for me ether, because reasons.
I wish you all the best of luck!

There, now I'm surely off-topic. Or perhaps, I'm on the only important topic there actually is?
Just some food to think about, when you're pushing your next commit :)