/api/v2/update consistently times out

This happens both on the production site and when running locally (e.g. via make serve).

For the local reproduction, the reproduction steps are:

Run make serve
Load http://localhost:8080/api/v2/update. About 40,000 entries will be loaded into the local database before the last datastore update fails: (rpc error: code = DeadlineExceeded desc = retry failed with context deadline exceeded; last error: context deadline exceeded)
Load http://localhost:8080/api/v2/update. This times out: Internal error: could not retrieve domain names previously marked as preloaded. (rpc error: code = DeadlineExceeded desc = retry failed with context deadline exceeded; last error: context deadline exceeded)

When reproducing locally, it not only reproduces on master (a7b7953), but also on previous revisions such as 246e763 and f0da6bf.

Sounds like the database is finally too large! This was definitely something I wanted to keep simple until we hit scaling issues, so I guess now is that time. Feel free to revise any architecture decisions. :-)

I did some experimenting with increasing the timeout on my workstation and here are the results:

15 second timeout: First call to /api/v2/update succeeds in loading all 51,718 HSTS entries into the database. Subsequent requests still time out.
30 second timeout, 45 second timeout: Same behavior as 15 second timeout.
60 second timeout: First call succeeds in loading; second call results in server halting execution:

$ make serve
PASS: Current directory is in $GOPATH.
git rev-parse HEAD > ./frontend/version
go run *.go -local
Setting up local database... checking database connection... done.
Serving from: http://localhost:8080
2018/06/11 15:34:01 Metadata fetch failed: Get http://metadata/computeMetadata/v1/instance/attributes/gae_project: dial tcp: lookup metadata on 127.0.0.1:53: no such host
exit status 1
Makefile:71: recipe for target 'serve' failed
make: *** [serve] Error 1

If I reduce the 60 seconds to 59 seconds, that error goes away. That error appears to come from https://github.com/golang/appengine/blob/master/internal/metadata.go#L35; I haven't tracked down why that gets called from appengine.Main() in server.go. I also think that's a red herring for the problem here.

I find it hard to believe that this failure is just because of the size of the database. It takes about 1 second to fetch and parse the JSON file from chromium source, which contains the entire contents of the database (when running the server locally). Even /api/v2/debug/all-states times out, which is literally just asking the database for all of its contents. Maybe I'm being too naive to think that if I can write 50,000 records in under 15 seconds that I should be able to read all of them back out in a similar amount of time. (Based on the behavior with the /api/v2/debug/all-states endpoint, I don't think it's a problem with this app; rather it's a problem with Google Cloud Datastore.)

I added some logging to PutStates and statesForQuery (the latter included a trial code change to use Client.Run and Next instead of Client.GetAll) to measure how many entries can be written to or read from the database per second. On my workstation, I can write between 526 to 4500 rows per second to the database, but I can only read 184 per second. (When writing, the first batch of 450 entries takes around 800ms, with the time per batch falling off quickly and reaching a steady state of less than 100ms per batch after about 25 batches.) I don't know how similar this is to the rates in production. Nevertheless, it seems kind of crazy that writes are processing 24 times faster than reads.

Are you testing using the local emulator or a prod/test environment connecting to GCE?
(Let me know if you don't know how to do the latter. I don't have it on the top of my head, but it should be easy to reconstruct.)

The local emulator is written in Java, and I would caution against using it for any real benchmarks.

That was testing with the local emulator. I’m thinking of next testing deploying to a test appengine instance and datastore. It should be easy enough to create a new project and deploy to that instead of the production project.

Okay, this works for me:

# 1. Create a new project in Google Cloud.
# 2. Select project:
gcloud init
# 3. Get default credentials.
gcloud auth application-default login
# 4. Change the prod project ID to the project you created in step 1. (unless you want to connect to real prod):

hstspreload.org/database/database.go

Line 14 in a7b7953

prodProjectID = "hstspreload"

# 5. Run
make bulk-preloaded
go run *.go

You can then try:

http://localhost:8080/
http://localhost:8080/api/v2/update
- For me, this manages to do two batches of 450 entries before timing out.

There's only one local flag, so you'll have to separate the notion of "local app" from "local database" if you want to use the debug endpoints:

hstspreload.org/server.go

Lines 45 to 49 in a7b7953

    
           if *local { 
        
           	server.HandleFunc("/api/v2/debug/all-states", a.DebugAllStates) 
        
           	server.HandleFunc("/api/v2/debug/set-preloaded", a.DebugSetPreloaded) 
        
           	server.HandleFunc("/api/v2/debug/set-rejected", a.DebugSetRejected) 
        
           }

However, it shouldn't be too hard to make a copy of the real prod environment using this setup that you can test using local code.

(And you should also be able to deploy to your test App Engine using make deploy this way.)

Today, I had 5 unsuccessful attempts (4 500 errors, 1 502 nginx error) loading /api/v2/update before it succeeded. The successful run took 38.4s.

	if *local {
	server.HandleFunc("/api/v2/debug/all-states", a.DebugAllStates)
	server.HandleFunc("/api/v2/debug/set-preloaded", a.DebugSetPreloaded)
	server.HandleFunc("/api/v2/debug/set-rejected", a.DebugSetRejected)
	}