palewire/django-calaccess-raw-data

Discrepancy between expected and actual zip file size

Closed this issue · 9 comments

On Sunday (8/14) and Monday (8/15) mornings the SoS zip files posted at the regular time (11:20 UTC), and on both days the update process ran successfully without interruption: The zips and raw files were added to the S3 bucket, the website pages baked and were published (hooray!).

However, two warnings were logged because the expected size of the zip and the actual zip file size were not equal.

  • On Sunday morning, the expected size was 764,915,558 bytes but the downloaded zip file was slightly smaller: 764,878,673.
  • On Monday morning, the expected size was 764,917,377 bytes but the downloaded zip file was slightly smaller: 764,915,558.

So, weirdly, the actual zip size on Monday is the same as the expected size on Sunday. It's as if each day we've somehow downloaded the zip from the previous day.

I re-downloaded each day's archive of the dbwebexport.zip from our S3 bucket and confirmed that the values recorded in the RawDataVersion.download_zip_size are the actual sizes of the zip file archived for each day.

I then re-downloaded Mon zip from the SoS and...lo and behold... it's actual size is the same as the expected size: 764,917,377 bytes. I downloaded this file twice with the same result: first, manually via Chrome and then again using our downloadcalaccessrawdata management command.

So now I'm thinking something is happening on their end where these HTTP response values are being updated before the new zip file is actually available to download.

More evidence: The previous most recent snapshot we've tracked in our production environment was released on Fri, 8/12, (no new zip file was posted on Sat, 8/13): The expected and actual size of that zip was 764,878,673, which is the same number of bytes as the zip we downloaded on Sun. The only difference is that we downloaded the zip much later on Fri, at 21:50, after I pushed some fixes to the raw-data app and repeated the end-to-end update process for the download's site.

So, three proposals:

  1. Set the cron on the prod server to run a little later each day.
  2. Add a little more logging to downloadcalaccessrawdata and updatecalaccessrawdata. Specifically, log what's in the download metadata each time we request it (regardless of when we download the zip). And I'm also including the ETag or "entity-tag", which should be a different value if the zip file has changed. And log the actual zip size.
  3. Set up an hourly cron, perhaps on the DEV server to call downloadcalaccessrawdata with the --force-restart so that we can start getting a better sense of exactly when we should be download. And maybe we should reach out with our findings to the SoS website admins as well.

If this actually is the problem, then at some point we might loop back to replace the warning in the downloadcalaccessrawdata command with an exception, and if that exception is raised, try the download again five minutes later, or something like that. I don't know if streaming the download might have something to do with the problem, but we might also try again without streaming.

Ran the update 20 mins later this morning, at 11:45 UTC, and the expected and actual zip size are the same.

I am going to go ahead and raise an exception in the download command if the expected and actual file size are different, and catch and re-try in the update command. Also will keep experimenting with getting the earliest possible time to start the update.

Also, the last-modified value in the HTTP response header is Tue, 16 Aug 2016 11:23:00 GMT, three minutes later than usual. Not sure if that has any significance.

Adding more weirdness to the stew...

Download metadata on 16/Aug/2016 at 11:45:05

ETag: "2320c8-2d989574-923e0d00"
Last-Modified: 2016-08-16 11:23:00+00:00
Content-Length: 764974452
Actual zip size (after download): 764974452

On At 17/Aug/2016 at 11:25:03:

ETag: "2320c8-2d99f99a-aa010780"
Last-Modified: 2016-08-17 11:21:18+00:00
Content-Length: 765065626
Actual zip size (after download): 764974452

Note how all the response headers have changed, but the downloaded zip is the same size as yesterday's.

At 11:40:02:

ETag: "2320c8-2d989574-923e0d00"
Last-Modified: 2016-08-16 11:23:00+00:00
Content-Length: 764974452
Actual zip size (after download): 764974452

All the response headers have reverted to yesterday, and the downloaded zip size is still the same as yesterday.

At 11:55:02:

ETag: "2320c8-2d99f99a-aa010780"
Last-Modified: 2016-08-17 11:21:18+00:00
Content-Length: 765065626
Actual zip size (after download): 765065626 bytes

Now all the response headers are updated again, and the download zip size matches the expected size.

And now let's get EVEN WEIRDER.

The logs above were collected by the dev server. Meanwhile on the prod server, the normal daily routine was ran. Both the update and the download commands make HEAD requests and write the response headers to the log. Here is what that looked this morning:

DEBUG|17/Aug/2016 11:30:02|updatecalaccessrawdata|ETag: "2320c8-2d99f99a-aa010780"
DEBUG|17/Aug/2016 11:30:02|updatecalaccessrawdata|Last-Modified: 2016-08-17 11:21:18+00:00
DEBUG|17/Aug/2016 11:30:02|updatecalaccessrawdata|Content-Length: 765065626
INFO|17/Aug/2016 11:30:02|updatecalaccessrawdata|New CAL-ACCESS version available.
DEBUG|17/Aug/2016 11:30:03|downloadcalaccessrawdata|ETag: "2320c8-2d989574-923e0d00"
DEBUG|17/Aug/2016 11:30:03|downloadcalaccessrawdata|Last-Modified: 2016-08-16 11:23:00+00:00
DEBUG|17/Aug/2016 11:30:03|downloadcalaccessrawdata|Content-Length: 764974452
DEBUG|17/Aug/2016 11:30:03|__init__|Downloading ZIP file
DEBUG|17/Aug/2016 11:30:37|downloadcalaccessrawdata|Download zip size: 764974452 bytes

So a second later (or maybe like every other time we make the request) the response headers all reverted to yesterday's values, and apparently the zip did as well.

Also, the update of the downloads website this morning (8/17) completed, but had a several problems. I think these arose from the scenario outlined above, when a new zip is posted, but the HTTP response yields yesterday's headers and/or content.

The symptoms: downloads/latest/ has today's date -- Wednesday, Aug. 17, 2016 at 11:21 a.m. -- and there's a clean.zip available to download, but no links to the individual files appear. Looking back at the most recent RawDataVersion db record, the update_start_datetime and update_finish_datetime are populated with today's values, but the following fields are empty:

  • download_zip_archive
  • download_zip_size
  • download_start_datetime
  • download_finish_datetime
  • extract_start_datetime
  • extract_finish_datetime

But if you look at the previous RawDataVersion db record for the Tuesday, 8/16 release, the download and extract datetime fields all have 8/17 values. Also the clean_zip_size values are the same for the Tues and Wed records.

So looking back at the prod server log lines pasted above:

  1. updatecalaccessrawdata made a HEAD request
  2. Their server responded with a new Last-modified value dated 8/17
  3. updatecalaccessrawdata creates a new RawDataVersion instance and calls downloadcalaccessrawdata
  4. downloadcalaccessrawdata makes another HEAD request to figure out which version it's working with
  5. Their server responds with a Last-modified value dated 8/16
  6. So downloadcalaccessrawdata fetches the 8/16 RawDataVersion instance and writes the download start and finish datetimes to it (probably the download_zip_archive and download_zip_size too).
  7. The zip downloads fine, and it's actual size matches yesterday's expected size, so downloadcalaccessrawdata finishes without an error.
  8. updatecalaccessrawdata calls extractcalaccessrawfiles
  9. extractcalaccessrawfiles needs to figure out which instance of RawDataVersion it needs to assign to all the RawDataFile instances it creates, so it gets the version with the most recent download_start_datetime. Which, again, is the 8/16 instance.
  10. extractcalaccessrawfiles finishes fine, but the extract start and finish datetimes are written to the 8/16 RawDataVersion instance.
  11. updatecalaccessrawdata calls cleancalaccessrawdata for each file in the csv/ directory, assuming these are all files for the 8/16 RawDataVersion (which...they are).
  12. updatecalaccessrawdata calls loadcalaccessrawdata for each file in the tsv/ directory, working under the same assumption.

So the point of failure is between when the update command calls the download command, and the SoS server decides to pull the ol' switch-a-roo. I propose a patch that checks if downloadcalaccessrawdata was called by updatecalaccessrawdata, then compare the RawDataVersion instances before downloading.

A little more info:

$ dig +vc campaignfinance.cdn.sos.ca.gov a
; <<>> DiG 9.8.3-P1 <<>> +vc campaignfinance.cdn.sos.ca.gov a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41090
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;campaignfinance.cdn.sos.ca.gov.    IN  A

;; ANSWER SECTION:
campaignfinance.cdn.sos.ca.gov. 300 IN  CNAME   dbexport.sos.ca.gov.cdn184.raxcdn.com.
dbexport.sos.ca.gov.cdn184.raxcdn.com. 300 IN CNAME raxcdn.com.mdc.edgesuite.net.
raxcdn.com.mdc.edgesuite.net. 27 IN CNAME   a1907.dscw14.akamai.net.
a1907.dscw14.akamai.net. 20 IN  A   23.62.239.40
a1907.dscw14.akamai.net. 20 IN  A   23.62.239.8

;; Query time: 89 msec
;; SERVER: 128.206.130.244#53(128.206.130.244)
;; WHEN: Wed Aug 17 13:47:22 2016
;; MSG SIZE  rcvd: 207

So I think this is saying they have a CNAME chain that resolves to two IP addresses at akamai.net. If you run this dig command multiple times in succession, then you'll note that the IP addresses sometimes switch order. The sys admin I talked to here in the office suggested this might signify a load-balancing strategy called round-robin DNS.

My guess is that one of these two servers is getting behind the other around the time we start our update. Since we probably lose our requests connection between the HEAD request in the update command and the HEAD request in the download command, that might be the point at which they're switching us from one to the other. That proposal I made in the previous comment should catch that scenario.

I don't yet know if it's possible that, within downloadcalaccessrawdata, the servers might get between making the HEAD request and making the GET request to actually download the file. Going to start logging the response headers of the GET request as well and do a comparison before actually writing the zip.

Maybe this is a CDN issue, where the files are being uploaded, but they're large enough that they don't fully propagate before the new headers do or vice versa. I didn't think that was even possible, but if there's something weird going on with the folks at Akamai, that could totally explain the discrepancy.

Today's update completed without any problems on the prod server. Ran it at 11:45 GMT.

Meanwhile on the dev server I confirmed that, even between the HEAD and GET requests sent by the downloadcalaccessrawdata, the response values are changing:

At 11:25:

ETag: "2320c8-2d9b0cd1-c702c800" <-- (from HEAD)
ETag: "2320c8-2d99f99a-aa010780" <-- (from GET)
Last-Modified: 2016-08-18 11:21:04+00:00 <-- (from HEAD)
Last-Modified: Wed, 17 Aug 2016 11:21:18 GMT <-- (from GET)
Content-Length: 765136081 <-- (from HEAD)
Content-Length: 765065626 <-- (from GET)

At 11:40:

ETag: "2320c8-2d9b0cd1-c702c800" <-- (from HEAD)
ETag: "2320c8-2d99f99a-aa010780" <-- (from GET)
Last-Modified: 2016-08-18 11:21:04+00:00 <-- (from HEAD)
Last-Modified: Wed, 17 Aug 2016 11:21:18 GMT <-- (from GET)
Content-Length: 765136081 <-- (from HEAD)
Content-Length: 765065626 <-- (from GET)

At 11:55:

ETag: "2320c8-2d9b0cd1-c702c800" <-- (from HEAD)
ETag: "2320c8-2d9b0cd1-c702c800" <-- (from GET)
Last-Modified: 2016-08-18 11:21:04+00:00 <-- (from HEAD)
Last-Modified: Thu, 18 Aug 2016 11:21:04 GMT <-- (from GET)
Content-Length: 765136081 <-- (from HEAD)
Content-Length: 765136081 <-- (from GET)

Because of the check I added yesterday, the download was stopped during the first two attempts (though, for some reason the CommandError isn't showing up in the log...), but then finished downloading 765136081 bytes at 11:56:21.

The other thing I just noticed is that the datetime format of Last-Modified is switching as well, which I suppose is further evidence that these are two different servers with different configs. Gonna start logging the server response field value (e.g., Apache/2.2.3 (Red Hat)) as well.

I think we can now close this chapter.