HTTPArchive/data-pipeline

Investigate why the crawl stalled on the remaining 300k HARs in June 2022

rviscomi opened this issue · 2 comments

When we ran the 2022_06_09 crawl, there were about 300k HAR files awaiting processing for a few days and the summary pipeline was very very slowly working through them. Maybe only 1k of the remaining HARs were processed after 3 days. The Dataflow job was allocated about 100 workers even though it didn't seem to be making any progress, so we stopped the job to reallocate those workers to the non-summary pipeline.

Investigate what went wrong to cause these HARs to not complete successfully.

Chatted with @giancarloaf about this yesterday. It seems this is a bug in Beam's Python SDK.

The underlying error is the one we've seen in our logs for a few months:

AttributeError: 'int' object has no attribute 'value' [while running 'WriteRequestsToBigQuery/WriteToBigQuery/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)-ptransform-67'

The next step is to file a bug with Beam directly and investigate workarounds.

Worst case, we'll need to continue manually stopping the pipeline when things get hung up like this each month. If needed we should cover this in our documentation / developer playbook.

Resolving this error group, it should not reactivate after bumping the beam SDK version to 2.40 in #104
https://console.cloud.google.com/errors/detail/CO2ox4P-_4T62gE;time=P1D?project=httparchive