m-lab/etl-gardener

Large number of small archives collected after test date.

Opened this issue · 0 comments

There are large numbers of tiny tcpinfo archives, with 10 to 20 files each. They are processed at the end of each date prefix. Archive names indicate that they were collected by pusher days after the test dates of the contained files.
There are on the order of 1000-2000 per day for tcpinfo. For ndt5, there are not as many, but they are smaller, and seem to be collected mostly shortly after midnight, rather than days later.

The tcpinfo files may originate from a small number of very long lived connections. The tcpinfo sidecar splits connection records into 10 minute sections, so a single connection can produce 144 files per day.

The dedup queries may not be handling tcpinfo continuations correctly. Need to check into this!