cal-itp/data-infra

Bug: unzip_and_validated_gtfs - negsignal.SIGKILL

Opened this issue · 1 comments

Describe the bug
Several different airflow runs have died with this error:
[2024-06-10, 04:19:14 UTC] {gtfs_csv_to_jsonl_hourly.py:153} INFO - Parsed gs://calitp-gtfs-schedule-unzipped-hourly/stop_times.txt/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1NQg==/stop_times.txt [2024-06-10, 04:19:14 UTC] {gtfs_csv_to_jsonl_hourly.py:115} INFO - Processing gs://calitp-gtfs-schedule-unzipped-hourly/stop_times.txt/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1NVg==/stop_times.txt [2024-06-10, 04:19:14 UTC] {storage.py:263} INFO - saving 15.5 kB to gs://calitp-gtfs-schedule-parsed-hourly/stop_times/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1NVg==/stop_times.jsonl.gz [2024-06-10, 04:19:14 UTC] {gtfs_csv_to_jsonl_hourly.py:153} INFO - Parsed gs://calitp-gtfs-schedule-unzipped-hourly/stop_times.txt/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1NVg==/stop_times.txt [2024-06-10, 04:19:14 UTC] {gtfs_csv_to_jsonl_hourly.py:115} INFO - Processing gs://calitp-gtfs-schedule-unzipped-hourly/stop_times.txt/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1QRQ==/stop_times.txt [2024-06-10, 04:19:14 UTC] {storage.py:263} INFO - saving 51.4 kB to gs://calitp-gtfs-schedule-parsed-hourly/stop_times/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1QRQ==/stop_times.jsonl.gz [2024-06-10, 04:19:14 UTC] {gtfs_csv_to_jsonl_hourly.py:153} INFO - Parsed gs://calitp-gtfs-schedule-unzipped-hourly/stop_times.txt/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1QRQ==/stop_times.txt [2024-06-10, 04:19:14 UTC] {gtfs_csv_to_jsonl_hourly.py:115} INFO - Processing gs://calitp-gtfs-schedule-unzipped-hourly/stop_times.txt/dt=2024-06-10/ts=2024-06-10T03:00:19.181898+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1SRw==/stop_times.txt [2024-06-10, 04:20:31 UTC] {local_task_job.py:212} INFO - Task exited with return code Negsignal.SIGKILL [2024-06-10, 04:20:31 UTC] {taskinstance.py:2599} INFO - 0 downstream tasks scheduled from follow-on schedule check

https://b2062ffca77d44a28b4e05f8f5bf4996-dot-us-west2.composer.googleusercontent.com/log?execution_date=2024-06-10T03%3A00%3A00%2B00%3A00&task_id=stop_times_txt&dag_id=unzip_and_validate_gtfs_schedule_hourly&map_index=-1

https://b2062ffca77d44a28b4e05f8f5bf4996-dot-us-west2.composer.googleusercontent.com/log?execution_date=2024-06-09T03%3A00%3A00%2B00%3A00&task_id=stop_times_txt&dag_id=unzip_and_validate_gtfs_schedule_hourly&map_index=-1

https://b2062ffca77d44a28b4e05f8f5bf4996-dot-us-west2.composer.googleusercontent.com/log?execution_date=2024-06-07T03%3A00%3A00%2B00%3A00&task_id=stop_times_txt&dag_id=unzip_and_validate_gtfs_schedule_hourly&map_index=-1

To Reproduce
every day during the 9pm run

Expected behavior
It finishes.

Additional context

This is most likely caused by a memory issue.

The final gtfs it chokes on is the bay area GTFS which is a lot larger.

It might be possible to refactor the code so it handles the data in parts, but probably the best and easiest next step is to

  • increase the ram of the kubernetes image and see if the problem goes away.
  • check the number of threads for this operation, reduce it if necessary

and just do it in production.

I'm not entirely sure how to do either of these.