stratosphereips/AIP

Failing to process zeek data

Closed this issue · 2 comments

It seems there is an error on this function (read_zeek):
https://github.com/stratosphereips/AIP-Blacklist-Algorithm/blob/ec295f98d3d5a607a84a2c3e698a89d4eec3dd74/lib/aip/data/functions.py#L50

2023-02-11 23:08:48,620 - aip.data.access - DEBUG - Processing hourly file: conn.11:00:00-12:00:00.log.gz
2023-02-11 23:08:59,051 - aip.data.access - DEBUG - Processing hourly file: conn.09:00:00-10:00:00.log.gz
2023-02-11 23:09:09,398 - aip.data.access - DEBUG - Processing hourly file: conn.10:00:00-11:00:00.log.gz
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/aip/data/access.py", line 72, in _process_zeek_file
    zeekdata = read_zeek(z)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/aip/data/functions.py", line 50, in read_zeek
    df = pd.read_csv(path, skiprows=8, names=header['fields'], sep=header['separator'], comment='#', **kwargs)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 581, in _read
    return parser.read(nrows)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1254, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/aip/miniconda3/envs/aip/lib/python3.10/gzip.py", line 507, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
"""

Checking the data sources, the zeek file was not empty and compressed but there was a compression error: gzip: unexpected end of file.

The tool could check for gzip validity before attempting to uncompress and process to avoid this in the future. In bash gzip -v -t {file} tells if a gzip file is correct or not. There should be something similar in Python.

Also not sure why it attempts to process a day in the past - again?

2023-02-11 23:05:07,922 - aip.data.access - DEBUG - Making  dataset from raw data for dates ['2023-02-10']
2023-02-11 23:05:14,582 - aip.data.access - DEBUG - Processing hourly file: conn.00:00:00-01:00:00.log.gz

....

2023-02-11 23:08:30,638 - aip.data.access - DEBUG - Creating attacks for dates ['2022-10-27']
2023-02-11 23:08:30,640 - aip.data.access - DEBUG - Making  dataset from raw data for dates ['2022-10-27']

@verovaleros file is not empty but there was compression error gzip: unexpected end of file. for checking this which type of data need to pass into this because i am able to understand code base of this file
please suggest me some steps to fix the bug.

Solved, migrated to zeeklog2pandas, which can read broken gz files.