Failing to process zeek data
Closed this issue · 2 comments
It seems there is an error on this function (read_zeek):
https://github.com/stratosphereips/AIP-Blacklist-Algorithm/blob/ec295f98d3d5a607a84a2c3e698a89d4eec3dd74/lib/aip/data/functions.py#L50
2023-02-11 23:08:48,620 - aip.data.access - DEBUG - Processing hourly file: conn.11:00:00-12:00:00.log.gz
2023-02-11 23:08:59,051 - aip.data.access - DEBUG - Processing hourly file: conn.09:00:00-10:00:00.log.gz
2023-02-11 23:09:09,398 - aip.data.access - DEBUG - Processing hourly file: conn.10:00:00-11:00:00.log.gz
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/aip/miniconda3/envs/aip/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/aip/data/access.py", line 72, in _process_zeek_file
zeekdata = read_zeek(z)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/aip/data/functions.py", line 50, in read_zeek
df = pd.read_csv(path, skiprows=8, names=header['fields'], sep=header['separator'], comment='#', **kwargs)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1254, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/aip/miniconda3/envs/aip/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
File "/home/aip/miniconda3/envs/aip/lib/python3.10/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/home/aip/miniconda3/envs/aip/lib/python3.10/gzip.py", line 507, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
"""
Checking the data sources, the zeek file was not empty and compressed but there was a compression error: gzip: unexpected end of file
.
The tool could check for gzip validity before attempting to uncompress and process to avoid this in the future. In bash gzip -v -t {file} tells if a gzip file is correct or not. There should be something similar in Python.
Also not sure why it attempts to process a day in the past - again?
2023-02-11 23:05:07,922 - aip.data.access - DEBUG - Making dataset from raw data for dates ['2023-02-10']
2023-02-11 23:05:14,582 - aip.data.access - DEBUG - Processing hourly file: conn.00:00:00-01:00:00.log.gz
....
2023-02-11 23:08:30,638 - aip.data.access - DEBUG - Creating attacks for dates ['2022-10-27']
2023-02-11 23:08:30,640 - aip.data.access - DEBUG - Making dataset from raw data for dates ['2022-10-27']
@verovaleros file is not empty but there was compression error gzip: unexpected end of file. for checking this which type of data need to pass into this because i am able to understand code base of this file
please suggest me some steps to fix the bug.
Solved, migrated to zeeklog2pandas, which can read broken gz files.