Put test data into repository

Currently the actual tests cannot be run because the data is only on authors' computer
Since there seems to be a special test batch, it can be checked into the repository.

There are a couple of reasons for this. 1) The data is roughly 500gigs so too much for Github I think. 2) We do not have a license or permission to share the data.

To implement this we need a routine that first downloads the full data if data is not already available in disk. Only after that can tests be conducted. Below is copy-paste from Readme that outlines the data sources.

3: HDFS_v1, Hadoop, BGL thanks to amazing LogHub team. For full data see Zenodo.
3: Sprit, Thunderbird and Liberty can be found from Usenix site.
2: Nezha has data from two systems TrainTicket and Google Cloud Webshop demo. It is the first dataset of microservice-based systems. Like other traditional log datasets it has Log data but additionally there are Traces and Metrics.
2: ADFA and AWSCTD are two datasets designed for intrusion detection.

What I mean is - the test files seen to use some subset of data

LogLead/tests/anomaly_detectors.py

Line 14 in c1c7498

test_data_path = os.path.join(home_directory, "Datasets", "test_data")

Or is it still a huge dataset that is not allowed to share?

The tests run through the following steps. They are linked via saved files with naming convention indicating which phase has created the data. First step takes in the huge raw data we have no permission to share. First step also samples it down and saves for steps 2 and 3. Step 0 is needed for downloading so that anyone can execute the full pipeline. Currently, step 0 has been done manually.

The latest commit fixes this. Now data is downloaded when one runs main.py in tests folder.
ddc8534

Data download can also be run separately. It does not overwrite. Rather it checks if folder exists and if it does it does not download
https://github.com/EvoTestOps/LogLead/blob/main/tests/download_data.py

This config file controls what gets downloaded and tested. Commenting out rows disables downloading and testing
https://github.com/EvoTestOps/LogLead/blob/main/tests/datasets.yml

@jnyyssol you had found couple of new datasets. Can you add them to config so they also get downloaded? Please also add them to tests.

@jnyyssol you had found couple of new datasets. Can you add them to config so they also get downloaded? Please also add them to tests.

This is a bit tricky, because the ADFA and AWSCTD datasets already consist of event IDs. Therefore most of the enhancements don't make sense, and some even cause it to crash. I got the tests to run with ADFA and AWSCTD by doing the following:

Download the data (needs new py7zr package to unpack .7z)
Load the data and save these two directly with _eh which indicates they have been enhanced
This will cause the enhancer to skip them
Add a check in anomaly_detectors.py to ensure numeric columns exist (which they don't in these datasets, so many tests are skipped)

@mmantyla do you think it makes sense to include these two in the tests given that they are so different? Before pushing I still need to check that my changes didn't break anything regarding the other datasets.

I am closing this. Full test data will never be in LogLead repo. However, tests folders already has mechanism of downloading all supported datasets