- The purpose of this repository is to organize and map data sets that can be used for testing different applications.
- Above dataset currently contains 88 distinctive files across 15 folders/15 different filetypes, 53MB in total.
- https://github.com/filetrust/malicious-test-files - DO NOT CLONE THIS TO YOUR LOCAL PC
- Number of files: > 3,400
- Total size: UNKNOWN
- https://github.com/k8-proxy/k8-test-data/tree/master/TestFiles/VirusShare
- Number of files: 5 zip files with numerous malicious files inside
- Total size: UNKNOWN
- https://github.com/k8-proxy/data-sets/blob/main/xlsx/0000274.xlsx
- 0000274.xlsx file uses the rand() function which changes some of the textual data but also there are graphs derived from it. Any type of comparison that actually opens the file should trigger the rand() functions and yield a difference.
- https://k8-mass-download.s3.eu-west-2.amazonaws.com/gov_Files.zip
- Number of files: 9,824
- Total size: 6.87GB
- s3://k8-test-data/gov_uk
- Number of files: 11k
- Total size: 11.9GB
- https://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/
- You can download one zip at the time on your local or use wget within remote CDR VM (ex.
wget https://digitalcorpora.s3.amazonaws.com/corpora/files/govdocs1/zipfiles/000.zip
) - To download all ~350GB of files run
aws s3 cp s3://digitalcorpora/corpora/files/govdocs1/zipfiles/ . --recursive
- You can download one zip at the time on your local or use wget within remote CDR VM (ex.
- CDR-Plugin-test-files: s3://cdr-plugin-test-files
-
Folder 100 : 100 files, 60MB, different files types, can be downloaded here
-
Folder 500 : 500 files, 375MB, subset of gov-uk, can be downloaded here
-
Folder 1000: 1000 files, 604MB, subset of gov-uk, can be downloaded here
-
Folder 2000: 2000 files, 1GB, subset of gov-uk, can be downloaded here
-
Folder hard-to-process: 3 files, -, files that GW SDK has trouble processing, worth of investigating
-
Folder high_threat_govUk: 88 files, -, govUK files marked as high threat ones after processing
-
Failed files from govUk dataset: s3://cdr-plugin-test-files/govUk_failed.zip
-
FFolder clean_files: multiple zip files, all files within this folder are clean (successfully rebuilt)
- Folder all-clean-73: 73 files, 40MB, folder 100 that contains just clean files
- s3://cdr-plugin-test-files/clean_files/919_clean.zip, folder 1000 that contains just clean files
- s3://cdr-plugin-test-files/clean_files/1907_clean.zip, https://github.com/filetrust/test-files-generatorfolder 2000 that contains just clean files
- s3://cdr-plugin-test-files/clean_files/gov-docs-ok-554.zip
- s3://cdr-plugin-test-files/clean_files/gov-docs-ok-2492.zip
- s3://cdr-plugin-test-files/clean_files/gov-uk-ok-9124.zip
- 35_OK.zip, all clean files, 34.2GB taken from the https://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/, s3://cdr-plugin-test-files/35_OK.zip
-
Scarped websites:
-
Source | S3 URL | Total files | Total size |
---|---|---|---|
bitsavers | url to download zip | 1267 | 835MB |
wikileaks | url to download zip | 4009 | 1.1GB |
gwsolutions | url to download zip | 490 | 56MB |
fticonsulting | url to download zip | 2823 | 1.3GB |
digitalcorpora | url to download zip | - | 35GB |
-
Use
WGET
for scarping -
Run one of below commands to scarp website. First will scarp whole website, second one just files that have pdf, jpg and png:
sudo wget -r -np -nd -k <WEBSITE URL> sudo wget -r -np -nd -k -e robots=off -A pdf,jpg,png <WEBSITE URL>
-
In case of lots of 503 errors add
-U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
to above commands -
If you want to run
wget
in background add-b
-
Before downloading the website you can first spider the website then find links associated with the files you want to test and then download them
sudo wget 1 --random-wait --no-check-certificate -e robots=off -o output.log --spider -r <WEBSITE URL> grep -oP "http\S+\.(pdf|doc|docx|docm|xls|xlsx|xlsm|ppt|pptx|jpg|gif|png)" 1.log | sort | uniq > links.txt wget -i links.txt --no-check-certificate -e robots=off
-
After downloading the files you can check if additional cleaning is needed (removing the files that are not supported or files without extensions and similar)
-
Websites that can be scrarped:
- http://www.bitsavers.org/bits/
- https://wikileaks.org/sony/docs/01
- or any other you would like
- https://github.com/filetrust/test-files-generator
- Files generated via above script are located in: s3://cdr-plugin-test-files/random_generated_pdfs/ folder
- Download files from specific website
- Run file processing via CDR Platform
- Check Threat Dashboard after the run and filter the results with threat level being set to high
- Download csv with the list of files
- Type the exact name in Browser and that should lead you back to the beggining
- Example of govUK list of files with high threat level: discover-file-analysis.txt