Hang: `-R` with 10,000 files and 20 threads on MacBook Pro
simsong opened this issue · 16 comments
Something is wrong with the -R file iterator. Redesign so that it:
- Counts all of the files and their sizes.
- Computes total bytes to process
- Displays this with the progress bar.
- Doesn't hang.
- Reports start:work and end:work in XML file.
Likely related to #396
Any news here?
Thanks for asking. This has been really causing me a lot of psychic pain but I just haven't gotten to it. If you have a student who can look at this, I can supervise. Otherwise it will need to wait until I finish the book that I'm currently working on, which has to be at the publisher in a few weeks.
@zdavatz - I think that Release 2.1.0 may solve your problem. Can you look and see if your original hang specified a regular expression? Perhaps give me a way to reproduce it?
Great, thank you 🙏 !
Download a website with wget -r
and then run bulk_extractor on the full dir.
Sure, try with wget -r https://www.zuerich.ch/content/zh/de/index.html
Or you just do a new release and I test the release.
Great! Can I already update via Kali Linux?
Kali shows me the version: 2.0.6-0kali1
Can I grab a binary somewhere?
On my Gentoo linux I run into the following compile errors, when doing make
.
239 | const std::filesystem::path get_input_fname() const;
| ^~~~~~~~~~
be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std«
243 | const std::vector<std::filesystem::path> &find_files() const { return sc.find_files(); }
| ^~~~~~~~~~
be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std«
be20_api/scanner_set.h:243:44: Fehler: Templateargument 1 ist ungültig
243 | const std::vector<std::filesystem::path> &find_files() const { return sc.find_files(); }
| ^
be20_api/scanner_set.h:243:44: Fehler: Templateargument 2 ist ungültig
In Datei, eingebunden von be20_api/path_printer.cpp:5:
be20_api/scanner_set.h: In Elementfunktion »scanner_set::stats scanner_set::stats::operator+(const scanner_set::stats&)«:
be20_api/scanner_set.h:95:64: Fehler: keine passende Funktion für Aufruf von »scanner_set::stats::stats(scanner_set::stats)«
95 | return stats(this->ns + s.ns, this->calls + s.calls);
| ^
make[2]: *** [Makefile:1520: be20_api/feature_recorder_file.o] Fehler 1
be20_api/path_printer.cpp: In Elementfunktion »void path_printer::process_http(std::istream&)«:
be20_api/path_printer.cpp:319:52: Fehler: »class abstract_image_reader« hat kein Element namens »image_fname«
319 | out << "X-Image-Filename: " << reader->image_fname() << PrintOptions::HTTP_EOL;
| ^~~~~~~~~~~
make[2]: *** [Makefile:1520: be20_api/feature_recorder.o] Fehler 1
make[2]: *** [Makefile:1520: be20_api/feature_recorder_set.o] Fehler 1
make[2]: *** [Makefile:1520: be20_api/path_printer.o] Fehler 1
make[2]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0/src“ wird verlassen
make[1]: *** [Makefile:525: all-recursive] Fehler 1
make[1]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0“ wird verlassen
make: *** [Makefile:465: all] Fehler 2
Well, I just ran
wget -r https://www.zuerich.ch/content/zh/de/index.html
src/bulk_extractor -o zuerich-out -R www.zuerich.ch/
Here is the email histogram:
# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.1.0
# Feature-Recorder: email
# Filename: www.zuerich.ch
# Histogram-File-Version: 1.1
n=39 information@zuerich.ch
n=4 hotel@zuerich.com
n=4 mail@dominiquemeienberg.ch
n=4 media@zuerich.com
n=2 anna.schindler@zuerich.ch
n=2 groups@zuerich.com
n=2 info@zuerich.com
n=1 foto@umeisser.ch
There are only 170 files. It processed in less than a second.
Attached is the report.xml file.
report.xml.txt
great, thank you!