simsong/bulk_extractor

Hang: `-R` with 10,000 files and 20 threads on MacBook Pro

simsong opened this issue · 16 comments

Something is wrong with the -R file iterator. Redesign so that it:

  • Counts all of the files and their sizes.
  • Computes total bytes to process
  • Displays this with the progress bar.
  • Doesn't hang.
  • Reports start:work and end:work in XML file.

Likely related to #396

Any news here?

Thanks for asking. This has been really causing me a lot of psychic pain but I just haven't gotten to it. If you have a student who can look at this, I can supervise. Otherwise it will need to wait until I finish the book that I'm currently working on, which has to be at the publisher in a few weeks.

@zdavatz - I think that Release 2.1.0 may solve your problem. Can you look and see if your original hang specified a regular expression? Perhaps give me a way to reproduce it?

Great, thank you 🙏 !

Download a website with wget -rand then run bulk_extractor on the full dir.

Sure, try with wget -r https://www.zuerich.ch/content/zh/de/index.html

Or you just do a new release and I test the release.

Great! Can I already update via Kali Linux?

Kali shows me the version: 2.0.6-0kali1

Can I grab a binary somewhere?

On my Gentoo linux I run into the following compile errors, when doing make.

  239 |     const std::filesystem::path get_input_fname() const;
      |                ^~~~~~~~~~
be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std«
  243 |     const std::vector<std::filesystem::path> &find_files()    const { return sc.find_files(); }
      |                            ^~~~~~~~~~
be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std«
be20_api/scanner_set.h:243:44: Fehler: Templateargument 1 ist ungültig
  243 |     const std::vector<std::filesystem::path> &find_files()    const { return sc.find_files(); }
      |                                            ^
be20_api/scanner_set.h:243:44: Fehler: Templateargument 2 ist ungültig
In Datei, eingebunden von be20_api/path_printer.cpp:5:
be20_api/scanner_set.h: In Elementfunktion »scanner_set::stats scanner_set::stats::operator+(const scanner_set::stats&)«:
be20_api/scanner_set.h:95:64: Fehler: keine passende Funktion für Aufruf von »scanner_set::stats::stats(scanner_set::stats)«
   95 |             return stats(this->ns + s.ns, this->calls + s.calls);
      |                                                                ^
make[2]: *** [Makefile:1520: be20_api/feature_recorder_file.o] Fehler 1
be20_api/path_printer.cpp: In Elementfunktion »void path_printer::process_http(std::istream&)«:
be20_api/path_printer.cpp:319:52: Fehler: »class abstract_image_reader« hat kein Element namens »image_fname«
  319 |             out << "X-Image-Filename: " << reader->image_fname() << PrintOptions::HTTP_EOL;
      |                                                    ^~~~~~~~~~~
make[2]: *** [Makefile:1520: be20_api/feature_recorder.o] Fehler 1
make[2]: *** [Makefile:1520: be20_api/feature_recorder_set.o] Fehler 1
make[2]: *** [Makefile:1520: be20_api/path_printer.o] Fehler 1
make[2]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0/src“ wird verlassen
make[1]: *** [Makefile:525: all-recursive] Fehler 1
make[1]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0“ wird verlassen
make: *** [Makefile:465: all] Fehler 2

Well, I just ran

wget -r https://www.zuerich.ch/content/zh/de/index.html
src/bulk_extractor -o zuerich-out -R www.zuerich.ch/

Here is the email histogram:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.1.0
# Feature-Recorder: email
# Filename: www.zuerich.ch
# Histogram-File-Version: 1.1
n=39	information@zuerich.ch
n=4	hotel@zuerich.com
n=4	mail@dominiquemeienberg.ch
n=4	media@zuerich.com
n=2	anna.schindler@zuerich.ch
n=2	groups@zuerich.com
n=2	info@zuerich.com
n=1	foto@umeisser.ch

There are only 170 files. It processed in less than a second.
Attached is the report.xml file.
report.xml.txt

great, thank you!