simsong/bulk_extractor

Running bulk_extractor with -J uses multiple threads

Closed this issue · 9 comments

  • Replicate with public data
  • Provide better debugging to output and report.xml when using -J
  • Create new debugging option to log every sbuf when analyzed in single-threaded mode.

Running bulk_extractor 2.02 with command:

bulk_extractor -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o /home/accessions/b_e2x_errors/debug_mode05 -R /home/accessions/UA2023-0021/objects/OPD/ -F /home/scripts/be_regex/uaregex.txt

reported "going multi-threaded (24)"
and eventually reported "All data read; waiting for threads to finish..."

Execution environment:
<execution_environment> <cpuid> <identification>GenuineIntel</identification> <family>0</family> <model>0</model> <stepping>15</stepping> <efamily>0</efamily> <emodel>0</emodel> <brand>71</brand> <clflush_size>808</clflush_size> <nproc>110</nproc> <apicid>117</apicid> <L1_cache_size>262144</L1_cache_size> </cpuid> <os_sysname>Linux</os_sysname> <os_release>5.15.49-linuxkit</os_release> <os_version>#1 SMP Tue Sep 13 07:51:46 UTC 2022</os_version> <host>039a1d8462b0</host> <arch>x86_64</arch> <command_line>bulk_extractor -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o /home/accessions/b_e2x_errors/debug_mode05 -R /home/accessions/UA2023-0021/objects/OPD/ -F /home/scripts/be_regex/uaregex.txt</command_line> <uid>1000</uid> <username>rluser</username> <start_time>2023-03-15T17:53:45Z</start_time> </execution_environment>

Hi. I'm trying to understand the error here. Is it that -J makes it use multiple threads, or it that it hung?

Can you provide /home/accessions/UA2023-0021/objects/OPD/ and /home/scripts/be_regex/uaregex.txt?

Hi -

With -J, I expected bulk_extractor to use only the primary thread, not run in multi-threaded mode.

The regex file is here: https://github.com/laissezfarrell/rl-bitcurator-scripts/blob/master/be_regex/uaregex.txt

I'm not able to share the content files because of access restrictions. I can share some summary data about them (filetypes, sizes, aggregate size, etc), if that would be helpful, though I'm not sure that would be.

Okay. Let me check this out over the weekend. Thanks.

This will almost certainly require review in the be20_api.

Here is a command line that may exercise the bug:

src/bulk_extractor -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o out1 -Z -F tests/patterns.txt tests/Images/nps-2010-emails.E01

Also:

src/bulk_extractor --notify_main_thread -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o out1 -Z -F tests/patterns.txt tests/Images/nps-2010-emails.E01

What's the use case for -J/--no-threads? It's not obvious to me; I don't see the advantage of -J over -j 1 (where processing happens on one background thread while the main thread monitors). This disadvantage is the complexity of supporting a different codepath. If there's no use case, maybe this option could just be deleted?

It's very useful for debugging.

Fixed. -J now works properly:

simsong@Simsons-MacBook-Pro bulk_extractor % src/bulk_extractor -1 -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -o out1 -R .                                                     (main)bulk_extractor
bulk_extractor version: 2.0.6
Input file: "."
Output directory: "out1"
Disk Size: 1577
Scanners: aes base64 elf evtx facebook find gzip httplogs json kml_carved msxml net ntfsindx ntfslogfile ntfsmft ntfsusn outlook sqlite utmp vcard_carved windirs winlnk winprefetch accts email gps
Threading Disabled
running single-threaded (DEBUG)...
20:21:47 Offset 0MB (0.00%) Done in n/a at 2024-01-15 20:21:46
20:21:48 Offset 0MB (2.09%) Done in  0:00:47 at 2024-01-15 20:22:35
20:21:49 Offset 0MB (2.09%) Done in  0:01:34 at 2024-01-15 20:23:23
20:21:50 Offset 0MB (2.09%) Done in  0:02:21 at 2024-01-15 20:24:11
20:21:51 Offset 0MB (2.09%) Done in  0:03:08 at 2024-01-15 20:24:59
...