Question about function name prediction
Closed this issue · 2 comments
Hi, thanks for your efforts in open source!
I have been trying to use this tool to generate the embeddings for some binaries. I did the model training following https://github.com/OSUSecLab/SymLM?tab=readme-ov-file#model-training, which seems to work. Then I tried to generate the dataset for my testing binary following https://github.com/OSUSecLab/SymLM/tree/main/dataset_generation#dataset-preparation, but it seems that the all the generated files in the directory are empty.
Specifically, following is my run.sh
:
#!/bin/bash
GHIDRA_ANALYZEHEADLESS_PATH='/home/ruoyu/workspace/ghidra/support/analyzeHeadless' # path to ghidra analyzeHeadless executable
GHIDRA_PROJECT_PATH='/home/ruoyu/' # path to ghidra project
GHIDRA_PROJECT_NAME='SymLM' # name of ghidra project
# BINARY_PATH='/home/ruoyu/workspace/SymLM/dataset_generation/sample_binary/bc/bc' # path to binary
BINARY_PATH='/home/ruoyu/workspace/SymLM/php_sample/php_binary' # path to binary
BINARY_ARCHITECTURE='x64' # architecture of binary, options: x86, x64, arm, mips
# DATASET_OUTPUT_DIR='/home/ruoyu/workspace/SymLM/dataset_generation/sample_output/bc' # path to output directory
DATASET_OUTPUT_DIR='/home/ruoyu/workspace/SymLM/php_sample/output' # path to output directory
# generate interprocedural cfg
$GHIDRA_ANALYZEHEADLESS_PATH $GHIDRA_PROJECT_PATH $GHIDRA_PROJECT_NAME -import $BINARY_PATH -readOnly -postScript ./get_calling_context.py
# generate dataset
python ./prepare_dataset.py \
--output_dir $DATASET_OUTPUT_DIR \
--input_binary_path $BINARY_PATH \
--arch $BINARY_ARCHITECTURE
And below is the log when running the run.sh
. I did not observe any error during the run.
openjdk version "11.0.21" 2023-10-17
OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu120.04, mixed mode)
INFO Using log config file: jar:file:/home/ruoyu/workspace/ghidra/Ghidra/Framework/Generic/lib/Generic.jar!/generic.log4j.xml (LoggingInitialization)
INFO Using log file: /home/ruoyu/.ghidra/.ghidra_10.1.2_PUBLIC/application.log (LoggingInitialization)
INFO Loading user preferences: /home/ruoyu/.ghidra/.ghidra_10.1.2_PUBLIC/preferences (Preferences)
INFO Class search complete (859 ms) (ClassSearcher)
INFO Initializing SSL Context (SSLContextInitializer)
INFO Initializing Random Number Generator... (SecureRandomFactory)
INFO Random Number Generator initialization complete: NativePRNGNonBlocking (SecureRandomFactory)
INFO Trust manager disabled, cacerts have not been set (ApplicationTrustManagerFactory)
INFO HEADLESS Script Paths:
/home/ruoyu/workspace/ghidra/Ghidra/Processors/DATA/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Debug/Debugger/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Processors/PIC/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/Python/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/MicrosoftCodeAnalyzer/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Processors/8051/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Debug/Debugger-agent-dbgmodel-traceloader/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/FileFormats/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/GnuDemangler/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/VersionTracking/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/Base/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/BytePatterns/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/FunctionID/ghidra_scripts
/home/ruoyu/workspace/ghidra/Ghidra/Features/Decompiler/ghidra_scripts (HeadlessAnalyzer)
INFO HEADLESS: execution starts (HeadlessAnalyzer)
INFO Opening existing project: /home/ruoyu/SymLM (HeadlessAnalyzer)
INFO Opening project: /home/ruoyu/SymLM (HeadlessProject)
INFO REPORT: Processing input files: (HeadlessAnalyzer)
INFO project: /home/ruoyu/SymLM (HeadlessAnalyzer)
INFO IMPORTING: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)
INFO REPORT: Import succeeded with language "x86:LE:64:default" and cspec "gcc" for file: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)
INFO ANALYZING all memory and code: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)
INFO DWARF external debug information found: ExternalDebugInfo [filename=null, crc=0, hash=09ebf3fca5a553c4033819ba96da4ee654f8a02f] (ExternalDebugFilesService)
INFO Unable to find DWARF information, skipping DWARF analysis (DWARFAnalyzer)
ERROR Invalid PNG data at 007ca170 (PngDataType)
INFO hit non-returning function, restarting decompiler switch analyzer later (DecompilerSwitchAnalyzer)
INFO Packed database cache: /tmp/ruoyu-Ghidra/packed-db-cache (PackedDatabaseCache)
INFO -----------------------------------------------------
ASCII Strings 1.276 secs
Apply Data Archives 0.327 secs
Call Convention ID 0.130 secs
Call-Fixup Installer 0.105 secs
Create Address Tables 0.513 secs
Create Address Tables - One Time 12.714 secs
Create Function 0.054 secs
DWARF 0.007 secs
Data Reference 1.895 secs
Decompiler Switch Analysis 1.561 secs
Decompiler Switch Analysis - One Time 74.185 secs
Demangler GNU 0.179 secs
Disassemble Entry Points 12.732 secs
ELF Scalar Operand References 3.396 secs
Embedded Media 0.100 secs
External Entry References 0.009 secs
Function ID 4.464 secs
Function Start Search 0.122 secs
Function Start Search After Code 0.080 secs
Function Start Search After Data 0.085 secs
GCC Exception Handlers 2.694 secs
Non-Returning Functions - Discovered 3.896 secs
Non-Returning Functions - Known 0.006 secs
Reference 2.540 secs
Shared Return Calls 0.789 secs
Stack 25.038 secs
Subroutine References 0.883 secs
x86 Constant Reference Analyzer 22.435 secs
-----------------------------------------------------
Total Time 172 secs
-----------------------------------------------------
(AutoAnalysisManager)
INFO REPORT: Analysis succeeded for file: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)
INFO SCRIPT: /home/ruoyu/workspace/SymLM/dataset_generation/get_calling_context.py (HeadlessAnalyzer)
[*] The interprocedural CFG be saved in: ./icfg/php_binary
INFO ANALYZING changes made by post scripts: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)
INFO REPORT: Post-analysis succeeded for file: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)
INFO REPORT: Discarded file import due to readOnly option: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)
[*] create output folder for an individual binary: /home/ruoyu/workspace/SymLM/php_sample/output/php_binary
[*] load icfg file: ./icfg/php_binary/icfg.json
[*] Dataset for /home/ruoyu/workspace/SymLM/php_sample/php_binary is generated in: /home/ruoyu/workspace/SymLM/php_sample/output/php_binary
Please find the testing binary at php_binary.tar.gz. Note that I tried with the bc
sample you provided with the same procedure, and it successfully generated some non-empty files.
Could you please take a brief look and let me know what could be the potential issue? Your help is very much appreciated!
Hi @wuruoyu,
Thanks for your interest! Based on your log, it appears that your test binary doesn't have DWARF information, which is required for our script to get the function boundaries. See details in
Thanks for your response, it is helpful!