OSUSecLab/SymLM

Question about function name prediction

Closed this issue · 2 comments

Hi, thanks for your efforts in open source!

I have been trying to use this tool to generate the embeddings for some binaries. I did the model training following https://github.com/OSUSecLab/SymLM?tab=readme-ov-file#model-training, which seems to work. Then I tried to generate the dataset for my testing binary following https://github.com/OSUSecLab/SymLM/tree/main/dataset_generation#dataset-preparation, but it seems that the all the generated files in the directory are empty.

Specifically, following is my run.sh:

#!/bin/bash

GHIDRA_ANALYZEHEADLESS_PATH='/home/ruoyu/workspace/ghidra/support/analyzeHeadless' # path to ghidra analyzeHeadless executable
GHIDRA_PROJECT_PATH='/home/ruoyu/' # path to ghidra project
GHIDRA_PROJECT_NAME='SymLM' # name of ghidra project
# BINARY_PATH='/home/ruoyu/workspace/SymLM/dataset_generation/sample_binary/bc/bc'  # path to binary
BINARY_PATH='/home/ruoyu/workspace/SymLM/php_sample/php_binary'  # path to binary
BINARY_ARCHITECTURE='x64' # architecture of binary, options: x86, x64, arm, mips
# DATASET_OUTPUT_DIR='/home/ruoyu/workspace/SymLM/dataset_generation/sample_output/bc' # path to output directory
DATASET_OUTPUT_DIR='/home/ruoyu/workspace/SymLM/php_sample/output' # path to output directory

# generate interprocedural cfg
$GHIDRA_ANALYZEHEADLESS_PATH $GHIDRA_PROJECT_PATH $GHIDRA_PROJECT_NAME -import $BINARY_PATH -readOnly -postScript ./get_calling_context.py

# generate dataset
python ./prepare_dataset.py \
    --output_dir $DATASET_OUTPUT_DIR \
    --input_binary_path $BINARY_PATH \
    --arch $BINARY_ARCHITECTURE

And below is the log when running the run.sh. I did not observe any error during the run.

openjdk version "11.0.21" 2023-10-17
OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu120.04, mixed mode)
INFO  Using log config file: jar:file:/home/ruoyu/workspace/ghidra/Ghidra/Framework/Generic/lib/Generic.jar!/generic.log4j.xml (LoggingInitialization)  
INFO  Using log file: /home/ruoyu/.ghidra/.ghidra_10.1.2_PUBLIC/application.log (LoggingInitialization)  
INFO  Loading user preferences: /home/ruoyu/.ghidra/.ghidra_10.1.2_PUBLIC/preferences (Preferences)  
INFO  Class search complete (859 ms) (ClassSearcher)  
INFO  Initializing SSL Context (SSLContextInitializer)  
INFO  Initializing Random Number Generator... (SecureRandomFactory)  
INFO  Random Number Generator initialization complete: NativePRNGNonBlocking (SecureRandomFactory)  
INFO  Trust manager disabled, cacerts have not been set (ApplicationTrustManagerFactory)  
INFO  HEADLESS Script Paths:
    /home/ruoyu/workspace/ghidra/Ghidra/Processors/DATA/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Debug/Debugger/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Processors/PIC/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/Python/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/MicrosoftCodeAnalyzer/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Processors/8051/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Debug/Debugger-agent-dbgmodel-traceloader/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/FileFormats/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/GnuDemangler/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/VersionTracking/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/Base/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/BytePatterns/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/FunctionID/ghidra_scripts
    /home/ruoyu/workspace/ghidra/Ghidra/Features/Decompiler/ghidra_scripts (HeadlessAnalyzer)  
INFO  HEADLESS: execution starts (HeadlessAnalyzer)  
INFO  Opening existing project: /home/ruoyu/SymLM (HeadlessAnalyzer)  
INFO  Opening project: /home/ruoyu/SymLM (HeadlessProject)  
INFO  REPORT: Processing input files:  (HeadlessAnalyzer)  
INFO       project: /home/ruoyu/SymLM (HeadlessAnalyzer)  
INFO  IMPORTING: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)  
INFO  REPORT: Import succeeded with language "x86:LE:64:default" and cspec "gcc" for file: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)  
INFO  ANALYZING all memory and code: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)  
INFO  DWARF external debug information found: ExternalDebugInfo [filename=null, crc=0, hash=09ebf3fca5a553c4033819ba96da4ee654f8a02f] (ExternalDebugFilesService)  
INFO  Unable to find DWARF information, skipping DWARF analysis (DWARFAnalyzer)  
ERROR Invalid PNG data at 007ca170 (PngDataType)  
INFO  hit non-returning function, restarting decompiler switch analyzer later (DecompilerSwitchAnalyzer)  
INFO  Packed database cache: /tmp/ruoyu-Ghidra/packed-db-cache (PackedDatabaseCache)  
INFO  -----------------------------------------------------
    ASCII Strings                              1.276 secs
    Apply Data Archives                        0.327 secs
    Call Convention ID                         0.130 secs
    Call-Fixup Installer                       0.105 secs
    Create Address Tables                      0.513 secs
    Create Address Tables - One Time          12.714 secs
    Create Function                            0.054 secs
    DWARF                                      0.007 secs
    Data Reference                             1.895 secs
    Decompiler Switch Analysis                 1.561 secs
    Decompiler Switch Analysis - One Time     74.185 secs
    Demangler GNU                              0.179 secs
    Disassemble Entry Points                  12.732 secs
    ELF Scalar Operand References              3.396 secs
    Embedded Media                             0.100 secs
    External Entry References                  0.009 secs
    Function ID                                4.464 secs
    Function Start Search                      0.122 secs
    Function Start Search After Code           0.080 secs
    Function Start Search After Data           0.085 secs
    GCC Exception Handlers                     2.694 secs
    Non-Returning Functions - Discovered       3.896 secs
    Non-Returning Functions - Known            0.006 secs
    Reference                                  2.540 secs
    Shared Return Calls                        0.789 secs
    Stack                                     25.038 secs
    Subroutine References                      0.883 secs
    x86 Constant Reference Analyzer           22.435 secs
-----------------------------------------------------
     Total Time   172 secs
-----------------------------------------------------
 (AutoAnalysisManager)  
INFO  REPORT: Analysis succeeded for file: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)  
INFO  SCRIPT: /home/ruoyu/workspace/SymLM/dataset_generation/get_calling_context.py (HeadlessAnalyzer)  
[*] The interprocedural CFG be saved in: ./icfg/php_binary
INFO  ANALYZING changes made by post scripts: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)  
INFO  REPORT: Post-analysis succeeded for file: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)  
INFO  REPORT: Discarded file import due to readOnly option: /home/ruoyu/workspace/SymLM/php_sample/php_binary (HeadlessAnalyzer)  
[*] create output folder for an individual binary: /home/ruoyu/workspace/SymLM/php_sample/output/php_binary
[*] load icfg file: ./icfg/php_binary/icfg.json
[*] Dataset for /home/ruoyu/workspace/SymLM/php_sample/php_binary is generated in: /home/ruoyu/workspace/SymLM/php_sample/output/php_binary

Please find the testing binary at php_binary.tar.gz. Note that I tried with the bc sample you provided with the same procedure, and it successfully generated some non-empty files.

Could you please take a brief look and let me know what could be the potential issue? Your help is very much appreciated!

Hi @wuruoyu,

Thanks for your interest! Based on your log, it appears that your test binary doesn't have DWARF information, which is required for our script to get the function boundaries. See details in

def get_function_reps(die):

Thanks for your response, it is helpful!