The code is constituted by two components: the first one takes as input the ACFG disasm data and it produces as output a number of intermediate results. Those are then taken as input by the second part, which implements the machine learning component.
The first part of the tool is implemented in a Python3 script called generate_function_traces.py
. We also provide a Docker container with the required dependencies.
The input of generate_function_traces.py
is a folder with the JSON files extracted via the ACFG disasm IDA plugin:
- For the Datasets we have released, the JSON files are already available in the
features
directory. Please, note that features are downloaded from GDrive as explained in the README. - To extract the features for a new set of binaries, run the ACFG disasm IDA plugin following the instructions in the README.
The script will produce the following output:
- A JSON file called
trex_traces.json
that contains the traces that are used in input by the Trex model
The following are the concrete steps to run the analysis within the provided Docker container:
- Build the docker image:
docker build Preprocessing/ -t trex-preprocessing
- Run the the script within the docker container:
docker run --rm -v <path_to_the_acfg_disasm_dir>:/input -v <path_to_the_output_dir>:/output -it trex-preprocessing /code/generate_function_traces.py -i /input -o /output/
You can see all the command line options using:
docker run --rm -it trex-preprocessing /code/generate_function_traces.py --help
Example: run the generate_function_traces.py
on the Dataset-Vulnerability:
docker run --rm -v $(pwd)/../../DBs/Dataset-Vulnerability/features/acfg_disasm_Dataset-Vulnerability:/input -v $(pwd)/Preprocessing/:/output -it trex-preprocessing /code/generate_function_traces.py -i /input -o /output/Dataset-Vulnerability-trex
Run unittest:
docker run --rm -v $(pwd)/Preprocessing/testdata/:/input -v $(pwd)/Preprocessing/testdata/trex_temp:/output -it trex-preprocessing /bin/bash -c "( cd /code && python3 -m unittest test_generate_function_traces.py )"
The machine learning component is implemented in a script called trex_inference.py
. We also provide a Docker container with the required dependencies.
The script takes in input:
- The CSV file with the function pairs to compare. For the Datasets we have released, the function pairs are located in the
pairs
folder (example here) - The JSON file with the function traces produced in output by the
generate_function_traces.py
script - The directory with the trained model (link to Google Drive)
- The folder with binarized data from the pre-training phase (link to Google Drive).
The script produces in output:
- A copy of the CSV file in input (called
{original_name}.trex_out.csv
) with an additionalcs
column that contains the cosine similarity between the function pairs.
The following are the concrete steps to run the Trex inference within the provided Docker container:
- Build the docker image:
docker build NeuralNetwork/ -t trex-inference
- Run the inference script:
docker run --rm -v <path_to_the_function_pairs_folder>:/pairs -v <path_to_the_trex_traces_folder>:/traces -v <path_to_the_output_folder>:/output -it trex-inference conda run --no-capture-output -n trex python3 trex_inference.py --input-pairs /pairs/<name_of_csv_file> --input-traces /traces/trex_traces.json --model-checkpoint-dir checkpoints/similarity/ --data-bin-dir data-bin-sim/similarity/ --output-dir /output/<name_of_output_dir>
You can see all the command line options using:
docker run --rm -it trex-inference conda run --no-capture-output -n trex python3 trex_inference.py --help
Example:
docker run --rm -v $(pwd)/../../DBs/Dataset-Vulnerability/pairs/:/pairs -v $(pwd)/Preprocessing/Dataset-Vulnerability-trex/:/traces -v $(pwd)/NeuralNetwork/:/output -it trex-inference conda run --no-capture-output -n trex python3 trex_inference.py --input-pairs /pairs/pairs_testing_Dataset-Vulnerability.csv --input-traces /traces/trex_traces.json --model-checkpoint-dir checkpoints/similarity/ --data-bin-dir data-bin-sim/similarity/ --output-dir /output/Dataset-Vulnerability-trex
The following are the main steps that are needed to run the Trex model on a new dataset of functions.
Our implementation only covers the inference part of the model. The official repository of Trex includes additional information on how to run the pre-training and training of the neural network.
- Create a CSV file with the pairs of functions selected for validation and testing. Example here. (
idb_path_1
,fva_1
) and (idb_path_2
,fva_2
) are the "primary keys". - Extract the features using the ACFG disasm IDA plugin following the instructions in the README.
idb_path_1
andidb_path_2
for the selected functions must be valid paths to the IDBs file to run the IDA plugin correctly. - Run the
generate_function_traces.py
script following the instructions in Part 1. - Run the
trex_inference.py
script following the instructions in Part 2.
- The model was originally trained on four architectures (x86, x64, ARM 32 bit, and MIPS 32 bit), four optimization levels (O0, O1, O2, O3) and one compiler version (GCC-7.5). Also, the pre-training phase requires an emulation component that only supports those architectures. Due to these limitations, we only tested the Trex model on Dataset-2 and Dataset-Vulnerability, which are a subset of the binaries that were released by the Trex authors.
- The Trex model is faster when running on a GPU, however this is not supported in the Dockerfile that we release.
The original code of Trex is released under MIT license.