This repository is the official implementation of the paper "RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity Detection".
This repository was forked from a private repository. Before uploading to GitHub, I removed some private scripts, which may lead to errors during execution. If you encounter any errors, please open an issue on GitHub, and I will address it as soon as possible.
Our codes are organized as a python package to facilitate fair comparison of different models. It currently implements several neural network models for binary code similarity detection, including:
- Gemini [paper] [code]
- SAFE [paper] [code]
- GraphEmbed [paper][code]
- jTrans [paper] [code]
- alpha-diff [paper] [code]
- RCFG2Vec [paper][code]
- Asteria [paper] [code]
We have tested the code on Ubuntu 22.04 LTS
with Python 3.10
.
Note: We have meet several problems when installing the python binding of
rocksdb
on other systems. Maybe compilingrocksdb
from source code can solve the problem.
We use BinaryNinja and IDA pro to disassemble the binary code and extract necessary information. So before running the code, you should install them and have a valid license. Additionally, for binaryninja, you should install its python binding.
Binsim depends on rocksdb to save training samples, so you should install it first.
sudo apt install build-essential
sudo apt-get install libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev libgflags-dev
sudo apt install librocksdb-dev
After installing above libraries and packages, you can install necessary python packages with the following command:
pip install -r requirements.txt
Note: The
dgl
package installed by the above command only supportsCPU
, if you want to install theGPU
version, you need to use the command provided by its official website.
pip install .
Note: We have implemented an experimental PyTorch operator for TreeLSTM and DAGGRU, which can significantly speed up the training process. If you want to use it, you have to make sure the cuda is available and the
nvcc
is installed.
We provide a guideline for reproducing the experiments in our paper. You can find it here.