You are expected to create a new directory for the experiment replication. In this document, we use ${ROOT}
to denote the path of this directory.
Before reproducing the experiments, you need to install BinSim
package according to the instructions in the README.md.
Download BinaryCorp and Trex datasets and extract them into ${ROOT}/original-dataset
.
Note: The Link of Trex dataset provides a large number of projects, and we only used those projects that were utilized in its original paper.
After extracting, the structure of the directory should be like,
.
└─ original-dataset
├── BinaryCorp
│ ├── small_train
│ └── test
└── Trex
├── binutils
├── ...
└── zlib
As different datasets have different directory structures, and we implement a script to convert these two datasets into same structure. You can use the following command to convert two datasets.
cd ${ROOT}
python experiment/preprocess/code/convert/convert.py --original ${ROOT}/original-dataset/BinaryCorp --converted ${ROOT}/dataset/BinaryCorp binarycorp
python experiment/preprocess/code/convert/convert.py --original ${ROOT}/original-dataset/Trex --converted ${ROOT}/dataset/Trex trex
Once the datasets are converted, you can disassemble them to extract necessary information. We implement a script for that, and use configuration files to specify the paths of the dataset and information to extract. The configuration files are located at experiment/preprocess/config/dataset
.
To disassemble the dataset, you need to modify the paths in the configuration files and execute the script with the modified configuration files. Here, we take the Trex
dataset as an example, and show the detailed steps to extract the ACFG
for binary functions in it.
-
modify configuration file
Open
experiment/preprocess/config/dataset/trex/ACFG.yaml
, and modify/path/to/your/root
to the value of${ROOT}
.dataset: type: ACFG binary-dir: /path/to/your/root/dataset/trex dataset-dir: /path/to/your/root/processed-dataset/trex/ACFG middle-dir: /path/to/your/root/cache/middle/trex/ACFG cache-dir: /path/to/your/root/cache/database/trex
-
Disassemble
Disassemble the dataset with the following command:
cd ${ROOT} python experiment/preprocess/code/disassemble/preprocess-dataset.py --config experiment/preprocess/config/dataset/${DATASET}/${GraphType}.yaml
Here,
${DATASET}
denotes the name of the dataset, and${GraphType}
denotes the type of data to extract. There are 9 data types in total, includingACFG
(for Gemini),ByteCode
(for$\alpha$ -diff),CodeAST
(for Asteria),InsDAG
(forRCFG2Vec
),jTransSeq
(forjTrans
),TokenCFG
(forGraphEmbed
), andTokenSeq
(forSAFE
). And you can replace${GraphType}
with the corresponding data type to extract the data you need.The disassembling process will take several hours, depending on the number of CPUs available on your machine. On our machine with 80 CPUs, this command takes less than 4 hours to extract ACFG.
After disassembling, the extracted data will be saved in
${ROOT}/processed-dataset/${DATASET}/${GraphType}
. And the structure of the directory should be like,. ├── test │ ├── dataset.db │ ├── meta.pkl │ └── statistics.pkl ├── train │ ├── dataset.db │ ├── dataset.db.lock │ ├── meta.pkl │ └── statistics.pkl └── validation ├── dataset.db ├── meta.pkl └── statistics.pkl
-
Extract Validation Set(only for
BinaryCorp
)The
BinaryCorp
dataset does not provide a validation set, so we extract about 30% functions from the training set. You can use the following command to extract the validation set.cd ${ROOT} python experiment/preprocess/code/disassemble/split-val-from-train.py --dataset-dir ${ROOT}/processed-dataset/BinaryCorp/ACFG
-
Train Ins2Vec(Optional)
GraphEmbed and SAFE adopt Word2Vec to learn the representation of assembly instructions(named i2v in their original paper). After disassembling the dataset, you can train the i2v model on the extracted corpus.
cd ${ROOT} python experiment/preprocess/code/disassemble/pretrain.py --config experiment/preprocess/config/ins2vec/trex/graphEmbed-ins2vec-default.yaml ins2vec
Once you have prepared the dataset, you can train and evaluate the models with another script and configuration files. The script is located at experiment/common/train-or-test/train-siamese.py
, and the configuration files are located at experiment/code/1_bcsd/config
.
Before training and evaluating the models, you need to modify the "/path/to/your/root" in all configuration files to the value of ${ROOT}
.
Then, you can train and evaluate most models with the following command:
cd ${ROOT}
# train
python experiment/common/train-or-test/train-siamese.py --config experiment/code/1_bcsd/config/trex/RCFG2Vec.yaml train
# test
python experiment/common/train-or-test/train-siamese.py --config experiment/code/1_bcsd/config/trex/RCFG2Vec.yaml test --test-config experiment/code/1_bcsd/config/trex/common-test.yaml
For jTrans
, we directly use its pretrained model, and you need to use "jTrans-test.yaml" as the test configuration file.
cd ${ROOT}
# test jTrans
python experiment/common/train-or-test/train-siamese.py --config experiment/code/1_bcsd/config/trex/jTrans.yaml test --test-config experiment/code/1_bcsd/config/trex/jTrans-test.yaml
note: In our paper, we directly use the pre-trained model of
jTrans
, and we don't pay enough attention to the training process ofjTrans
and we cannot guarantee the correctness of the training process ofjTrans
in the current version.