Skip to content

Commit

Permalink
adds support for new compilers
Browse files Browse the repository at this point in the history
- modifies regex in `tiknib/utils.py` to match any version of gcc/clang
- adds `config/path_variables.py` to enable ease of use.
  note: `path_variables.py` was written to work in Bash and Python
  universally.
- fixes a problem with the ase18 dataset and the coreutils debug
  information which caused all v6.5 functions to be discarded.
- adds tablulate to print out a formated ROC table.
- adds `-P+` to enable compressing the IDA Pro databases.  Saves a lot of
  storage space for this massive dataset! For example objdump reduces from
  48M to 6M.

```
6.8M Dec  2  2018 /tmp/notpacked/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf
 48M Jan 28 10:43 /tmp/notpacked/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf.i64
6.8M Dec  2  2018 /tmp/packed/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf
6.2M Jan 28 10:41 /tmp/packed/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf.i64
```
  • Loading branch information
wideglide committed Jan 28, 2022
1 parent 1f43df3 commit 4239818
Show file tree
Hide file tree
Showing 20 changed files with 178 additions and 101 deletions.
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,10 @@ confirm their equivalence. Based on these criteria we conducted several steps to
build ground truth and clean the datasets. For more details, please check [our
paper](https://arxiv.org/abs/2011.10749).

### 1. Run IDA Pro to extract preliminary data for each functions.
### 1. Configure path variables for IDA Pro and this repository (`config/path_variables.py`).


### 2. Run IDA Pro to extract preliminary data for each functions.

**This step takes the most time.**

Expand Down Expand Up @@ -145,7 +148,7 @@ Additionally, **you can use this script to run any idascript for numerous
binaries in parallel.**


### 2. Extract source file names and line numbers to build ground truth.
### 3. Extract source file names and line numbers to build ground truth.
This extracts source file name and line number by parsing the debugging
information in a given binary. The binary must have been compiled with
the `-g` option.
Expand All @@ -156,7 +159,7 @@ $ python helper/extract_lineno.py \
--threshold 1
```

### 3. Filter functions.
### 4. Filter functions.
This filters functions by checking the source file name and line number.
This removes compiler intrinsic functions and duplicate functions spread
over multiple binaries within the same package.
Expand All @@ -167,7 +170,7 @@ $ python helper/filter_functions.py \
--threshold 1
```

### (Optional) 4. Counting the number of functions.
### (Optional) 5. Counting the number of functions.
This counts the number of functions and generates a graph of that function
on the same path of `input_list`. This also prints the numbers separated
by `','`. In the below example, a pdf file containing the graph will be
Expand Down
20 changes: 10 additions & 10 deletions config/config_list_openssl.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
/home/dongkwan/tiknib/config/openssl/config_openssl_all.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_arm_arm.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_arm_mips.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_arm_x86.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_mips_arm.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_mips_mips.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_mips_x86.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_x86_arm.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_x86_mips.yml
/home/dongkwan/tiknib/config/openssl/config_openssl_x86_x86.yml
config/openssl/config_openssl_all.yml
config/openssl/config_openssl_arm_arm.yml
config/openssl/config_openssl_arm_mips.yml
config/openssl/config_openssl_arm_x86.yml
config/openssl/config_openssl_mips_arm.yml
config/openssl/config_openssl_mips_mips.yml
config/openssl/config_openssl_mips_x86.yml
config/openssl/config_openssl_x86_arm.yml
config/openssl/config_openssl_x86_mips.yml
config/openssl/config_openssl_x86_x86.yml
33 changes: 28 additions & 5 deletions example/example.sh
Original file line number Diff line number Diff line change
@@ -1,24 +1,47 @@
#!/bin/bash

source config/path_variables.py

SECONDS=0
echo "Processing IDA analysis ..."
python3 helper/do_idascript.py \
--idapath "/home/dongkwan/.tools/ida-6.95" \
--idc "tiknib/ida/fetch_funcdata.py" \
--idapath "${IDA_PATH}" \
--idc "${IDA_FETCH_FUNCDATA}" \
--input_list "example/input_list_find.txt" \
--log

echo "Extracting function types ..."

echo "Extract source file names and line numbers... ${SECONDS}s"
python3 helper/extract_lineno.py \
--input_list "example/input_list_find.txt" \
--threshold 1


echo "Filtering functions... ${SECONDS}s"
python3 helper/filter_functions.py \
--input_list "example/input_list_find.txt" \
--threshold 1


echo "Counting functions..."
python3 helper/count_functions.py \
--input_list "example/input_list_find.txt" \
--threshold 1


echo "Extracting function types ... ${SECONDS}s"
python3 helper/extract_functype.py \
--source_list "example/source_list.txt" \
--input_list "example/input_list_find.txt" \
--ctags_dir "data/ctags" \
--threshold 1

echo "Extracting features ..."
echo "Extracting features ... ${SECONDS}s"
python3 helper/extract_features.py \
--input_list "example/input_list_find.txt" \
--threshold 1

echo "Testing features ..."
echo "Testing features ... ${SECONDS}s"
python3 helper/test_roc.py \
--input_list "example/input_list_find.txt" \
--config "config/gnu/config_gnu_normal_all.yml"
5 changes: 3 additions & 2 deletions helper/do_idascript.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
sys.path.insert(0, os.path.join(sys.path[0], ".."))
from tiknib.idascript import IDAScript
from tiknib.utils import do_multiprocess
from config.path_variables import IDA_PATH, IDA_FETCH_FUNCDATA

if __name__ == "__main__":
op = OptionParser()
Expand All @@ -16,15 +17,15 @@
action="store",
type=str,
dest="idapath",
default="/home/dongkwan/.tools/ida-6.95",
default=IDA_PATH,
help="IDA directory path",
)
op.add_option(
"--idc",
action="store",
type=str,
dest="idc",
default="tiknib/ida/fetch_funcdata.py",
default=IDA_FETCH_FUNCDATA,
help="IDA script file",
)
op.add_option(
Expand Down
8 changes: 6 additions & 2 deletions helper/extract_lineno.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from tiknib.utils import do_multiprocess
from tiknib.utils import load_func_data, store_func_data
from tiknib.utils import parse_source_path
from config.path_variables import *

import logging
import coloredlogs
Expand All @@ -31,6 +32,9 @@ def extract_func_lineno(bin_name):
func["src_path"] = line_map[func_addr][0]
func["src_file"] = parse_source_path(func["src_path"])
func["src_line"] = line_map[func_addr][1]
# Fix ase18 source paths coreutils-6.7-6.5 / coreutils-6.7-6.7
if 'coreutils-6.7-6.5' in func['src_path']:
func['src_path'] = func['src_path'].replace('6.7-6.5', '6.5')
store_func_data(bin_name, func_data_list)
return

Expand Down Expand Up @@ -84,8 +88,8 @@ def extract_func_lineno(bin_name):

from tiknib.idascript import IDAScript
idascript = IDAScript(
idapath="/home/dongkwan/.tools/ida-6.95",
idc="tiknib/ida/fetch_funcdata.py",
idapath=IDA_PATH,
idc=IDA_FETCH_FUNCDATA,
force=True,
log=True,
)
Expand Down
2 changes: 1 addition & 1 deletion helper/filter_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def filter_funcs(bin_path):
# print(func['name'], func['src_file'], func['src_line'])

# filter functions by package name (remove functions inserted by compilers)
funcs = list(filter(lambda x: x['package'] in x['src_path'], funcs))
funcs = list(filter(lambda x: pack_name in x['src_path'], funcs))
num_pack_funcs = len(funcs)

if num_pack_funcs == 0:
Expand Down
51 changes: 33 additions & 18 deletions helper/get_roc_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import numpy as np

from optparse import OptionParser
from tabulate import tabulate

sys.path.insert(0, os.path.join(sys.path[0], ".."))
from tiknib.utils import load_cache
Expand All @@ -15,6 +16,12 @@
rootLogger = logging.getLogger()
rootLogger.setLevel(logging.INFO)

def config_rename(config_fname):
# TODO: clean up the key name (config_fname to something neat).
config_key = os.path.basename(config_fname)
config_key = re.search("config_(.+).yml", config_key).groups()[0]
return config_key

def calc_tptn_gap(tps, tns):
return np.mean(np.abs(tps - tns), axis=0)

Expand Down Expand Up @@ -43,9 +50,7 @@ def load_results(opts):
# select the latest one
cache_dir = sorted(glob.glob("{}/*".format(outdir)))[-1]

# TODO: clean up the key name (config_fname to something neat).
config_key = os.path.basename(config_fname)
config_key = re.search("config_(.+).yml", config_key).groups()[0]
config_key = config_rename(config_fname)
all_data[config_key] = []
features_inter = set()
for idx in range(10):
Expand All @@ -65,9 +70,7 @@ def load_results(opts):

# Now fetch real data
for config_idx, config_fname in enumerate(config_fnames):
# TODO: clean up the key name (config_fname to something neat).
config_key = os.path.basename(config_fname)
config_key = re.search("config_(.+).yml", config_key).groups()[0]
config_key = config_rename(config_fname)

rocs = []
aps = []
Expand Down Expand Up @@ -154,18 +157,22 @@ def get_results(opts):
config_fnames, total_data, features, features_union = load_results(opts)

# first rows
print(','.join(map(lambda x:
row1 = ["# Train pairs (10^6)"]
row1.extend(list(map(lambda x:
'%.2f' % (x[0] / 1000000.0)
if x[0] > 100000
else '%.2fF' % (x[0] / 10000), total_data[0])))
print(','.join(map(lambda x:
row2 = ["# Test pairs (10^6)"]
row2.extend(list(map(lambda x:
'%.2f' % (x[1] / 1000000.0)
if x[1] > 100000
else '%.2fF' % (x[1] / 10000), total_data[0])))

# second rows
print(','.join(map(lambda x: '%.1f' % (x[0]), total_data[1])))
print(','.join(map(lambda x: '%.1f' % (x[1]), total_data[1])))
row3 = ["Train time"] + ['%.1f' % x[0] for x in total_data[1]]
row4 = ["Test time"] + ['%.1f' % x[1] for x in total_data[1]]

table = [row1, row2, row3, row4]

# third rows
for idx in features_union:
Expand All @@ -177,22 +184,30 @@ def get_results(opts):
s.append('%.2f-' % (data[feature][0]))
else:
s.append('%.2f' % (data[feature][0]))
print(','.join(s))
table.append(s)

# fourth row
print(','.join(map(lambda x: '%.1f' % (x), total_data[3])))
row = ["Avg # features"] + ['%.1f' % x for x in total_data[3]]
table.append(row)

# fifth rows
print(','.join(map(lambda x: '%.2f' % (x[0]), total_data[4])))
print(','.join(map(lambda x: '%.2f' % (x[1]), total_data[4])))
row = ["Mean tptn_gap"] + ['%.2f' % x[0] for x in total_data[4]]
table.append(row)
row = ["Std tptn_gap"] + ['%.2f' % x[1] for x in total_data[4]]
table.append(row)

# sixth rows
print(','.join(map(lambda x: '%.2f' % (x[0]), total_data[5])))
print(','.join(map(lambda x: '%.2f' % (x[1]), total_data[5])))
row = ["ROC AUC"] + ['%.2f' % x[0] for x in total_data[5]]
table.append(row)
row = ["Std. of ROC"] + ['%.2f' % x[1] for x in total_data[5]]
table.append(row)

# seventh rows
print(','.join(map(lambda x: '%.2f' % (x[0]), total_data[6])))
print(','.join(map(lambda x: '%.2f' % (x[1]), total_data[6])))
row = ["Avg Prec (AP)"] + ['%.2f' % x[0] for x in total_data[6]]
table.append(row)
row = ["Std of AP"] + ['%.2f' % x[1] for x in total_data[6]]
table.append(row)
print(tabulate(table, floatfmt=".2f"))


if __name__ == "__main__":
Expand Down
31 changes: 21 additions & 10 deletions helper/run_ase.sh
Original file line number Diff line number Diff line change
@@ -1,29 +1,34 @@
#!/bin/bash
#!/bin/bash -ue
set -x

source config/path_variables.py

declare -a input_list=(
# This one is for processing all functions.
"/home/dongkwan/binkit-dataset/ase_debug.txt"
"${BINKIT_DATASET}/ase_debug.txt"
# Then, for experiment and counting, we utilize them separately.
# "/home/dongkwan/binkit-dataset/ase1_debug.txt"
# "/home/dongkwan/binkit-dataset/ase2_debug.txt"
# "/home/dongkwan/binkit-dataset/ase3_debug.txt"
# "/home/dongkwan/binkit-dataset/ase4_debug.txt"
# "${BINKIT_DATASET}/ase1_debug.txt"
# "${BINKIT_DATASET}/ase2_debug.txt"
# "${BINKIT_DATASET}/ase3_debug.txt"
# "${BINKIT_DATASET}/ase4_debug.txt"
)

source_list="/home/dongkwan/binkit-dataset/ase_source_list.txt"
ctags_dir="/home/dongkwan/binkit-dataset/ase_ctags_data"
source_list="${BINKIT_DATASET}/ase_source_list.txt"
ctags_dir="${BINKIT_DATASET}/ase_ctags_data"

SECONDS=0
echo "Processing IDA analysis ..."
for f in "${input_list[@]}"
do
echo "Processing ${f} ..."
python helper/do_idascript.py \
--idapath "/home/dongkwan/.tools/ida-6.95" \
--idc "/home/dongkwan/tiknib/tiknib/ida/fetch_funcdata_v6.95.py" \
--idapath "${IDA_PATH}" \
--idc "${IDA_FETCH_FUNCDATA}" \
--input_list "${f}" \
--log
done

echo "Extract source file names and line numbers... ${SECONDS}s"
for f in "${input_list[@]}"
do
echo "Processing ${f} ..."
Expand All @@ -32,6 +37,7 @@ do
--threshold 1
done

echo "Filtering functions... ${SECONDS}s"
for f in "${input_list[@]}"
do
echo "Processing ${f} ..."
Expand All @@ -40,6 +46,7 @@ do
--threshold 1
done

echo "Counting functions... ${SECONDS}s"
for f in "${input_list[@]}"
do
echo "Processing ${f} ..."
Expand All @@ -48,6 +55,7 @@ do
--threshold 1
done

echo "Extracting function types ... ${SECONDS}s"
for f in "${input_list[@]}"
do
echo "Processing ${f} ..."
Expand All @@ -58,10 +66,13 @@ do
--threshold 1
done

echo "Extracting features ... ${SECONDS}s"
for f in "${input_list[@]}"
do
echo "Processing ${f} ..."
python helper/extract_features.py \
--input_list "${f}" \
--threshold 1
done

echo "DONE in ${SECONDS}s"
Loading

0 comments on commit 4239818

Please sign in to comment.