adds support for new compilers

- modifies regex in `tiknib/utils.py` to match any version of gcc/clang - adds `config/path_variables.py` to enable ease of use. note: `path_variables.py` was written to work in Bash and Python universally. - fixes a problem with the ase18 dataset and the coreutils debug information which caused all v6.5 functions to be discarded. - adds tablulate to print out a formated ROC table. - adds `-P+` to enable compressing the IDA Pro databases. Saves a lot of storage space for this massive dataset! For example objdump reduces from 48M to 6M. ``` 6.8M Dec 2 2018 /tmp/notpacked/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf 48M Jan 28 10:43 /tmp/notpacked/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf.i64 6.8M Dec 2 2018 /tmp/packed/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf 6.2M Jan 28 10:41 /tmp/packed/binutils-2.30_clang-7.0_arm_64_O0_objdump.elf.i64 ```
wideglide · Jan 28, 2022 · 4239818 · 4239818
1 parent 1f43df3
commit 4239818
Show file tree

Hide file tree

Showing 20 changed files with 178 additions and 101 deletions.
diff --git a/README.md b/README.md
@@ -111,7 +111,10 @@ confirm their equivalence. Based on these criteria we conducted several steps to
 build ground truth and clean the datasets. For more details, please check [our
 paper](https://arxiv.org/abs/2011.10749).
 
-### 1. Run IDA Pro to extract preliminary data for each functions.
+### 1. Configure path variables for IDA Pro and this repository (`config/path_variables.py`).
+
+
+### 2. Run IDA Pro to extract preliminary data for each functions.
 
 **This step takes the most time.**
 
@@ -145,7 +148,7 @@ Additionally, **you can use this script to run any idascript for numerous
 binaries in parallel.**
 
 
-### 2. Extract source file names and line numbers to build ground truth.
+### 3. Extract source file names and line numbers to build ground truth.
 This extracts source file name and line number by parsing the debugging
 information in a given binary. The binary must have been compiled with
 the `-g` option.
@@ -156,7 +159,7 @@ $ python helper/extract_lineno.py \
     --threshold 1
 ```
 
-### 3. Filter functions.
+### 4. Filter functions.
 This filters functions by checking the source file name and line number.
 This removes compiler intrinsic functions and duplicate functions spread
 over multiple binaries within the same package.
@@ -167,7 +170,7 @@ $ python helper/filter_functions.py \
     --threshold 1
 ```
 
-### (Optional) 4. Counting the number of functions.
+### (Optional) 5. Counting the number of functions.
 This counts the number of functions and generates a graph of that function
 on the same path of `input_list`. This also prints the numbers separated
 by `','`. In the below example, a pdf file containing the graph will be

diff --git a/config/config_list_openssl.txt b/config/config_list_openssl.txt
@@ -1,10 +1,10 @@
-/home/dongkwan/tiknib/config/openssl/config_openssl_all.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_arm_arm.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_arm_mips.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_arm_x86.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_mips_arm.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_mips_mips.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_mips_x86.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_x86_arm.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_x86_mips.yml
-/home/dongkwan/tiknib/config/openssl/config_openssl_x86_x86.yml
+config/openssl/config_openssl_all.yml
+config/openssl/config_openssl_arm_arm.yml
+config/openssl/config_openssl_arm_mips.yml
+config/openssl/config_openssl_arm_x86.yml
+config/openssl/config_openssl_mips_arm.yml
+config/openssl/config_openssl_mips_mips.yml
+config/openssl/config_openssl_mips_x86.yml
+config/openssl/config_openssl_x86_arm.yml
+config/openssl/config_openssl_x86_mips.yml
+config/openssl/config_openssl_x86_x86.yml
diff --git a/example/example.sh b/example/example.sh
@@ -1,24 +1,47 @@
 #!/bin/bash
+
+source config/path_variables.py
+
+SECONDS=0
 echo "Processing IDA analysis ..."
 python3 helper/do_idascript.py \
-    --idapath "/home/dongkwan/.tools/ida-6.95" \
-    --idc "tiknib/ida/fetch_funcdata.py" \
+    --idapath "${IDA_PATH}" \
+    --idc "${IDA_FETCH_FUNCDATA}" \
     --input_list "example/input_list_find.txt" \
     --log
 
-echo "Extracting function types ..."
+
+echo "Extract source file names and line numbers... ${SECONDS}s"
+python3 helper/extract_lineno.py \
+    --input_list "example/input_list_find.txt" \
+    --threshold 1
+
+
+echo "Filtering functions... ${SECONDS}s"
+python3 helper/filter_functions.py \
+    --input_list "example/input_list_find.txt" \
+    --threshold 1
+
+
+echo "Counting functions..."
+python3 helper/count_functions.py \
+    --input_list "example/input_list_find.txt" \
+    --threshold 1
+
+
+echo "Extracting function types ... ${SECONDS}s"
 python3 helper/extract_functype.py \
     --source_list "example/source_list.txt" \
     --input_list "example/input_list_find.txt" \
     --ctags_dir "data/ctags" \
     --threshold 1
 
-echo "Extracting features ..."
+echo "Extracting features ... ${SECONDS}s"
 python3 helper/extract_features.py \
     --input_list "example/input_list_find.txt" \
     --threshold 1
 
-echo "Testing features ..."
+echo "Testing features ... ${SECONDS}s"
 python3 helper/test_roc.py \
     --input_list "example/input_list_find.txt" \
     --config "config/gnu/config_gnu_normal_all.yml"
diff --git a/helper/do_idascript.py b/helper/do_idascript.py
@@ -5,6 +5,7 @@
 sys.path.insert(0, os.path.join(sys.path[0], ".."))
 from tiknib.idascript import IDAScript
 from tiknib.utils import do_multiprocess
+from config.path_variables import IDA_PATH, IDA_FETCH_FUNCDATA
 
 if __name__ == "__main__":
     op = OptionParser()
@@ -16,15 +17,15 @@
         action="store",
         type=str,
         dest="idapath",
-        default="/home/dongkwan/.tools/ida-6.95",
+        default=IDA_PATH,
         help="IDA directory path",
     )
     op.add_option(
         "--idc",
         action="store",
         type=str,
         dest="idc",
-        default="tiknib/ida/fetch_funcdata.py",
+        default=IDA_FETCH_FUNCDATA,
         help="IDA script file",
     )
     op.add_option(

diff --git a/helper/extract_lineno.py b/helper/extract_lineno.py
@@ -8,6 +8,7 @@
 from tiknib.utils import do_multiprocess
 from tiknib.utils import load_func_data, store_func_data
 from tiknib.utils import parse_source_path
+from config.path_variables import *
 
 import logging
 import coloredlogs
@@ -31,6 +32,9 @@ def extract_func_lineno(bin_name):
         func["src_path"] = line_map[func_addr][0]
         func["src_file"] = parse_source_path(func["src_path"])
         func["src_line"] = line_map[func_addr][1]
+        # Fix ase18 source paths coreutils-6.7-6.5 / coreutils-6.7-6.7
+        if 'coreutils-6.7-6.5' in func['src_path']:
+            func['src_path'] = func['src_path'].replace('6.7-6.5', '6.5')
     store_func_data(bin_name, func_data_list)
     return
 
@@ -84,8 +88,8 @@ def extract_func_lineno(bin_name):
 
         from tiknib.idascript import IDAScript
         idascript = IDAScript(
-            idapath="/home/dongkwan/.tools/ida-6.95",
-            idc="tiknib/ida/fetch_funcdata.py",
+            idapath=IDA_PATH,
+            idc=IDA_FETCH_FUNCDATA,
             force=True,
             log=True,
         )

diff --git a/helper/filter_functions.py b/helper/filter_functions.py
@@ -39,7 +39,7 @@ def filter_funcs(bin_path):
     #        print(func['name'], func['src_file'], func['src_line'])
 
     # filter functions by package name (remove functions inserted by compilers)
-    funcs = list(filter(lambda x: x['package'] in x['src_path'], funcs))
+    funcs = list(filter(lambda x: pack_name in x['src_path'], funcs))
     num_pack_funcs = len(funcs)
 
     if num_pack_funcs == 0:

diff --git a/helper/get_roc_table.py b/helper/get_roc_table.py
@@ -6,6 +6,7 @@
 import numpy as np
 
 from optparse import OptionParser
+from tabulate import tabulate
 
 sys.path.insert(0, os.path.join(sys.path[0], ".."))
 from tiknib.utils import load_cache
@@ -15,6 +16,12 @@
 rootLogger = logging.getLogger()
 rootLogger.setLevel(logging.INFO)
 
+def config_rename(config_fname):
+    # TODO: clean up the key name (config_fname to something neat).
+    config_key = os.path.basename(config_fname)
+    config_key = re.search("config_(.+).yml", config_key).groups()[0]
+    return config_key
+
 def calc_tptn_gap(tps, tns):
     return np.mean(np.abs(tps - tns), axis=0)
 
@@ -43,9 +50,7 @@ def load_results(opts):
         # select the latest one
         cache_dir = sorted(glob.glob("{}/*".format(outdir)))[-1]
 
-        # TODO: clean up the key name (config_fname to something neat).
-        config_key = os.path.basename(config_fname)
-        config_key = re.search("config_(.+).yml", config_key).groups()[0]
+        config_key = config_rename(config_fname)
         all_data[config_key] = []
         features_inter = set()
         for idx in range(10):
@@ -65,9 +70,7 @@ def load_results(opts):
 
     # Now fetch real data
     for config_idx, config_fname in enumerate(config_fnames):
-        # TODO: clean up the key name (config_fname to something neat).
-        config_key = os.path.basename(config_fname)
-        config_key = re.search("config_(.+).yml", config_key).groups()[0]
+        config_key = config_rename(config_fname)
 
         rocs = []
         aps = []
@@ -154,18 +157,22 @@ def get_results(opts):
     config_fnames, total_data, features, features_union = load_results(opts)
 
     # first rows
-    print(','.join(map(lambda x:
+    row1 = ["# Train pairs (10^6)"]
+    row1.extend(list(map(lambda x:
                        '%.2f' % (x[0] / 1000000.0)
                        if x[0] > 100000
                        else '%.2fF' % (x[0] / 10000), total_data[0])))
-    print(','.join(map(lambda x:
+    row2 = ["# Test pairs (10^6)"]
+    row2.extend(list(map(lambda x:
                        '%.2f' % (x[1] / 1000000.0)
                        if x[1] > 100000
                        else '%.2fF' % (x[1] / 10000), total_data[0])))
 
     # second rows
-    print(','.join(map(lambda x: '%.1f' % (x[0]), total_data[1])))
-    print(','.join(map(lambda x: '%.1f' % (x[1]), total_data[1])))
+    row3 = ["Train time"] + ['%.1f' % x[0] for x in total_data[1]]
+    row4 = ["Test time"] + ['%.1f' % x[1] for x in total_data[1]]
+
+    table = [row1, row2, row3, row4]
 
     # third rows
     for idx in features_union:
@@ -177,22 +184,30 @@ def get_results(opts):
                 s.append('%.2f-' % (data[feature][0]))
             else:
                 s.append('%.2f' % (data[feature][0]))
-        print(','.join(s))
+        table.append(s)
 
     # fourth row
-    print(','.join(map(lambda x: '%.1f' % (x), total_data[3])))
+    row = ["Avg # features"] + ['%.1f' % x for x in total_data[3]]
+    table.append(row)
 
     # fifth rows
-    print(','.join(map(lambda x: '%.2f' % (x[0]), total_data[4])))
-    print(','.join(map(lambda x: '%.2f' % (x[1]), total_data[4])))
+    row = ["Mean tptn_gap"] + ['%.2f' % x[0] for x in total_data[4]]
+    table.append(row)
+    row = ["Std tptn_gap"] + ['%.2f' % x[1] for x in total_data[4]]
+    table.append(row)
 
     # sixth rows
-    print(','.join(map(lambda x: '%.2f' % (x[0]), total_data[5])))
-    print(','.join(map(lambda x: '%.2f' % (x[1]), total_data[5])))
+    row = ["ROC AUC"] + ['%.2f' % x[0] for x in total_data[5]]
+    table.append(row)
+    row = ["Std. of  ROC"] + ['%.2f' % x[1] for x in total_data[5]]
+    table.append(row)
 
     # seventh rows
-    print(','.join(map(lambda x: '%.2f' % (x[0]), total_data[6])))
-    print(','.join(map(lambda x: '%.2f' % (x[1]), total_data[6])))
+    row = ["Avg Prec (AP)"] + ['%.2f' % x[0] for x in total_data[6]]
+    table.append(row)
+    row = ["Std of AP"] + ['%.2f' % x[1] for x in total_data[6]]
+    table.append(row)
+    print(tabulate(table, floatfmt=".2f"))
 
 
 if __name__ == "__main__":

diff --git a/helper/run_ase.sh b/helper/run_ase.sh
@@ -1,29 +1,34 @@
-#!/bin/bash
+#!/bin/bash -ue
 set -x
 
+source config/path_variables.py
+
 declare -a input_list=(
   # This one is for processing all functions.
-  "/home/dongkwan/binkit-dataset/ase_debug.txt"
+  "${BINKIT_DATASET}/ase_debug.txt"
   # Then, for experiment and counting, we utilize them separately.
-#  "/home/dongkwan/binkit-dataset/ase1_debug.txt"
-#  "/home/dongkwan/binkit-dataset/ase2_debug.txt"
-#  "/home/dongkwan/binkit-dataset/ase3_debug.txt"
-#  "/home/dongkwan/binkit-dataset/ase4_debug.txt"
+#  "${BINKIT_DATASET}/ase1_debug.txt"
+#  "${BINKIT_DATASET}/ase2_debug.txt"
+#  "${BINKIT_DATASET}/ase3_debug.txt"
+#  "${BINKIT_DATASET}/ase4_debug.txt"
 )
 
-source_list="/home/dongkwan/binkit-dataset/ase_source_list.txt"
-ctags_dir="/home/dongkwan/binkit-dataset/ase_ctags_data"
+source_list="${BINKIT_DATASET}/ase_source_list.txt"
+ctags_dir="${BINKIT_DATASET}/ase_ctags_data"
 
+SECONDS=0
+echo "Processing IDA analysis ..."
 for f in "${input_list[@]}"
 do
   echo "Processing ${f} ..."
   python helper/do_idascript.py \
-    --idapath "/home/dongkwan/.tools/ida-6.95" \
-    --idc "/home/dongkwan/tiknib/tiknib/ida/fetch_funcdata_v6.95.py" \
+    --idapath "${IDA_PATH}" \
+    --idc "${IDA_FETCH_FUNCDATA}" \
     --input_list "${f}" \
     --log
 done
 
+echo "Extract source file names and line numbers... ${SECONDS}s"
 for f in "${input_list[@]}"
 do
   echo "Processing ${f} ..."
@@ -32,6 +37,7 @@ do
     --threshold 1
 done
 
+echo "Filtering functions... ${SECONDS}s"
 for f in "${input_list[@]}"
 do
   echo "Processing ${f} ..."
@@ -40,6 +46,7 @@ do
     --threshold 1
 done
 
+echo "Counting functions... ${SECONDS}s"
 for f in "${input_list[@]}"
 do
   echo "Processing ${f} ..."
@@ -48,6 +55,7 @@ do
     --threshold 1
 done
 
+echo "Extracting function types ... ${SECONDS}s"
 for f in "${input_list[@]}"
 do
   echo "Processing ${f} ..."
@@ -58,10 +66,13 @@ do
     --threshold 1
 done
 
+echo "Extracting features ... ${SECONDS}s"
 for f in "${input_list[@]}"
 do
   echo "Processing ${f} ..."
   python helper/extract_features.py \
     --input_list "${f}" \
     --threshold 1
 done
+
+echo "DONE in ${SECONDS}s"