Added description of output for generic token and alias detection to

README.
malicialab · Jul 18, 2016 · d234cc4 · d234cc4
1 parent 4421a07
commit d234cc4
Show file tree

Hide file tree

Showing 2 changed files with 35 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -287,6 +287,19 @@ If the switch is ommitted, the default threshold of 8 is used.
 
 For more details you can refer to our RAID 2016 paper.
 
+**Output**
+
+The above command outputs two files: 
+*malheurReference.gen* and *malheurReference_lb.gen*. 
+Each of them has 2 columns: token and number of families where the token 
+was observed.
+File *malheurReference.gen* is the final output with the detected 
+generic tokens for which the number of families is above 
+the given threshold. 
+The file *malheurReference_lb.gen* has this information for all tokens.
+Thus, *malheurReference.gen* is a subset of *malheurReference_lb.gen*. 
+
+
 ## Preparation: Alias Detection
 
 Different vendors may assign different names (i.e., aliases) for the same
@@ -303,7 +316,7 @@ provided default file.
 But, if you want to test it you can do:
 
 ```
- $./avclass_alias_detect.py -lb data/malheurReference_lb.json -nalias 100 -talias 0.98 > malheurReference.aliases
+$./avclass_alias_detect.py -lb data/malheurReference_lb.json -nalias 100 -talias 0.98 > malheurReference.aliases
 ```
 
 The -nalias threshold provides the minimum number of samples two tokens 
@@ -316,6 +329,25 @@ If the switch is not provided the default is 0.94 (94%).
 
 For more details you can refer to our RAID 2016 paper.
 
+**Output**
+
+The above command outputs two files:
+*malheurReference.aliases* and *malheurReference_lb.alias*.
+Each of them has 6 columns: 
+1. t1: token that is an alias
+2. t2: family for which t1 is an alias
+3. |t1|: number of input samples where t1 was observed
+4. |t2|: number of input samples where t2 was observed
+5. |t1^t2|: number of input samples where both t1 and t2 were observed
+6. |t1^t2|/|t1|: ratio of input samples where both t1 and t2 
+were observed over the number of input samples where t1 was observed.
+
+File *malheurReference.aliases* is the final output with the 
+detected aliases that satisfy the -nalias and -talias thresholds.
+The file *malheurReference_lb.alias* has this information for all tokens.
+Thus, *malheurReference.aliases* is a subset 
+of *malheurReference_lb.alias*.
+
 
 ## Support
 

diff --git a/avclass_labeler.py b/avclass_labeler.py
@@ -196,7 +196,7 @@ def main(args):
                             '.gen'
         gen_fd = open(gen_filename, 'w+')
         # Output header line
-        gen_fd.write("Token\t# Families\n")
+        gen_fd.write("Token\t#Families\n")
         sorted_pairs = sorted(token_family_map.iteritems(), 
                               key=lambda x: len(x[1]) if x[1] else 0, 
                               reverse=True)
@@ -216,7 +216,7 @@ def main(args):
         sorted_pairs = sorted(
                 pair_count_map.items(), key=itemgetter(1))
         # Output header line
-        alias_fd.write("# t1\tt2\t|t1|\t|t2|\t|t1^t2|\t|t1^t2|/|t_1|\n")
+        alias_fd.write("# t1\tt2\t|t1|\t|t2|\t|t1^t2|\t|t1^t2|/|t1|\n")
         # Compute token pair statistic and output to alias file
         for (t1,t2),c in sorted_pairs:
             n1 = token_count_map[t1]