Skip to content

Commit

Permalink
Added description of output for generic token and alias detection to
Browse files Browse the repository at this point in the history
README.
  • Loading branch information
malicialab committed Jul 18, 2016
1 parent 4421a07 commit d234cc4
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 3 deletions.
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,19 @@ If the switch is ommitted, the default threshold of 8 is used.

For more details you can refer to our RAID 2016 paper.

**Output**

The above command outputs two files:
*malheurReference.gen* and *malheurReference_lb.gen*.
Each of them has 2 columns: token and number of families where the token
was observed.
File *malheurReference.gen* is the final output with the detected
generic tokens for which the number of families is above
the given threshold.
The file *malheurReference_lb.gen* has this information for all tokens.
Thus, *malheurReference.gen* is a subset of *malheurReference_lb.gen*.


## Preparation: Alias Detection

Different vendors may assign different names (i.e., aliases) for the same
Expand All @@ -303,7 +316,7 @@ provided default file.
But, if you want to test it you can do:

```
$./avclass_alias_detect.py -lb data/malheurReference_lb.json -nalias 100 -talias 0.98 > malheurReference.aliases
$./avclass_alias_detect.py -lb data/malheurReference_lb.json -nalias 100 -talias 0.98 > malheurReference.aliases
```

The -nalias threshold provides the minimum number of samples two tokens
Expand All @@ -316,6 +329,25 @@ If the switch is not provided the default is 0.94 (94%).

For more details you can refer to our RAID 2016 paper.

**Output**

The above command outputs two files:
*malheurReference.aliases* and *malheurReference_lb.alias*.
Each of them has 6 columns:
1. t1: token that is an alias
2. t2: family for which t1 is an alias
3. |t1|: number of input samples where t1 was observed
4. |t2|: number of input samples where t2 was observed
5. |t1^t2|: number of input samples where both t1 and t2 were observed
6. |t1^t2|/|t1|: ratio of input samples where both t1 and t2
were observed over the number of input samples where t1 was observed.

File *malheurReference.aliases* is the final output with the
detected aliases that satisfy the -nalias and -talias thresholds.
The file *malheurReference_lb.alias* has this information for all tokens.
Thus, *malheurReference.aliases* is a subset
of *malheurReference_lb.alias*.


## Support

Expand Down
4 changes: 2 additions & 2 deletions avclass_labeler.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ def main(args):
'.gen'
gen_fd = open(gen_filename, 'w+')
# Output header line
gen_fd.write("Token\t# Families\n")
gen_fd.write("Token\t#Families\n")
sorted_pairs = sorted(token_family_map.iteritems(),
key=lambda x: len(x[1]) if x[1] else 0,
reverse=True)
Expand All @@ -216,7 +216,7 @@ def main(args):
sorted_pairs = sorted(
pair_count_map.items(), key=itemgetter(1))
# Output header line
alias_fd.write("# t1\tt2\t|t1|\t|t2|\t|t1^t2|\t|t1^t2|/|t_1|\n")
alias_fd.write("# t1\tt2\t|t1|\t|t2|\t|t1^t2|\t|t1^t2|/|t1|\n")
# Compute token pair statistic and output to alias file
for (t1,t2),c in sorted_pairs:
n1 = token_count_map[t1]
Expand Down

0 comments on commit d234cc4

Please sign in to comment.