-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve deduplication for VarCorpus dataset splitting #1
Comments
Thank you for your interest in our work! The above example seems to be from Ghidra-O2, so I looked at a few cases and followed the suggested approach of normalizing For example:
void FUN_00112210(void)
{
long @@var_0@@uVar1@@;
- @@var_0@@uVar1@@ = FUN_00111d90();
+ @@var_0@@uVar1@@ = FUN_001124c0()
if (@@var_0@@uVar1@@ != 0)
{
return;
}
- FUN_00111df0();
+ FUN_0010cab0();
return;
}
void FUN_00109590(void *@@var_0@@list@@)
{
- FUN_00109530();
+ FUN_001091c0()
free(@@var_0@@list@@);
return;
} In the above examples, the function in each pair is from a different project, has different call sites, and their source code varies. |
Thanks for your thorough response! The examples you gave are interesting, and I am generally curious how often such cases occur. However, in your dataset we can also use the (debug) function name ( Counting only cases where normalized function bodies and function names are identical still shows a similar level of duplication (~19% of functions in test are also in train). Of course, this still may not catch all duplicate functions (and definitely will not match slight variations of the same function), but the functions it catches are almost certainly source-code duplicates. If not de-duplication, my hope is that future work at least reports accuracy on memorizable and non-memorizable functions separately, or at least compares to a memorizing baseline. The simplest example of this would be a baseline which matches functions in the training set (e.g. using FLIRT, binja sigkit, or just perfect string matches after thorough normalization) and predicts the same names. Clearly, VarBERT is doing more than memorization, but it would be nice if future work quantified this, since there are simpler tools for dealing with libraries of known functions. |
Thanks for the really cool work and the open dataset!
I noticed the deduplication (when splitting into test/train) is missing some near/exact duplicates. I understand that near-duplicate detection is tricky, but I think some more aggressive (text-based) normalization could help.
For Ghidra,
FUN_*
,LAB_*
,DAT_*
, andPTR_*
(among others) all leak offsets into the decompilation text. Similar labels exist in IDA.For example, take these two cases from the Ghidra_O3 dataset (per-binary split):
8b24d40a9bcf3a361a784f0d32abb3f0_(0011c500)
(test)2334db8f8877d54ca418f044f1954d86_(0011c4e0)
(train)They are considered distinct by the current deduplication algorithm. However, diffing them shows that the only difference is the offsets of callees within their respective binaries:
These functions are derived from identical source code.
I understand that we can't say for certain the calls are the same from the text above (we would need to check the callee code, I guess), but I would argue that even if you normalize the callsites (i.e. replacing
FUN_*
withFUN
), the false-positive rate will be very low. Most pairs of nontrivial functions would be expected to differ by far more than just the names of the called functions.~20% of the test set is duplicated in train, after normalizing for
FUN_*
,LAB_*
,DAT_*
, andPTR_*
.I would argue that deduplication should ignore all function names, not just the
FUN_*
names left by ghidra. There are cases where identical library functions may use an alternative function names (e.g. see--zprefix
for zlib), and there are probably cases in the dataset where identical functions have names in some binaries and not others.The text was updated successfully, but these errors were encountered: