Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: query on large collections #43

Open
yankovs opened this issue Sep 17, 2023 · 1 comment
Open

Performance: query on large collections #43

yankovs opened this issue Sep 17, 2023 · 1 comment

Comments

@yankovs
Copy link
Contributor

yankovs commented Sep 17, 2023

Hey!

We see MCRIT as a great tool for malware similarity purposes and want to see if it can be integrated into our malware pipeline, with emphasis on the API it provides. We have a DB with a lot of samples, some families with tens of thousands of files associated with them. Simple testing with a moderate number of files shows MCRIT indeed works great. However, when it grows to 100k+ files, it begins to significantly slow down and can take more than 10 minutes for a query about some file. Because of the amount of samples we already have and the daily amount of stuff we get from different sources, it is a matter of time before we reach 100k even if we start small and make a curated set of samples for each family.

Of course, this isn't a trivial thing and it requires further inspection of each step in the process. However, I think that it raises some questions worth discussing:

  • Why was MongoDB chosen for the project? Is it the right fit, if we keep scale in mind?
  • Is the database design optimal, or is there any place to improve in regard to the indexes chosen and the queries performed?
  • How does MCRIT deal with files that contain many functions (we have ones with over 80k!😅)? Is it there any other way to deal with them?
  • Should MCRIT support managed solutions like amazon's DocumentDB? Those kinds of solutions handle things like sharding the DB for horizontal scaling and are easy to deploy. However, DocumentDB in particular isn't quite 100% MongoDB compatible.
  • Where are places where a bottleneck can occur during a query?

I hope this doesn't come across as a complaint, because we think MCRIT is great and would really love for it to use it in production :)

@danielplohmann
Copy link
Owner

Hey!
First, I'm absolutely happy that MCRIT seems to provide value and promise to be useful.
You are also the first organization that is pushing it to that scale (at least that I'm aware of) and I'm happy that you already contributed and engage here for improvements.

However, when it grows to 100k+ files, it begins to significantly slow down and can take more than 10 minutes for a query about some file.

By "query", you mean matching an existing sample against the database or having a sample disassembled and matched (query in MCRITweb sense) or a simpler operation like just pulling out the information associated with it?

For the questions you posted, I think it would be worthwhile to schedule a call and talk about this synchronously as it will quickly become too complex for a threaded discussion. :) It would also help me understand better how you are using the system, which would certainly influence how/which parts can/should be addressed first.

Feel free to contact me here: [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants