Performance: query on large collections #43

yankovs · 2023-09-17T16:25:27Z

Hey!

We see MCRIT as a great tool for malware similarity purposes and want to see if it can be integrated into our malware pipeline, with emphasis on the API it provides. We have a DB with a lot of samples, some families with tens of thousands of files associated with them. Simple testing with a moderate number of files shows MCRIT indeed works great. However, when it grows to 100k+ files, it begins to significantly slow down and can take more than 10 minutes for a query about some file. Because of the amount of samples we already have and the daily amount of stuff we get from different sources, it is a matter of time before we reach 100k even if we start small and make a curated set of samples for each family.

Of course, this isn't a trivial thing and it requires further inspection of each step in the process. However, I think that it raises some questions worth discussing:

Why was MongoDB chosen for the project? Is it the right fit, if we keep scale in mind?
Is the database design optimal, or is there any place to improve in regard to the indexes chosen and the queries performed?
How does MCRIT deal with files that contain many functions (we have ones with over 80k!😅)? Is it there any other way to deal with them?
Should MCRIT support managed solutions like amazon's DocumentDB? Those kinds of solutions handle things like sharding the DB for horizontal scaling and are easy to deploy. However, DocumentDB in particular isn't quite 100% MongoDB compatible.
Where are places where a bottleneck can occur during a query?

I hope this doesn't come across as a complaint, because we think MCRIT is great and would really love for it to use it in production :)

danielplohmann · 2023-09-18T08:05:25Z

Hey!
First, I'm absolutely happy that MCRIT seems to provide value and promise to be useful.
You are also the first organization that is pushing it to that scale (at least that I'm aware of) and I'm happy that you already contributed and engage here for improvements.

However, when it grows to 100k+ files, it begins to significantly slow down and can take more than 10 minutes for a query about some file.

By "query", you mean matching an existing sample against the database or having a sample disassembled and matched (query in MCRITweb sense) or a simpler operation like just pulling out the information associated with it?

For the questions you posted, I think it would be worthwhile to schedule a call and talk about this synchronously as it will quickly become too complex for a threaded discussion. :) It would also help me understand better how you are using the system, which would certainly influence how/which parts can/should be addressed first.

Feel free to contact me here: [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: query on large collections #43

Performance: query on large collections #43

yankovs commented Sep 17, 2023

danielplohmann commented Sep 18, 2023

Performance: query on large collections #43

Performance: query on large collections #43

Comments

yankovs commented Sep 17, 2023

danielplohmann commented Sep 18, 2023