Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongo: BSON document might be bigger than 16mb #42

Open
yankovs opened this issue Sep 13, 2023 · 1 comment
Open

Mongo: BSON document might be bigger than 16mb #42

yankovs opened this issue Sep 13, 2023 · 1 comment

Comments

@yankovs
Copy link
Contributor

yankovs commented Sep 13, 2023

MongoDbStorage's insert_many method should probably check for the total size of the documents or if one of the documents is too big itself. In some (pretty rare) cases, the size could exceed the 16mb BSON limit and result in an exception:

mcrit-server                  | 2023-09-12 17:57:10 [FALCON] [ERROR] POST /samples => Traceback (most recent call last):
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 229, in _dbInsertMany
mcrit-server                  |     insert_result = self._database[collection].insert_many([self._toBinary(document) for document in data])
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                  |     return func(self, *args, **kwargs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 757, in insert_many
mcrit-server                  |     blk.execute(write_concern, session=session)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 580, in execute
mcrit-server                  |     return self.execute_command(generator, write_concern, session)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 447, in execute_command
mcrit-server                  |     client._retry_with_session(self.is_retryable, retryable_bulk, s, self)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1413, in _retry_with_session
mcrit-server                  |     return self._retry_internal(retryable, func, session, bulk)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                  |     return func(self, *args, **kwargs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1460, in _retry_internal
mcrit-server                  |     return func(session, conn, retryable)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 435, in retryable_bulk
mcrit-server                  |     self._execute_command(
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 381, in _execute_command
mcrit-server                  |     result, to_send = bwc.execute(cmd, ops, client)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 966, in execute
mcrit-server                  |     request_id, msg, to_send = self.__batch_command(cmd, docs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 956, in __batch_command
mcrit-server                  |     request_id, msg, to_send = _do_batched_op_msg(
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 1353, in _do_batched_op_msg
mcrit-server                  |     return _batched_op_msg(operation, command, docs, ack, opts, ctx)
mcrit-server                  | pymongo.errors.DocumentTooLarge: BSON document too large (60427090 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
mcrit-server                  |
mcrit-server                  | During handling of the above exception, another exception occurred:
mcrit-server                  |
mcrit-server                  | Traceback (most recent call last):
mcrit-server                  |   File "falcon/app.py", line 365, in falcon.app.App.__call__
mcrit-server                  |   File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
mcrit-server                  |     func(*args, **kwargs)
mcrit-server                  |   File "/opt/mcrit/mcrit/server/SampleResource.py", line 126, in on_post_collection
mcrit-server                  |     summary = self.index.addReportJson(req.media, username=username)
mcrit-server                  |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 280, in addReportJson
mcrit-server                  |     return self.addReport(report, calculate_hashes=calculate_hashes, calculate_matches=calculate_matches, username=username)
mcrit-server                  |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 265, in addReport
mcrit-server                  |     sample_entry = self._storage.addSmdaReport(smda_report)
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 622, in addSmdaReport
mcrit-server                  |     self._dbInsertMany("functions", function_dicts)
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 238, in _dbInsertMany
mcrit-server                  |     raise ValueError("Database insert failed.")
mcrit-server                  | ValueError: Database insert failed.

Unfortunately I didn't add any print of the samples that caused this, so I don't really have context to provide 😭. Overall this is pretty uncommon, happened 4 times for over 120k files.

@danielplohmann
Copy link
Owner

Oh wow. 😆
Your error message

pymongo.errors.DocumentTooLarge: BSON document too large (60427090 bytes)

Suggests that there was a single function for which the JSON representation was 60M+ in size.
If I had to guess and you want to find any of the 4 samples for reproduction, I'd search for the largest binaries or .text sections you can find among what you processed. :)

Now, the generic solution for this would go beyond just checking for size.
In order to not lose any data, these objects should instead be stored in GridFS. At that point, probably all control flow graphs should instead be stored in GridFS, and possibly be compressed to save roundabout 5-10x of space used as they are only used rarely.
This would then also require providing code for migrating the database layout or reindexing the samples.
I'll keep the issue open as a reminder that this issue exists, despite being a very rare edge case as your observations suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants