Skip to content

Writing Your Own DocManager

Luke Lovett edited this page Feb 10, 2015 · 16 revisions

This page details how to write your own DocManager class, which allows you to replicate CRUD operations from MongoDB to another indexing system.

Step 1: Write the Code

The first step to creating your own DocManager is to create a new Python module my_doc_manager.py and define a class called DocManager. All DocManagers inherit from DocManagerBase:

# filename: my_doc_manager.py

from mongo_connector.doc_managers.doc_manager_base import DocManagerBase

class DocManager(DocManagerBase):
    """DocManager that connects to MyIndexingSystem"""

    # methods will go here

There are a number of methods that need to be defined within this class. They are:

  1. __init__(self, url, **kwargs)

    This is the contructor and should be used to do any setup your client needs in order to communicate with the target system. The only required parameter is url, which is the endpoint that the DocManager should target. This is given by the -t option on the command line, or in the targetURL field for this DocManager in Mongo Connector's config file.

    There are also a number of other standard parameters that your DocManager can take advantage of:

    • auto_commit_interval is the time period, in seconds, between when the DocManager should attempt to commit any outstanding changes to the indexing system. A value of None indicates that mongo-connector should not attempt to sync any changes automatically.
    • unique_key gives the unique key the DocManager should use in the target system for documents. The default is _id, the same unique key used by MongoDB.
    • chunk_size specifies the maximum number of documents to be inserted in a batch.

    If the user defines anything in the args section for this DocManager, they will be provided as keyword arguments to the constructor. Just document what additional arguments your DocManager may take, and the user can provide values to them in the config file.

  2. stop(self)

    This method is called to stop the DocManager. If you started any threads to take care of auto commit, for example, this is the place to join() them.

  3. upsert(self, doc, namespace, timestamp)

    This should upsert (i.e., insert or write over) the document provided in the doc parameter. doc is the full document to be upserted. namespace and timestamp are the namespace (i.e., database + '.' + collection) and Timestamp of the oplog record that caused this event.

  4. bulk_upsert(self, docs, namespace, timestamp)

    This is used to insert documents in-bulk to the target system during a collection dump. This method is optional, and mongo-connector will fall back to calling upsert on each document serially if not provided. However, inserting documents in-bulk is a lot more efficient. namespace and timestamp are the same as above.

  5. remove(self, document_id, namespace, timestamp)

    This should remove the document with the id document_id from the external system. namespace and timestamp are the same as above.

  6. search(self, start_ts, end_ts)

    This should provide an iterator over all documents that were last modified between start_ts and end_ts. Your DocManager implementation needs to take care of how to store this information. This method is called when a MongoDB rollback occurs.

  7. commit(self)

    This method should commit any outstanding changes to the target system.

  8. get_last_doc(self)

    This should return the document most recently modified in the target system.

  9. handle_command(self, doc, namespace, timestamp)

    This optional method processes an arbitrary database command in the oplog. doc is the command document itself. namespace is the original namespace the command was performed on, without any namespace mappings applied. This is contrary to how the other methods work! This is so that your DocManager has maximum flexibility in how to deal with the command.

It might be helpful to see an example implementation of a DocManager. For this, we recommend taking a look at doc_manager_simulator.py, which is used in the test suite to mock replicating CRUD operations.

Step 2: Distribute your DocManager

You don't need to make a pull request to this project in order for users to be able to use your DocManager. You can distribute your DocManager on PyPI separately as its own package, and users can install it alongside Mongo Connector, referencing it on the command line like any of the built-in DocManagers. All you need to do is:

  1. Create the project directory structure to mirror that of the built-in DocManagers like this:

     project/mongo_connector/__init__.py
     project/mongo_connector/doc_managers/__init__.py
     project/mongo_connector/doc_managers/your_custom_doc_manager.py
    
  2. Put the following in your mongo_connector/__init__.py and mongo_connector/doc_managers/__init__.py:

     from pkgutil import extend_path
     __path__ = extend_path(__path__, __name__)
    
  3. If you want to distribute this on PyPI, you'll probably want to add a README.rst and a setup.py to install your mongo_connector package.

That's it! Install your DocManager and test it with mongo-connector -d your_custom_doc_manager. No ".py" at the end; Mongo Connector will find the correct file just from the module name.

Clone this wiki locally