Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrity docs #189

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/src/integrity/integrity.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
***********************
Computational Integrity
***********************

The integrity module of Opaque ensures that the untrusted job driver hosted on the cloud service schedules tasks in the manner computed by Spark's Catalyst query optimizer.
Opaque runs on Spark, which utilizes data partitioning to speed up computation.
Specifically, Catalyst will compute a physical query plan for a given dataframe query and delegate Spark workers (run on enclaves) to compute Spark SQL operations on data partitions.
Each of these individual units is trusted, but the intermediary steps in which the units communicate is controlled by the job driver, running as untrusted code in the cloud.
The integrity module will detect if the job driver has deviated from the query plan computed by Catalyst.
chester-leung marked this conversation as resolved.
Show resolved Hide resolved

Overview
--------
The main idea behind integrity support is to tag each step of computation with a MAC, attached by the enclave worker when it has completed its computation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what exactly the MAC is over?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to elaborate a bit further on what is MAC'd over -- maybe for example you can explain what is in each LogEntry, similar to what we say in the "Building Blocks" section of this document (but of course with the updated fields)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the flatbuffers schema explicitly and a short description of each object in 2db9389

All MACs received by all previous enclave workers are logged. In the end, these MACs are compared and reconstructed into a graph.
chester-leung marked this conversation as resolved.
Show resolved Hide resolved
This graph is compared to that computed by Catalyst.
chester-leung marked this conversation as resolved.
Show resolved Hide resolved
If the graphs are isomorphic, then no tampering has occurred.
Else, the result of the query returned by the cloud is rejected.

Implementation
--------------
Two main extensions were made to support integrity - one in enclave code, and one in the Scala client application.

Enclave Code
^^^^^^^^^^^^
In the enclave code (C++), modifications were made to the ``FlatbuffersWriters.cpp`` file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications were also made to FlatbuffersReaders for the during-execution integrity checks: checking whether all blocks that were outputted from the previous ecall indeed were received by the subsequent ecall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a section about the during-execution checks and post-verification checks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? I think I wrote a little about the post verification checks in the part about the "Scala / Job Verification Engine Code" which outlines the reconstruction of the executed and expected. Do you want me to explain more in this section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it'd be good to either have a separate section or to add to the "Overview" section a bit about the during-execution integrity checks and the post-verification checks, i.e. say that as part of integrity we perform checks during execution and post-execution. In particular, maybe we can talk about what each is checking for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little more detail in commit 2db9389

Attached to every output of an ``EncryptedBlocks``` object is a MAC over the output.
No further modifications need to be made to the application logic since this functionality hooks into how Opaque workers output their data.

Scala/Application Code
^^^^^^^^^^^^^^^^^^^^^^
The main extension supporting Integrity is the ```JobVerificationEngine`` which is a piece of Scala code that broadly carries out three tasks:

1. Reconstruct the flow of information between enclave workers.

2. Compute the corresponding DAG of ecalls for a given query.

3. Compare the two DAGs and output "accept" or "reject."

These happen in the "verify" function of the JobVerificationEngine class.

Reconstructing the executed DAG of ecalls involves iterating through the MACs attached by enclave workers, provided in the "LogEntryChain" object in the Job Verification Engine.
This object is filled by Opaque when Spark's ``collect`` method is called when a query is executed.

Output MACs of parents correspond to input MACs of their child. Using this information, the DAG is created.

The "expected" DAG is created from Spark's ``dataframe.queryPlan.executedPlan`` object which is a recursive tree node of Spark Operators.
The Job Verification Engine contains the logic to transform this tree of operators into a tree of ecalls.

Adding Integrity Support for New Operators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To support new operators, if they are added, one should make changes to the Enclave code and the Job Verification Engine code.

In the enclave, make sure that the enclave context's "finish_ecall" method is called before returning in ``Enclave.cpp```.

In the Job Verification Engine, add the logic to transform the operator into a list of ecalls that the operator uses in ``generateJobNodes``.
This amounts to adding a case in the switch statement of this function.

Furthermore, add the logic to connect the ecalls together in ``linkEcalls``.
As above, this amounts to adding a case in the switch statement of this function, but requires knowledge of how each ecall communicates the transfer of data partitions to its successor ecall
(broadcast, all to one, one to all, etc.).