Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: sanitize transformation error message to hide record values #221

Merged
merged 11 commits into from
Jan 16, 2025

Conversation

ChristoGrab
Copy link
Collaborator

@ChristoGrab ChristoGrab commented Jan 15, 2025

What

A user alerted us to a Record transformation error that was exposing sensitive email values in their Builder test reads. Although we're not overly concerned about the security implications given these logs are not persisted long-term and should only be available to the project owner, we should err on the side of caution when logging source data. This PR updates the error message constructor to sanitize data values, only displaying the structure/type of the field causing the failure.

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling and reporting mechanism for type transformations.
    • Enhanced error messages to provide more detailed information about type conversion failures.
  • Documentation

    • Updated error message format to clarify type transformation issues.
    • Improved readability of warning messages during type conversions.
  • Tests

    • Added new test cases for scenarios involving null input data and deep nested structures.

@ChristoGrab ChristoGrab requested a review from bnchrch January 15, 2025 17:34
@ChristoGrab ChristoGrab marked this pull request as ready for review January 15, 2025 20:29
@ChristoGrab ChristoGrab changed the title security fix: sanitize transformation error message to hide record values chore: sanitize transformation error message to hide record values Jan 15, 2025
Copy link
Contributor

coderabbitai bot commented Jan 15, 2025

📝 Walkthrough

Walkthrough

The changes enhance the error messaging mechanism in the type transformation utility within the Airbyte CDK. The get_error_message method in the TypeTransformer class has been refactored to provide more detailed and structured error reporting. A new private method, _get_type_structure, has been introduced to recursively analyze the input data's type structure, improving the clarity of error messages related to type conversion failures.

Changes

File Change Summary
airbyte_cdk/sources/utils/transform.py - Updated implementation of get_error_message method
- Added new private method _get_type_structure
- Enhanced formatting of error messages
- Renamed key_path to field_path
- Introduced constant MAX_NESTING_DEPTH = 3
unit_tests/sources/utils/test_transform.py - Revised warning message formats for clarity
- Improved error message descriptions
- Added new test cases for null input and deep nested structures

Sequence Diagram

sequenceDiagram
    participant Transformer as TypeTransformer
    participant ValidationError as Validation Error
    participant ErrorMessage as Error Message Generator

    Transformer->>ValidationError: Receive validation error
    Transformer->>ErrorMessage: Call get_error_message()
    ErrorMessage->>ErrorMessage: _get_type_structure()
    ErrorMessage-->>Transformer: Return detailed error message
Loading

Hey! It looks like the changes really focus on making the error reporting clearer and more informative. Would you like to dive deeper into how the new type structure analysis works? It seems like a fascinating approach! Wdyt? 😊

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/utils/transform.py (1)

228-236: LGTM! The recursive type structure analysis looks solid.

The implementation effectively sanitizes sensitive data by focusing on type information rather than actual values. I particularly like how it handles nested dictionaries recursively.

One thought: Would it be helpful to add a docstring explaining the purpose and return format of _get_type_structure? This could help future maintainers understand its security implications, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76b5306 and c16ab54.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/utils/transform.py (1 hunks)
  • unit_tests/sources/utils/test_transform.py (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/utils/transform.py (1)

237-240: The error message format looks good and secure!

The new format effectively communicates type transformation issues without exposing sensitive data. I like how it:

  1. Uses dot notation for clear field paths
  2. Shows type structure instead of actual values
  3. Maintains helpful context for debugging
unit_tests/sources/utils/test_transform.py (1)

107-107: The test coverage looks comprehensive!

The updated test cases effectively verify that:

  1. Error messages follow the new secure format
  2. Type information is correctly structured
  3. Various edge cases are handled properly
  4. No sensitive data is exposed

I particularly appreciate how the test cases cover nested objects, arrays, and complex type scenarios. This helps ensure the security fix is robust across different data structures.

Also applies to: 114-114, 199-199, 208-208, 215-215, 222-222, 255-255, 266-266, 275-275

airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the chore label Jan 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/utils/transform.py (2)

237-237: How about adding a return type hint? 🤔

The method could benefit from an explicit return type annotation. Maybe something like this? wdyt?

-    def _get_type_structure(self, input_data: Any, current_depth: int = 0) -> Any:
+    def _get_type_structure(self, input_data: Any, current_depth: int = 0) -> Union[str, Dict[str, Any]]:

246-254: Would you consider simplifying this nested condition? 🤔

The nested if-else structure could be simplified using a helper function or dictionary mapping. Here's one way to do it, wdyt?

     if current_depth >= MAX_NESTING_DEPTH:
-        return (
-            "object"
-            if isinstance(input_data, dict)
-            else "array"
-            if isinstance(input_data, list)
-            else python_to_json[type(input_data)]
-        )
+        if isinstance(input_data, dict):
+            return "object"
+        if isinstance(input_data, list):
+            return "array"
+        return python_to_json[type(input_data)]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c16ab54 and f90e61a.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/utils/transform.py (2 hunks)
  • unit_tests/sources/utils/test_transform.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/utils/test_transform.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/utils/transform.py (2)

12-12: Nice addition of the MAX_NESTING_DEPTH constant! 👍

This addresses the previous concern about infinite recursion nicely. The value of 3 seems reasonable for type structure representation.


229-236: Clean implementation of the sanitized error message! 🎯

The new message format effectively hides sensitive values while still providing useful debugging context through type information and field path.

airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
airbyte_cdk/sources/utils/transform.py (1)

229-236: Consider truncating very long field paths for readability?

The error message looks good and safely avoids exposing sensitive data! For better readability, what do you think about truncating very long field paths (e.g., showing only the last N segments with an ellipsis prefix)? wdyt?

-        field_path = ".".join(map(str, e.path))
+        MAX_PATH_SEGMENTS = 3
+        path_segments = list(map(str, e.path))
+        if len(path_segments) > MAX_PATH_SEGMENTS:
+            field_path = "..." + ".".join(path_segments[-MAX_PATH_SEGMENTS:])
+        else:
+            field_path = ".".join(path_segments)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f90e61a and c97cf85.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/utils/transform.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Analyze (python)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (2)
airbyte_cdk/sources/utils/transform.py (2)

12-12: LGTM! Good call on adding the MAX_NESTING_DEPTH constant.

The constant helps prevent stack overflow from deeply nested structures while maintaining readable error messages.


229-262: Great job on the security improvements! 🔒

The changes effectively prevent sensitive data leakage while maintaining useful error messages. The type structure provides enough context for debugging without exposing actual values.

airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/utils/transform.py (2)

229-235: How about enhancing the docstring to be more descriptive? wdyt?

The docstring could better explain the method's purpose and return value format. Here's a suggestion:

     def get_error_message(self, e: ValidationError) -> str:
         """
-        Construct a sanitized error message from a ValidationError instance.
+        Construct a sanitized error message from a ValidationError instance that describes
+        the type conversion failure without exposing sensitive data values.
+
+        Args:
+            e: The ValidationError instance containing error details
+
+        Returns:
+            A string message in the format: "Failed to transform value from type 'X' to type 'Y' at path: 'path'"
         """

237-237: Should we add a more specific return type hint? wdyt?

The return type Any could be more specific to better document the possible return values:

-    def _get_type_structure(self, input_data: Any, current_depth: int = 0) -> Any:
+    def _get_type_structure(self, input_data: Any, current_depth: int = 0) -> Union[str, dict[str, Any]]:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c97cf85 and 3738363.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/utils/transform.py (2 hunks)
  • unit_tests/sources/utils/test_transform.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/utils/test_transform.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/utils/transform.py (2)

12-12: LGTM! Good choice on the nesting depth.

The constant is well-placed and the value of 3 provides a good balance between detail and brevity in error messages.


255-256: 🛠️ Refactor suggestion

Should we add graceful handling for custom types? wdyt?

The current implementation might raise KeyError for custom types not in python_to_json. Consider adding a fallback:

         else:
-            return python_to_json[type(input_data)]
+            try:
+                return python_to_json[type(input_data)]
+            except KeyError:
+                return f"custom_type({type(input_data).__name__})"

Likely invalid or redundant comment.

@ChristoGrab ChristoGrab requested a review from bnchrch January 16, 2025 00:40
Copy link
Contributor

@bnchrch bnchrch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well sanitized
a-man-is-spraying-something-in-a-kitchen-with-the-words-box-video-downloader-behind-him

@ChristoGrab ChristoGrab merged commit b5ed82c into main Jan 16, 2025
18 of 19 checks passed
@ChristoGrab ChristoGrab deleted the christo/sanitize-error-message branch January 16, 2025 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants