Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds ZipfileDecoder component #169

Merged
merged 40 commits into from
Jan 17, 2025
Merged

Conversation

pnilan
Copy link
Contributor

@pnilan pnilan commented Dec 12, 2024

What

How

  • Creates new ZipFileDecoder class and component. Enables connector dev to specify underlying data parser component (or defaults to the JsonParser component)

Recommended Reviewing Order

  1. zipfile_decoder.py
  2. model_to_component_factory.py
  3. declarative_component_schema.yaml
  4. test_zipfile_decoder.py

Summary by CodeRabbit

  • New Features

    • Added support for decoding zip file responses in the Airbyte CDK.
    • Introduced ZipfileDecoder to handle compressed data formats.
    • Enabled parsing of zip files with flexible parser options (JSON, Gzip, Csv).
    • Enhanced ModelToComponentFactory with a method to create ZipfileDecoder components.
    • Updated SessionTokenAuthenticator to integrate the new decoder.
  • Bug Fixes

    • Implemented error handling for invalid zip file responses.
  • Tests

    • Added comprehensive unit tests for ZipfileDecoder functionality.

@github-actions github-actions bot added the enhancement New feature or request label Dec 12, 2024
Copy link
Contributor

coderabbitai bot commented Dec 12, 2024

📝 Walkthrough

Walkthrough

The pull request introduces a new ZipfileDecoder component to the Airbyte CDK, designed to handle responses compressed in zip files. This decoder allows for flexible parsing of zipped data by utilizing a configurable parser (such as GzipParser or JsonParser). The implementation spans multiple files, including schema definitions, decoder logic, and a corresponding test suite, enabling developers to easily decode and process zip-compressed responses in declarative source connectors.

Changes

File Change Summary
airbyte_cdk/sources/declarative/declarative_component_schema.yaml Added ZipfileDecoder component and updated SimpleRetriever to include it
airbyte_cdk/sources/declarative/decoders/__init__.py Updated imports and __all__ list to include new parsers and ZipfileDecoder
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py New file with ZipfileDecoder implementation
airbyte_cdk/sources/declarative/models/declarative_component_schema.py Added ZipfileDecoder class definition and updated JsonParser and SessionTokenAuthenticator
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Added method to create ZipfileDecoder
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py New test file for ZipfileDecoder functionality

Assessment against linked issues

Objective Addressed Explanation
Create ZipfileDecoder
Inherit from Parser class
Support underlying parser

Possibly related PRs

Suggested reviewers

  • maxi297
  • natikgadzhi

Hey there! 👋 I noticed you've implemented the ZipfileDecoder exactly as proposed in the issue. Would you like me to elaborate on any specific aspect of the implementation? Wdyt about adding some additional error handling or logging? 🤔

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (7)
unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1)

30-35: Simplify test assertions for clarity?

In your test function, would it be cleaner to compare the entire list of parsed results with the expected output directly? This could make the test more straightforward and easier to read. Wdyt?

Here's how you might adjust it:

def test_json_parser_with_valid_data(raw_data, expected):
-    for i, actual in enumerate(JsonParser().parse(raw_data)):
-        if isinstance(expected, list):
-            assert actual == expected[i]
-        else:
-            assert actual == expected
+    actual = list(JsonParser().parse(raw_data))
+    assert actual == expected if isinstance(expected, list) else [expected]
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (3)

17-23: Consider adding error handling to the helper function?

The create_zip_from_dict function looks good but might benefit from some error handling for invalid inputs. What do you think about adding try-catch blocks to handle potential serialization errors? wdyt?

def create_zip_from_dict(data: Union[dict, list]):
    try:
        zip_buffer = BytesIO()
        with zipfile.ZipFile(zip_buffer, mode="w") as zip_file:
            zip_file.writestr("data.json", data)
        zip_buffer.seek(0)
        return zip_buffer.getvalue()
    except (TypeError, ValueError) as e:
        raise ValueError(f"Failed to create zip from data: {e}")

35-36: Clarify the commented code's purpose?

There's a commented line that seems to show an alternative implementation. Consider either removing it if it's no longer needed or adding a comment explaining why it's kept for reference. wdyt?


25-44: Consider adding more test scenarios?

The test covers basic happy path scenarios but could benefit from additional test cases. What do you think about adding tests for:

  • Error cases (malformed zip files)
  • Empty files
  • Large datasets
  • Invalid JSON data
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)

1900-1916: Documentation could be more detailed, wdyt?

The ZipfileDecoder component looks good, but the documentation could be enhanced to help users better understand its capabilities. Consider adding:

  1. How multiple files within a zip are handled
  2. Example usage showing parser configuration
  3. What happens if no parser is specified (default behavior)
   ZipfileDecoder:
     title: Zipfile Decoder
-    description: Decoder for response data that is returned as zipfile(s).
+    description: |
+      Decoder for response data that is returned as zipfile(s). When multiple files are present in the zip,
+      they are processed sequentially. If no parser is specified, JsonParser is used by default.
+      
+      Example usage:
+      ```yaml
+      decoder:
+        type: ZipfileDecoder
+        parser:
+          type: JsonParser
+      ```

1917-1927: Would adding error handling details be helpful?

The JsonParser component looks good for basic use cases, but users might benefit from understanding error scenarios. Consider enhancing the description with:

  1. How parsing errors are handled
  2. Expected input/output format examples
   JsonParser:
     title: JsonParser
-    description: Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format.
+    description: |
+      Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format.
+      Raises JsonParseError if the input cannot be parsed as valid JSON.
+      
+      Example input/output:
+      Input: '{"key": "value"}'
+      Output: {"key": "value"}
+      
+      Input: b'{"key": "value"}'
+      Output: {"key": "value"}

1928-1949: Should we use a more practical example?

The CustomParser component looks good, but the example might be too whimsical. Consider:

  1. Using a real-world example that demonstrates practical usage
  2. Adding information about the base Parser class requirements
   CustomParser:
     title: Custom Parser
-    description: Use this to implement custom parser logic.
+    description: |
+      Use this to implement custom parser logic by extending the base Parser class.
+      The custom implementation must override the parse method with signature:
+      def parse(self, response: Union[str, bytes, bytearray]) -> Dict[str, Any]
     type: object
     additionalProperties: true
     required:
       - type
       - class_name
     properties:
       class_name:
         examples:
-          - "source_rivendell.components.ElvishParser"
+          - "source_amplitude.components.GzipJsonParser"
+          - "source_salesforce.components.CsvParser"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceebfda and 49d0ec8.

📒 Files selected for processing (10)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/__init__.py (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/parsers/__init__.py (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/parsers/parsers.py (1 hunks)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (6 hunks)
  • unit_tests/sources/declarative/decoders/parsers/__init__.py (1 hunks)
  • unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1 hunks)
  • unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • unit_tests/sources/declarative/decoders/parsers/init.py
  • airbyte_cdk/sources/declarative/decoders/parsers/init.py
🔇 Additional comments (7)
airbyte_cdk/sources/declarative/decoders/__init__.py (1)

10-10: ZipfileDecoder added to exports correctly!

The inclusion of ZipfileDecoder in the module imports and __all__ list looks great. Nice work!

Also applies to: 12-12

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)

744-750: LGTM! Clean implementation of JsonParser

The JsonParser class follows the established pattern and correctly allows extra fields through Config.extra.


1492-1502: LGTM! Well-structured ZipfileDecoder implementation

The ZipfileDecoder class is well-designed with:

  • Proper inheritance from BaseModel
  • Flexible parser configuration through Union type
  • Clear field descriptions
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)

69-69: LGTM! Clean import additions

The new imports are properly organized and follow the existing import structure.

Also applies to: 71-71


481-481: LGTM! Well-organized constructor mapping

The new entries in PYDANTIC_MODEL_TO_CONSTRUCTOR are properly placed and maintain alphabetical ordering.

Also applies to: 517-517


1695-1704: LGTM! Well-implemented ZipfileDecoder factory method

The create_zipfile_decoder method follows the established pattern and properly handles the optional parser configuration.


1705-1708: LGTM! Clean JsonParser factory method

The create_json_parser method is simple and follows the established pattern for basic components.

@pnilan

This comment was marked as off-topic.

This comment was marked as off-topic.

@pnilan pnilan closed this Jan 14, 2025
@pnilan pnilan reopened this Jan 15, 2025
@pnilan pnilan closed this Jan 15, 2025
@pnilan pnilan reopened this Jan 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)

30-39: Consider adding context manager for zip file handling

The zip file should be properly closed after use. Wdyt about wrapping it in a context manager?

-        try:
-            zip_file = zipfile.ZipFile(io.BytesIO(response.content))
-        except zipfile.BadZipFile as e:
-            logger.exception(e)
-            logger.error(
-                f"Received an invalid zip file in response to URL: {response.request.url}. "
-                f"The size of the response body is: {len(response.content)}"
-            )
-            yield {}
+        with io.BytesIO(response.content) as content_buffer:
+            try:
+                with zipfile.ZipFile(content_buffer) as zip_file:
+                    try:
+                        for filename in zip_file.namelist():
+                            with zip_file.open(filename) as file:
+                                buffered_file = io.BytesIO(file.read())
+                                yield from self.parser.parse(buffered_file)
+                    except Exception as e:
+                        logger.exception(f"Error processing zip file contents: {e}")
+                        yield {}
+            except zipfile.BadZipFile as e:
+                logger.exception(e)
+                logger.error(
+                    f"Received an invalid zip file in response to URL: {response.request.url}. "
+                    f"The size of the response body is: {len(response.content)}"
+                )
+                yield {}

22-22: Add type annotation for parser

The parser attribute should have a type annotation to improve code clarity and help with static type checking. Wdyt?

-    parser: Parser
+    parser: Parser | None = None
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2580-2584: Consider adding ZipfileDecoder to error message

The error message for unsupported decoders should be updated to include ZipfileDecoder in the list of supported decoders. Wdyt?

    _UNSUPPORTED_DECODER_ERROR = (
        "Specified decoder of {decoder_type} is not supported for pagination."
-        "Please set as `JsonDecoder`, `XmlDecoder`, or a `CompositeRawDecoder` with an inner_parser of `JsonParser` or `GzipParser` instead."
+        "Please set as `JsonDecoder`, `XmlDecoder`, `ZipfileDecoder`, or a `CompositeRawDecoder` with an inner_parser of `JsonParser` or `GzipParser` instead."
        "If using `GzipParser`, please ensure that the lowest level inner_parser is a `JsonParser`."
    )
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2925-2938: LGTM! Consider adding validation options?

The JsonParser implementation looks good. What do you think about adding options for handling invalid JSON (e.g., strict_mode, allow_nan)? This could help users customize the parsing behavior, wdyt?

docs/RELEASES.md (2)

52-52: Consider making the TODO more actionable?

The TODO comment could be more specific about what alternatives should be explored. Maybe add acceptance criteria or potential approaches to consider? For example:

  • API endpoint to query version info
  • CLI command to display current version
  • Web UI indicator
    wdyt?

60-61: Minor grammar suggestion

Consider changing "publish pipeline" to "publishing pipeline" in both sections for better grammar, wdyt?

-Once the publish pipeline has completed
+Once the publishing pipeline has completed

Also applies to: 68-70

🧰 Tools
🪛 LanguageTool

[grammar] ~60-~60: Did you mean the noun “publishing”?
Context: ...ting Manifest-Only connectors Once the publish pipeline has completed, choose a connec...

(PREPOSITION_VERB)

airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (4)

52-69: Consider improving memory efficiency for large responses

The current implementation reads the entire response into memory with data.read(). For large JSON responses, this could lead to high memory usage. What do you think about streaming the response instead?

Here's a possible approach:

-        raw_data = data.read()
-        body_json = self._parse_orjson(raw_data) or self._parse_json(raw_data)
+        # Stream in chunks of 64KB
+        chunks = []
+        while True:
+            chunk = data.read(65536)
+            if not chunk:
+                break
+            chunks.append(chunk)
+        raw_data = b''.join(chunks)

wdyt? 🤔


59-64: How about including the actual parsing error?

The error message could be more helpful if it included the actual parsing error from both attempts. Would you consider something like this?

         if body_json is None:
             raise AirbyteTracedException(
-                message="Response JSON data failed to be parsed. See logs for more information.",
-                internal_message=f"Response JSON data failed to be parsed.",
+                message="Failed to parse JSON data using both orjson and json libraries. See logs for details.",
+                internal_message=f"Failed to parse JSON data. Last error: {exc}",
                 failure_type=FailureType.system_error,
             )

71-85: DRY: Consider consolidating the encoding logic

Both parsing methods decode with the same encoding. Maybe we could do this once? 🎯

+    def _decode_raw_data(self, raw_data: bytes) -> str:
+        return raw_data.decode(self.encoding)
+
     def _parse_orjson(self, raw_data: bytes) -> Optional[Any]:
         try:
-            return orjson.loads(raw_data.decode(self.encoding))
+            return orjson.loads(self._decode_raw_data(raw_data))

66-69: Simplify the generator pattern

The current implementation has a conditional yield. We could make it more straightforward, wdyt?

-        if isinstance(body_json, list):
-            yield from body_json
-        else:
-            yield from [body_json]
+        yield from [body_json] if not isinstance(body_json, list) else body_json
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96ec874 and aae3e77.

📒 Files selected for processing (9)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (4 hunks)
  • airbyte_cdk/sources/declarative/decoders/__init__.py (2 hunks)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2 hunks)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (12 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (11 hunks)
  • airbyte_cdk/sources/declarative/requesters/README.md (2 hunks)
  • docs/RELEASES.md (2 hunks)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • airbyte_cdk/sources/declarative/requesters/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/declarative/decoders/init.py
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

[error] 42-42: Argument 1 to "BufferedReader" has incompatible type "IO[bytes]"; expected "RawIOBase"

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

[error] 1797-1797: "ZipfileDecoder" has no attribute "inner_parser"


[error] 1798-1798: Unexpected keyword arguments "parameters" and "inner_parser" for "ZipfileDecoder"

🪛 LanguageTool
docs/RELEASES.md

[grammar] ~60-~60: Did you mean the noun “publishing”?
Context: ...ting Manifest-Only connectors Once the publish pipeline has completed, choose a connec...

(PREPOSITION_VERB)


[grammar] ~68-~68: Did you mean the noun “publishing”?
Context: ...ng Low-Code Python connectors Once the publish pipeline has completed, set the version...

(PREPOSITION_VERB)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

125-153: Well-structured test cases!

Great job on the comprehensive test coverage with parameterized tests for various JSON structures.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2045-2062: LGTM! Well-structured component definition for handling zip files.

The ZipfileDecoder component is well-defined with clear requirements and appropriate parser options.

airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)

10-15: Nice choice of dependencies! 🚀

Using orjson as the primary JSON parser is a great performance optimization. The error handling imports look well-thought-out too.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1643-1653: Consider enhancing the documentation?

The implementation looks good, but what do you think about adding more details in the description field? For example:

  • Explain what happens if the zip file contains multiple files
  • Clarify if there are any filename pattern requirements
  • Document any size limitations

Here's a suggested diff:

     type: Literal["ZipfileDecoder"]
     parser: Union[GzipParser, JsonParser] = Field(
         ...,
-        description="Parser to parse the decompressed data from the zipfile(s).",
+        description="Parser to parse the decompressed data from the zipfile(s). If the zip contains multiple files, all files will be processed. The parser will be applied to each file's content. Files are processed in alphabetical order.",
         title="Parser",
     )

wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1761-1764: Consider adding error handling to factory methods?

The implementations look good, but what do you think about adding some defensive programming? For example:

  • Validate that the encoding is supported in create_json_parser
  • Check that the parser is compatible in create_zipfile_decoder

Here's a suggested diff:

     def create_json_parser(model: JsonParserModel, config: Config, **kwargs: Any) -> JsonParser:
         encoding = model.encoding if model.encoding else "utf-8"
+        try:
+            "test".encode(encoding)
+        except LookupError:
+            raise ValueError(f"Unsupported encoding: {encoding}")
         return JsonParser(encoding=encoding)

     def create_zipfile_decoder(
         self, model: ZipfileDecoderModel, config: Config, **kwargs: Any
     ) -> ZipfileDecoder:
         parser = self._create_component_from_model(model=model.parser, config=config)
+        if not isinstance(parser, (GzipParser, JsonParser)):
+            raise ValueError(f"Unsupported parser type: {type(parser)}. Must be GzipParser or JsonParser")
         return ZipfileDecoder(parser=parser)

wdyt?

Also applies to: 1794-1798

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aae3e77 and fe6859f.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1204-1210: LGTM! Clean and focused implementation.

The JsonParser model is well-structured with a sensible default encoding. The extra configuration allowance provides good flexibility.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2580-2600: Great refactoring of pagination decoder support! 👏

The extraction of the error message and addition of helper methods significantly improves code maintainability and readability. The recursive parser support check is particularly elegant.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1660-1670: LGTM! The ZipfileDecoder class looks well-designed.

The implementation follows the established patterns and provides good flexibility through the configurable parser field. The docstring could be enhanced though - what do you think about adding more details about the purpose and usage? wdyt?

Consider adding a docstring like:

class ZipfileDecoder(BaseModel):
+    """Decoder for handling zip file responses.
+    
+    This decoder extracts and parses data from zip files using a configurable parser
+    (either GzipParser or JsonParser) to process the decompressed content.
+    """
    class Config:
        extra = Extra.allow

    type: Literal["ZipfileDecoder"] 
    parser: Union[GzipParser, JsonParser] = Field(
        ...,
        description="Parser to parse the decompressed data from the zipfile(s).",
        title="Parser",
    )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe6859f and 8a1ccf0.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

69-69: LGTM! Clean import addition.

The import is properly placed with other decoder imports.


578-578: LGTM! Proper mapping registration.

The ZipfileDecoderModel is correctly mapped to its factory method.


1804-1809: LGTM! The factory method looks good.

The implementation correctly creates the parser and instantiates the decoder. The fix from the previous pipeline failure has been properly applied.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2071-2088: Consider enhancing the component description?

The description could be more detailed to help users understand:

  • When to use this decoder
  • What types of zip files are supported (single file, multi-file archives)
  • Whether it handles nested zip files
  • Any size limitations

What do you think about expanding it? Here's a suggestion:

-    description: Decoder for response data that is returned as zipfile(s).
+    description: |
+      Decoder for handling API responses containing zip file(s). This decoder:
+      - Supports both single and multi-file zip archives
+      - Automatically extracts and processes compressed files
+      - Delegates the parsing of extracted content to a configured parser component
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a1ccf0 and fcb3184.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

2917-2917: LGTM! Consistent decoder references

The ZipfileDecoder references are correctly added to all relevant decoder lists in SimpleRetriever and AsyncRetriever components. The placement and indentation are consistent with other decoder references.

Also applies to: 3116-3117, 3128-3129


2083-2088: Verify supported parser types

The schema currently supports only GzipParser and JsonParser. Should we also support other parsers like JsonLineParser, CsvParser, etc., for more flexibility? This would align with the parser options available in CompositeRawDecoder.

Let me check the codebase for any existing implementations that might benefit from additional parser support:

✅ Verification successful

Current parser support is appropriate for the use case

The ZipfileDecoder is specifically designed for the declarative framework and works with parsers from the composite_raw_decoder module. The file-based parsers (CsvParser, JsonLineParser, etc.) belong to a different framework with different interfaces and purposes. The current GzipParser and JsonParser support aligns well with the decoder's intended use case.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for files that might use ZipfileDecoder to understand parser requirements
rg -l "ZipfileDecoder|zip_file" --type python

Length of output: 78


Script:

#!/bin/bash
# Search for ZipfileDecoder and zip_file without file type filter
rg "ZipfileDecoder|zip_file"

# Search for parser-related files
fd "parser|decoder" --type f

# Search for parser class definitions
ast-grep --pattern 'class $_Parser'

Length of output: 5030


Script:

#!/bin/bash
# Check ZipfileDecoder implementation
cat airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

# Check parser interface requirements
cat airbyte_cdk/sources/declarative/decoders/decoder.py

Length of output: 2553

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)

28-30: Add docstring for better clarity?

The implementation is correct, but would it be helpful to add a docstring explaining why this decoder doesn't support streaming? Wdyt?

     def is_stream_response(self) -> bool:
+        """
+        Indicates whether this decoder supports streaming responses.
+        Returns False as ZIP files need to be processed in their entirety.
+        """
         return False

34-46: Consider using context manager for zip file?

The error handling looks good, but would it be cleaner to use a context manager for the zip file? This ensures proper cleanup. Wdyt?

-        try:
-            zip_file = zipfile.ZipFile(BytesIO(response.content))
-        except zipfile.BadZipFile as e:
+        try:
+            with zipfile.ZipFile(BytesIO(response.content)) as zip_file:
+                for file_name in zip_file.namelist():
+                    unzipped_content = zip_file.read(file_name)
+                    buffered_content = BytesIO(unzipped_content)
+                    yield from self.parser.parse(buffered_content)
+        except zipfile.BadZipFile as e:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcb3184 and 1c8cd66.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
  • unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)

1-22: LGTM! Well-organized imports.

The imports are logically grouped and all necessary dependencies are included.


24-27: LGTM! Clean class definition.

The dataclass implementation with flexible parser configuration looks good.

@pnilan pnilan requested a review from maxi297 January 17, 2025 02:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1660-1670: The implementation looks good but should we restrict extra fields?

The ZipfileDecoder implementation looks clean and well-structured. However, I noticed that Config allows extra fields. While this provides flexibility, it could potentially lead to unexpected behavior if unknown fields are passed. Would you consider removing the extra = Extra.allow to make the component more strict and predictable? wdyt?

class ZipfileDecoder(BaseModel):
-    class Config:
-        extra = Extra.allow
+    class Config:
+        extra = Extra.forbid

    type: Literal["ZipfileDecoder"]
    parser: Union[GzipParser, JsonParser, JsonLineParser, CsvParser] = Field(
        ...,
        description="Parser to parse the decompressed data from the zipfile(s).",
        title="Parser",
    )
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2071-2090: Should we enhance the documentation with examples?

The schema definition for ZipfileDecoder looks good, but would it be helpful to add some examples showing how to configure it with different parser types? This could help users understand how to properly configure the component. wdyt?

For instance, we could add examples like:

examples:
  - type: ZipfileDecoder
    parser:
      type: JsonParser
  - type: ZipfileDecoder
    parser:
      type: CsvParser
      delimiter: ","
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c8cd66 and 1777cac.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)

1877-1877: LGTM! Clean integration with SessionTokenAuthenticator

The ZipfileDecoder is properly integrated into the decoder field options.


Line range hint 2919-2919: LGTM! Clean integration with SimpleRetriever

The ZipfileDecoder is properly integrated into the decoder field options.


Line range hint 3118-3119: LGTM! Clean integration with AsyncRetriever

The ZipfileDecoder is properly integrated into both decoder and download_decoder field options.

Also applies to: 3130-3131

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comment but overall seems fine. Poke me on Slack once those are addressed

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (2)

16-20: Consider encoding the data and using consistent compression settings.

The function writes raw data to the zip file. Should we encode it like in create_multi_zip_from_dict? Also, might be good to use ZIP_DEFLATED here for consistency, wdyt?

 def create_zip_from_dict(data: Union[dict, list]) -> bytes:
     zip_buffer = BytesIO()
-    with zipfile.ZipFile(zip_buffer, mode="w") as zip_file:
-        zip_file.writestr("data.json", data)
+    with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zip_file:
+        file_content = json.dumps(data).encode("utf-8")
+        zip_file.writestr("data.json", file_content)
     return zip_buffer.getvalue()

56-68: Consider enhancing multi-file test coverage.

The test verifies basic multi-file functionality, but could we add:

  1. Test for consistent file order processing?
  2. Test with mixed content types (some gzipped, some not)?
  3. Descriptive messages for assertions?

Also, should we verify the behavior when some files in the zip are invalid/corrupted, wdyt?

 def test_zipfile_decoder_with_multi_file_response(requests_mock):
     data_to_zip = [{"key1": "value1"}, {"key2": "value2"}, {"key3": "value3"}]
 
     mocked_response = create_multi_zip_from_dict(data_to_zip)
 
     decoder = ZipfileDecoder(parser=JsonParser())
     requests_mock.register_uri("GET", "https://airbyte.io/", content=mocked_response)
     response = requests.get("https://airbyte.io/")
     results = list(decoder.decode(response))
 
-    assert len(results) == 3
+    assert len(results) == 3, "Expected exactly 3 files in the zip"
     for i, actual in enumerate(results):
-        assert actual == data_to_zip[i]
+        assert actual == data_to_zip[i], f"File {i} content does not match expected data"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1777cac and c0b2130.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
  • unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1)

33-54: Consider adding error handling test cases.

The test covers happy paths well, but what about error scenarios? Maybe we could add test cases for:

  • Malformed zip files
  • Invalid JSON content
  • Empty zip files
  • Nested directories in zip

This aligns with the previous feedback about testing edge cases, wdyt?

@pnilan pnilan requested a review from maxi297 January 17, 2025 17:40
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing those comments!

@pnilan pnilan merged commit d2016c6 into main Jan 17, 2025
19 checks passed
@pnilan pnilan deleted the pnilan/declarative/zipfiledecoder branch January 17, 2025 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(ZipfileDecoder) - Create new ZipfileDecoder that takes parser to further parse a response
3 participants