feat: Adds `ZipfileDecoder` component #169

pnilan · 2024-12-12T23:04:52Z

What

Adds ZipfileDecoder component to decode responses containing compressed zip files (e.g. source-amplitude events stream)
Resolves feat(ZipfileDecoder) - Create new ZipfileDecoder that takes parser to further parse a response #163

How

Creates new ZipFileDecoder class and component. Enables connector dev to specify underlying data parser component (or defaults to the JsonParser component)

Recommended Reviewing Order

zipfile_decoder.py
model_to_component_factory.py
declarative_component_schema.yaml
test_zipfile_decoder.py

Summary by CodeRabbit

New Features
- Added support for decoding zip file responses in the Airbyte CDK.
- Introduced ZipfileDecoder to handle compressed data formats.
- Enabled parsing of zip files with flexible parser options (JSON, Gzip, Csv).
- Enhanced ModelToComponentFactory with a method to create ZipfileDecoder components.
- Updated SessionTokenAuthenticator to integrate the new decoder.
Bug Fixes
- Implemented error handling for invalid zip file responses.
Tests
- Added comprehensive unit tests for ZipfileDecoder functionality.

…filedecoder

coderabbitai · 2024-12-12T23:07:38Z

📝 Walkthrough

Walkthrough

The pull request introduces a new ZipfileDecoder component to the Airbyte CDK, designed to handle responses compressed in zip files. This decoder allows for flexible parsing of zipped data by utilizing a configurable parser (such as GzipParser or JsonParser). The implementation spans multiple files, including schema definitions, decoder logic, and a corresponding test suite, enabling developers to easily decode and process zip-compressed responses in declarative source connectors.

Changes

File	Change Summary
`airbyte_cdk/sources/declarative/declarative_component_schema.yaml`	Added `ZipfileDecoder` component and updated `SimpleRetriever` to include it
`airbyte_cdk/sources/declarative/decoders/__init__.py`	Updated imports and `__all__` list to include new parsers and `ZipfileDecoder`
`airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py`	New file with `ZipfileDecoder` implementation
`airbyte_cdk/sources/declarative/models/declarative_component_schema.py`	Added `ZipfileDecoder` class definition and updated `JsonParser` and `SessionTokenAuthenticator`
`airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`	Added method to create `ZipfileDecoder`
`unit_tests/sources/declarative/decoders/test_zipfile_decoder.py`	New test file for `ZipfileDecoder` functionality

Assessment against linked issues

Objective	Addressed	Explanation
Create `ZipfileDecoder`	✅
Inherit from `Parser` class	✅
Support underlying parser	✅

Possibly related PRs

feat(airbyte-cdk): add gzipjson decoder #20: GzipJsonDecoder addition relates to compressed data handling
feat: Add JsonParser component to declarative framework #166: JsonParser introduction complements ZipfileDecoder's parsing capabilities
feat: add Composite Raw Decoder #179: CompositeRawDecoder enhances decoding flexibility

Suggested reviewers

maxi297
natikgadzhi

Hey there! 👋 I noticed you've implemented the ZipfileDecoder exactly as proposed in the issue. Would you like me to elaborate on any specific aspect of the implementation? Wdyt about adding some additional error handling or logging? 🤔

Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (7)

unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1)
30-35: Simplify test assertions for clarity?

In your test function, would it be cleaner to compare the entire list of parsed results with the expected output directly? This could make the test more straightforward and easier to read. Wdyt?

Here's how you might adjust it:
def test_json_parser_with_valid_data(raw_data, expected):
-    for i, actual in enumerate(JsonParser().parse(raw_data)):
-        if isinstance(expected, list):
-            assert actual == expected[i]
-        else:
-            assert actual == expected
+    actual = list(JsonParser().parse(raw_data))
+    assert actual == expected if isinstance(expected, list) else [expected]
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (3)
17-23: Consider adding error handling to the helper function?

The create_zip_from_dict function looks good but might benefit from some error handling for invalid inputs. What do you think about adding try-catch blocks to handle potential serialization errors? wdyt?
def create_zip_from_dict(data: Union[dict, list]):
    try:
        zip_buffer = BytesIO()
        with zipfile.ZipFile(zip_buffer, mode="w") as zip_file:
            zip_file.writestr("data.json", data)
        zip_buffer.seek(0)
        return zip_buffer.getvalue()
    except (TypeError, ValueError) as e:
        raise ValueError(f"Failed to create zip from data: {e}")
35-36: Clarify the commented code's purpose?

There's a commented line that seems to show an alternative implementation. Consider either removing it if it's no longer needed or adding a comment explaining why it's kept for reference. wdyt?

25-44: Consider adding more test scenarios?

The test covers basic happy path scenarios but could benefit from additional test cases. What do you think about adding tests for:

Error cases (malformed zip files)

Empty files

Large datasets

Invalid JSON data
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)
1900-1916: Documentation could be more detailed, wdyt?

The ZipfileDecoder component looks good, but the documentation could be enhanced to help users better understand its capabilities. Consider adding:

How multiple files within a zip are handled

Example usage showing parser configuration

What happens if no parser is specified (default behavior)
   ZipfileDecoder:
     title: Zipfile Decoder
-    description: Decoder for response data that is returned as zipfile(s).
+    description: |
+      Decoder for response data that is returned as zipfile(s). When multiple files are present in the zip,
+      they are processed sequentially. If no parser is specified, JsonParser is used by default.
+      
+      Example usage:
+      ```yaml
+      decoder:
+        type: ZipfileDecoder
+        parser:
+          type: JsonParser
+      ```
1917-1927: Would adding error handling details be helpful?

The JsonParser component looks good for basic use cases, but users might benefit from understanding error scenarios. Consider enhancing the description with:

How parsing errors are handled

Expected input/output format examples
   JsonParser:
     title: JsonParser
-    description: Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format.
+    description: |
+      Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format.
+      Raises JsonParseError if the input cannot be parsed as valid JSON.
+      
+      Example input/output:
+      Input: '{"key": "value"}'
+      Output: {"key": "value"}
+      
+      Input: b'{"key": "value"}'
+      Output: {"key": "value"}
1928-1949: Should we use a more practical example?

The CustomParser component looks good, but the example might be too whimsical. Consider:

Using a real-world example that demonstrates practical usage

Adding information about the base Parser class requirements
   CustomParser:
     title: Custom Parser
-    description: Use this to implement custom parser logic.
+    description: |
+      Use this to implement custom parser logic by extending the base Parser class.
+      The custom implementation must override the parse method with signature:
+      def parse(self, response: Union[str, bytes, bytearray]) -> Dict[str, Any]
     type: object
     additionalProperties: true
     required:
       - type
       - class_name
     properties:
       class_name:
         examples:
-          - "source_rivendell.components.ElvishParser"
+          - "source_amplitude.components.GzipJsonParser"
+          - "source_salesforce.components.CsvParser"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceebfda and 49d0ec8.

📒 Files selected for processing (10)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1 hunks)
airbyte_cdk/sources/declarative/decoders/__init__.py (1 hunks)
airbyte_cdk/sources/declarative/decoders/parsers/__init__.py (1 hunks)
airbyte_cdk/sources/declarative/decoders/parsers/parsers.py (1 hunks)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (6 hunks)
unit_tests/sources/declarative/decoders/parsers/__init__.py (1 hunks)
unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1 hunks)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)

✅ Files skipped from review due to trivial changes (2)

unit_tests/sources/declarative/decoders/parsers/init.py
airbyte_cdk/sources/declarative/decoders/parsers/init.py

🔇 Additional comments (7)

airbyte_cdk/sources/declarative/decoders/__init__.py (1)

10-10: ZipfileDecoder added to exports correctly!

The inclusion of ZipfileDecoder in the module imports and __all__ list looks great. Nice work!

Also applies to: 12-12

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)

744-750: LGTM! Clean implementation of JsonParser

The JsonParser class follows the established pattern and correctly allows extra fields through Config.extra.

1492-1502: LGTM! Well-structured ZipfileDecoder implementation

The ZipfileDecoder class is well-designed with:

Proper inheritance from BaseModel
Flexible parser configuration through Union type
Clear field descriptions

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)

69-69: LGTM! Clean import additions

The new imports are properly organized and follow the existing import structure.

Also applies to: 71-71

481-481: LGTM! Well-organized constructor mapping

The new entries in PYDANTIC_MODEL_TO_CONSTRUCTOR are properly placed and maintain alphabetical ordering.

Also applies to: 517-517

1695-1704: LGTM! Well-implemented ZipfileDecoder factory method

The create_zipfile_decoder method follows the established pattern and properly handles the optional parser configuration.

1705-1708: LGTM! Clean JsonParser factory method

The create_json_parser method is simple and follows the established pattern for basic components.

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

airbyte_cdk/sources/declarative/decoders/parsers/parsers.py

…e `CompositeRawDecoder`

…filedecoder

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (10)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)
30-39: Consider adding context manager for zip file handling

The zip file should be properly closed after use. Wdyt about wrapping it in a context manager?
-        try:
-            zip_file = zipfile.ZipFile(io.BytesIO(response.content))
-        except zipfile.BadZipFile as e:
-            logger.exception(e)
-            logger.error(
-                f"Received an invalid zip file in response to URL: {response.request.url}. "
-                f"The size of the response body is: {len(response.content)}"
-            )
-            yield {}
+        with io.BytesIO(response.content) as content_buffer:
+            try:
+                with zipfile.ZipFile(content_buffer) as zip_file:
+                    try:
+                        for filename in zip_file.namelist():
+                            with zip_file.open(filename) as file:
+                                buffered_file = io.BytesIO(file.read())
+                                yield from self.parser.parse(buffered_file)
+                    except Exception as e:
+                        logger.exception(f"Error processing zip file contents: {e}")
+                        yield {}
+            except zipfile.BadZipFile as e:
+                logger.exception(e)
+                logger.error(
+                    f"Received an invalid zip file in response to URL: {response.request.url}. "
+                    f"The size of the response body is: {len(response.content)}"
+                )
+                yield {}
22-22: Add type annotation for parser

The parser attribute should have a type annotation to improve code clarity and help with static type checking. Wdyt?
-    parser: Parser
+    parser: Parser | None = None
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2580-2584: Consider adding ZipfileDecoder to error message

The error message for unsupported decoders should be updated to include ZipfileDecoder in the list of supported decoders. Wdyt?
    _UNSUPPORTED_DECODER_ERROR = (
        "Specified decoder of {decoder_type} is not supported for pagination."
-        "Please set as `JsonDecoder`, `XmlDecoder`, or a `CompositeRawDecoder` with an inner_parser of `JsonParser` or `GzipParser` instead."
+        "Please set as `JsonDecoder`, `XmlDecoder`, `ZipfileDecoder`, or a `CompositeRawDecoder` with an inner_parser of `JsonParser` or `GzipParser` instead."
        "If using `GzipParser`, please ensure that the lowest level inner_parser is a `JsonParser`."
    )
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2925-2938: LGTM! Consider adding validation options?

The JsonParser implementation looks good. What do you think about adding options for handling invalid JSON (e.g., strict_mode, allow_nan)? This could help users customize the parsing behavior, wdyt?
docs/RELEASES.md (2)
52-52: Consider making the TODO more actionable?

The TODO comment could be more specific about what alternatives should be explored. Maybe add acceptance criteria or potential approaches to consider? For example:

API endpoint to query version info

CLI command to display current version

Web UI indicator
wdyt?

60-61: Minor grammar suggestion

Consider changing "publish pipeline" to "publishing pipeline" in both sections for better grammar, wdyt?
-Once the publish pipeline has completed
+Once the publishing pipeline has completed
Also applies to: 68-70

🧰 Tools

🪛 LanguageTool

[grammar] ~60-~60: Did you mean the noun “publishing”?
Context: ...ting Manifest-Only connectors Once the publish pipeline has completed, choose a connec...

(PREPOSITION_VERB)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (4)
52-69: Consider improving memory efficiency for large responses

The current implementation reads the entire response into memory with data.read(). For large JSON responses, this could lead to high memory usage. What do you think about streaming the response instead?

Here's a possible approach:
-        raw_data = data.read()
-        body_json = self._parse_orjson(raw_data) or self._parse_json(raw_data)
+        # Stream in chunks of 64KB
+        chunks = []
+        while True:
+            chunk = data.read(65536)
+            if not chunk:
+                break
+            chunks.append(chunk)
+        raw_data = b''.join(chunks)
wdyt? 🤔

59-64: How about including the actual parsing error?

The error message could be more helpful if it included the actual parsing error from both attempts. Would you consider something like this?
         if body_json is None:
             raise AirbyteTracedException(
-                message="Response JSON data failed to be parsed. See logs for more information.",
-                internal_message=f"Response JSON data failed to be parsed.",
+                message="Failed to parse JSON data using both orjson and json libraries. See logs for details.",
+                internal_message=f"Failed to parse JSON data. Last error: {exc}",
                 failure_type=FailureType.system_error,
             )
71-85: DRY: Consider consolidating the encoding logic

Both parsing methods decode with the same encoding. Maybe we could do this once? 🎯
+    def _decode_raw_data(self, raw_data: bytes) -> str:
+        return raw_data.decode(self.encoding)
+
     def _parse_orjson(self, raw_data: bytes) -> Optional[Any]:
         try:
-            return orjson.loads(raw_data.decode(self.encoding))
+            return orjson.loads(self._decode_raw_data(raw_data))
66-69: Simplify the generator pattern

The current implementation has a conditional yield. We could make it more straightforward, wdyt?
-        if isinstance(body_json, list):
-            yield from body_json
-        else:
-            yield from [body_json]
+        yield from [body_json] if not isinstance(body_json, list) else body_json

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96ec874 and aae3e77.

📒 Files selected for processing (9)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (4 hunks)
airbyte_cdk/sources/declarative/decoders/__init__.py (2 hunks)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2 hunks)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (12 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (11 hunks)
airbyte_cdk/sources/declarative/requesters/README.md (2 hunks)
docs/RELEASES.md (2 hunks)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (3 hunks)

✅ Files skipped from review due to trivial changes (1)

airbyte_cdk/sources/declarative/requesters/README.md

🚧 Files skipped from review as they are similar to previous changes (2)

airbyte_cdk/sources/declarative/decoders/init.py
airbyte_cdk/sources/declarative/models/declarative_component_schema.py

🧰 Additional context used

🪛 GitHub Actions: Linters

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

[error] 42-42: Argument 1 to "BufferedReader" has incompatible type "IO[bytes]"; expected "RawIOBase"

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

[error] 1797-1797: "ZipfileDecoder" has no attribute "inner_parser"

[error] 1798-1798: Unexpected keyword arguments "parameters" and "inner_parser" for "ZipfileDecoder"

🪛 LanguageTool

docs/RELEASES.md

[grammar] ~60-~60: Did you mean the noun “publishing”?
Context: ...ting Manifest-Only connectors Once the publish pipeline has completed, choose a connec...

(PREPOSITION_VERB)

[grammar] ~68-~68: Did you mean the noun “publishing”?
Context: ...ng Low-Code Python connectors Once the publish pipeline has completed, set the version...

(PREPOSITION_VERB)

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-the-guardian-api' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Analyze (python)

🔇 Additional comments (3)

unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

125-153: Well-structured test cases!

Great job on the comprehensive test coverage with parameterized tests for various JSON structures.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2045-2062: LGTM! Well-structured component definition for handling zip files.

The ZipfileDecoder component is well-defined with clear requirements and appropriate parser options.

airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)

10-15: Nice choice of dependencies! 🚀

Using orjson as the primary JSON parser is a great performance optimization. The error handling imports look well-thought-out too.

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1643-1653: Consider enhancing the documentation?

The implementation looks good, but what do you think about adding more details in the description field? For example:

Explain what happens if the zip file contains multiple files

Clarify if there are any filename pattern requirements

Document any size limitations

Here's a suggested diff:
     type: Literal["ZipfileDecoder"]
     parser: Union[GzipParser, JsonParser] = Field(
         ...,
-        description="Parser to parse the decompressed data from the zipfile(s).",
+        description="Parser to parse the decompressed data from the zipfile(s). If the zip contains multiple files, all files will be processed. The parser will be applied to each file's content. Files are processed in alphabetical order.",
         title="Parser",
     )
wdyt?
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1761-1764: Consider adding error handling to factory methods?

The implementations look good, but what do you think about adding some defensive programming? For example:

Validate that the encoding is supported in create_json_parser

Check that the parser is compatible in create_zipfile_decoder

Here's a suggested diff:
     def create_json_parser(model: JsonParserModel, config: Config, **kwargs: Any) -> JsonParser:
         encoding = model.encoding if model.encoding else "utf-8"
+        try:
+            "test".encode(encoding)
+        except LookupError:
+            raise ValueError(f"Unsupported encoding: {encoding}")
         return JsonParser(encoding=encoding)

     def create_zipfile_decoder(
         self, model: ZipfileDecoderModel, config: Config, **kwargs: Any
     ) -> ZipfileDecoder:
         parser = self._create_component_from_model(model=model.parser, config=config)
+        if not isinstance(parser, (GzipParser, JsonParser)):
+            raise ValueError(f"Unsupported parser type: {type(parser)}. Must be GzipParser or JsonParser")
         return ZipfileDecoder(parser=parser)
wdyt?

Also applies to: 1794-1798

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aae3e77 and fe6859f.

📒 Files selected for processing (3)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (11 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-the-guardian-api' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Analyze (python)

🔇 Additional comments (2)

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1204-1210: LGTM! Clean and focused implementation.

The JsonParser model is well-structured with a sensible default encoding. The extra configuration allowance provides good flexibility.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2580-2600: Great refactoring of pagination decoder support! 👏

The extraction of the error message and addition of helper methods significantly improves code maintainability and readability. The recursive parser support check is particularly elegant.

…hema

… component schema

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1660-1670: LGTM! The ZipfileDecoder class looks well-designed.

The implementation follows the established patterns and provides good flexibility through the configurable parser field. The docstring could be enhanced though - what do you think about adding more details about the purpose and usage? wdyt?

Consider adding a docstring like:
class ZipfileDecoder(BaseModel):
+    """Decoder for handling zip file responses.
+    
+    This decoder extracts and parses data from zip files using a configurable parser
+    (either GzipParser or JsonParser) to process the decompressed content.
+    """
    class Config:
        extra = Extra.allow

    type: Literal["ZipfileDecoder"] 
    parser: Union[GzipParser, JsonParser] = Field(
        ...,
        description="Parser to parse the decompressed data from the zipfile(s).",
        title="Parser",
    )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe6859f and 8a1ccf0.

📒 Files selected for processing (3)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-the-guardian-api' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Analyze (python)

🔇 Additional comments (3)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

69-69: LGTM! Clean import addition.

The import is properly placed with other decoder imports.

578-578: LGTM! Proper mapping registration.

The ZipfileDecoderModel is correctly mapped to its factory method.

1804-1809: LGTM! The factory method looks good.

The implementation correctly creates the parser and instantiates the decoder. The fix from the previous pipeline failure has been properly applied.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2071-2088: Consider enhancing the component description?

The description could be more detailed to help users understand:

When to use this decoder

What types of zip files are supported (single file, multi-file archives)

Whether it handles nested zip files

Any size limitations

What do you think about expanding it? Here's a suggestion:
-    description: Decoder for response data that is returned as zipfile(s).
+    description: |
+      Decoder for handling API responses containing zip file(s). This decoder:
+      - Supports both single and multi-file zip archives
+      - Automatically extracts and processes compressed files
+      - Delegates the parsing of extracted content to a configured parser component

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a1ccf0 and fcb3184.

📒 Files selected for processing (2)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

airbyte_cdk/sources/declarative/models/declarative_component_schema.py

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-the-guardian-api' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Analyze (python)

🔇 Additional comments (2)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

2917-2917: LGTM! Consistent decoder references

The ZipfileDecoder references are correctly added to all relevant decoder lists in SimpleRetriever and AsyncRetriever components. The placement and indentation are consistent with other decoder references.

Also applies to: 3116-3117, 3128-3129

2083-2088: Verify supported parser types

The schema currently supports only GzipParser and JsonParser. Should we also support other parsers like JsonLineParser, CsvParser, etc., for more flexibility? This would align with the parser options available in CompositeRawDecoder.

Let me check the codebase for any existing implementations that might benefit from additional parser support:

✅ Verification successful

Current parser support is appropriate for the use case

The ZipfileDecoder is specifically designed for the declarative framework and works with parsers from the composite_raw_decoder module. The file-based parsers (CsvParser, JsonLineParser, etc.) belong to a different framework with different interfaces and purposes. The current GzipParser and JsonParser support aligns well with the decoder's intended use case.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for files that might use ZipfileDecoder to understand parser requirements
rg -l "ZipfileDecoder|zip_file" --type python
Length of output: 78

Script:
#!/bin/bash
# Search for ZipfileDecoder and zip_file without file type filter
rg "ZipfileDecoder|zip_file"

# Search for parser-related files
fd "parser|decoder" --type f

# Search for parser class definitions
ast-grep --pattern 'class $_Parser'
Length of output: 5030

Script:
#!/bin/bash
# Check ZipfileDecoder implementation
cat airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

# Check parser interface requirements
cat airbyte_cdk/sources/declarative/decoders/decoder.py
Length of output: 2553

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)

28-30: Add docstring for better clarity?

The implementation is correct, but would it be helpful to add a docstring explaining why this decoder doesn't support streaming? Wdyt?

     def is_stream_response(self) -> bool:
+        """
+        Indicates whether this decoder supports streaming responses.
+        Returns False as ZIP files need to be processed in their entirety.
+        """
         return False

34-46: Consider using context manager for zip file?

The error handling looks good, but would it be cleaner to use a context manager for the zip file? This ensures proper cleanup. Wdyt?

-        try:
-            zip_file = zipfile.ZipFile(BytesIO(response.content))
-        except zipfile.BadZipFile as e:
+        try:
+            with zipfile.ZipFile(BytesIO(response.content)) as zip_file:
+                for file_name in zip_file.namelist():
+                    unzipped_content = zip_file.read(file_name)
+                    buffered_content = BytesIO(unzipped_content)
+                    yield from self.parser.parse(buffered_content)
+        except zipfile.BadZipFile as e:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcb3184 and 1c8cd66.

📒 Files selected for processing (2)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

unit_tests/sources/declarative/decoders/test_zipfile_decoder.py

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-the-guardian-api' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Analyze (python)

🔇 Additional comments (2)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)

1-22: LGTM! Well-organized imports.

The imports are logically grouped and all necessary dependencies are included.

24-27: LGTM! Clean class definition.

The dataclass implementation with flexible parser configuration looks good.

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

… ZipfileDecoder

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1660-1670: The implementation looks good but should we restrict extra fields?

The ZipfileDecoder implementation looks clean and well-structured. However, I noticed that Config allows extra fields. While this provides flexibility, it could potentially lead to unexpected behavior if unknown fields are passed. Would you consider removing the extra = Extra.allow to make the component more strict and predictable? wdyt?
class ZipfileDecoder(BaseModel):
-    class Config:
-        extra = Extra.allow
+    class Config:
+        extra = Extra.forbid

    type: Literal["ZipfileDecoder"]
    parser: Union[GzipParser, JsonParser, JsonLineParser, CsvParser] = Field(
        ...,
        description="Parser to parse the decompressed data from the zipfile(s).",
        title="Parser",
    )
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2071-2090: Should we enhance the documentation with examples?

The schema definition for ZipfileDecoder looks good, but would it be helpful to add some examples showing how to configure it with different parser types? This could help users understand how to properly configure the component. wdyt?

For instance, we could add examples like:
examples:
  - type: ZipfileDecoder
    parser:
      type: JsonParser
  - type: ZipfileDecoder
    parser:
      type: CsvParser
      delimiter: ","

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c8cd66 and 1777cac.

📒 Files selected for processing (2)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-the-guardian-api' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: Analyze (python)

🔇 Additional comments (3)

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)

1877-1877: LGTM! Clean integration with SessionTokenAuthenticator

The ZipfileDecoder is properly integrated into the decoder field options.

Line range hint 2919-2919: LGTM! Clean integration with SimpleRetriever

The ZipfileDecoder is properly integrated into the decoder field options.

Line range hint 3118-3119: LGTM! Clean integration with AsyncRetriever

The ZipfileDecoder is properly integrated into both decoder and download_decoder field options.

Also applies to: 3130-3131

maxi297

Just a couple comment but overall seems fine. Poke me on Slack once those are addressed

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

unit_tests/sources/declarative/decoders/test_zipfile_decoder.py

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

…d test for multifile zip responses

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (2)
16-20: Consider encoding the data and using consistent compression settings.

The function writes raw data to the zip file. Should we encode it like in create_multi_zip_from_dict? Also, might be good to use ZIP_DEFLATED here for consistency, wdyt?
 def create_zip_from_dict(data: Union[dict, list]) -> bytes:
     zip_buffer = BytesIO()
-    with zipfile.ZipFile(zip_buffer, mode="w") as zip_file:
-        zip_file.writestr("data.json", data)
+    with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zip_file:
+        file_content = json.dumps(data).encode("utf-8")
+        zip_file.writestr("data.json", file_content)
     return zip_buffer.getvalue()
56-68: Consider enhancing multi-file test coverage.

The test verifies basic multi-file functionality, but could we add:

Test for consistent file order processing?

Test with mixed content types (some gzipped, some not)?

Descriptive messages for assertions?

Also, should we verify the behavior when some files in the zip are invalid/corrupted, wdyt?
 def test_zipfile_decoder_with_multi_file_response(requests_mock):
     data_to_zip = [{"key1": "value1"}, {"key2": "value2"}, {"key3": "value3"}]
 
     mocked_response = create_multi_zip_from_dict(data_to_zip)
 
     decoder = ZipfileDecoder(parser=JsonParser())
     requests_mock.register_uri("GET", "https://airbyte.io/", content=mocked_response)
     response = requests.get("https://airbyte.io/")
     results = list(decoder.decode(response))
 
-    assert len(results) == 3
+    assert len(results) == 3, "Expected exactly 3 files in the zip"
     for i, actual in enumerate(results):
-        assert actual == data_to_zip[i]
+        assert actual == data_to_zip[i], f"File {i} content does not match expected data"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1777cac and c0b2130.

📒 Files selected for processing (2)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (1 hunks)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-the-guardian-api' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: Analyze (python)

🔇 Additional comments (1)

unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1)

33-54: Consider adding error handling test cases.

The test covers happy paths well, but what about error scenarios? Maybe we could add test cases for:

Malformed zip files

Invalid JSON content

Empty zip files

Nested directories in zip

This aligns with the previous feedback about testing edge cases, wdyt?

maxi297

LGTM, thanks for addressing those comments!

pnilan added 10 commits December 10, 2024 15:59

initial JsonParser component

e68f36f

update parser

a8a7bb3

add tests for json parser

254f877

update parser and tests to yield empty dict if unparseable.

8df239a

add zipfile_decoder

8c7d5f8

chore: format code

92574df

Merge branch 'pnilan/declarative/parsers' into pnilan/declarative/zip…

6e4b376

…filedecoder

update zipfile_decoder and relevants tests

ab3f404

Merge branch 'main' into pnilan/declarative/parsers

82a15c9

Merge branch 'pnilan/declarative/parsers' into pnilan/declarative/zip…

49d0ec8

…filedecoder

github-actions bot added the enhancement New feature or request label Dec 12, 2024

coderabbitai bot requested changes Dec 12, 2024

View reviewed changes

This comment was marked as off-topic.

Sign in to view

pnilan added 6 commits December 13, 2024 11:19

remove errant comment

96ec874

Merge branch 'main' into pnilan/declarative/parsers

0b3b5e1

conform tests

9fd93cb

initial test updates

1892a03

update JsonParser and relevant tests

51118f1

chore: format/type-check

34a710d

pnilan closed this Jan 14, 2025

pnilan added 8 commits January 14, 2025 08:24

remove orjson from composite_raw_decoder file

060178a

Merge branch 'main' into pnilan/declarative/parsers

bf8dd26

chore: format code

d9b6df3

add additional test

f20fffc

update to fallback to json library if orjson fails, update test to us…

9ce2c28

…e `CompositeRawDecoder`

add JsonParser to GzipDecoder and CompositeRawDecoder "anyOf" list

7e7b2c4

update to simplify orjson/json parsing

23cbfb7

chore: type-check

1c2a832

pnilan reopened this Jan 15, 2025

pnilan closed this Jan 15, 2025

pnilan added 4 commits January 15, 2025 13:50

Merge branch 'main' into pnilan/declarative/parsers

27bf5a7

Merge branch 'pnilan/declarative/parsers' into pnilan/declarative/zip…

bd724bf

…filedecoder

remove unnecessary parser

350fcdb

update ZipfileDecoder to hanlde underlying parsers

aae3e77

pnilan reopened this Jan 15, 2025

coderabbitai bot requested changes Jan 15, 2025

View reviewed changes

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Outdated Show resolved Hide resolved

update types

fe6859f

coderabbitai bot reviewed Jan 15, 2025

View reviewed changes

pnilan added 3 commits January 16, 2025 08:17

Merge branch 'main' into pnilan/declarative/zipfiledecoder

907f628

add ZipfileDecoder to anyOf validator in declarative component sc…

8a1ccf0

…hema

update anyOf validations to include ZipfileDecoder in declarative…

fcb3184

… component schema

coderabbitai bot reviewed Jan 16, 2025

View reviewed changes

update zipfiledecoder and relevant tests

1c8cd66

coderabbitai bot requested changes Jan 17, 2025

View reviewed changes

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py Outdated Show resolved Hide resolved

pnilan temporarily deployed to DockerHub January 17, 2025 00:07 — with GitHub Actions Inactive

pnilan temporarily deployed to PyPi January 17, 2025 00:07 — with GitHub Actions Inactive

coderabbitai bot approved these changes Jan 17, 2025

View reviewed changes

pnilan requested a review from maxi297 January 17, 2025 02:39

adds JsonLineParser and CsvParser to available underlying parsers for…

1777cac

… ZipfileDecoder

coderabbitai bot reviewed Jan 17, 2025

View reviewed changes

maxi297 reviewed Jan 17, 2025

View reviewed changes

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py Outdated Show resolved Hide resolved

unit_tests/sources/declarative/decoders/test_zipfile_decoder.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py Outdated Show resolved Hide resolved

close zipfile context, add exception logging at decoder level, and ad…

c0b2130

…d test for multifile zip responses

coderabbitai bot reviewed Jan 17, 2025

View reviewed changes

pnilan requested a review from maxi297 January 17, 2025 17:40

maxi297 approved these changes Jan 17, 2025

View reviewed changes

pnilan merged commit d2016c6 into main Jan 17, 2025
19 checks passed

pnilan deleted the pnilan/declarative/zipfiledecoder branch January 17, 2025 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds `ZipfileDecoder` component #169

feat: Adds `ZipfileDecoder` component #169

pnilan commented Dec 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2024 •

edited

Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

This comment was marked as off-topic.

This comment was marked as off-topic.

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

maxi297 left a comment

coderabbitai bot left a comment

maxi297 left a comment

feat: Adds ZipfileDecoder component #169

feat: Adds ZipfileDecoder component #169

Conversation

pnilan commented Dec 12, 2024 • edited by coderabbitai bot Loading

What

How

Recommended Reviewing Order

Summary by CodeRabbit

coderabbitai bot commented Dec 12, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

feat: Adds `ZipfileDecoder` component #169

feat: Adds `ZipfileDecoder` component #169

pnilan commented Dec 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)