-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds ZipfileDecoder
component
#169
Conversation
📝 WalkthroughWalkthroughThe pull request introduces a new Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Hey there! 👋 I noticed you've implemented the Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (7)
unit_tests/sources/declarative/decoders/parsers/test_parsers.py (1)
30-35
: Simplify test assertions for clarity?In your test function, would it be cleaner to compare the entire list of parsed results with the expected output directly? This could make the test more straightforward and easier to read. Wdyt?
Here's how you might adjust it:
def test_json_parser_with_valid_data(raw_data, expected): - for i, actual in enumerate(JsonParser().parse(raw_data)): - if isinstance(expected, list): - assert actual == expected[i] - else: - assert actual == expected + actual = list(JsonParser().parse(raw_data)) + assert actual == expected if isinstance(expected, list) else [expected]unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (3)
17-23
: Consider adding error handling to the helper function?The
create_zip_from_dict
function looks good but might benefit from some error handling for invalid inputs. What do you think about adding try-catch blocks to handle potential serialization errors? wdyt?def create_zip_from_dict(data: Union[dict, list]): try: zip_buffer = BytesIO() with zipfile.ZipFile(zip_buffer, mode="w") as zip_file: zip_file.writestr("data.json", data) zip_buffer.seek(0) return zip_buffer.getvalue() except (TypeError, ValueError) as e: raise ValueError(f"Failed to create zip from data: {e}")
35-36
: Clarify the commented code's purpose?There's a commented line that seems to show an alternative implementation. Consider either removing it if it's no longer needed or adding a comment explaining why it's kept for reference. wdyt?
25-44
: Consider adding more test scenarios?The test covers basic happy path scenarios but could benefit from additional test cases. What do you think about adding tests for:
- Error cases (malformed zip files)
- Empty files
- Large datasets
- Invalid JSON data
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)
1900-1916
: Documentation could be more detailed, wdyt?The ZipfileDecoder component looks good, but the documentation could be enhanced to help users better understand its capabilities. Consider adding:
- How multiple files within a zip are handled
- Example usage showing parser configuration
- What happens if no parser is specified (default behavior)
ZipfileDecoder: title: Zipfile Decoder - description: Decoder for response data that is returned as zipfile(s). + description: | + Decoder for response data that is returned as zipfile(s). When multiple files are present in the zip, + they are processed sequentially. If no parser is specified, JsonParser is used by default. + + Example usage: + ```yaml + decoder: + type: ZipfileDecoder + parser: + type: JsonParser + ```
1917-1927
: Would adding error handling details be helpful?The JsonParser component looks good for basic use cases, but users might benefit from understanding error scenarios. Consider enhancing the description with:
- How parsing errors are handled
- Expected input/output format examples
JsonParser: title: JsonParser - description: Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format. + description: | + Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format. + Raises JsonParseError if the input cannot be parsed as valid JSON. + + Example input/output: + Input: '{"key": "value"}' + Output: {"key": "value"} + + Input: b'{"key": "value"}' + Output: {"key": "value"}
1928-1949
: Should we use a more practical example?The CustomParser component looks good, but the example might be too whimsical. Consider:
- Using a real-world example that demonstrates practical usage
- Adding information about the base Parser class requirements
CustomParser: title: Custom Parser - description: Use this to implement custom parser logic. + description: | + Use this to implement custom parser logic by extending the base Parser class. + The custom implementation must override the parse method with signature: + def parse(self, response: Union[str, bytes, bytearray]) -> Dict[str, Any] type: object additionalProperties: true required: - type - class_name properties: class_name: examples: - - "source_rivendell.components.ElvishParser" + - "source_amplitude.components.GzipJsonParser" + - "source_salesforce.components.CsvParser"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(1 hunks)airbyte_cdk/sources/declarative/decoders/__init__.py
(1 hunks)airbyte_cdk/sources/declarative/decoders/parsers/__init__.py
(1 hunks)airbyte_cdk/sources/declarative/decoders/parsers/parsers.py
(1 hunks)airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(6 hunks)unit_tests/sources/declarative/decoders/parsers/__init__.py
(1 hunks)unit_tests/sources/declarative/decoders/parsers/test_parsers.py
(1 hunks)unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- unit_tests/sources/declarative/decoders/parsers/init.py
- airbyte_cdk/sources/declarative/decoders/parsers/init.py
🔇 Additional comments (7)
airbyte_cdk/sources/declarative/decoders/__init__.py (1)
10-10
: ZipfileDecoder added to exports correctly!
The inclusion of ZipfileDecoder
in the module imports and __all__
list looks great. Nice work!
Also applies to: 12-12
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)
744-750
: LGTM! Clean implementation of JsonParser
The JsonParser class follows the established pattern and correctly allows extra fields through Config.extra.
1492-1502
: LGTM! Well-structured ZipfileDecoder implementation
The ZipfileDecoder class is well-designed with:
- Proper inheritance from BaseModel
- Flexible parser configuration through Union type
- Clear field descriptions
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
69-69
: LGTM! Clean import additions
The new imports are properly organized and follow the existing import structure.
Also applies to: 71-71
481-481
: LGTM! Well-organized constructor mapping
The new entries in PYDANTIC_MODEL_TO_CONSTRUCTOR are properly placed and maintain alphabetical ordering.
Also applies to: 517-517
1695-1704
: LGTM! Well-implemented ZipfileDecoder factory method
The create_zipfile_decoder method follows the established pattern and properly handles the optional parser configuration.
1705-1708
: LGTM! Clean JsonParser factory method
The create_json_parser method is simple and follows the established pattern for basic components.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
…e `CompositeRawDecoder`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (10)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)
30-39
: Consider adding context manager for zip file handlingThe zip file should be properly closed after use. Wdyt about wrapping it in a context manager?
- try: - zip_file = zipfile.ZipFile(io.BytesIO(response.content)) - except zipfile.BadZipFile as e: - logger.exception(e) - logger.error( - f"Received an invalid zip file in response to URL: {response.request.url}. " - f"The size of the response body is: {len(response.content)}" - ) - yield {} + with io.BytesIO(response.content) as content_buffer: + try: + with zipfile.ZipFile(content_buffer) as zip_file: + try: + for filename in zip_file.namelist(): + with zip_file.open(filename) as file: + buffered_file = io.BytesIO(file.read()) + yield from self.parser.parse(buffered_file) + except Exception as e: + logger.exception(f"Error processing zip file contents: {e}") + yield {} + except zipfile.BadZipFile as e: + logger.exception(e) + logger.error( + f"Received an invalid zip file in response to URL: {response.request.url}. " + f"The size of the response body is: {len(response.content)}" + ) + yield {}
22-22
: Add type annotation for parserThe parser attribute should have a type annotation to improve code clarity and help with static type checking. Wdyt?
- parser: Parser + parser: Parser | None = Noneairbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2580-2584
: Consider adding ZipfileDecoder to error messageThe error message for unsupported decoders should be updated to include ZipfileDecoder in the list of supported decoders. Wdyt?
_UNSUPPORTED_DECODER_ERROR = ( "Specified decoder of {decoder_type} is not supported for pagination." - "Please set as `JsonDecoder`, `XmlDecoder`, or a `CompositeRawDecoder` with an inner_parser of `JsonParser` or `GzipParser` instead." + "Please set as `JsonDecoder`, `XmlDecoder`, `ZipfileDecoder`, or a `CompositeRawDecoder` with an inner_parser of `JsonParser` or `GzipParser` instead." "If using `GzipParser`, please ensure that the lowest level inner_parser is a `JsonParser`." )airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2925-2938
: LGTM! Consider adding validation options?The
JsonParser
implementation looks good. What do you think about adding options for handling invalid JSON (e.g.,strict_mode
,allow_nan
)? This could help users customize the parsing behavior, wdyt?docs/RELEASES.md (2)
52-52
: Consider making the TODO more actionable?The TODO comment could be more specific about what alternatives should be explored. Maybe add acceptance criteria or potential approaches to consider? For example:
- API endpoint to query version info
- CLI command to display current version
- Web UI indicator
wdyt?
60-61
: Minor grammar suggestionConsider changing "publish pipeline" to "publishing pipeline" in both sections for better grammar, wdyt?
-Once the publish pipeline has completed +Once the publishing pipeline has completedAlso applies to: 68-70
🧰 Tools
🪛 LanguageTool
[grammar] ~60-~60: Did you mean the noun “publishing”?
Context: ...ting Manifest-Only connectors Once the publish pipeline has completed, choose a connec...(PREPOSITION_VERB)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (4)
52-69
: Consider improving memory efficiency for large responsesThe current implementation reads the entire response into memory with
data.read()
. For large JSON responses, this could lead to high memory usage. What do you think about streaming the response instead?Here's a possible approach:
- raw_data = data.read() - body_json = self._parse_orjson(raw_data) or self._parse_json(raw_data) + # Stream in chunks of 64KB + chunks = [] + while True: + chunk = data.read(65536) + if not chunk: + break + chunks.append(chunk) + raw_data = b''.join(chunks)wdyt? 🤔
59-64
: How about including the actual parsing error?The error message could be more helpful if it included the actual parsing error from both attempts. Would you consider something like this?
if body_json is None: raise AirbyteTracedException( - message="Response JSON data failed to be parsed. See logs for more information.", - internal_message=f"Response JSON data failed to be parsed.", + message="Failed to parse JSON data using both orjson and json libraries. See logs for details.", + internal_message=f"Failed to parse JSON data. Last error: {exc}", failure_type=FailureType.system_error, )
71-85
: DRY: Consider consolidating the encoding logicBoth parsing methods decode with the same encoding. Maybe we could do this once? 🎯
+ def _decode_raw_data(self, raw_data: bytes) -> str: + return raw_data.decode(self.encoding) + def _parse_orjson(self, raw_data: bytes) -> Optional[Any]: try: - return orjson.loads(raw_data.decode(self.encoding)) + return orjson.loads(self._decode_raw_data(raw_data))
66-69
: Simplify the generator patternThe current implementation has a conditional yield. We could make it more straightforward, wdyt?
- if isinstance(body_json, list): - yield from body_json - else: - yield from [body_json] + yield from [body_json] if not isinstance(body_json, list) else body_json
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(4 hunks)airbyte_cdk/sources/declarative/decoders/__init__.py
(2 hunks)airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
(2 hunks)airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(12 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(11 hunks)airbyte_cdk/sources/declarative/requesters/README.md
(2 hunks)docs/RELEASES.md
(2 hunks)unit_tests/sources/declarative/decoders/test_composite_decoder.py
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- airbyte_cdk/sources/declarative/requesters/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/declarative/decoders/init.py
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
[error] 42-42: Argument 1 to "BufferedReader" has incompatible type "IO[bytes]"; expected "RawIOBase"
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[error] 1797-1797: "ZipfileDecoder" has no attribute "inner_parser"
[error] 1798-1798: Unexpected keyword arguments "parameters" and "inner_parser" for "ZipfileDecoder"
🪛 LanguageTool
docs/RELEASES.md
[grammar] ~60-~60: Did you mean the noun “publishing”?
Context: ...ting Manifest-Only connectors Once the publish pipeline has completed, choose a connec...
(PREPOSITION_VERB)
[grammar] ~68-~68: Did you mean the noun “publishing”?
Context: ...ng Low-Code Python connectors Once the publish pipeline has completed, set the version...
(PREPOSITION_VERB)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
125-153
: Well-structured test cases!Great job on the comprehensive test coverage with parameterized tests for various JSON structures.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2045-2062
: LGTM! Well-structured component definition for handling zip files.The
ZipfileDecoder
component is well-defined with clear requirements and appropriate parser options.airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)
10-15
: Nice choice of dependencies! 🚀Using
orjson
as the primary JSON parser is a great performance optimization. The error handling imports look well-thought-out too.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1643-1653
: Consider enhancing the documentation?The implementation looks good, but what do you think about adding more details in the description field? For example:
- Explain what happens if the zip file contains multiple files
- Clarify if there are any filename pattern requirements
- Document any size limitations
Here's a suggested diff:
type: Literal["ZipfileDecoder"] parser: Union[GzipParser, JsonParser] = Field( ..., - description="Parser to parse the decompressed data from the zipfile(s).", + description="Parser to parse the decompressed data from the zipfile(s). If the zip contains multiple files, all files will be processed. The parser will be applied to each file's content. Files are processed in alphabetical order.", title="Parser", )wdyt?
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1761-1764
: Consider adding error handling to factory methods?The implementations look good, but what do you think about adding some defensive programming? For example:
- Validate that the encoding is supported in create_json_parser
- Check that the parser is compatible in create_zipfile_decoder
Here's a suggested diff:
def create_json_parser(model: JsonParserModel, config: Config, **kwargs: Any) -> JsonParser: encoding = model.encoding if model.encoding else "utf-8" + try: + "test".encode(encoding) + except LookupError: + raise ValueError(f"Unsupported encoding: {encoding}") return JsonParser(encoding=encoding) def create_zipfile_decoder( self, model: ZipfileDecoderModel, config: Config, **kwargs: Any ) -> ZipfileDecoder: parser = self._create_component_from_model(model=model.parser, config=config) + if not isinstance(parser, (GzipParser, JsonParser)): + raise ValueError(f"Unsupported parser type: {type(parser)}. Must be GzipParser or JsonParser") return ZipfileDecoder(parser=parser)wdyt?
Also applies to: 1794-1798
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(3 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1204-1210
: LGTM! Clean and focused implementation.The JsonParser model is well-structured with a sensible default encoding. The extra configuration allowance provides good flexibility.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2580-2600
: Great refactoring of pagination decoder support! 👏The extraction of the error message and addition of helper methods significantly improves code maintainability and readability. The recursive parser support check is particularly elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1660-1670
: LGTM! TheZipfileDecoder
class looks well-designed.The implementation follows the established patterns and provides good flexibility through the configurable parser field. The docstring could be enhanced though - what do you think about adding more details about the purpose and usage? wdyt?
Consider adding a docstring like:
class ZipfileDecoder(BaseModel): + """Decoder for handling zip file responses. + + This decoder extracts and parses data from zip files using a configurable parser + (either GzipParser or JsonParser) to process the decompressed content. + """ class Config: extra = Extra.allow type: Literal["ZipfileDecoder"] parser: Union[GzipParser, JsonParser] = Field( ..., description="Parser to parse the decompressed data from the zipfile(s).", title="Parser", )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(2 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/declarative_component_schema.yaml
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
69-69
: LGTM! Clean import addition.The import is properly placed with other decoder imports.
578-578
: LGTM! Proper mapping registration.The
ZipfileDecoderModel
is correctly mapped to its factory method.
1804-1809
: LGTM! The factory method looks good.The implementation correctly creates the parser and instantiates the decoder. The fix from the previous pipeline failure has been properly applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2071-2088
: Consider enhancing the component description?The description could be more detailed to help users understand:
- When to use this decoder
- What types of zip files are supported (single file, multi-file archives)
- Whether it handles nested zip files
- Any size limitations
What do you think about expanding it? Here's a suggestion:
- description: Decoder for response data that is returned as zipfile(s). + description: | + Decoder for handling API responses containing zip file(s). This decoder: + - Supports both single and multi-file zip archives + - Automatically extracts and processes compressed files + - Delegates the parsing of extracted content to a configured parser component
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(5 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)
2917-2917
: LGTM! Consistent decoder referencesThe
ZipfileDecoder
references are correctly added to all relevant decoder lists inSimpleRetriever
andAsyncRetriever
components. The placement and indentation are consistent with other decoder references.Also applies to: 3116-3117, 3128-3129
2083-2088
: Verify supported parser typesThe schema currently supports only
GzipParser
andJsonParser
. Should we also support other parsers likeJsonLineParser
,CsvParser
, etc., for more flexibility? This would align with the parser options available inCompositeRawDecoder
.Let me check the codebase for any existing implementations that might benefit from additional parser support:
✅ Verification successful
Current parser support is appropriate for the use case
The ZipfileDecoder is specifically designed for the declarative framework and works with parsers from the composite_raw_decoder module. The file-based parsers (CsvParser, JsonLineParser, etc.) belong to a different framework with different interfaces and purposes. The current GzipParser and JsonParser support aligns well with the decoder's intended use case.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for files that might use ZipfileDecoder to understand parser requirements rg -l "ZipfileDecoder|zip_file" --type pythonLength of output: 78
Script:
#!/bin/bash # Search for ZipfileDecoder and zip_file without file type filter rg "ZipfileDecoder|zip_file" # Search for parser-related files fd "parser|decoder" --type f # Search for parser class definitions ast-grep --pattern 'class $_Parser'Length of output: 5030
Script:
#!/bin/bash # Check ZipfileDecoder implementation cat airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py # Check parser interface requirements cat airbyte_cdk/sources/declarative/decoders/decoder.pyLength of output: 2553
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)
28-30
: Add docstring for better clarity?The implementation is correct, but would it be helpful to add a docstring explaining why this decoder doesn't support streaming? Wdyt?
def is_stream_response(self) -> bool: + """ + Indicates whether this decoder supports streaming responses. + Returns False as ZIP files need to be processed in their entirety. + """ return False
34-46
: Consider using context manager for zip file?The error handling looks good, but would it be cleaner to use a context manager for the zip file? This ensures proper cleanup. Wdyt?
- try: - zip_file = zipfile.ZipFile(BytesIO(response.content)) - except zipfile.BadZipFile as e: + try: + with zipfile.ZipFile(BytesIO(response.content)) as zip_file: + for file_name in zip_file.namelist(): + unzipped_content = zip_file.read(file_name) + buffered_content = BytesIO(unzipped_content) + yield from self.parser.parse(buffered_content) + except zipfile.BadZipFile as e:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(1 hunks)unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py (2)
1-22
: LGTM! Well-organized imports.The imports are logically grouped and all necessary dependencies are included.
24-27
: LGTM! Clean class definition.The dataclass implementation with flexible parser configuration looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1660-1670
: The implementation looks good but should we restrict extra fields?The
ZipfileDecoder
implementation looks clean and well-structured. However, I noticed thatConfig
allows extra fields. While this provides flexibility, it could potentially lead to unexpected behavior if unknown fields are passed. Would you consider removing theextra = Extra.allow
to make the component more strict and predictable? wdyt?class ZipfileDecoder(BaseModel): - class Config: - extra = Extra.allow + class Config: + extra = Extra.forbid type: Literal["ZipfileDecoder"] parser: Union[GzipParser, JsonParser, JsonLineParser, CsvParser] = Field( ..., description="Parser to parse the decompressed data from the zipfile(s).", title="Parser", )airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2071-2090
: Should we enhance the documentation with examples?The schema definition for
ZipfileDecoder
looks good, but would it be helpful to add some examples showing how to configure it with different parser types? This could help users understand how to properly configure the component. wdyt?For instance, we could add examples like:
examples: - type: ZipfileDecoder parser: type: JsonParser - type: ZipfileDecoder parser: type: CsvParser delimiter: ","
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(5 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)
1877-1877
: LGTM! Clean integration with SessionTokenAuthenticatorThe
ZipfileDecoder
is properly integrated into thedecoder
field options.
Line range hint
2919-2919
: LGTM! Clean integration with SimpleRetrieverThe
ZipfileDecoder
is properly integrated into thedecoder
field options.
Line range hint
3118-3119
: LGTM! Clean integration with AsyncRetrieverThe
ZipfileDecoder
is properly integrated into bothdecoder
anddownload_decoder
field options.Also applies to: 3130-3131
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comment but overall seems fine. Poke me on Slack once those are addressed
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
Outdated
Show resolved
Hide resolved
…d test for multifile zip responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (2)
16-20
: Consider encoding the data and using consistent compression settings.The function writes raw data to the zip file. Should we encode it like in
create_multi_zip_from_dict
? Also, might be good to useZIP_DEFLATED
here for consistency, wdyt?def create_zip_from_dict(data: Union[dict, list]) -> bytes: zip_buffer = BytesIO() - with zipfile.ZipFile(zip_buffer, mode="w") as zip_file: - zip_file.writestr("data.json", data) + with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zip_file: + file_content = json.dumps(data).encode("utf-8") + zip_file.writestr("data.json", file_content) return zip_buffer.getvalue()
56-68
: Consider enhancing multi-file test coverage.The test verifies basic multi-file functionality, but could we add:
- Test for consistent file order processing?
- Test with mixed content types (some gzipped, some not)?
- Descriptive messages for assertions?
Also, should we verify the behavior when some files in the zip are invalid/corrupted, wdyt?
def test_zipfile_decoder_with_multi_file_response(requests_mock): data_to_zip = [{"key1": "value1"}, {"key2": "value2"}, {"key3": "value3"}] mocked_response = create_multi_zip_from_dict(data_to_zip) decoder = ZipfileDecoder(parser=JsonParser()) requests_mock.register_uri("GET", "https://airbyte.io/", content=mocked_response) response = requests.get("https://airbyte.io/") results = list(decoder.decode(response)) - assert len(results) == 3 + assert len(results) == 3, "Expected exactly 3 files in the zip" for i, actual in enumerate(results): - assert actual == data_to_zip[i] + assert actual == data_to_zip[i], f"File {i} content does not match expected data"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
(1 hunks)unit_tests/sources/declarative/decoders/test_zipfile_decoder.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/decoders/zipfile_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/declarative/decoders/test_zipfile_decoder.py (1)
33-54
: Consider adding error handling test cases.The test covers happy paths well, but what about error scenarios? Maybe we could add test cases for:
- Malformed zip files
- Invalid JSON content
- Empty zip files
- Nested directories in zip
This aligns with the previous feedback about testing edge cases, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for addressing those comments!
What
ZipfileDecoder
component to decode responses containing compressed zip files (e.g.source-amplitude
events stream)How
ZipFileDecoder
class and component. Enables connector dev to specify underlying data parser component (or defaults to theJsonParser
component)Recommended Reviewing Order
Summary by CodeRabbit
New Features
ZipfileDecoder
to handle compressed data formats.ModelToComponentFactory
with a method to createZipfileDecoder
components.SessionTokenAuthenticator
to integrate the new decoder.Bug Fixes
Tests
ZipfileDecoder
functionality.