-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File is overwritten with each sink operation #31
Comments
@pedro-muniz -- this sounds great, can you make this change (or make it configurable) and submit a PR? |
smart_open plugin does not support append mode. I'll try to work on this. https://github.com/crowemi/target-s3/blob/main/target_s3/formats/format_base.py#L10 |
Bumped into the very same behaviour with Parquet files which has different implementation. The workaround could be to keep writer open and keep adding, e.g.:
However, each set of records may have different schema and I'm not sure how to overcome this. |
I managed to resolve my issue described above by extending batch size and age. Please see #32 for details. In my case, I use hourly batching and minute grain for filename. This combination solves the problem with overwriting for me. |
I think they are good parameters to have control over, but they don't resolve this issue, for example, if we set the granularity to microseconds, we also have a workaround for most cases without these variables. S3 objects don't support append operation, so the best solution is to create a new file for each sink operation, but IMHO this behavior cannot be part of a parameter combination. What do you think? |
Another thing to add here is that in situations of large number of files, the S3 I/O has shown very good performance for an object size of ~100 MB. This provides a good blend of I/O latencies and record count, so rolling the file to new file based on a size is also a good option to add |
Indeed, for Parquet files AWS recommends ~250MB per file. However, I didn't see any built-in mechanism in Meltano to flush based on byte size. Moreover, with compression enabled, it will be as hard to estimate output size when using byte size limit as it is with row size limit. |
When using target-s3 with over 10k records, we've noticed that the only way to achieve the desired outcome is by using the "append_date_to_prefix_grain": "microsecond" option. This is due to the fact that each sink operation overwrites the generated file instead of appending the data. Consequently, if more than 10K data is sent by the tap to the target within a short time frame, some data may be lost during the file writing process. Could we consider modifying the command from "w" to "a"? Is there a particular reason for using the write operation? If so, it might be necessary to create a new file for each sink operation to prevent data loss.
The target-s3-parquet uses append mode to write data.
https://github.com/gupy-io/target-s3-parquet/blob/main/target_s3_parquet/sinks.py#L83C15-L83C25
The text was updated successfully, but these errors were encountered: