Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add raw data filter regex and task #294

Open
wants to merge 5 commits into
base: feature/automate-to-cog-transformation
Choose a base branch
from

Conversation

paridhi-parajuli
Copy link
Contributor

@paridhi-parajuli paridhi-parajuli commented Feb 5, 2025

We need an additional task to filter the discovered files for cases when we need to select specific subfolders or files within the provided prefix.

Changes:

  • Added raw data regex to the config
  • Added raw data filtering based on the regex
  • Make the xcom to pass the location to discovered files in s3 rather than the list itself
  • Increase the run per DAG to 20 for process_files

Example scenario:

{
    "collection_name": "",
    "data_acquisition_method": "s3",
    "data_prefix": "",
    "dest_data_bucket": "ghgc-data-store-develop",
    "ext": ".nc",
    "nodata": -9999,
    "plugins_uri": "",
    "raw_data_bucket": "noaa-goes16",
    "raw_data_prefix": "ABI-L1b-RadF/2024",
    "raw_data_filter_regex": "(178|179)/00/.*MC02.*\.nc$"
}


Say we want the file hierarchy:

2024
├── 178
│   ├── 00
│   │   ├── file1_MC02.nc
│   │   ├── file2_MC08.nc
│   │   └── file3_MC02.nc
│   ├── 01
│   │   ├── file4_MC08.nc
│   │   └── file5_MC02.nc
├── 179
│   ├── 00
│   │   ├── file8_MC02.nc
│   │   └── file9_MC08.nc
│   ├── 01
│   │   ├── file10_MC08.nc
│   │   └── file11_MC02.nc
└── 180
    ├── 00
    │   ├── file14_MC02.nc
    │   └── file15_MC08.nc
    ├── 01
    │   ├── file16_MC08.nc
    │   └── file17_MC02.nc
    

We want the subfolders /178 /179 and, within that just /00 and only the files with MC02

@amarouane-ABDELHAK
Copy link
Contributor

amarouane-ABDELHAK commented Feb 5, 2025

Can we increase the run per DAG to like 20?
This line https://github.com/NASA-IMPACT/veda-data-airflow/blob/automated-transformation/add-raw-data-filter/dags/automated_transformation/automation_dag.py#L117

2. Make xcom to pass the s3 location of discovered files rather than the list itself
3. Do chunking for writing the files to s3 if its more than 900
@@ -5,6 +5,9 @@
from airflow.models.param import Param
from airflow.operators.dummy_operator import DummyOperator
from slack_notifications import slack_fail_alert
from airflow.models.variable import Variable
from dags.veda_data_pipeline.utils.xcom_to_s3 import write_xcom_to_s3,read_xcom_from_s3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import shouldn't work! try

from veda_data_pipeline.utils.xcom_to_s3 import write_xcom_to_s3,read_xcom_from_s3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants