Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEN-1516, GEN-1555, GEN-1576] make table update cohort specific #158

Merged
merged 43 commits into from
Oct 31, 2024

Conversation

danlu1
Copy link
Contributor

@danlu1 danlu1 commented Oct 24, 2024

Problem:

  1. Currently, the update table step is NOT cohort-specific which means that all the data in the internal tables get overwritten.
  2. Outputs are either saved to prod environment or local. We would like to be able to save them in staging project for testing or validation purpose.
  3. Need Docker image for the feature branch specific for testing
  4. Can't test this data processing step individually in Tower.
  5. None values in the Synapse table has been converted to blank since None is in default na_values list of pandas dataframe.
  6. Non-informative backslash in the input data.

Solution:

  1. Table queries are updated to include cohort filter.
  2. Wipe table line has been updated to wipe cohort lines only.
  3. Staging projects are added and Staging tables are used.
  4. The table_update step has been added in the Nextflow workflow. Nextflow scripts for table_update step can intake docker image as a parameter.
  5. Remove None from the na_values of the asDataFrame() function.
  6. A new column specific remove_backslash function is added to remove non-informative backslashes in the data before pushing to Synapse table.

Testing:

  1. Unit tests for new and modified functions are added
  2. Comparison scrips were run to compare versions of the tables and got expected results.
  3. Scripts run through both in EC2 instance and Tower.

danlu1 and others added 30 commits October 1, 2024 16:51
…ge-Bionetworks/genie-bpc-pipeline into GEN-1516-table_update_cohort_specific

merge upstream changes
Update README file to reflect new parameters in update_data_table.py
CONTRIBUTING.md Show resolved Hide resolved
main.nf Show resolved Hide resolved
nextflow.config Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
…ge-Bionetworks/genie-bpc-pipeline into gen-1516-table_update_cohort_specific

merge upstream changes
@rxu17 rxu17 self-requested a review October 24, 2024 23:56
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just minor comments

@danlu1 danlu1 changed the title [GEN-1516, GEN-1555] make table update cohort specific [GEN-1516, GEN-1555, GEN-1576] make table update cohort specific Oct 28, 2024
CONTRIBUTING.md Outdated Show resolved Hide resolved
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! Please make sure all of Rixing's comments are addressed!

danlu1 and others added 3 commits October 28, 2024 23:17
Co-authored-by: Thomas Yu <[email protected]>
…ge-Bionetworks/genie-bpc-pipeline into gen-1516-table_update_cohort_specific

merge upstream changes
@danlu1 danlu1 merged commit 86afb66 into develop Oct 31, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants