Skip to content

Commit

Permalink
add function to check join keys
Browse files Browse the repository at this point in the history
  • Loading branch information
danlu1 committed Dec 12, 2024
2 parents 38592da + ade4e9d commit 2396bbd
Show file tree
Hide file tree
Showing 27 changed files with 1,227 additions and 467 deletions.
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @Sage-Bionetworks/genie_admins
76 changes: 76 additions & 0 deletions .github/workflows/build-docker-images.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
name: Build and Push Docker Images

on:
push:
branches: [develop, 'GEN*', 'gen*']
paths:
- 'scripts/**'
- '.github/workflows/build-docker-images.yml'
workflow_dispatch:

jobs:
build_references_docker:
runs-on: ubuntu-latest
strategy:
matrix:
module: ["references", "table_updates", "uploads"] # Define the modules you want to loop through for builds
env:
REGISTRY: ghcr.io
IMAGE_NAME: sage-bionetworks/genie-bpc-pipeline
permissions:
contents: read
packages: write

steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 2

- name: Setup Docker buildx
uses: docker/setup-buildx-action@v3

- name: Fetch the default branch (develop) for comparison
run: git fetch origin develop:refs/remotes/origin/develop --depth=1

- name: Check for Changes in scripts/${{ matrix.module }}
id: check_changes
run: |
# Determine the correct DIFF_BASE
if [ "${{ github.ref_name }}" = "develop" ]; then
# On the develop branch, compare with the previous commit (HEAD^)
DIFF_BASE="HEAD^"
else
# On feature branches, compare with origin/develop
if git merge-base --is-ancestor origin/develop HEAD; then
DIFF_BASE="origin/develop"
else
DIFF_BASE=$(git rev-list --max-parents=0 HEAD) # Use the initial commit as fallback
fi
fi
# Compare changes between DIFF_BASE and HEAD
if git diff --name-only $DIFF_BASE -- scripts/${{ matrix.module }} | grep -q .; then
echo "CHANGED=true" >> $GITHUB_ENV
else
echo "CHANGED=false" >> $GITHUB_ENV
fi
- name: Log in to GitHub Container Registry
if: env.CHANGED == 'true'
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Build and Push Docker Image for scripts/${{ matrix.module }}
if: env.CHANGED == 'true'
uses: docker/build-push-action@v5
with:
context: scripts/${{ matrix.module }}
push: true
tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ matrix.module }}-${{ github.ref_name }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ matrix.module }}-${{ github.ref_name }}-cache
cache-to: type=inline,mode=max

38 changes: 36 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# genie-bpc-pipeline: Contributing Guidelines

## Getting started
1. [Clone the repository](https://help.github.com/articles/cloning-a-repository/) to your local machine so you can begin making changes.
2. On your local machine make sure you have the latest version of the `develop` branch:

```
git checkout develop
git pull origin develop
```
3. Create a feature branch off the `develop` branch and work on it. The branch should be named the same as the JIRA issue you are working on in **lowercase** (e.g., `gen-1234-{feature-here}`). Make sure the branch name as informative as possible.
```
git checkout develop
git checkout -b gen-1234-{feature-here}
```
4. Once you have made your additions or changes, make sure you write tests and run [comparison scripts](https://github.com/Sage-Bionetworks/Genie_processing/blob/ed806d163fa4063a84920483e8ada21ea4b6cf47/README.md#comparisons-between-two-synapse-entities) to ensure changes are expected.
5. At this point, you have only created the branch locally, you need to push this to your fork on GitHub.
```
git add your file
git commit -m "your commit information"
git push --set-upstream origin gen-1234-{feature-here}
```
6. Create a pull request from the feature branch to the develop branch. When changes are made to `script/<module_folder_name>` (only applies to scripts/references and scripts/table_updates for now), a Github action will be triggered to create a docker image for the branch, you can check it [here](https://github.com/Sage-Bionetworks/genie-bpc-pipeline/pkgs/container/genie-bpc-pipeline).
## Nextflow Pipeline contribution
Here is how to contribute to the nextflow workflow of the genie-bpc-pipeline
Expand All @@ -8,18 +31,29 @@ Here is how to contribute to the nextflow workflow of the genie-bpc-pipeline
If you wish to contribute a new step, please use the following guidelines:
1. Add a new process step as a module under `genie-bpc-pipeline/modules`
2. Write the process step code
1. Add a new process step as a nextflow module under `genie-bpc-pipeline/modules`
2. Write the process step code and add to the appropriate module folder under `script/<module_folder_name>`
3. Add the process step to the workflow section in `main.nf` as a step
4. Add to any pre-existing process steps that needs this step as an input and vice versa
5. Add any new parameters to `nextflow_schema.json` with help text.
6. Add any new parameter's default values to the set parameter default values section in `main.nf`.
7. Add any additional validation for all relevant parameters. See validation section in `main.nf`
### Adding a new process module
We have automated docker builds to GHCR whenever there are changes to the scripts within a "module" as each module has its own image. Whenever a new module gets added, the github workflow `.github/workflows/build-docker-images.yml` should be updated.
1. Under `jobs` add your module name to `matrix:`
1. Once you push your changes, your docker image will build and will in the form: `<registry>/<repo>:<folder_name>-<branch>` (Example: `ghcr.io/genie-bpc-pipeline:references-gen-1485-update-potential-phi`)
### Default values
Parameters should be initialized / defined with default values in the set parameter default values section in `main.nf`
### Default processes resource requirements
Defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `nextflow.config`.
### Testing
1. To test locally, you can’t use ghcr.io/sage-bionetworks/docker_image_name locally directly. You will need to pull the docker image to local and test with it.
2. To test on Sequra, go to the test_bpc_pipeline, edit the pipeline by pointing it to your feature branch then update. Doing this will allow you to select parameters from the dropdown menu directly.
40 changes: 28 additions & 12 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ params.comment = 'NSCLC public release update'
params.production = false
params.schema_ignore_params = ""
params.help = false
params.step = "update_potential_phi_fields_table"

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -44,9 +45,12 @@ NfcoreSchema.validateParameters(workflow, params, log)
if (params.cohort == null) { exit 1, 'cohort parameter not specified!' }
if (params.comment == null) { exit 1, 'comment parameter not specified!' }
if (params.production == null) { exit 1, 'production parameter not specified!' }
if (params.step == null) { exit 1, 'step parameter not specified!' }


// Print parameter summary log to screen
log.info NfcoreSchema.paramsSummaryLog(workflow, params)
log.info "Running step: ${params.step}"

// Print message for production mode vs test mode
if (params.production) {
Expand All @@ -66,6 +70,7 @@ else {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

include { update_potential_phi_fields_table } from './modules/update_potential_phi_fields_table'
include { run_quac_upload_report_error } from './modules/run_quac_upload_report_error'
include { run_quac_upload_report_warning } from './modules/run_quac_upload_report_warning'
include { merge_and_uncode_rca_uploads } from './modules/merge_and_uncode_rca_uploads'
Expand All @@ -77,7 +82,6 @@ include { run_quac_comparison_report } from './modules/run_quac_comparison_repor
include { create_masking_report } from './modules/create_masking_report'
include { update_case_count_table } from './modules/update_case_count_table'
include { run_clinical_release } from './modules/run_clinical_release'

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RUN WORKFLOW
Expand All @@ -87,17 +91,29 @@ include { run_clinical_release } from './modules/run_clinical_release'
workflow BPC_PIPELINE {
ch_cohort = Channel.value(params.cohort)
ch_comment = Channel.value(params.comment)

run_quac_upload_report_error(ch_cohort)
run_quac_upload_report_warning(run_quac_upload_report_error.out, ch_cohort, params.production)
merge_and_uncode_rca_uploads(run_quac_upload_report_warning.out, ch_cohort, params.production)
// remove_patients_from_merged(merge_and_uncode_rca_uploads.out, ch_cohort, params.production)
update_data_table(merge_and_uncode_rca_uploads.out, ch_cohort, ch_comment, params.production)
update_date_tracking_table(update_data_table.out, ch_cohort, ch_comment, params.production)
run_quac_table_report(update_date_tracking_table.out, ch_cohort, params.production)
run_quac_comparison_report(run_quac_table_report.out, ch_cohort, params.production)
create_masking_report(run_quac_comparison_report.out, ch_cohort, params.production)
update_case_count_table(create_masking_report.out, ch_comment, params.production)

if (params.step == "update_potential_phi_fields_table") {
update_potential_phi_fields_table(ch_comment, params.production)
// validate_data.out.view()
} else if (params.step == "merge_and_uncode_rca_uploads"){
merge_and_uncode_rca_uploads("default", ch_cohort, ch_comment, params.production)
} else if (params.step == "update_data_table") {
update_data_table("default", ch_cohort, ch_comment, params.production)
} else if (params.step == "genie_bpc_pipeline"){
update_potential_phi_fields_table(ch_comment, params.production)
run_quac_upload_report_error(update_potential_phi_fields_table.out, ch_cohort)
run_quac_upload_report_warning(run_quac_upload_report_error.out, ch_cohort, params.production)
merge_and_uncode_rca_uploads(run_quac_upload_report_warning.out, ch_cohort, ch_comment, params.production)
// remove_patients_from_merged(merge_and_uncode_rca_uploads.out, ch_cohort, params.production)
update_data_table(merge_and_uncode_rca_uploads.out, ch_cohort, ch_comment, params.production)
update_date_tracking_table(update_data_table.out, ch_cohort, ch_comment, params.production)
run_quac_table_report(update_date_tracking_table.out, ch_cohort, params.production)
run_quac_comparison_report(run_quac_table_report.out, ch_cohort, params.production)
create_masking_report(run_quac_comparison_report.out, ch_cohort, params.production)
update_case_count_table(create_masking_report.out, ch_comment, params.production)
} else {
exit 1, 'step not supported'
}
}

/*
Expand Down
7 changes: 4 additions & 3 deletions modules/merge_and_uncode_rca_uploads.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@ Merge and uncode REDcap export data files.
*/
process merge_and_uncode_rca_uploads {

container 'sagebionetworks/genie-bpc-pipeline-uploads'
container "$params.uploads_docker"
secret 'SYNAPSE_AUTH_TOKEN'
debug true

input:
val previous
val cohort
val comment
val production

output:
Expand All @@ -19,13 +20,13 @@ process merge_and_uncode_rca_uploads {
if (production) {
"""
cd /usr/local/src/myscripts/
Rscript merge_and_uncode_rca_uploads.R -c $cohort -u -v
Rscript merge_and_uncode_rca_uploads.R -c $cohort -v --production --save_synapse --comment $comment
"""
}
else {
"""
cd /usr/local/src/myscripts/
Rscript merge_and_uncode_rca_uploads.R -c $cohort -v
Rscript merge_and_uncode_rca_uploads.R -c $cohort -v --save_synapse --comment $comment
"""
}
}
1 change: 1 addition & 0 deletions modules/run_quac_upload_report_error.nf
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ process run_quac_upload_report_error {
debug true

input:
val previous
val cohort

output:
Expand Down
4 changes: 2 additions & 2 deletions modules/update_data_table.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@
Update Synapse tables with merged and uncoded data.
*/
process update_data_table {
container "$params.table_updates_docker"

container 'sagebionetworks/genie-bpc-pipeline-table-updates'
secret 'SYNAPSE_AUTH_TOKEN'
debug true

input:
val previous
val cohort
val production
val comment
val production

output:
stdout
Expand Down
30 changes: 30 additions & 0 deletions modules/update_potential_phi_fields_table.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
/*
Updates the potential pHI fields table with any new variables to redact
*/
process update_potential_phi_fields_table {

container "$params.references_docker"
secret 'SYNAPSE_AUTH_TOKEN'
debug true

input:
val comment
val production

output:
stdout

script:
if (production) {
"""
cd /usr/local/src/myscripts/
Rscript update_potential_phi_fields_table.R -c $comment --production
"""
}
else {
"""
cd /usr/local/src/myscripts/
Rscript update_potential_phi_fields_table.R -c $comment
"""
}
}
10 changes: 10 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ manifest {
profiles {
aws_prod {
process {
withName: update_potential_phi_fields_table {
memory = 32.GB
cpus = 8
}
withName: run_workflow_case_selection {
memory = 32.GB
cpus = 8
Expand Down Expand Up @@ -46,5 +50,11 @@ profiles {
cpus = 8
}
}
params {
// docker image parameters, see nextflow_schema.json for details
references_docker = "sagebionetworks/genie-bpc-pipeline-references"
uploads_docker = "sagebionetworks/genie-bpc-pipeline-uploads"
table_updates_docker = "sagebionetworks/genie-bpc-pipeline-table-updates"
}
}
}
23 changes: 23 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,29 @@
false
]
},
"step": {
"type": "string",
"default": "update_potential_phi_fields_table",
"description": "Available BPC steps",
"enum": [
"update_potential_phi_fields_table",
"merge_and_uncode_rca_uploads",
"update_data_table",
"genie_bpc_pipeline"
]
},
"references_docker":{
"type": "string",
"description": "Name of docker to use in processes in scripts/references"
},
"uploads_docker":{
"type": "string",
"description": "Name of docker to use in processes in scripts/uploads"
},
"table_updates_docker":{
"type": "string",
"description": "Name of docker to use in processes in scripts/table_updates"
},
"schema_ignore_params": {
"type": "string",
"description": "Put parameters to ignore for validation here separated by comma",
Expand Down
12 changes: 6 additions & 6 deletions scripts/case_selection/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -470,13 +470,13 @@ phase:
# production: 0
# adjusted: 0
PROV:
production: 133
adjusted: 133
production: 115
adjusted: 115
pressure: 5
irr: 0
UCSF:
production: 30
adjusted: 30
production: 38
adjusted: 38
pressure: 5
irr: 0
UHN:
Expand All @@ -487,8 +487,8 @@ phase:
# production: 0
# adjusted: 0
VHIO:
production: 70
adjusted: 70
production: 80
adjusted: 80
pressure: 5
irr: 0
#WAKE:
Expand Down
Loading

0 comments on commit 2396bbd

Please sign in to comment.