-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove feature components in favor of quantms-rescoring #46
Conversation
Support for DIA-NN 2.0
fixed error and clean codes
# Conflicts: # pyproject.toml
WalkthroughThe changes remove outdated percolator script details from the documentation and eliminate related CLI utilities and tests. The version has been updated from "0.0.18" to "0.0.19" across configuration and initialization files. Significant improvements have been introduced in the Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant D2M as diann2mztab()
participant Rpt as Report File
User->>D2M: Call diann2mztab with parameters
D2M->>Rpt: Attempt to load "report.tsv"
alt "report.tsv" not found
D2M->>Rpt: Load "report.parquet"
end
D2M->>D2M: Filter decoy precursors (if "Decoy" exists)
D2M->>D2M: Adjust peptide sequence modifications
D2M->>D2M: Merge DataFrames with new parameters
D2M->>User: Return generated mztab output
Suggested reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
quantmsutils/diann/diann2mztab.py (1)
718-721
: Remove or repurpose unused variable ‘file’
Static analysis flags that 'file' is assigned but never used. Either remove it or replace the if-else with aternary operator if you plan to use it.Apply this diff if the variable is not needed:
- if "2.0" not in diann_version: - file = list(pg.columns[5:]) - else: - file = list(pg.columns[4:])🧰 Tools
🪛 Ruff (0.8.2)
718-721: Use ternary operator
file = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:])
instead ofif
-else
-blockReplace
if
-else
-block withfile = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:])
(SIM108)
721-721: Local variable
file
is assigned to but never usedRemove assignment to unused variable
file
(F841)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.md
(0 hunks)pyproject.toml
(2 hunks)quantmsutils/__init__.py
(1 hunks)quantmsutils/diann/diann2mztab.py
(27 hunks)quantmsutils/features/sage_feature.py
(0 hunks)quantmsutils/features/snr.py
(0 hunks)quantmsutils/quantmsutilsc.py
(0 hunks)recipe/meta.yaml
(1 hunks)tests/test_commands.py
(0 hunks)
💤 Files with no reviewable changes (5)
- tests/test_commands.py
- quantmsutils/quantmsutilsc.py
- README.md
- quantmsutils/features/sage_feature.py
- quantmsutils/features/snr.py
✅ Files skipped from review due to trivial changes (2)
- quantmsutils/init.py
- recipe/meta.yaml
🧰 Additional context used
🪛 Ruff (0.8.2)
quantmsutils/diann/diann2mztab.py
718-721: Use ternary operator file = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:])
instead of if
-else
-block
Replace if
-else
-block with file = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:])
(SIM108)
721-721: Local variable file
is assigned to but never used
Remove assignment to unused variable file
(F841)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (27)
quantmsutils/diann/diann2mztab.py (26)
51-59
: Function signature looks good
The new parameters enhance clarity for DIA-NN conversions. No further issues found.
97-102
: Appropriate decoy filtering for MSstats
Filtering out decoy entries (Decoy == 1) helps ensure MSstats analysis is limited to target data.
112-116
: Careful handling of DIA-NN decoy “^”
Appending a caret prefix only when present ensures consistent naming for decoy peptides. This looks correct.
260-262
: Skipping caret character is correct
Excluding “^” from mass calculations prevents incorrect mass contributions for decoy peptides.
272-295
: Condition for unknown amino acids
The extended condition properly checks for recognized amino acids, logging unknowns. This is fine, though consider consolidating the list or using a data structure if it grows further.
331-335
: Fallback to parquet file is valid
Attempting to locate “report.tsv” first, then defaulting to “report.parquet,” is a robust approach for DIA-NN version differences.
372-372
: Expanded support for DIA-NN versions
Adding “2.0.1” and “2.0.2” ensures broader coverage. No issues found.
377-383
: Extended convert_to_mztab function signature
Passing in all necessary parameters aligns well with the updates in DIA-NN output.
407-407
: Refactoring references to ‘assay’ and ‘study_variable’
Renaming columns and merging DataFrames help map run information accurately. This appears consistent with the new integration logic.Also applies to: 412-412, 415-415, 420-420, 428-428, 436-436, 438-438
469-476
: Conditional loading for non-2.0 vs 2.0
Selecting columns for the main report based on DIA-NN version keeps backward compatibility.
490-494
: Efficient precursor m/z calculation
Mapping unique sequences to computed masses offers a significant performance gain. Good optimization.
783-789
: Averaging protein abundance per study variable
Dropping source columns after computing the mean is valid. Ensure no references depend on them later.
877-883
: Updated mztab_peh function signature
Inclusion of ms_run_index fosters consistent indexing in downstream operations. No issues found.
969-972
: Pivot logic for search_engine_score
Aggregating Q.Value by precursor and run, then pivoting for ms_run references is a solid approach.Also applies to: 977-981
1037-1037
: Merging aggregated data
Combining the aggregated table (aggtable) with existing columns ensures each precursor has updated stats. No issues found.
1043-1050
: Adding decoy column to PEH
Conditional merging of "Decoy" avoids errors for versions that don’t store decoy info.
1062-1062
: New mztab_psh function
Excellent step to handle PSM-level data separately in the pipeline.
1123-1125
: Normalized RT and fallback to nearest match
Dividing RT by 60 and performing nearest-match merges ensures cross-file consistency. A robust fallback for missing MS2.Scan values.Also applies to: 1128-1129
1131-1182
: Mapping MS2.Scan to RT
Merging with both columns ensures accurate referencing for the PSM data. Logging unassigned scans is helpful for debugging.
1184-1185
: Fallback merge for RT-only matching
When “MS2.Scan” is unavailable, using RT as the single key is a sensible fallback.
1190-1205
: Defining PSM columns
Extracting essential fields, including Q.Value, ensures the PSM sub-table is well structured for mzTab.
1206-1250
: Handling Decoy in PSH
Creating “opt_global_cv_MS:1002217_decoy_peptide” for both decoy and non-decoy peptides is consistent with upstream logic.
1269-1272
: Finding modifications
Applying a per-row approach to parse modifications is correct. No concerns.
1281-1283
: Stripping '^' from PSM annotation
Replacing decoy character before constructing the final string ensures consistent peptidoform sequences.
1564-1564
: Interval merging logic
Maximizing merged intervals for coverage calculation is correct. No issues.
1576-1577
: calculate_protein_coverages function
The approach for partial string matching of accessions against FASTA is well-documented. Implementation looks fine.pyproject.toml (1)
6-6
: Version bump
Incrementing version from 0.0.18 to 0.0.19 aligns with broader updates to the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
quantmsutils/diann/diann2mztab.py (3)
1119-1122
: Consider using consistent float precision.The code mixes float32 and float64 types for RT values. Consider standardizing to float64 throughout to avoid potential precision issues.
-target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1) -group['RT'] = group['RT'].astype('float64') +# Convert all RT values to float64 for consistency +target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1).astype('float64')
1118-1182
: Consider refactoring MS2 scan matching logic.The MS2 scan matching logic is complex and could benefit from being extracted into a separate function for better maintainability.
+def match_ms2_scans(rt_matched, target, group): + """Match MS2 scans between DIA-NN report and mzML file. + + Args: + rt_matched: DataFrame with RT-matched data + target: DataFrame with target data + group: DataFrame with group data + + Returns: + DataFrame with matched MS2 scans + """ + if "MS2.Scan" not in group.columns: + logger.info("MS2.Scan column isn't in DIA-NN report, only Matching MS2 by RT values") + return rt_matched + + logger.info("Mapping DIA-NN MS2 to mzML by MS2.Scan and RT columns") + new_target = target.copy() + new_target.columns = [ + "scan_RT", + "scan_opt_global_spectrum_reference", + "MS2.Scan", + "scan_exp_mass_to_charge", + ] + scan_matched = pd.merge(rt_matched, new_target, on="MS2.Scan") + + scan_matched["unassigned_matched"] = scan_matched.apply( + lambda row: 1 if row["MS2.Scan"] != row["DIANN-intraID"] else 0, axis=1 + ) + + if len(scan_matched[scan_matched["unassigned_matched"] == 1]) > 0: + v_str = scan_matched[scan_matched["unassigned_matched"] == 1]["MS2.Scan"].tolist() + logger.info( + f"RT matched and DIA-NN Scan matched are inconsistent in mzML. Keep Scan matched: {v_str}" + ) + scan_matched.drop( + [ + "RT", + "opt_global_spectrum_reference", + "DIANN-intraID", + "exp_mass_to_charge", + "unassigned_matched", + ], + inplace=True, + axis=1, + ) + scan_matched.rename( + columns={ + "scan_RT": "RT", + "scan_opt_global_spectrum_reference": "opt_global_spectrum_reference", + "scan_exp_mass_to_charge": "exp_mass_to_charge", + }, + inplace=True, + ) + else: + scan_matched.drop( + [ + "scan_RT", + "scan_opt_global_spectrum_reference", + "scan_exp_mass_to_charge", + "unassigned_matched", + ], + inplace=True, + axis=1, + ) + + return scan_matched # In mztab_psh function: -if "MS2.Scan" in group.columns: - logger.info("Mapping DIA-NN MS2 to mzML by MS2.Scan and RT columns") - new_target = target - new_target.columns = [ - "scan_RT", - "scan_opt_global_spectrum_reference", - "MS2.Scan", - "scan_exp_mass_to_charge", - ] - scan_matched = pd.merge(rt_matched, new_target, on="MS2.Scan") - ... - out_mztab_psh = pd.concat([out_mztab_psh, scan_matched]) -else: - logger.info("MS2.Scan column isn't in DIA-NN report, only Matching MS2 by RT values") - out_mztab_psh = pd.concat([out_mztab_psh, rt_matched]) +out_mztab_psh = pd.concat([out_mztab_psh, match_ms2_scans(rt_matched, target, group)])
1560-1561
: Fix line indentation.The line is over-indented. Adjust the indentation to match the surrounding code.
- max(merged_starts[-1] + merged_lengths[-1], start + length) - merged_starts[-1] + max(merged_starts[-1] + merged_lengths[-1], start + length) - merged_starts[-1]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
quantmsutils/diann/diann2mztab.py
(27 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (6)
quantmsutils/diann/diann2mztab.py (6)
50-60
: LGTM! Function signature formatting improves readability.The parameters of the
diann2mztab
function have been reformatted with consistent indentation, making the code more readable.
98-103
: LGTM! Decoy filtering for MSstats analysis.The code now properly filters out decoy precursors (where Decoy=1) before MSstats analysis, while preserving them for mzTab output. This is a good practice as decoy hits should not be included in downstream statistical analysis.
113-118
: LGTM! Improved handling of DIA-NN decoy tag.The code now correctly handles the DIA-NN decoy tag (^) in peptide sequences by:
- Preserving the tag when converting modifications
- Prepending the tag to maintain decoy identification
332-337
: LGTM! Enhanced DIA-NN version support.The code now:
- Handles both .tsv (DIA-NN 1.8.1) and .parquet (DIA-NN 2.0+) report formats
- Includes support for DIA-NN 2.0.x versions
- Properly loads decoy information from DIA-NN 2.0+ outputs
Also applies to: 472-477
413-414
: LGTM! Improved DataFrame merging logic.The code now:
- Ensures correct type casting for assay and study_variable columns
- Efficiently calculates protein abundance by averaging MS runs from the same sample
Also applies to: 778-784
1202-1242
: LGTM! Enhanced PSH table decoy handling.The code now properly:
- Includes decoy information in PSH table when available
- Sets default decoy value to 0 when not available
- Maintains consistent column ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
quantmsutils/diann/diann2mztab.py (3)
640-651
: Consider using a more efficient data structure for ms_run_index.The current implementation builds the ms_run_index dictionary using a loop. Consider using a more efficient dictionary comprehension.
-ms_run_index = {} -for i in range(1, len(index_ref["Run"]) + 1): - ms_run_index[index_ref.loc[i - 1, "Run"]] = i +ms_run_index = {index_ref.loc[i - 1, "Run"]: i for i in range(1, len(index_ref["Run"]) + 1)}
1118-1120
: Consider using pandas type conversion for better performance.The current implementation uses apply for type conversion. Consider using pandas astype for better performance.
-target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1) -group['RT'] = group['RT'].astype('float64') +target.loc[:, "RT"] = (target["RT"] / 60).astype('float64') +group['RT'] = group['RT'].astype('float64')
1117-1117
: Add a comment explaining the RT conversion.The code converts RT from seconds to minutes but lacks a clear comment explaining this conversion.
+# Convert retention time from seconds to minutes for standardization target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
quantmsutils/diann/diann2mztab.py
(27 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (8)
quantmsutils/diann/diann2mztab.py (8)
51-59
: LGTM! Function parameters are well-organized.The parameters of the
diann2mztab
function are properly indented and aligned, improving code readability.
98-103
: Improved handling of decoy precursors.The code now properly filters out decoy precursors (where Decoy=1) before MSstats analysis, while preserving them for mzTab output. This ensures accurate downstream analysis.
114-117
: Enhanced peptide sequence handling.The code now properly handles DIA-NN decoy tags (^) in peptide sequences, ensuring correct sequence representation in the output.
332-336
: Improved error handling for report file retrieval.The code now gracefully handles both tsv and parquet formats, adapting to different DIA-NN versions.
472-476
: Enhanced version-specific report loading.The code now properly handles different report formats based on DIA-NN version:
- Version 1.8.1: Loads TSV with MS2.Scan column
- Version 2.0+: Loads Parquet with Decoy column
1038-1045
: Improved decoy handling in PSH construction.The code now properly handles the Decoy column in PSH sub-table construction, ensuring accurate representation of decoy peptides in the output.
1123-1180
: Robust MS2 scan matching logic.The code now handles cases where MS2.Scan column is not available in DIA-NN report:
- If MS2.Scan exists: Matches by both RT and scan number
- If MS2.Scan doesn't exist: Matches only by RT values
This makes the code more resilient to different DIA-NN versions.
373-374
:✅ Verification successful
Verify DIA-NN version compatibility.
The supported DIA-NN versions have been updated to include newer versions (2.0, 2.0.1, 2.0.2).
Run this script to check if there are any unsupported versions in use:
🏁 Script executed:
#!/bin/bash # Description: Check for any unsupported DIA-NN versions in the codebase rg -A 5 "DIA-NN.*version" --glob '!*.{pyc,pyo}'Length of output: 2299
DIA-NN Version Compatibility Check Passed
The updated code in
quantmsutils/diann/diann2mztab.py
(lines 373-374) now correctly verifies that the DIA-NN version is one of the supported ones:["1.8.1", "2.0", "2.0.1", "2.0.2"]
. The output from the verification script confirms no unsupported versions are in use. Please continue to update this list as new DIA-NN versions emerge if needed.
Summary by CodeRabbit