Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove feature components in favor of quantms-rescoring #46

Merged
merged 14 commits into from
Feb 20, 2025
Merged

Conversation

ypriverol
Copy link
Member

@ypriverol ypriverol commented Feb 20, 2025

Summary by CodeRabbit

  • Documentation
    • Updated user documentation by removing references to legacy features to ensure clarity on current capabilities.
  • Chores
    • Bumped the package version to 0.0.19.
    • Streamlined the command-line interface by deprecating outdated commands.
  • Refactor
    • Enhanced the key data processing tool with improved error handling, more accurate filtering, and broader compatibility with updated output standards.

Copy link
Contributor

coderabbitai bot commented Feb 20, 2025

Walkthrough

The changes remove outdated percolator script details from the documentation and eliminate related CLI utilities and tests. The version has been updated from "0.0.18" to "0.0.19" across configuration and initialization files. Significant improvements have been introduced in the diann2mztab module—enhancing parameter readability, decoy filtering, peptide sequence handling, error resiliency, and DataFrame merging. Additionally, commands for the removed features have been deleted from the CLI entry point.

Changes

File(s) Change Summary
README.md Removed the "Features to percolator scripts" section and descriptions for sage2feature and spectrum2feature.
pyproject.toml, quantmsutils/__init__.py, recipe/meta.yaml Updated version from "0.0.18" to "0.0.19" (with minor reformatting).
.../diann/diann2mztab.py Improved function signature formatting; added decoy filtering, peptide sequence adjustment, robust error handling for report file lookup, updated supported DIA-NN versions, refined DataFrame merging, and introduced new parameters.
.../features/sage_feature.py, .../features/snr.py Removed CLI utility files for sage2feature and spectrum2feature, eliminating associated command implementations.
.../quantmsutilsc.py, tests/test_commands.py Removed CLI command registrations and tests related to the deleted sage and spectrum feature utilities.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant D2M as diann2mztab()
    participant Rpt as Report File

    User->>D2M: Call diann2mztab with parameters
    D2M->>Rpt: Attempt to load "report.tsv"
    alt "report.tsv" not found
        D2M->>Rpt: Load "report.parquet"
    end
    D2M->>D2M: Filter decoy precursors (if "Decoy" exists)
    D2M->>D2M: Adjust peptide sequence modifications
    D2M->>D2M: Merge DataFrames with new parameters
    D2M->>User: Return generated mztab output
Loading

Suggested reviewers

  • jpfeuffer
  • daichengxin

Poem

I'm a hopping coder rabbit, light on my feet,
Skipping old commands, making new beats.
With versions renewed and code clear as day,
I celebrate fresh paths in a joyful display.
In the realm of bytes, I merrily bound,
Leaving legacy errors far underground!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
quantmsutils/diann/diann2mztab.py (1)

718-721: Remove or repurpose unused variable ‘file’
Static analysis flags that 'file' is assigned but never used. Either remove it or replace the if-else with aternary operator if you plan to use it.

Apply this diff if the variable is not needed:

-    if "2.0" not in diann_version:
-        file = list(pg.columns[5:])
-    else:
-        file = list(pg.columns[4:])
🧰 Tools
🪛 Ruff (0.8.2)

718-721: Use ternary operator file = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:]) instead of if-else-block

Replace if-else-block with file = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:])

(SIM108)


721-721: Local variable file is assigned to but never used

Remove assignment to unused variable file

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3aa1a35 and 04d2af6.

📒 Files selected for processing (9)
  • README.md (0 hunks)
  • pyproject.toml (2 hunks)
  • quantmsutils/__init__.py (1 hunks)
  • quantmsutils/diann/diann2mztab.py (27 hunks)
  • quantmsutils/features/sage_feature.py (0 hunks)
  • quantmsutils/features/snr.py (0 hunks)
  • quantmsutils/quantmsutilsc.py (0 hunks)
  • recipe/meta.yaml (1 hunks)
  • tests/test_commands.py (0 hunks)
💤 Files with no reviewable changes (5)
  • tests/test_commands.py
  • quantmsutils/quantmsutilsc.py
  • README.md
  • quantmsutils/features/sage_feature.py
  • quantmsutils/features/snr.py
✅ Files skipped from review due to trivial changes (2)
  • quantmsutils/init.py
  • recipe/meta.yaml
🧰 Additional context used
🪛 Ruff (0.8.2)
quantmsutils/diann/diann2mztab.py

718-721: Use ternary operator file = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:]) instead of if-else-block

Replace if-else-block with file = list(pg.columns[5:]) if "2.0" not in diann_version else list(pg.columns[4:])

(SIM108)


721-721: Local variable file is assigned to but never used

Remove assignment to unused variable file

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (27)
quantmsutils/diann/diann2mztab.py (26)

51-59: Function signature looks good
The new parameters enhance clarity for DIA-NN conversions. No further issues found.


97-102: Appropriate decoy filtering for MSstats
Filtering out decoy entries (Decoy == 1) helps ensure MSstats analysis is limited to target data.


112-116: Careful handling of DIA-NN decoy “^”
Appending a caret prefix only when present ensures consistent naming for decoy peptides. This looks correct.


260-262: Skipping caret character is correct
Excluding “^” from mass calculations prevents incorrect mass contributions for decoy peptides.


272-295: Condition for unknown amino acids
The extended condition properly checks for recognized amino acids, logging unknowns. This is fine, though consider consolidating the list or using a data structure if it grows further.


331-335: Fallback to parquet file is valid
Attempting to locate “report.tsv” first, then defaulting to “report.parquet,” is a robust approach for DIA-NN version differences.


372-372: Expanded support for DIA-NN versions
Adding “2.0.1” and “2.0.2” ensures broader coverage. No issues found.


377-383: Extended convert_to_mztab function signature
Passing in all necessary parameters aligns well with the updates in DIA-NN output.


407-407: Refactoring references to ‘assay’ and ‘study_variable’
Renaming columns and merging DataFrames help map run information accurately. This appears consistent with the new integration logic.

Also applies to: 412-412, 415-415, 420-420, 428-428, 436-436, 438-438


469-476: Conditional loading for non-2.0 vs 2.0
Selecting columns for the main report based on DIA-NN version keeps backward compatibility.


490-494: Efficient precursor m/z calculation
Mapping unique sequences to computed masses offers a significant performance gain. Good optimization.


783-789: Averaging protein abundance per study variable
Dropping source columns after computing the mean is valid. Ensure no references depend on them later.


877-883: Updated mztab_peh function signature
Inclusion of ms_run_index fosters consistent indexing in downstream operations. No issues found.


969-972: Pivot logic for search_engine_score
Aggregating Q.Value by precursor and run, then pivoting for ms_run references is a solid approach.

Also applies to: 977-981


1037-1037: Merging aggregated data
Combining the aggregated table (aggtable) with existing columns ensures each precursor has updated stats. No issues found.


1043-1050: Adding decoy column to PEH
Conditional merging of "Decoy" avoids errors for versions that don’t store decoy info.


1062-1062: New mztab_psh function
Excellent step to handle PSM-level data separately in the pipeline.


1123-1125: Normalized RT and fallback to nearest match
Dividing RT by 60 and performing nearest-match merges ensures cross-file consistency. A robust fallback for missing MS2.Scan values.

Also applies to: 1128-1129


1131-1182: Mapping MS2.Scan to RT
Merging with both columns ensures accurate referencing for the PSM data. Logging unassigned scans is helpful for debugging.


1184-1185: Fallback merge for RT-only matching
When “MS2.Scan” is unavailable, using RT as the single key is a sensible fallback.


1190-1205: Defining PSM columns
Extracting essential fields, including Q.Value, ensures the PSM sub-table is well structured for mzTab.


1206-1250: Handling Decoy in PSH
Creating “opt_global_cv_MS:1002217_decoy_peptide” for both decoy and non-decoy peptides is consistent with upstream logic.


1269-1272: Finding modifications
Applying a per-row approach to parse modifications is correct. No concerns.


1281-1283: Stripping '^' from PSM annotation
Replacing decoy character before constructing the final string ensures consistent peptidoform sequences.


1564-1564: Interval merging logic
Maximizing merged intervals for coverage calculation is correct. No issues.


1576-1577: calculate_protein_coverages function
The approach for partial string matching of accessions against FASTA is well-documented. Implementation looks fine.

pyproject.toml (1)

6-6: Version bump
Incrementing version from 0.0.18 to 0.0.19 aligns with broader updates to the codebase.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
quantmsutils/diann/diann2mztab.py (3)

1119-1122: Consider using consistent float precision.

The code mixes float32 and float64 types for RT values. Consider standardizing to float64 throughout to avoid potential precision issues.

-target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1)
-group['RT'] = group['RT'].astype('float64')
+# Convert all RT values to float64 for consistency
+target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1).astype('float64')

1118-1182: Consider refactoring MS2 scan matching logic.

The MS2 scan matching logic is complex and could benefit from being extracted into a separate function for better maintainability.

+def match_ms2_scans(rt_matched, target, group):
+    """Match MS2 scans between DIA-NN report and mzML file.
+    
+    Args:
+        rt_matched: DataFrame with RT-matched data
+        target: DataFrame with target data
+        group: DataFrame with group data
+    
+    Returns:
+        DataFrame with matched MS2 scans
+    """
+    if "MS2.Scan" not in group.columns:
+        logger.info("MS2.Scan column isn't in DIA-NN report, only Matching MS2 by RT values")
+        return rt_matched
+
+    logger.info("Mapping DIA-NN MS2 to mzML by MS2.Scan and RT columns")
+    new_target = target.copy()
+    new_target.columns = [
+        "scan_RT",
+        "scan_opt_global_spectrum_reference",
+        "MS2.Scan",
+        "scan_exp_mass_to_charge",
+    ]
+    scan_matched = pd.merge(rt_matched, new_target, on="MS2.Scan")
+
+    scan_matched["unassigned_matched"] = scan_matched.apply(
+        lambda row: 1 if row["MS2.Scan"] != row["DIANN-intraID"] else 0, axis=1
+    )
+
+    if len(scan_matched[scan_matched["unassigned_matched"] == 1]) > 0:
+        v_str = scan_matched[scan_matched["unassigned_matched"] == 1]["MS2.Scan"].tolist()
+        logger.info(
+            f"RT matched and DIA-NN Scan matched are inconsistent in mzML. Keep Scan matched: {v_str}"
+        )
+        scan_matched.drop(
+            [
+                "RT",
+                "opt_global_spectrum_reference",
+                "DIANN-intraID",
+                "exp_mass_to_charge",
+                "unassigned_matched",
+            ],
+            inplace=True,
+            axis=1,
+        )
+        scan_matched.rename(
+            columns={
+                "scan_RT": "RT",
+                "scan_opt_global_spectrum_reference": "opt_global_spectrum_reference",
+                "scan_exp_mass_to_charge": "exp_mass_to_charge",
+            },
+            inplace=True,
+        )
+    else:
+        scan_matched.drop(
+            [
+                "scan_RT",
+                "scan_opt_global_spectrum_reference",
+                "scan_exp_mass_to_charge",
+                "unassigned_matched",
+            ],
+            inplace=True,
+            axis=1,
+        )
+
+    return scan_matched

# In mztab_psh function:
-if "MS2.Scan" in group.columns:
-    logger.info("Mapping DIA-NN MS2 to mzML by MS2.Scan and RT columns")
-    new_target = target
-    new_target.columns = [
-        "scan_RT",
-        "scan_opt_global_spectrum_reference",
-        "MS2.Scan",
-        "scan_exp_mass_to_charge",
-    ]
-    scan_matched = pd.merge(rt_matched, new_target, on="MS2.Scan")
-    ...
-    out_mztab_psh = pd.concat([out_mztab_psh, scan_matched])
-else:
-    logger.info("MS2.Scan column isn't in DIA-NN report, only Matching MS2 by RT values")
-    out_mztab_psh = pd.concat([out_mztab_psh, rt_matched])
+out_mztab_psh = pd.concat([out_mztab_psh, match_ms2_scans(rt_matched, target, group)])

1560-1561: Fix line indentation.

The line is over-indented. Adjust the indentation to match the surrounding code.

-                    max(merged_starts[-1] + merged_lengths[-1], start + length) - merged_starts[-1]
+                max(merged_starts[-1] + merged_lengths[-1], start + length) - merged_starts[-1]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04d2af6 and 09273b9.

📒 Files selected for processing (1)
  • quantmsutils/diann/diann2mztab.py (27 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (6)
quantmsutils/diann/diann2mztab.py (6)

50-60: LGTM! Function signature formatting improves readability.

The parameters of the diann2mztab function have been reformatted with consistent indentation, making the code more readable.


98-103: LGTM! Decoy filtering for MSstats analysis.

The code now properly filters out decoy precursors (where Decoy=1) before MSstats analysis, while preserving them for mzTab output. This is a good practice as decoy hits should not be included in downstream statistical analysis.


113-118: LGTM! Improved handling of DIA-NN decoy tag.

The code now correctly handles the DIA-NN decoy tag (^) in peptide sequences by:

  1. Preserving the tag when converting modifications
  2. Prepending the tag to maintain decoy identification

332-337: LGTM! Enhanced DIA-NN version support.

The code now:

  1. Handles both .tsv (DIA-NN 1.8.1) and .parquet (DIA-NN 2.0+) report formats
  2. Includes support for DIA-NN 2.0.x versions
  3. Properly loads decoy information from DIA-NN 2.0+ outputs

Also applies to: 472-477


413-414: LGTM! Improved DataFrame merging logic.

The code now:

  1. Ensures correct type casting for assay and study_variable columns
  2. Efficiently calculates protein abundance by averaging MS runs from the same sample

Also applies to: 778-784


1202-1242: LGTM! Enhanced PSH table decoy handling.

The code now properly:

  1. Includes decoy information in PSH table when available
  2. Sets default decoy value to 0 when not available
  3. Maintains consistent column ordering

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
quantmsutils/diann/diann2mztab.py (3)

640-651: Consider using a more efficient data structure for ms_run_index.

The current implementation builds the ms_run_index dictionary using a loop. Consider using a more efficient dictionary comprehension.

-ms_run_index = {}
-for i in range(1, len(index_ref["Run"]) + 1):
-    ms_run_index[index_ref.loc[i - 1, "Run"]] = i
+ms_run_index = {index_ref.loc[i - 1, "Run"]: i for i in range(1, len(index_ref["Run"]) + 1)}

1118-1120: Consider using pandas type conversion for better performance.

The current implementation uses apply for type conversion. Consider using pandas astype for better performance.

-target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1)
-group['RT'] = group['RT'].astype('float64')
+target.loc[:, "RT"] = (target["RT"] / 60).astype('float64')
+group['RT'] = group['RT'].astype('float64')

1117-1117: Add a comment explaining the RT conversion.

The code converts RT from seconds to minutes but lacks a clear comment explaining this conversion.

+# Convert retention time from seconds to minutes for standardization
target.loc[:, "RT"] = target.apply(lambda x: (x["RT"] / 60), axis=1)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09273b9 and f72c96f.

📒 Files selected for processing (1)
  • quantmsutils/diann/diann2mztab.py (27 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (8)
quantmsutils/diann/diann2mztab.py (8)

51-59: LGTM! Function parameters are well-organized.

The parameters of the diann2mztab function are properly indented and aligned, improving code readability.


98-103: Improved handling of decoy precursors.

The code now properly filters out decoy precursors (where Decoy=1) before MSstats analysis, while preserving them for mzTab output. This ensures accurate downstream analysis.


114-117: Enhanced peptide sequence handling.

The code now properly handles DIA-NN decoy tags (^) in peptide sequences, ensuring correct sequence representation in the output.


332-336: Improved error handling for report file retrieval.

The code now gracefully handles both tsv and parquet formats, adapting to different DIA-NN versions.


472-476: Enhanced version-specific report loading.

The code now properly handles different report formats based on DIA-NN version:

  • Version 1.8.1: Loads TSV with MS2.Scan column
  • Version 2.0+: Loads Parquet with Decoy column

1038-1045: Improved decoy handling in PSH construction.

The code now properly handles the Decoy column in PSH sub-table construction, ensuring accurate representation of decoy peptides in the output.


1123-1180: Robust MS2 scan matching logic.

The code now handles cases where MS2.Scan column is not available in DIA-NN report:

  1. If MS2.Scan exists: Matches by both RT and scan number
  2. If MS2.Scan doesn't exist: Matches only by RT values

This makes the code more resilient to different DIA-NN versions.


373-374:

✅ Verification successful

Verify DIA-NN version compatibility.

The supported DIA-NN versions have been updated to include newer versions (2.0, 2.0.1, 2.0.2).

Run this script to check if there are any unsupported versions in use:


🏁 Script executed:

#!/bin/bash
# Description: Check for any unsupported DIA-NN versions in the codebase
rg -A 5 "DIA-NN.*version" --glob '!*.{pyc,pyo}'

Length of output: 2299


DIA-NN Version Compatibility Check Passed

The updated code in quantmsutils/diann/diann2mztab.py (lines 373-374) now correctly verifies that the DIA-NN version is one of the supported ones: ["1.8.1", "2.0", "2.0.1", "2.0.2"]. The output from the verification script confirms no unsupported versions are in use. Please continue to update this list as new DIA-NN versions emerge if needed.

@ypriverol ypriverol merged commit 56f0d67 into main Feb 20, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants