Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output file format not strictly enough defined. #33

Open
BioGeek opened this issue Oct 10, 2024 · 3 comments
Open

Output file format not strictly enough defined. #33

BioGeek opened this issue Oct 10, 2024 · 3 comments

Comments

@BioGeek
Copy link
Contributor

BioGeek commented Oct 10, 2024

The current description of the output format contains a lot of ambiguities and several algorithms interpret them in different ways.

A non-exhaustive list:

aa_scores

  • biatNovo-DDA: a string of comma-separated negative float values with two decimal digits: "-3.20,-3.77,-4.74,-5.10,-4.31,-3.78,-3.91,-4.04,-3.52,-4.12,-3.13,-7.27,-4.34,-3.76"
  • pepnet: space separated positive float values of six decimal digits between square brackets:
    [0.164515 0.235235 0.218719 0.358655 0.252523 0.227940 0.342400 0.456557 0.576003 0.679042 0.927740 0.996059 0.999982 0.999307 0.999602 0.999992 0.999995 0.999997 0.999989 0.999987 0.999997 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000]
  • pi-HelixNovo: pipe-separated positive float values of two decimal digits:
    0.03|0.03|0.03|0.03|0.03|1.0|0.67|0.4|0.14|0.11|0.24|0.22|0.19|0.31

spectrum_id

  • pepnet: includes the subfolder and file extension: 9_species_human/151009_exo4_1.mgf:0
  • all the other algorithms: filename without file extension: 151009_exo4_1:0

sequence

  • biatNovo-DDA: sometimes has nan as sequence:
    151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,

score

  • biatNovo-DDA: sometimes has no score value:
    151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,
  • biatNovo-DDA and casanovo uses negative score values, instanovo, pepnet and pi-HelixNovo use positive values. So clarify if score means log probabilities or confidence score.

Are extra columns allowed?

  • biatNovo-DDA: spectrum_id,feature_area,sequence,score,aa_scores,precursor_mz,precursor_charge,protein_access_id,scan_list_middle,scan_list_original,predicted_score_max
  • casanovo: sequence,PSM_ID,accession,unique,database,database_version,search_engine,score,modifications,retention_time,charge,exp_mass_to_charge,calc_mass_to_charge,spectrum_id,pre,post,start,end,aa_scores

Ideally one would want to have a validator added to the codebase that checks the output file and complains if the output.csv is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.

Something like:

import csv
from typing import List, Dict, Any
from pydantic import BaseModel, field_validator, ValidationError, model_validator
import re
import os


class CSVRow(BaseModel):
    sequence: str
    score: float
    aa_scores: str
    spectrum_id: str

    @field_validator('sequence')
    @classmethod
    def validate_sequence(cls, v: str) -> str:
        valid_aa = set("GASPVTCLINDQKEMHFRYW")
        
        stripped_seq = re.sub(r'\[UNIMOD:\d+\]', '', v)
        
        if not all(aa in valid_aa for aa in stripped_seq):
            invalid_aa = set(stripped_seq) - valid_aa
            raise ValueError(f"Invalid amino acid(s) {', '.join(invalid_aa)} in sequence {v}")
        
        return v
    
    @field_validator('score')
    @classmethod
    def validate_score(cls, v: float) -> float:
        if not 0 <= v <= 1:
            raise ValueError("Score must be between 0 and 1")
        return v

    @field_validator('aa_scores')
    @classmethod
    def validate_aa_scores(cls, v: str) -> str:
        try:
            scores = [float(score) for score in v.split(',')]
        except ValueError:
                raise ValueError("Invalid aa_scores format. Must be a string of comma-separated floats.")
        
        return v

    @field_validator('spectrum_id')
    @classmethod
    def validate_spectrum_id(cls, v: str) -> str:
        if '/' in v:
            raise ValueError("spectrum_id cannot contain forward slashes")
        if not re.match(r'^.+:\d+$', v):
            raise ValueError("Invalid spectrum_id format. Must be in the format 'filename:index'")
        return v


def validate_csv(file_path: str) -> List[CSVRow]:
    validated_rows = []
    
    with open(file_path, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        
        if set(reader.fieldnames) != {'sequence', 'score', 'aa_scores', 'spectrum_id'}:
           raise ValueError("CSV file must contain columns: sequence, score, aa_scores, spectrum_id")

        for row in reader:
            try:
                validated_row = CSVRow(**row)
                validated_rows.append(validated_row)
            except ValidationError as e:
                print(f"Validation error in row: {row}")
                print(e)
    
    return validated_rows

# Example usage
if __name__ == "__main__":
    output_dir = "outputs/9_species_human"
    for filename in os.listdir(output_dir):
        file_path = os.path.join(output_dir, filename)
        try:
            validated_data = validate_csv(file_path)
            print(f"Successfully validated {len(validated_data)} rows in {filename}")
        except ValueError as e:
            print(f"Validation failed for {filename}: {str(e)}")
@bittremieux
Copy link

biatNovo-DDA and casanovo uses negative score values, instanovo, pepnet and pi-HelixNovo use positive values. So clarify if score means log probabilities or confidence score.

This is just what is reported by the tools, we don't interpret them as probabilities or otherwise. The raw values are not used directly, instead only the corresponding ranking of the PSMs is used.

Ideally one would want to have a validator added to the codebase that checks the output file and complains if the output.csv is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.

Yes, this should be added.

Thanks for reporting. Indeed, the output should be properly validated and ensured to be correct @PominovaMS. The ambiguity and even incorrect encoding of the aa_scores for example is problematic.

@BioGeek
Copy link
Contributor Author

BioGeek commented Oct 14, 2024

The validation code above is based on my interpretation of the output format description, but I am not sure it is correct. So if the description can be updated to remove the ambiguities then I'm happy to submit a PR with the updated validation code.

@bittremieux
Copy link

Thanks. Some of these inconsistencies have been reported before, but it seems that @PominovaMS hasn't been able to update the instructions and code yet. We'll keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants