Output file format not strictly enough defined. #33

BioGeek · 2024-10-10T20:50:27Z

The current description of the output format contains a lot of ambiguities and several algorithms interpret them in different ways.

A non-exhaustive list:

`aa_scores`

biatNovo-DDA: a string of comma-separated negative float values with two decimal digits: "-3.20,-3.77,-4.74,-5.10,-4.31,-3.78,-3.91,-4.04,-3.52,-4.12,-3.13,-7.27,-4.34,-3.76"
pepnet: space separated positive float values of six decimal digits between square brackets:
[0.164515 0.235235 0.218719 0.358655 0.252523 0.227940 0.342400 0.456557 0.576003 0.679042 0.927740 0.996059 0.999982 0.999307 0.999602 0.999992 0.999995 0.999997 0.999989 0.999987 0.999997 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000]
pi-HelixNovo: pipe-separated positive float values of two decimal digits:
0.03|0.03|0.03|0.03|0.03|1.0|0.67|0.4|0.14|0.11|0.24|0.22|0.19|0.31

`spectrum_id`

pepnet: includes the subfolder and file extension: 9_species_human/151009_exo4_1.mgf:0
all the other algorithms: filename without file extension: 151009_exo4_1:0

`sequence`

biatNovo-DDA: sometimes has nan as sequence:
151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,

`score`

biatNovo-DDA: sometimes has no score value:
151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,
biatNovo-DDA and casanovo uses negative score values, instanovo, pepnet and pi-HelixNovo use positive values. So clarify if score means log probabilities or confidence score.

Are extra columns allowed?

biatNovo-DDA: spectrum_id,feature_area,sequence,score,aa_scores,precursor_mz,precursor_charge,protein_access_id,scan_list_middle,scan_list_original,predicted_score_max
casanovo: sequence,PSM_ID,accession,unique,database,database_version,search_engine,score,modifications,retention_time,charge,exp_mass_to_charge,calc_mass_to_charge,spectrum_id,pre,post,start,end,aa_scores

Ideally one would want to have a validator added to the codebase that checks the output file and complains if the output.csv is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.

Something like:

import csv
from typing import List, Dict, Any
from pydantic import BaseModel, field_validator, ValidationError, model_validator
import re
import os


class CSVRow(BaseModel):
    sequence: str
    score: float
    aa_scores: str
    spectrum_id: str

    @field_validator('sequence')
    @classmethod
    def validate_sequence(cls, v: str) -> str:
        valid_aa = set("GASPVTCLINDQKEMHFRYW")
        
        stripped_seq = re.sub(r'\[UNIMOD:\d+\]', '', v)
        
        if not all(aa in valid_aa for aa in stripped_seq):
            invalid_aa = set(stripped_seq) - valid_aa
            raise ValueError(f"Invalid amino acid(s) {', '.join(invalid_aa)} in sequence {v}")
        
        return v
    
    @field_validator('score')
    @classmethod
    def validate_score(cls, v: float) -> float:
        if not 0 <= v <= 1:
            raise ValueError("Score must be between 0 and 1")
        return v

    @field_validator('aa_scores')
    @classmethod
    def validate_aa_scores(cls, v: str) -> str:
        try:
            scores = [float(score) for score in v.split(',')]
        except ValueError:
                raise ValueError("Invalid aa_scores format. Must be a string of comma-separated floats.")
        
        return v

    @field_validator('spectrum_id')
    @classmethod
    def validate_spectrum_id(cls, v: str) -> str:
        if '/' in v:
            raise ValueError("spectrum_id cannot contain forward slashes")
        if not re.match(r'^.+:\d+$', v):
            raise ValueError("Invalid spectrum_id format. Must be in the format 'filename:index'")
        return v


def validate_csv(file_path: str) -> List[CSVRow]:
    validated_rows = []
    
    with open(file_path, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        
        if set(reader.fieldnames) != {'sequence', 'score', 'aa_scores', 'spectrum_id'}:
           raise ValueError("CSV file must contain columns: sequence, score, aa_scores, spectrum_id")

        for row in reader:
            try:
                validated_row = CSVRow(**row)
                validated_rows.append(validated_row)
            except ValidationError as e:
                print(f"Validation error in row: {row}")
                print(e)
    
    return validated_rows

# Example usage
if __name__ == "__main__":
    output_dir = "outputs/9_species_human"
    for filename in os.listdir(output_dir):
        file_path = os.path.join(output_dir, filename)
        try:
            validated_data = validate_csv(file_path)
            print(f"Successfully validated {len(validated_data)} rows in {filename}")
        except ValueError as e:
            print(f"Validation failed for {filename}: {str(e)}")

The text was updated successfully, but these errors were encountered:

bittremieux · 2024-10-11T08:45:25Z

biatNovo-DDA and casanovo uses negative score values, instanovo, pepnet and pi-HelixNovo use positive values. So clarify if score means log probabilities or confidence score.

This is just what is reported by the tools, we don't interpret them as probabilities or otherwise. The raw values are not used directly, instead only the corresponding ranking of the PSMs is used.

Ideally one would want to have a validator added to the codebase that checks the output file and complains if the output.csv is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.

Yes, this should be added.

Thanks for reporting. Indeed, the output should be properly validated and ensured to be correct @PominovaMS. The ambiguity and even incorrect encoding of the aa_scores for example is problematic.

BioGeek · 2024-10-14T07:41:42Z

The validation code above is based on my interpretation of the output format description, but I am not sure it is correct. So if the description can be updated to remove the ambiguities then I'm happy to submit a PR with the updated validation code.

bittremieux · 2024-10-14T12:08:37Z

Thanks. Some of these inconsistencies have been reported before, but it seems that @PominovaMS hasn't been able to update the instructions and code yet. We'll keep you posted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output file format not strictly enough defined. #33

Output file format not strictly enough defined. #33

BioGeek commented Oct 10, 2024 •

edited

Loading

bittremieux commented Oct 11, 2024

BioGeek commented Oct 14, 2024

bittremieux commented Oct 14, 2024

Output file format not strictly enough defined. #33

Output file format not strictly enough defined. #33

Comments

BioGeek commented Oct 10, 2024 • edited Loading

aa_scores

spectrum_id

sequence

score

Are extra columns allowed?

bittremieux commented Oct 11, 2024

BioGeek commented Oct 14, 2024

bittremieux commented Oct 14, 2024

BioGeek commented Oct 10, 2024 •

edited

Loading

`aa_scores`

`spectrum_id`

`sequence`

`score`