-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output file format not strictly enough defined. #33
Comments
This is just what is reported by the tools, we don't interpret them as probabilities or otherwise. The raw values are not used directly, instead only the corresponding ranking of the PSMs is used.
Yes, this should be added. Thanks for reporting. Indeed, the output should be properly validated and ensured to be correct @PominovaMS. The ambiguity and even incorrect encoding of the |
The validation code above is based on my interpretation of the output format description, but I am not sure it is correct. So if the description can be updated to remove the ambiguities then I'm happy to submit a PR with the updated validation code. |
Thanks. Some of these inconsistencies have been reported before, but it seems that @PominovaMS hasn't been able to update the instructions and code yet. We'll keep you posted. |
The current description of the output format contains a lot of ambiguities and several algorithms interpret them in different ways.
A non-exhaustive list:
aa_scores
"-3.20,-3.77,-4.74,-5.10,-4.31,-3.78,-3.91,-4.04,-3.52,-4.12,-3.13,-7.27,-4.34,-3.76"
[0.164515 0.235235 0.218719 0.358655 0.252523 0.227940 0.342400 0.456557 0.576003 0.679042 0.927740 0.996059 0.999982 0.999307 0.999602 0.999992 0.999995 0.999997 0.999989 0.999987 0.999997 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000]
0.03|0.03|0.03|0.03|0.03|1.0|0.67|0.4|0.14|0.11|0.24|0.22|0.19|0.31
spectrum_id
9_species_human/151009_exo4_1.mgf:0
151009_exo4_1:0
sequence
nan
as sequence:151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,
score
151009_exo4_1:11001,1.0,nan,,,430.887054443359,2.0,,151009_exo4_1:11001,151009_exo4_1:11001,
Are extra columns allowed?
spectrum_id,feature_area,sequence,score,aa_scores,precursor_mz,precursor_charge,protein_access_id,scan_list_middle,scan_list_original,predicted_score_max
sequence,PSM_ID,accession,unique,database,database_version,search_engine,score,modifications,retention_time,charge,exp_mass_to_charge,calc_mass_to_charge,spectrum_id,pre,post,start,end,aa_scores
Ideally one would want to have a validator added to the codebase that checks the output file and complains if the
output.csv
is not up to spec. An output file that does not confirm to the spec can cause the evaluation step to fail or give wrong results.Something like:
The text was updated successfully, but these errors were encountered: