Adjusting Parameters

Weighting

There are several options to adjust the feature weighting. However, the major differences are between the uniform and the statistical weighting. The uniform weighting gives every feature the same weight for the scoring. This is the default weighting if no reference protein set is given. We also recommend using this weighting when working with features of interest that are very abundant. For example, features such as Transmembrane domains are very common in a the protein set of a taxon which leads to lower weights in the statistical weighting.

We generally recommend the statistical weighting in standard analysis as this assigns weights based on the feature abundance. Here rare features will have more impact. This does not necessarily reflect the biological importance of the feature. However, it reduces the impact of highly abundant features such as low complexity regions where the importance to the function is hard discern.

We do not recommend changing the weight correction function in standard analysis. This option is mainly for fine tuning the weighting. It might be of interest when setting up a standard reference protein set (We usually recommend using the protein set of the seed or ortholog taxa as reference. However, this may not always be feasible). There is no definite way to recommend settings here. We suggest testing out the different functions and decide based on the results.

Adjusting Feature Weights Manually

Lastly, the minimal weight constraints option is of interest when there is interest in a specific feature. Again, this option has no recommended settings as it depends highly on the research question. For example, if one is interested in proteins that contain Transmembrane domains, it might be interest to set the minimal weight to 0.5 to make sure that even in architectures with many features this domain will drive the scoring:

 #tool_constraints
 coils N
 flps N
 pfam N
 seg N
 signalp N
 smart N
 tmhmm N
 #feature_constraints
 tmhmm_transmembrane 0.5

An example of this can be seen in the figure below. Here, we compared the standard statistical weighting against the weighting with minimal weight for Transmembrane domains set to 0.5, by calculating both versions for a set of orthologs predicted by OMA pairwise for the Quest for Ortholog benchmark service and plotted them against each other in a scatter plot. The majority of pairs, where neither of the pair has any Transmembrane domains, do not have change in their score (diagonal from 0,0 to 1,1). Pairs with Transmembrane domains, however, can have difference in the score.

As an example for a specific protein pair, the next figure shows the architectures of Q9Y6X5 and P39997, which are a case where one protein has a Transmembrane domain while the other is missing the domain. Per default, the domain gets a weight of 0.16, resulting in a FAS score of 0.61. With the minimal weight setting of 0.5 for Transmembrane domains this domain will have a higher weight, thus further reducing the score to 0.28.

Note, that this option is only available when using the statistical, reference based weighting.

Score weighting

The score weighting decides how much the individual scores (Multiplicity Score, Positional Score) contribute to the combined FAS score. As the .tsv output of FAS also provides the individual scores it is not recommended to change these weightings. However, as the linearization is driven by the combined FAS score it might be interest when working with proteins where large differences in the position of features is expected. This might be relevant when using FAS to find similar feature architectures independent of gene orthology as it was done in the comparison with Blast in FACT. It also might be of interest when comparing different splice forms of genes affected by alternative splicing as this can shift the position of features in the resulting protein isoforms. Here, one can increase the weighting of the PS score to be more important (~0.5 to 0.7). Another change that might be of interest is to include the Clan score (CS) to the combined FAS score. The CS is similar to the MS score, however it uses (Pfam) clan ids instead of feature ids when available. This means feature will be considered as identical if they are part of the same clan. As we only have clan information for Pfam and the linearization is implemented to handle overlapping annotations this score does not contribute to the combined score. However, it might be of interest when working with feature sets that provide clan or similar information on which feature types are similar (i.e, Interpro). Here, one might want to add the clan score or even replace the MS score with it.

Overlap Thresholds

The overlap thresholds provides the possibility to ignore small overlaps. Per default, FAS take all overlaps into account. Optimal overlap thresholds somewhat depend on the features, however we do not going higher than 30 AA for max overlap and 50% for max overlap percentage.

Priority Mode thresholds

We do not recommend to completely deactivate priority mode as the calculation time for certain proteins is not feasible. It might be of interest to change the threshold when the priority mode is applied. This is a trade of between calculation speed and precision. The threshold is set more towards speed, as such, lowering the threshold will not increase calculation speed by much. The threshold can be increased however we do not recommend to increase the priority and max cardinality past 100 and 100000, respectively.

e-value thresholds

Per default, the e-value threshold of FAS for Pfam/Smart domains is a bit more inclusive. For alternative values we recommend looking at default cut offs of the individual feature classes.