Write the gene families by ordering the cluster based on their size. #265

jpjarnoux · 2024-08-19T12:53:05Z

This PR is linked to the issue #263.

When the genes of gene families were written to the tsv, genes were randomly sorted. As reported in issue #263, clustering appeared non-deterministic, even though the number of clusters was the same. By ordering the clusters by size and the genes alphabetically, the gene_families.tsv file is the same between several ppanggolin cluster executions.

This change makes it possible to test whether there is a difference between the expected and the computed clustering in the GitHub action. We added a checksum file containing sha256sum for each gene_families.tsv, and when the file is written in action, the sha256sum is tested.

Also, as mentioned in the issue, the result2repseq mmseqs2 command was missing the --thread option. This has been fixed.

N.B., it could be a good thing to add more sha256sum check for output file.

jpjarnoux added 4 commits August 19, 2024 14:43

Add thread arguments to mmseqs2 result2repseq command

030783f

Make write gene families in order and test it

6b68bf9

Add possibility to write compress file

f204803

Test for sha256

b3a768b

JeanMainguy changed the base branch from master to dev August 19, 2024 13:12

replace sha256sum by shasum -a 256 for mac compatibility

3c2f9c8

jpjarnoux force-pushed the deterCluster branch from e43de5b to 3c2f9c8 Compare August 19, 2024 13:28

jpjarnoux added 2 commits August 19, 2024 15:52

fix use of shasum

7f46cf3

add sha256 to all gene_families.tsv computed

3f28d98

jpjarnoux requested a review from JeanMainguy August 19, 2024 15:28

jpjarnoux mentioned this pull request Aug 19, 2024

Non-deterministic clustering (possibly due to multi-threading) #263

Closed

JeanMainguy approved these changes Aug 20, 2024

View reviewed changes

JeanMainguy merged commit 3208797 into dev Aug 20, 2024
4 checks passed

JeanMainguy deleted the deterCluster branch October 29, 2024 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write the gene families by ordering the cluster based on their size. #265

Write the gene families by ordering the cluster based on their size. #265

jpjarnoux commented Aug 19, 2024 •

edited

Loading

Write the gene families by ordering the cluster based on their size. #265

Write the gene families by ordering the cluster based on their size. #265

Conversation

jpjarnoux commented Aug 19, 2024 • edited Loading

jpjarnoux commented Aug 19, 2024 •

edited

Loading