You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a workflow which uses kmc to count all kmers in an extremely large dataset of about 250,000 fasta files. The workflow was originally built with v3.2.1 of kmc, but stalled when I updates to v3.2.4. Unfortunately it doesn't exit or report an error. Here's the details I can provide:
The program spends some time printing * characters, and then it prints Stage 1: 0% before stalling. There are 511 bin files in the workdir. Htop shows no processor activity, but the commands are still listed.
Before changing versions, i spent time trying to make sure that none of the fasta.gz files were corrupted.
gzip -t was clean for all genomes
py_fasta_validator did not indicate a problem with any of the fasta formatting
I ran kmc on each genome individually and it returned a result for all (however, it did fail on a few genomes, but then passed when I reran on those. This could be because I used xargs to parallelize 94 at a time)
The text was updated successfully, but these errors were encountered:
The data is publicly available. They are all the genomes I could collect from the gtdb database via NCBI. I'm attaching two lists. One is the ftp links I used to download all the genomes. They may or may not still be valid. The other is a subset of the genomes that I used when I encountered the error. You'll need about 351G of space to download all the genomes, and the final database ends up being about 4 TB. I'm working on an AWS EC2 instance (r5a.24xlarge) running AWS Linux 2023. KMC was installed using mamba, and the call was made from within a snakemake pipeline which I cannot share.
I've already moved passed the problem and have to get to the downstream analysis. I'll try the changes you suggested the next time I run this pipeline (likely in a few weeks).
I have a workflow which uses kmc to count all kmers in an extremely large dataset of about 250,000 fasta files. The workflow was originally built with v3.2.1 of kmc, but stalled when I updates to v3.2.4. Unfortunately it doesn't exit or report an error. Here's the details I can provide:
KMC call:
Result:
The program spends some time printing * characters, and then it prints Stage 1: 0% before stalling. There are 511 bin files in the workdir. Htop shows no processor activity, but the commands are still listed.
Before changing versions, i spent time trying to make sure that none of the fasta.gz files were corrupted.
The text was updated successfully, but these errors were encountered: