Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.2.4 stalls on very large dataset but v3.2.1 does not #238

Open
peter-kanvas opened this issue Jul 4, 2024 · 2 comments
Open

v3.2.4 stalls on very large dataset but v3.2.1 does not #238

peter-kanvas opened this issue Jul 4, 2024 · 2 comments

Comments

@peter-kanvas
Copy link

I have a workflow which uses kmc to count all kmers in an extremely large dataset of about 250,000 fasta files. The workflow was originally built with v3.2.1 of kmc, but stalled when I updates to v3.2.4. Unfortunately it doesn't exit or report an error. Here's the details I can provide:

KMC call:

kmc -fm -ci0 -cx100000000000 -t94 -k75 -m745 @reference_list database databse_dir

Result:

The program spends some time printing * characters, and then it prints Stage 1: 0% before stalling. There are 511 bin files in the workdir. Htop shows no processor activity, but the commands are still listed.

Before changing versions, i spent time trying to make sure that none of the fasta.gz files were corrupted.

  • gzip -t was clean for all genomes
  • py_fasta_validator did not indicate a problem with any of the fasta formatting
  • I ran kmc on each genome individually and it returned a result for all (however, it did fail on a few genomes, but then passed when I reran on those. This could be because I used xargs to parallelize 94 at a time)
@marekkokot
Copy link
Contributor

Hello,

this sounds bad.
Is this data anyhow downloadable, such that I could try to reproduce this bug?

Some ideas you could try to narrow:

  • use fewer threads and less RAM
  • check on some subset of input files, for example, if it also occurs on half of the files, a quarter of the files, and so on

I would really like to fix it because it seems to be quite a serious bug, but without reproducing this, it may be really challenging.

@peter-kanvas
Copy link
Author

The data is publicly available. They are all the genomes I could collect from the gtdb database via NCBI. I'm attaching two lists. One is the ftp links I used to download all the genomes. They may or may not still be valid. The other is a subset of the genomes that I used when I encountered the error. You'll need about 351G of space to download all the genomes, and the final database ends up being about 4 TB. I'm working on an AWS EC2 instance (r5a.24xlarge) running AWS Linux 2023. KMC was installed using mamba, and the call was made from within a snakemake pipeline which I cannot share.

I've already moved passed the problem and have to get to the downstream analysis. I'll try the changes you suggested the next time I run this pipeline (likely in a few weeks).

reference_genome_list_kan002_v3.txt.gz
gtdb_in_genbank_ftp_links.txt.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants