Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to filter and export a taxonomy in e.g. NCBI taxdump style #112

Open
jfy133 opened this issue Feb 27, 2025 · 2 comments
Open

Ability to filter and export a taxonomy in e.g. NCBI taxdump style #112

jfy133 opened this issue Feb 27, 2025 · 2 comments

Comments

@jfy133
Copy link

jfy133 commented Feb 27, 2025

Please feel free to close if not in scope of the tool

I have been trying to make a very small version of the NCBI taxonomy with just a few species for test-data purposes (metagenomic database building).

However I've been struggling to find a tool that can efficiently build such very small versions of the NCBI taxonomy.

In an ideal world I am looking for is something where I can supply a tool the NCBI taxonomy, a few taxon IDs (e.g. at species level), and then the tool will export in the same NCBI taxdmp format, just the taxonomic 'lineages' from root to the specified taxon ID.

Then I can use the rsulting nodes.dmp and names.dmp in any tool that accepts NCBI taxonomy as input.

e.g. if I borrow the taxonkit cli system, assuming the full NCBI taxdmp files downloaded from the NCBI FTP are already in TXaONKIT_DB

taxonkit subset --taxids 707,9606 -o ./custom_taxdmp/

Where custom_taxdmp/

Would contain

nodes.dmp
names.dmp
<...>

And the two example files would contain the contents of the following attached files (which I 'manually' reconstructed with a horrible bash script)
(too big to post)

custom_taxdmp.zip

I hope my request makes sense, and I would like to think this tool would be a suitable place for such functionality.

@shenwei356
Copy link
Owner

Hi James, it's quite easy.

Step 1: preparing taxids in the subset tree

# here, only keep nodes at the rank of species
taxonkit list --ids 707,9606 -I "" \
    | taxonkit filter -E species \
    | taxonkit lineage -t \
    | cut -f 3 \
    | sed -s 's/;/\n/g' \
    > taxids.txt

# the root node
echo 1 >> taxids.txt

Step 2: extracting data of needed nodes

mkdir subset

# method 1: using https://github.com/shenwei356/csvtk
csvtk grep -Ht -f 1 -P taxids.txt ~/.taxonkit/nodes.dmp > subset/nodes.dmp

# there are some bare " in non-quoted-field. 
cat ~/.taxonkit/names.dmp \
    | csvtk fix-quotes -t \
    | csvtk grep -Ht -f 1 -P taxids.txt \
    | csvtk del-quotes -t \
    > subset/names.dmp

# ------------------------------------------
# method 2: using grep, it's much easier and faster here when filtering via values in the FIRST column.
# grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/nodes.dmp > subset/nodes.dmp
# grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/names.dmp > subset/names.dmp

touch subset/delnodes.dmp subset/merged.dmp

Checking it. Since there are only two leaves here, we just dump the whole tree

$ wc -l subset/*.dmp
   0 subset/delnodes.dmp
   0 subset/merged.dmp
 144 subset/names.dmp
  39 subset/nodes.dmp
 183 total

$ taxonkit list --ids 1 --data-dir subset/ -nr
1 [no rank] root
  131567 [no rank] cellular organisms
    2 [superkingdom] Bacteria
      1224 [phylum] Pseudomonadota
        1236 [class] Gammaproteobacteria
          135623 [order] Vibrionales
            641 [family] Vibrionaceae
              662 [genus] Vibrio
                28174 [species] Vibrio ordalii
    2759 [superkingdom] Eukaryota
      33154 [clade] Opisthokonta
        33208 [kingdom] Metazoa
          6072 [clade] Eumetazoa
            33213 [clade] Bilateria
              33511 [clade] Deuterostomia
                7711 [phylum] Chordata
                  89593 [subphylum] Craniata
                    7742 [clade] Vertebrata
                      7776 [clade] Gnathostomata
                        117570 [clade] Teleostomi
                          117571 [clade] Euteleostomi
                            8287 [superclass] Sarcopterygii
                              1338369 [clade] Dipnotetrapodomorpha
                                32523 [clade] Tetrapoda
                                  32524 [clade] Amniota
                                    40674 [class] Mammalia
                                      32525 [clade] Theria
                                        9347 [clade] Eutheria
                                          1437010 [clade] Boreoeutheria
                                            314146 [superorder] Euarchontoglires
                                              9443 [order] Primates
                                                376913 [suborder] Haplorrhini
                                                  314293 [infraorder] Simiiformes
                                                    9526 [parvorder] Catarrhini
                                                      314295 [superfamily] Hominoidea
                                                        9604 [family] Hominidae
                                                          207598 [subfamily] Homininae
                                                            9605 [genus] Homo
                                                              9606 [species] Homo sapiens


$ echo 28174 | taxonkit lineage -nr --data-dir subset/
28174   cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Vibrionales;Vibrionaceae;Vibrio;Vibrio ordalii       Vibrio ordalii  species

@jfy133
Copy link
Author

jfy133 commented Feb 28, 2025

Hi @shenwei356 !

This is absolutely perfect! This generated exactly what I needed and solved a problem I've been stuck on for the last couple o days, this should absolutely be added as a tutorial.

You are quite right, that is indeed rather easy, I think I was just thinking about how taxonkit works in the wrong way (and indeed csvtk helps a lot here). I'll definitely take more time to look through the examples more closely.

Thank you very very much!

If it helps, I think the following or a variant of a tutorial title or 'description' would be match searches I had been trying:

Title: 'Filtering or subsetting taxdmp files to make a custom taxdmp'
First sentence/intro: You want to create a smaller version of the official NCBI taxonomy taxdmp filtered or subset to just the lineages of certain species, for purposes such as creating small test data for testing of tools using taxdmp files.

But of course, just a suggestion :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants