- Working with
geneHummus
- load the package
- Legume family ids
- conserved domains
- SPARCLE architectures
- protein ids
- species table
- XP accessions
- Customizing
geneHummus
- Brassicaceae
- Cucurbitaceae
- Rosaceae
- Solanaceae
library(geneHummus)
Load the object legumesIds
that contains the legumes taxonomy
identifiers. You can easily download the taxids for your own family or
species from the NCBI.
data("legumesIds")
Plant gene families are characterized by common protein structures. The structure that defines a given family can be found in relevant literature. For example, the three conserved domains that define the ARF gene family are:
- B3 DNA binding domain
- Auxin response factor
- AUX/IAA family
NCBI databases host these conserved domains written in this way :
- B3_DNA
- Auxin_resp
- AUX_IAA
You can find this nomenclature or search for your own domains by typing in the search box on top of the Conserved Domain Database (NCBI). That search will give the accession number for each conserved domain.
arf <- c("pfam02362", "pfam06507", "pfam02309")
Now, we want to get the protein architectures. The conserved domain architectures are retrieved from the SPARCLE database (NCBI), a resource for the functional characterization and labeling of protein sequences. For example: the first SPARCLE architectures for "pfam02362" are:
head(getArch_ids(arf[1]), n=10)
#> [1] "13674487" "13674483" "13674476" "13674465" "13674455" "13674450"
#> [7] "13674431" "12034184" "11279088" "11279084"
Not all of the architectures link to ARF proteins. But the architecture
ids for the ARF proteins will be definitely among them. We want to get
the SPARCLE architectures for the vector arf
at once using the
function getArchids
.
archids <- getArch_ids(arf)
We know from relevant literature that the last domain (AUX/IAA) in the
canonical ARF protein structure may or may not be present. Therefore, at
least two domains have to be present. In other words, we want to filter
the SPARCLE architectures that contain, at least, those 2 domains (in
that sequential order!). Also, it is recommended to give the family name as second filter.
The function filterArch_ids
will do the job :
my_filter <- c("B3_DNA", "Auxin_resp")
family_name = "auxin response factor"
filtered_archids <- filterArch_ids(archids, my_filter, family_name)
We can look at the labels of the architectures using the
getArch_labels
function. Our proteins will have any of these SPARCLE
ids and labels:
getArch_labels(filtered_archids)
#> [1] "13674487 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674483 protein containing domains B3_DNA, Auxin_resp, Pox_F17, and AUX_IAA"
#> [1] "13674476 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674465 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674455 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674450 protein containing domains B3_DNA, Auxin_resp, Activator_LAG-3, and AUX_IAA"
#> [1] "12034184 protein containing domains B3_DNA, Auxin_resp, Med15, and PB1"
#> [1] "11130507 B3_DNA and Auxin_resp domain-containing protein"
#> [1] "11130491 protein containing domains B3_DNA, Auxin_resp, and PEARLI-4"
#> [1] "11130490 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "11130489 B3_DNA and Auxin_resp domain-containing protein"
#> [1] "10492348 protein containing domains B3_DNA, Auxin_resp, and PB1"
#> [1] "11279093 protein containing domains B3_DNA, Auxin_resp, and PB1"
#> [1] "11279092 B3_DNA and Auxin_resp domain-containing protein"
#> [1] "10332700 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "10332698 B3_DNA and Auxin_resp domain-containing protein"
Now we'll get all proteins identifiers with any of those architectures.
For that, we use the getProteins_from_tax_ids
function. Depending on
your dataset this step may take from seconds to 3-5 minutes. The
taxonomy family is given as second argument to the function:
arf_legumes = getProteins_from_tax_ids(filtered_archids, legumesIds)
In the exceptional case, the function call causes an error in the HTTP2
framing layer, it is recommended to call again the
getProteins_from_tax_ids
function after running the following code
first:
httr::set_config(httr::config(http_version = 0))
At this point we have likely identified the whole set of ARF protein ids for the Legume family. Let's look at the first ten ARF protein ids:
arf_legumes[1:10]
#> [1] "1150156484" "1150156482" "950933327" "593705262" "1379669790"
#> [6] "357520645" "1021547483" "1431757174" "1150137558" "593188509"
Finally we will get the protein accessions (usually referred as "XP" ids
in the RefSeq database). We want to use here the getAccessions
function.
arf_accs = getAccessions(arf_legumes)
This is our final results. The object is a data frame with two columns
(accession
, and organism
). Let's look at the first elements of the
object.
head(arf_accs)
#> accession organism
#> 1 XP_025621726 Arachis hypogaea
#> 2 XP_014518560 Vigna radiata var. radiata
#> 3 XP_006592682 Glycine max
#> 4 XP_025665252 Arachis hypogaea
#> 5 XP_003600594 Medicago truncatula
#> 6 XP_003616115 Medicago truncatula
At this point, we can go over the results with two different approaches.
The first one generates a summary of total number of proteins per species
by using the function accessions_by_spp
.
knitr::kable(accessions_by_spp(arf_accs))
organism | N.seqs |
---|---|
Abrus precatorius | 94 |
Arachis duranensis | 57 |
Arachis hypogaea | 120 |
Arachis ipaensis | 63 |
Cajanus cajan | 56 |
Cicer arietinum | 52 |
Glycine max | 87 |
Lupinus angustifolius | 92 |
Medicago truncatula | 52 |
Phaseolus vulgaris | 31 |
Vigna angularis | 43 |
Vigna radiata var. radiata | 42 |
The second approach uses the function accessions_from_spp
to obtain
the proteins for a given species. For example, the XP accessions for
Medicago are :
medicago = accessions_from_spp(arf_accs, "Medicago truncatula")
Look at the first elements of the object medicago
:
head(medicago, n = 10)
#> [1] "XP_003600594" "XP_003616115" "XP_013445290" "XP_024638199"
#> [5] "XP_024640656" "XP_013448901" "XP_003614872" "XP_003593664"
#> [9] "XP_003593869" "XP_024638577"
We have already summarized the data and the protein XP accessions for
each species are contained in the object list my_legumes
. The XPs are
sorted in the following order:
- chickpea
- medicago
- soybean
- arachis_duranensis
- arachis_ipaensis
- cajanus
- vigna_angulata
- vigna_radiata
- phaseolus
- lupinus
For example, the ARF proteins for chickpea are:
data("my_legumes")
my_legumes[[1]]
#> [1] "XP_012573396" "XP_004503803" "XP_004490828" "XP_012568938"
#> [5] "XP_012573395" "XP_004504543" "XP_012572776" "XP_004497510"
#> [9] "XP_004508020" "XP_012571326" "XP_004504542" "XP_004505104"
#> [13] "XP_012569005" "XP_004485845" "XP_012574137" "XP_004511136"
#> [17] "XP_004508021" "XP_004485416" "XP_004510661" "XP_004508019"
#> [21] "XP_004487099" "XP_004508018" "XP_004485417" "XP_004505967"
#> [25] "XP_004485979" "XP_004508022" "XP_004506012" "XP_012571810"
#> [29] "XP_004488112" "XP_012570270" "XP_004510646" "XP_004490754"
#> [33] "XP_012574328" "XP_012570274" "XP_004488113" "XP_004510662"
#> [37] "XP_012574145" "XP_012567350" "XP_012572936" "XP_012570835"
#> [41] "XP_012571327" "XP_004485844" "XP_004505103" "XP_004503551"
#> [45] "XP_004503553"
geneHummus
can be customized to be suitable for other agronomically
important taxonomic families beyond legumes. For that, we´ll use the
same functions as we did earlier and pass the corresponding taxonomic
filter as argument. You can download from the NCBI other taxonomy ids
and customize your search for your own species. When installing the
genehummus
package, you'll have access to several objects that contain
the taxonomy ids for the families:
- Brassicaceae:
brassicaceaeIds
- Cucurbitaceae:
cucurbitaceaeIds
- Rosaceae:
rosaceaeIds
- Solanaceae:
solanaceaeIds
You can see the ids by calling, for example :
head(geneHummus:::brassicaceaeIds)
#> [1] "2358338" "2340982" "2340889" "2340888" "2340887" "2340886"
For example, this chunk will get the ARF protein ids for the Solanaceae family:
solids = geneHummus:::solanaceaeIds
arf_sol = getProteins_from_tax_ids(filtered_archids, solids)
Then, we can get the the XP accessions for each species in the same way we did for the legume species.
sol_accs = getAccessions(arf_sol)
head(accessions_from_spp(sol_accs, "Solanum tuberosum"))
#> [1] "XP_006349280" "XP_006350452" "XP_015170637" "XP_006339723"
#> [5] "XP_006343312" "XP_006357892"
head(accessions_from_spp(sol_accs, "Nicotiana tabacum"))
#> [1] "XP_016445203" "XP_016513811" "XP_016510734" "XP_016435534"
#> [5] "XP_016487494" "XP_016441644"
Using geneHummus
with the identifiers for the default taxonomy
families, we have already identified the deduced ARF gene family members
based on current genomic resources for the following species:
Family | Species | # ARF proteins |
---|---|---|
Brassicaceae | Brassica napus | 123 |
Brassicaceae | Camelina sativa | 86 |
Brassicaceae | Raphanus sativus | 52 |
Brassicaceae | Brassica oleracea | 49 |
Brassicaceae | Brassica rapa | 48 |
Brassicaceae | Arabidopsis thaliana | 44 |
Brassicaceae | Capsella rubella | 28 |
Brassicaceae | Eutrema salsugineum | 26 |
Brassicaceae | Arabidopsis lyrata | 24 |
Cucurbitaceae | Cucurbita maxima | 71 |
Cucurbitaceae | Cucurbita pepo | 71 |
Cucurbitaceae | Cucurbita moschata | 67 |
Cucurbitaceae | Cucumis sativus | 27 |
Cucurbitaceae | Momordica charantia | 26 |
Cucurbitaceae | Cucumis melo | 24 |
Rosaceae | Pyrus x bretschneideri | 51 |
Rosaceae | Malus domestica | 48 |
Rosaceae | Prunus avium | 33 |
Rosaceae | Rosa chinensis | 28 |
Rosaceae | Prunus persica | 27 |
Rosaceae | Prunus mume | 25 |
Rosaceae | Fragaria vesca | 23 |
Solanaceae | Nicotiana tabacum | 103 |
Solanaceae | Nicotiana tomentosiformis | 73 |
Solanaceae | Nicotiana sylvestris | 56 |
Solanaceae | Nicotiana attenuata | 49 |
Solanaceae | Capsicum annuum | 46 |
Solanaceae | Solanum tuberosum | 43 |
Solanaceae | Solanum lycopersicum | 39 |
Solanaceae | Solanum pennellii | 34 |
If you run into problem using geneHummus
, or just need help with the
package, you can contact us by opening an issue at the
github
repository.