Skip to content

Latest commit

 

History

History
503 lines (436 loc) · 13.9 KB

genHummus.md

File metadata and controls

503 lines (436 loc) · 13.9 KB

Content

  1. Working with geneHummus
  1. Customizing geneHummus
  • Brassicaceae
  • Cucurbitaceae
  • Rosaceae
  • Solanaceae

geneHummus

Load the package

library(geneHummus)

Read Legumes taxids

Load the object legumesIds that contains the legumes taxonomy identifiers. You can easily download the taxids for your own family or species from the NCBI.

data("legumesIds")

Define conserved domains

Plant gene families are characterized by common protein structures. The structure that defines a given family can be found in relevant literature. For example, the three conserved domains that define the ARF gene family are:

  • B3 DNA binding domain
  • Auxin response factor
  • AUX/IAA family

NCBI databases host these conserved domains written in this way :

  • B3_DNA
  • Auxin_resp
  • AUX_IAA

You can find this nomenclature or search for your own domains by typing in the search box on top of the Conserved Domain Database (NCBI). That search will give the accession number for each conserved domain.

arf <- c("pfam02362", "pfam06507", "pfam02309")

Obtain SPARCLE architectures

Now, we want to get the protein architectures. The conserved domain architectures are retrieved from the SPARCLE database (NCBI), a resource for the functional characterization and labeling of protein sequences. For example: the first SPARCLE architectures for "pfam02362" are:

head(getArch_ids(arf[1]), n=10)
#>  [1] "13674487" "13674483" "13674476" "13674465" "13674455" "13674450"
#>  [7] "13674431" "12034184" "11279088" "11279084"

Not all of the architectures link to ARF proteins. But the architecture ids for the ARF proteins will be definitely among them. We want to get the SPARCLE architectures for the vector arf at once using the function getArchids.

archids <- getArch_ids(arf)

We know from relevant literature that the last domain (AUX/IAA) in the canonical ARF protein structure may or may not be present. Therefore, at least two domains have to be present. In other words, we want to filter the SPARCLE architectures that contain, at least, those 2 domains (in that sequential order!). Also, it is recommended to give the family name as second filter.
The function filterArch_ids will do the job :

my_filter <- c("B3_DNA", "Auxin_resp")
family_name = "auxin response factor" 
filtered_archids <- filterArch_ids(archids, my_filter, family_name)

Get SPARCLE labels

We can look at the labels of the architectures using the getArch_labels function. Our proteins will have any of these SPARCLE ids and labels:

getArch_labels(filtered_archids)
#> [1] "13674487 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674483 protein containing domains B3_DNA, Auxin_resp, Pox_F17, and AUX_IAA"
#> [1] "13674476 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674465 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674455 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "13674450 protein containing domains B3_DNA, Auxin_resp, Activator_LAG-3, and AUX_IAA"
#> [1] "12034184 protein containing domains B3_DNA, Auxin_resp, Med15, and PB1"
#> [1] "11130507 B3_DNA and Auxin_resp domain-containing protein"
#> [1] "11130491 protein containing domains B3_DNA, Auxin_resp, and PEARLI-4"
#> [1] "11130490 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "11130489 B3_DNA and Auxin_resp domain-containing protein"
#> [1] "10492348 protein containing domains B3_DNA, Auxin_resp, and PB1"
#> [1] "11279093 protein containing domains B3_DNA, Auxin_resp, and PB1"
#> [1] "11279092 B3_DNA and Auxin_resp domain-containing protein"
#> [1] "10332700 protein containing domains B3_DNA, Auxin_resp, and AUX_IAA"
#> [1] "10332698 B3_DNA and Auxin_resp domain-containing protein"

Collect protein ids

Now we'll get all proteins identifiers with any of those architectures. For that, we use the getProteins_from_tax_ids function. Depending on your dataset this step may take from seconds to 3-5 minutes. The taxonomy family is given as second argument to the function:

arf_legumes = getProteins_from_tax_ids(filtered_archids, legumesIds)

In the exceptional case, the function call causes an error in the HTTP2 framing layer, it is recommended to call again the getProteins_from_tax_ids function after running the following code first:

httr::set_config(httr::config(http_version = 0))

At this point we have likely identified the whole set of ARF protein ids for the Legume family. Let's look at the first ten ARF protein ids:

arf_legumes[1:10]
#>  [1] "1150156484" "1150156482" "950933327"  "593705262"  "1379669790"
#>  [6] "357520645"  "1021547483" "1431757174" "1150137558" "593188509"

Collect protein accessions

Finally we will get the protein accessions (usually referred as "XP" ids in the RefSeq database). We want to use here the getAccessions function.

arf_accs = getAccessions(arf_legumes)

This is our final results. The object is a data frame with two columns (accession, and organism). Let's look at the first elements of the object.

head(arf_accs)
#>      accession                   organism
#> 1 XP_025621726           Arachis hypogaea
#> 2 XP_014518560 Vigna radiata var. radiata
#> 3 XP_006592682                Glycine max
#> 4 XP_025665252           Arachis hypogaea
#> 5 XP_003600594        Medicago truncatula
#> 6 XP_003616115        Medicago truncatula

Number of proteins per species

At this point, we can go over the results with two different approaches. The first one generates a summary of total number of proteins per species by using the function accessions_by_spp.

knitr::kable(accessions_by_spp(arf_accs))
organism N.seqs
Abrus precatorius 94
Arachis duranensis 57
Arachis hypogaea 120
Arachis ipaensis 63
Cajanus cajan 56
Cicer arietinum 52
Glycine max 87
Lupinus angustifolius 92
Medicago truncatula 52
Phaseolus vulgaris 31
Vigna angularis 43
Vigna radiata var. radiata 42

Proteins for a given species

The second approach uses the function accessions_from_spp to obtain the proteins for a given species. For example, the XP accessions for Medicago are :

medicago = accessions_from_spp(arf_accs, "Medicago truncatula")

Look at the first elements of the object medicago :

head(medicago, n = 10)
#>  [1] "XP_003600594" "XP_003616115" "XP_013445290" "XP_024638199"
#>  [5] "XP_024640656" "XP_013448901" "XP_003614872" "XP_003593664"
#>  [9] "XP_003593869" "XP_024638577"

We have already summarized the data and the protein XP accessions for each species are contained in the object list my_legumes. The XPs are sorted in the following order:

  • chickpea
  • medicago
  • soybean
  • arachis_duranensis
  • arachis_ipaensis
  • cajanus
  • vigna_angulata
  • vigna_radiata
  • phaseolus
  • lupinus

For example, the ARF proteins for chickpea are:

data("my_legumes")
my_legumes[[1]]
#>  [1] "XP_012573396" "XP_004503803" "XP_004490828" "XP_012568938"
#>  [5] "XP_012573395" "XP_004504543" "XP_012572776" "XP_004497510"
#>  [9] "XP_004508020" "XP_012571326" "XP_004504542" "XP_004505104"
#> [13] "XP_012569005" "XP_004485845" "XP_012574137" "XP_004511136"
#> [17] "XP_004508021" "XP_004485416" "XP_004510661" "XP_004508019"
#> [21] "XP_004487099" "XP_004508018" "XP_004485417" "XP_004505967"
#> [25] "XP_004485979" "XP_004508022" "XP_004506012" "XP_012571810"
#> [29] "XP_004488112" "XP_012570270" "XP_004510646" "XP_004490754"
#> [33] "XP_012574328" "XP_012570274" "XP_004488113" "XP_004510662"
#> [37] "XP_012574145" "XP_012567350" "XP_012572936" "XP_012570835"
#> [41] "XP_012571327" "XP_004485844" "XP_004505103" "XP_004503551"
#> [45] "XP_004503553"

Customizing GeneHummus

geneHummus can be customized to be suitable for other agronomically important taxonomic families beyond legumes. For that, we´ll use the same functions as we did earlier and pass the corresponding taxonomic filter as argument. You can download from the NCBI other taxonomy ids and customize your search for your own species. When installing the genehummus package, you'll have access to several objects that contain the taxonomy ids for the families:

  • Brassicaceae: brassicaceaeIds
  • Cucurbitaceae: cucurbitaceaeIds
  • Rosaceae: rosaceaeIds
  • Solanaceae: solanaceaeIds

You can see the ids by calling, for example :

head(geneHummus:::brassicaceaeIds)
#> [1] "2358338" "2340982" "2340889" "2340888" "2340887" "2340886"

For example, this chunk will get the ARF protein ids for the Solanaceae family:

solids = geneHummus:::solanaceaeIds
arf_sol = getProteins_from_tax_ids(filtered_archids, solids)

Then, we can get the the XP accessions for each species in the same way we did for the legume species.

sol_accs = getAccessions(arf_sol)
head(accessions_from_spp(sol_accs, "Solanum tuberosum"))
#> [1] "XP_006349280" "XP_006350452" "XP_015170637" "XP_006339723"
#> [5] "XP_006343312" "XP_006357892"
head(accessions_from_spp(sol_accs, "Nicotiana tabacum"))
#> [1] "XP_016445203" "XP_016513811" "XP_016510734" "XP_016435534"
#> [5] "XP_016487494" "XP_016441644"

Using geneHummus with the identifiers for the default taxonomy families, we have already identified the deduced ARF gene family members based on current genomic resources for the following species:

Family Species # ARF proteins
Brassicaceae Brassica napus 123
Brassicaceae Camelina sativa 86
Brassicaceae Raphanus sativus 52
Brassicaceae Brassica oleracea 49
Brassicaceae Brassica rapa 48
Brassicaceae Arabidopsis thaliana 44
Brassicaceae Capsella rubella 28
Brassicaceae Eutrema salsugineum 26
Brassicaceae Arabidopsis lyrata 24
Cucurbitaceae Cucurbita maxima 71
Cucurbitaceae Cucurbita pepo 71
Cucurbitaceae Cucurbita moschata 67
Cucurbitaceae Cucumis sativus 27
Cucurbitaceae Momordica charantia 26
Cucurbitaceae Cucumis melo 24
Rosaceae Pyrus x bretschneideri 51
Rosaceae Malus domestica 48
Rosaceae Prunus avium 33
Rosaceae Rosa chinensis 28
Rosaceae Prunus persica 27
Rosaceae Prunus mume 25
Rosaceae Fragaria vesca 23
Solanaceae Nicotiana tabacum 103
Solanaceae Nicotiana tomentosiformis 73
Solanaceae Nicotiana sylvestris 56
Solanaceae Nicotiana attenuata 49
Solanaceae Capsicum annuum 46
Solanaceae Solanum tuberosum 43
Solanaceae Solanum lycopersicum 39
Solanaceae Solanum pennellii 34

If you run into problem using geneHummus , or just need help with the package, you can contact us by opening an issue at the github repository.