diff --git a/004-NCBI/01-NCBI_overview.Rmd b/004-NCBI/01-NCBI_overview.Rmd index ec6fc6c..d93e130 100755 --- a/004-NCBI/01-NCBI_overview.Rmd +++ b/004-NCBI/01-NCBI_overview.Rmd @@ -2,6 +2,15 @@ **NOTE:** The following material was partially adapted by N. Brouwer from [Wikipedia](https://en.wikipedia.org/wiki/National_Center_for_Biotechnology_Information). See the underlying .Rmd file for information on specific paragraphs from Wikipedia. +## Key concepts + +* NCBI +* Rentrez +* accession numbers +* BLAST + +## NCBI + The **National Center for Biotechnology Information (NCBI)** is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It was founded in 1988 and is approved and funded by the government of the United States. The NCBI houses a series of **databases** relevant to the basic and applied life sciences and is an important resource for **bioinformatics** tools and services. Major databases include **GenBank** for DNA sequences and **PubMed**, a **bibliographic** database for biomedical literature. All these databases are available online through the **Entrez** search engine. @@ -12,7 +21,7 @@ In this chapter we'll briefly discuss the major databases, following up with spe The GenBank sequence database is an open access collection of publicly available DNA and protein sequences. If you've work with sequence data, you'll work with GenBank. GenBank is the actual database, and it can be searched several ways. For example, you can search for a sequence by its ID number (**accession number**) if you know it, or do a **BLAST search** using an actual sequence to look for similar sequences. -A key component of GenBank are the **GeneBank Records**, which are annotated summaries of sequences in the databases. For example, below is shown record for a gene [pallysin](https://www.ncbi.nlm.nih.gov/protein/AAC65720) from syphilis In addition to the actual A, T, C, and Gs of the sequences, the record provides **metadata**, such as the scientific name of the organism (*Treponema pallidum*), who did the sequencing, the name of the paper where the sequence was published, and important features of the gene. +A key component of GenBank are the **GeneBank Records**, which are annotated summaries of sequences in the databases. For example, below is shown record for a gene [pallysin](https://www.ncbi.nlm.nih.gov/protein/AAC65720) https://www.ncbi.nlm.nih.gov/protein/AAC65720 from syphilis In addition to the actual A, T, C, and Gs of the sequences, the record provides **metadata**, such as the scientific name of the organism (*Treponema pallidum*), who did the sequencing, the name of the paper where the sequence was published, and important features of the gene. A key feature of PubMed records is that they are **hyperlinked** to other NCBI databases. For example, you can click link under the name of the paper which reported the sequence of the gene and it will take you to the PubMed record for that paper (see below). You can also click the "Run BLAST" link and you can search the database for similar sequences. This protein coded for by this particular gene has had its structured solved using x-ray crystallography, and you can see these results under "Protein 3D Structure." In a later chapter we'll get to know these records in further detail. @@ -29,20 +38,18 @@ knitr::include_graphics("images/genbank_record.png") ## Entrez -https://en.wikipedia.org/wiki/Entrez - + -"Entrez is a federated search engine and web portal that allows users to search many discrete health sciences databases of the NCBI website. The name "Entrez" (a greeting meaning "Come in" in French) was chosen to reflect the spirit of welcoming the public to search the content available from NCBI." +Entrez is a search engine and web portal that allows users to search many discrete health sciences databases of the NCBI website. The name "Entrez" (a greeting meaning "Come in" in French) was chosen to reflect the spirit of welcoming the public to search the content available from NCBI. -"Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system." +Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system. ## BLAST -https://en.wikipedia.org/wiki/BLAST_(biotechnology) - + -"BLAST (basic local alignment search tool)is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence." +BLAST (Basic Local Alignment Search Tool) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. diff --git a/004-NCBI/02-NCBI_genebank_fasta.Rmd b/004-NCBI/02-NCBI_genebank_fasta.Rmd index 0fa568f..175af0b 100755 --- a/004-NCBI/02-NCBI_genebank_fasta.Rmd +++ b/004-NCBI/02-NCBI_genebank_fasta.Rmd @@ -31,7 +31,7 @@ library(compbio4all) ## Introduction -**NCBI** is the National Center for Biotechnology Information. The [NCBI Webiste](www.ncbi.nlm.nih.gov/) is the entry point to a large number of databases giving access to **biological sequences** (DNA, RNA, protein) and biology-related publications. +**NCBI** is the National Center for Biotechnology Information. The [NCBI Webiste](https://www.ncbi.nlm.nih.gov/) www.ncbi.nlm.nih.gov/ is the entry point to a large number of databases giving access to **biological sequences** (DNA, RNA, protein) and biology-related publications. When scientists sequence DNA, RNA and proteins they typically publish their data via databases with the NCBI. Each is given a unique identification number known as an **accession number**. For example, each time a unique human genome sequence is produced it is uploaded to the relevant databases, assigned a unique **accession**, and a website created to access it. Sequence are also cross-referenced to related papers, so you can start with a sequence and find out what scientific paper it was used in, or start with a paper and see if any sequences are associated with it. @@ -47,7 +47,7 @@ In this chapter we'll typically refer generically to "NCBI data" and "NCBI datab Almost published biological sequences are available online, as it is a requirement of every scientific journal that any published DNA or RNA or protein sequence must be deposited in a public database. The main resources for storing and distributing sequence data are three large databases: -1. USA: **[NCBI database](www.ncbi.nlm.nih.gov/)** (www.ncbi.nlm.nih.gov/) +1. USA: **[NCBI database](https://www.ncbi.nlm.nih.gov/)** (www.ncbi.nlm.nih.gov/) 1. Europe: **European Molecular Biology Laboratory (EMBL)** database (https://www.ebi.ac.uk/ena) 1. Japan: **DNA Database of Japan (DDBJ)** database (www.ddbj.nig.ac.jp/). These databases collect all publicly available DNA, RNA and protein sequence data and make it available for free. They exchange data nightly, so contain essentially the same data. The redundancy among the databases allows them to serve different communities (e.g. native languages), provide different additional services such as tutorials, and assure that the world's scientists have their data backed up in different physical locations -- a key component of good data management! @@ -87,7 +87,7 @@ As mentioned above, for each sequence the NCBI database stores some extra inform To view the GenBank entry for the DEN-1 Dengue virus, follow these steps: -1. Go to the [NCBI website](www.ncbi.nlm.nih.gov) (www.ncbi.nlm.nih.gov). +1. Go to the [NCBI website](https://www.ncbi.nlm.nih.gov) (www.ncbi.nlm.nih.gov). 1. Search for the accession number NC_001477. 1. Since we searched for a particular accession we are only returned a single main result which is titled "NUCLEOTIDE SEQUENCE: Dengue virus 1, complete genome." 1. Click on "Dengue virus 1, complete genome" to go to the GenBank entry. diff --git a/004-NCBI/03-NCBI_seqdata_by_GUI1.Rmd b/004-NCBI/03-NCBI_seqdata_by_GUI1.Rmd index 5171afa..1084849 100755 --- a/004-NCBI/03-NCBI_seqdata_by_GUI1.Rmd +++ b/004-NCBI/03-NCBI_seqdata_by_GUI1.Rmd @@ -10,7 +10,7 @@ The following chapter was originally written by Avril Coghlan. It provides brie ## Retrieving genome sequence data via the NCBI website -You can easily retrieve DNA or protein sequence data by hand from the [NCBI](www.ncbi.nlm.nih.gov) Sequence Database via its website www.ncbi.nlm.nih.gov. +You can easily retrieve DNA or protein sequence data by hand from the [NCBI](https://www.ncbi.nlm.nih.gov/) Sequence Database via its website www.ncbi.nlm.nih.gov. Dengue DEN-1 DNA is a viral DNA sequence and its NCBI **accession number** is NC_001477. To retrieve the DNA sequence for the Dengue DEN-1 virus from NCBI, go to the NCBI website, type “NC_001477” in the Search box at the top of the webpage, and press the “Search” button beside the Search box. diff --git a/004-NCBI/04-uniprot_by_GUI-AC07-01.Rmd b/004-NCBI/04-uniprot_by_GUI-AC07-01.Rmd index f1edc56..c5eed0a 100755 --- a/004-NCBI/04-uniprot_by_GUI-AC07-01.Rmd +++ b/004-NCBI/04-uniprot_by_GUI-AC07-01.Rmd @@ -22,7 +22,7 @@ In a previous vignette you learned how to retrieve sequences from the NCBI datab As mentioned previously, a subsection of the NCBI database called **RefSeq** consists of high quality DNA and protein sequence data. Furthermore, the NCBI entries for the RefSeq sequences have been **manually curated**, which means that expert biologists employed by NCBI have added additional information to the NCBI entries for those sequences, such as details of scientific papers that describe the sequences. -Another extremely important manually curated database is [**UniProt**](www.uniprot.org), which focuses on protein sequences. UniProt aims to contains manually curated information on all known protein sequences. While many of the protein sequences in UniProt are also present in RefSeq, the amount and quality of manually curated information in UniProt is much higher than that in RefSeq. +Another extremely important manually curated database is [UniProt](https://www.uniprot.org/) www.uniprot.org, which focuses on protein sequences. UniProt aims to contains manually curated information on all known protein sequences. While many of the protein sequences in UniProt are also present in RefSeq, the amount and quality of manually curated information in UniProt is much higher than that in RefSeq. For each protein in UniProt, the UniProt curators read all the scientific papers that they can find about that protein, and add information from those papers to the protein’s UniProt entry. For example, for a human protein, the UniProt entry for the protein usually includes information about the biological function of the protein, in what human tissues it is expressed, whether it interacts with other human proteins, and much more. All this information has been manually gathered by the UniProt curators from scientific papers, and the papers in which the found the information are always listed in the UniProt entry for the protein. @@ -42,7 +42,7 @@ This tells us that *Mycobacterium* is a species of bacteria, which belongs to a Back up at the top under "organism" is says "Status", which tells us the **annotation score** is 2 out of 5, that it is a "Protein inferred from homology", which means what we know about it is derived from bioinformatics and computational tools, not lab work. -Beside the heading “Function”, it says that the function of this protein is that it “Removes the pyruvyl group from chorismate to provide 4-hydroxybenzoate (4HB)”. This tells us this protein is an enzyme (a protein that increases the rate of a specific biochemical reaction), and tells us what is the particular biochemical reaction that this enzyme is involved in. At the end of this info it says "By similarity", which again indicates that what we know about this protein comes from bioinformatics, not lab work. +Beside the heading “Function”, it says that the function of this protein is that it “Removes the pyruvyl group from chorismate to provide 4-hydroxybenzoate (4HB)”. This tells us this protein is an enzyme (a protein that increases the rate of a specific biochemical reaction), and tells us what is the particular biochemical reaction that this enzyme is involved in. At the end of this info it says "By similarity", which again indicates that what we know about this protein comes from bioinformatics, not lab work. ### Protein sequence and size @@ -88,6 +88,13 @@ We can confirm str(lepraeseq) ``` +```{r eval = F} + # 'SeqFastadna' chr [1:210] "m" "t" "n" "r" "t" "l" "s" "r" "e" "e" "i" ... + # - attr(*, "name")= chr "sp|Q9CD83|PHBS_MYCLE" + # - attr(*, "Annot")= chr ">sp|Q9CD83|PHBS_MYCLE Chorismate pyruvate-lyase + # OS=Mycobacterium leprae (strain TN) OX=272631 GN=ML0133 PE=3 SV=1" +``` + For the other sequence @@ -99,6 +106,16 @@ file.2 <- system.file("./extdata/A0PQ23.fasta", package = "compbio4all") # load fasta ulcerans <- read.fasta(file = file.2) ulceransseq <- ulcerans[[1]] + +str(ulceransseq)[[1]] +``` + + +```{r} + # 'SeqFastadna' chr [1:212] "m" "l" "a" "v" "l" "p" "e" "k" "r" "e" "m" ... + # - attr(*, "name")= chr "tr|A0PQ23|A0PQ23_MYCUA" + # - attr(*, "Annot")= chr ">tr|A0PQ23|A0PQ23_MYCUA Chorismate pyruvate-lyase +# OS=Mycobacterium ulcerans (strain Agy99) OX=362242 GN=MUL_2003 PE=4 SV=1" ``` diff --git a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-159-1.png b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-159-1.png old mode 100755 new mode 100644 index aa7a334..a695096 Binary files a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-159-1.png and b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-159-1.png differ diff --git a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-166-1.png b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-166-1.png index 5248f4e..651404e 100644 Binary files a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-166-1.png and b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-166-1.png differ diff --git a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-167-1.png b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-167-1.png index 784aa53..066b75d 100644 Binary files a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-167-1.png and b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-167-1.png differ diff --git a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-168-1.png b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-168-1.png old mode 100755 new mode 100644 index 52671e1..5248f4e Binary files a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-168-1.png and b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-168-1.png differ diff --git a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-169-1.png b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-169-1.png old mode 100755 new mode 100644 index a90c75a..784aa53 Binary files a/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-169-1.png and b/_bookdown_files/lbrb_files/figure-html/unnamed-chunk-169-1.png differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-159-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-159-1.pdf old mode 100755 new mode 100644 index bd27236..260b408 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-159-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-159-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-166-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-166-1.pdf index e0d441a..a852ff9 100644 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-166-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-166-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-167-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-167-1.pdf index e189a3e..d35026f 100644 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-167-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-167-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-168-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-168-1.pdf old mode 100755 new mode 100644 index 5816338..269d00a Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-168-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-168-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-169-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-169-1.pdf old mode 100755 new mode 100644 index 05143da..4568ea0 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-169-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-169-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-84-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-84-1.pdf index 7b04168..a28e3ba 100644 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-84-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-84-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-85-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-85-1.pdf index 0ac7434..3221b17 100644 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-85-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-85-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-87-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-87-1.pdf index d32703d..b4a4caf 100644 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-87-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-87-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-88-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-88-1.pdf index 99c8cd7..cee85d8 100644 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-88-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-88-1.pdf differ diff --git a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-90-1.pdf b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-90-1.pdf index f403af1..4fd3992 100644 Binary files a/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-90-1.pdf and b/_bookdown_files/lbrb_files/figure-latex/unnamed-chunk-90-1.pdf differ diff --git a/docs/R-objects.html b/docs/R-objects.html index 5944677..c19545c 100644 --- a/docs/R-objects.html +++ b/docs/R-objects.html @@ -265,10 +265,12 @@
You can easily retrieve DNA or protein sequence data by hand from the NCBI Sequence Database via its website www.ncbi.nlm.nih.gov.
+You can easily retrieve DNA or protein sequence data by hand from the NCBI Sequence Database via its website www.ncbi.nlm.nih.gov.
Dengue DEN-1 DNA is a viral DNA sequence and its NCBI accession number is NC_001477. To retrieve the DNA sequence for the Dengue DEN-1 virus from NCBI, go to the NCBI website, type “NC_001477” in the Search box at the top of the webpage, and press the “Search” button beside the Search box.
(While this is the normal workflow, accessions related to well-known organisms can sometimes turn up using a direct Google search. This is the case for NC_001477 - if you Google it you can go directly to the website with the genome sequence.)
On the results page of a normal NCBI search you will see the number of hits to “NC_001477” in each of the NCBI databases on the NCBI website. There are many databases on the NCBI website, for example, PubMed and Pubmed Central contain abstracts from scientific papers, the Genes and Genomes database contains DNA and RNA sequence data, the Proteins database contains protein sequence data, and so on.
diff --git a/docs/downloading-sequences-from-uniprot-by-hand.html b/docs/downloading-sequences-from-uniprot-by-hand.html index 00fdd92..fb8871e 100644 --- a/docs/downloading-sequences-from-uniprot-by-hand.html +++ b/docs/downloading-sequences-from-uniprot-by-hand.html @@ -265,10 +265,12 @@In a previous vignette you learned how to retrieve sequences from the NCBI database. The NCBI database is a key database in bioinformatics because it contains essentially all DNA sequences ever sequenced.
As mentioned previously, a subsection of the NCBI database called RefSeq consists of high quality DNA and protein sequence data. Furthermore, the NCBI entries for the RefSeq sequences have been manually curated, which means that expert biologists employed by NCBI have added additional information to the NCBI entries for those sequences, such as details of scientific papers that describe the sequences.
-Another extremely important manually curated database is UniProt, which focuses on protein sequences. UniProt aims to contains manually curated information on all known protein sequences. While many of the protein sequences in UniProt are also present in RefSeq, the amount and quality of manually curated information in UniProt is much higher than that in RefSeq.
+Another extremely important manually curated database is UniProt www.uniprot.org, which focuses on protein sequences. UniProt aims to contains manually curated information on all known protein sequences. While many of the protein sequences in UniProt are also present in RefSeq, the amount and quality of manually curated information in UniProt is much higher than that in RefSeq.
For each protein in UniProt, the UniProt curators read all the scientific papers that they can find about that protein, and add information from those papers to the protein’s UniProt entry. For example, for a human protein, the UniProt entry for the protein usually includes information about the biological function of the protein, in what human tissues it is expressed, whether it interacts with other human proteins, and much more. All this information has been manually gathered by the UniProt curators from scientific papers, and the papers in which the found the information are always listed in the UniProt entry for the protein.
Just like NCBI, UniProt also assigns an accession to each sequence in the UniProt database. Although the same protein sequence may appear in both the NCBI database and the UniProt database, it will have different NCBI and UniProt accessions. However, there is usually a link on the NCBI entry for the protein sequence to the UniProt entry, and vice versa.
We can confirm
str(lepraeseq)
# 'SeqFastadna' chr [1:210] "m" "t" "n" "r" "t" "l" "s" "r" "e" "e" "i" ...
+ # - attr(*, "name")= chr "sp|Q9CD83|PHBS_MYCLE"
+ # - attr(*, "Annot")= chr ">sp|Q9CD83|PHBS_MYCLE Chorismate pyruvate-lyase
+ # OS=Mycobacterium leprae (strain TN) OX=272631 GN=ML0133 PE=3 SV=1"
For the other sequence
-# locate the file within the package using system.file()
-.2 <- system.file("./extdata/A0PQ23.fasta", package = "compbio4all")
- file
-
-# load fasta
-<- read.fasta(file = file.2)
- ulcerans <- ulcerans[[1]] ulceransseq
# locate the file within the package using system.file()
+.2 <- system.file("./extdata/A0PQ23.fasta", package = "compbio4all")
+ file
+
+# load fasta
+<- read.fasta(file = file.2)
+ ulcerans <- ulcerans[[1]]
+ ulceransseq
+str(ulceransseq)[[1]]
# 'SeqFastadna' chr [1:212] "m" "l" "a" "v" "l" "p" "e" "k" "r" "e" "m" ...
+ # - attr(*, "name")= chr "tr|A0PQ23|A0PQ23_MYCUA"
+ # - attr(*, "Annot")= chr ">tr|A0PQ23|A0PQ23_MYCUA Chorismate pyruvate-lyase
+ # OS=Mycobacterium ulcerans (strain Agy99) OX=362242 GN=MUL_2003 PE=4 SV=1"
You can interact with R’s console similar to a scientific calculator. For example, you can use parentheses to set up mathematical statements like
-5*(1+1)
5*(1+1)
## [1] 10
Note however that you have to be explicit about multiplication. If you try the following it won’t work.
-5(1+1)
5(1+1)
R also has built-in functions that work similar to what you might have used in Excel. For example, in Excel you can calculate the average of a set of numbers by typing “=average(1,2,3)” into a cell. R can do the same thing except
mean(c(1,2,3))
mean(c(1,2,3))
## [1] 2
Where “c(…)” packages up the numbers the way the mean() function wants to see them.
If you just do the following R will give you an answer, but its the wrong one
-mean(1,2,3)
mean(1,2,3)
This is a common issue with R – and many programs, really – it won’t always tell you when somethind didn’t go as planned. This is because it doesn’t know something didn’t go as planned; you have to learn the rules R plays by.
See if you can reproduce the following results
Division
-10/3
10/3
## [1] 3.333333
The standard deviation
-sd(c(5,10,15)) # note the use of "c(...)"
sd(c(5,10,15)) # note the use of "c(...)"
## [1] 5
The code you’ve chosen to run will be sent by RStudio from the script editor over to the console. The console will show you both the code and then the output.
You can run several lines of code if you want; the console will run a line, print the output, and then run the next line. First I’ll use the command mean(), and then the command sd() for the standard deviation:
-mean(c(1,2,3))
mean(c(1,2,3))
## [1] 2
-sd(c(1,2,3))
sd(c(1,2,3))
## [1] 1
If you are using a function in R you can get info about how it works like this
- ?mean
?mean
In RStudio the help screen should appear, probably above your console. If you start reading this help file, though, you don’t have to go far until you start seeing lots of R lingo, like “S3 method”,“na.rm”, “vectors”. Unfortunately, the R help files are usually not written for beginners, and reading help files is a skill you have to acquire.
For example, when we load data into R in subsequent lessons we will use a function called “read.csv”
Access the help file by typing “?read.csv” into the console and pressing enter. Surprisingly, the function that R give you the help file isn’t what you asked for, but is read.table(). This is a related function to read.csv, but when you’re a beginner thing like this can really throw you off.
@@ -543,35 +545,35 @@Practice the following operations. Type the directly into the console and execute them. Also write them in a script in the script editor and run them.
Square roots
-sqrt(42)
sqrt(42)
## [1] 6.480741
The date Some functions in R can be executed within nothing in the parentheses.
-date()
## [1] "Sat Sep 18 00:31:13 2021"
+date()
## [1] "Sat Sep 18 01:23:04 2021"
Exponents The ^ is used for exponents
-42^2
42^2
## [1] 1764
A series of numbers A colon between two numbers creates a series of numbers.
-1:42
1:42
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
logs The default for the log() function is the natural log.
-log(42)
log(42)
## [1] 3.73767
log10() gives the base-10 log.
-log10(42)
log10(42)
## [1] 1.623249
exp() raises e to a power
-exp(3.73767)
exp(3.73767)
## [1] 42.00002
Multiple commands can be nested
-sqrt(42)^2
-log(sqrt(42)^2)
-exp(log(sqrt(42)^2))
sqrt(42)^2
+log(sqrt(42)^2)
+exp(log(sqrt(42)^2))
Here is an example of the contents of a FASTA file. (If your are viewing this chapter in the form of the source .Rmd file, the cat()
function is included just to print out the content properly and is not part of the FASTA format).
cat(">gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
-MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
-AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
-QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
-LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK")
cat(">gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
+MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
+AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
+QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
+LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK")
## >gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
## MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
## AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
@@ -422,23 +424,23 @@ 16.1 Example FASTA file
16.2 Multiple sequences in a single FASTA file
Multiple sequences can be stored in a single FASTA file, each on separated by a line and have its own headline.
-cat(">LCBO - Prolactin precursor - Bovine
-MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
-EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
-VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
-ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
-
->MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
-MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
-FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
-DIDGDGQVNYEEFVQMMTAK*
-
->gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
-LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
-EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
-LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
-GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
-IENY")
+cat(">LCBO - Prolactin precursor - Bovine
+MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
+EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
+VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
+ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
+
+>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
+MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
+FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
+DIDGDGQVNYEEFVQMMTAK*
+
+>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
+LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
+EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
+LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
+GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
+IENY")
## >LCBO - Prolactin precursor - Bovine
## MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
## EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
@@ -467,18 +469,18 @@ 16.3 Multiple sequence alignments
- Each spaces added to the beginning and end of sequences that vary in length are indicated by
~
In the sample FASTA file below, the example1
sequence has a gap of 8 near its beginning. The example2
sequence has numerous ~
indicating that this sequence is missing data from its beginning that are present in the other sequences. The example3
sequence has numerous ~
at its end, indicating that this sequence is shorter than the others.
-cat(">example1
-MKALWALLLVPLLTGCLA........EGELEVTDQLPGQSDQP.WEQALNRFWDYLRWVQ
-GNQARDRLEEVREQMEEVRSKMEEQTQQIRLQAEIFQARIKGWFEPLVEDMQRQWANLME
-KIQASVATNSIASTTVPLENQ
->example2
-~~~~~~~~~~~~~~~~~~~~~~~~~~KVQQELEPEAGWQTGQP.WEAALARFWDYLRWVQ
-SSRARGHLEEMREQIQEVRVKMEEQADQIRQKAEAFQARLKSWFEPLLEDMQRQWDGLVE
-KVQAAVAT.IPTSKPVEEP~~
->example3
-MRSLVVFFALAVLTGCQARSLFQAD..............APQPRWEEMVDRFWQYVSELN
-AGALKEKLEETAENL...RTSLEGRVDELTSLLAPYSQKIREQLQEVMDKIKEATAALPT
-QA~~~~~~~~~~~~~~~~~~~")
+cat(">example1
+MKALWALLLVPLLTGCLA........EGELEVTDQLPGQSDQP.WEQALNRFWDYLRWVQ
+GNQARDRLEEVREQMEEVRSKMEEQTQQIRLQAEIFQARIKGWFEPLVEDMQRQWANLME
+KIQASVATNSIASTTVPLENQ
+>example2
+~~~~~~~~~~~~~~~~~~~~~~~~~~KVQQELEPEAGWQTGQP.WEAALARFWDYLRWVQ
+SSRARGHLEEMREQIQEVRVKMEEQADQIRQKAEAFQARLKSWFEPLLEDMQRQWDGLVE
+KVQAAVAT.IPTSKPVEEP~~
+>example3
+MRSLVVFFALAVLTGCQARSLFQAD..............APQPRWEEMVDRFWQYVSELN
+AGALKEKLEETAENL...RTSLEGRVDELTSLLAPYSQKIREQLQEVMDKIKEATAALPT
+QA~~~~~~~~~~~~~~~~~~~")
## >example1
## MKALWALLLVPLLTGCLA........EGELEVTDQLPGQSDQP.WEQALNRFWDYLRWVQ
## GNQARDRLEEVREQMEEVRSKMEEQTQQIRLQAEIFQARIKGWFEPLVEDMQRQWANLME
@@ -505,16 +507,16 @@ 16.4 FASTQ Format
- Line 4 encodes the quality values for the sequence in Line 2 of the file, and must contain the same number of symbols as letters in the sequence.
A FASTQ file containing a single sequence might look like this:”
-cat("@SEQ_ID
-GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
-+
-!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65")
+cat("@SEQ_ID
+GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
++
+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65")
## @SEQ_ID
## GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
## +
## !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Here are the quality value characters in left-to-right increasing order of quality (ASCII):”
-!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
+!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
FASTQ files typically do not include line breaks and do not wrap around when they reach the width of a normal page or file.
diff --git a/docs/introduction-to-biological-sequences-databases.html b/docs/introduction-to-biological-sequences-databases.html
index 7da4a24..a4f9ebc 100644
--- a/docs/introduction-to-biological-sequences-databases.html
+++ b/docs/introduction-to-biological-sequences-databases.html
@@ -265,10 +265,12 @@
- 12 NCBI: The National Center for Biotechnology Information
- 13 Introduction to biological sequences databases
@@ -414,7 +416,7 @@ 13.1 Topics
NCBI is the National Center for Biotechnology Information. The NCBI Webiste is the entry point to a large number of databases giving access to biological sequences (DNA, RNA, protein) and biology-related publications.
+NCBI is the National Center for Biotechnology Information. The NCBI Webiste www.ncbi.nlm.nih.gov/ is the entry point to a large number of databases giving access to biological sequences (DNA, RNA, protein) and biology-related publications.
When scientists sequence DNA, RNA and proteins they typically publish their data via databases with the NCBI. Each is given a unique identification number known as an accession number. For example, each time a unique human genome sequence is produced it is uploaded to the relevant databases, assigned a unique accession, and a website created to access it. Sequence are also cross-referenced to related papers, so you can start with a sequence and find out what scientific paper it was used in, or start with a paper and see if any sequences are associated with it.
This chapter provides an introduction to the general search features of the NCBI databases via the interface on the website, including how to locate sequences using accession numbers and other search parameters, such as specific authors or papers. Subsequent chapters will introduce advanced search features, and how to carry out searches using R commands.
One consequence of the explosion of biological sequences used in publications is that the system of databases has become fairly complex. There are databases for different types of data, different types of molecules, data from individual experiments showing genetic variation and also the consensus sequence for a given molecule. Luckily, if you know the accession number of the sequence you are looking for – which will be our starting point throughout this book – its fairly straight forward. There are numerous other books on bioinformatics and genomics that provide all details if you need to do more complex searches.
@@ -424,7 +426,7 @@Almost published biological sequences are available online, as it is a requirement of every scientific journal that any published DNA or RNA or protein sequence must be deposited in a public database. The main resources for storing and distributing sequence data are three large databases:
As mentioned above, for each sequence the NCBI database stores some extra information such as the species that it came from, publications describing the sequence, etc. This information is stored in the GenBank entry (aka GenBank Record) for the sequence. The GenBank entry for a sequence can be viewed by searching the NCBI database for the accession number for that sequence.
To view the GenBank entry for the DEN-1 Dengue virus, follow these steps:
Logging splits up multiplication into addition. So, log(m*n) is the same as log(m) + log(n)
You can check this
-<-10
- m<-11
- nlog(m*n)
<-10
+ m<-11
+ nlog(m*n)
## [1] 4.70048
-log(m)+log(n)
log(m)+log(n)
## [1] 4.70048
-log(m*n) == log(m)+log(n)
log(m*n) == log(m)+log(n)
## [1] TRUE
Exponentiation undos logs
-exp(log(m*n))
exp(log(m*n))
## [1] 110
-*n m
*n m
## [1] 110
The key equation in BLAST’s E values is
u = ln(Kmn)/lambda
This can be changed to
[ln(K) + ln(mn)]/lambda
We can check this
-<- 1
- K <- 10
- m <- 11
- n <- 110
- lambda
-log(K*m*n)/lambda
<- 1
+ K <- 10
+ m <- 11
+ n <- 110
+ lambda
+log(K*m*n)/lambda
## [1] 0.04273164
-log(K) + log(m*n))/lambda (
log(K) + log(m*n))/lambda (
## [1] 0.04273164
-log(K*m*n)/lambda == (log(K) + log(m*n))/lambda
log(K*m*n)/lambda == (log(K) + log(m*n))/lambda
## [1] TRUE
NOTE: The following material was partially adapted by N. Brouwer from Wikipedia. See the underlying .Rmd file for information on specific paragraphs from Wikipedia.
+The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It was founded in 1988 and is approved and funded by the government of the United States. The NCBI houses a series of databases relevant to the basic and applied life sciences and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for biomedical literature. All these databases are available online through the Entrez search engine.
In this chapter we’ll briefly discuss the major databases, following up with specific details in subsequent chapters. One thing to note is that in practice people do no always adhere to the specific names of databases and other tools. For example, someone may say “I searched NCBI for sequences.” It would be more accurate to say “I used Entrez to search GenBank for sequences.” For in-depth details on all the features see Database resources of the National Center for Biotechnology Information
-The GenBank sequence database is an open access collection of publicly available DNA and protein sequences. If you’ve work with sequence data, you’ll work with GenBank. GenBank is the actual database, and it can be searched several ways. For example, you can search for a sequence by its ID number (accession number) if you know it, or do a BLAST search using an actual sequence to look for similar sequences.
-A key component of GenBank are the GeneBank Records, which are annotated summaries of sequences in the databases. For example, below is shown record for a gene pallysin from syphilis In addition to the actual A, T, C, and Gs of the sequences, the record provides metadata, such as the scientific name of the organism (Treponema pallidum), who did the sequencing, the name of the paper where the sequence was published, and important features of the gene.
+
A key component of GenBank are the GeneBank Records, which are annotated summaries of sequences in the databases. For example, below is shown record for a gene pallysin https://www.ncbi.nlm.nih.gov/protein/AAC65720 from syphilis In addition to the actual A, T, C, and Gs of the sequences, the record provides metadata, such as the scientific name of the organism (Treponema pallidum), who did the sequencing, the name of the paper where the sequence was published, and important features of the gene.
A key feature of PubMed records is that they are hyperlinked to other NCBI databases. For example, you can click link under the name of the paper which reported the sequence of the gene and it will take you to the PubMed record for that paper (see below). You can also click the “Run BLAST” link and you can search the database for similar sequences. This protein coded for by this particular gene has had its structured solved using x-ray crystallography, and you can see these results under “Protein 3D Structure.” In a later chapter we’ll get to know these records in further detail.
PubMed and PubMed Central (PMC) are databases of scientific articles related to data contained in NCBI databases. PubMed contains basic bibliographic information, the abstract, relevant links to sequences, and to the websites of the actual publishers of the papers. If the text of an article is open-access, PMC should have a copy of it. Articles in PMC contain relevant hyperlinks, such as to any sequences that are mentioned. For example, the image below show where the sequence of Tp0751 is mentioned in a paper and linked to the GenBank record we looked at above.
https://en.wikipedia.org/wiki/Entrez
+“Entrez is a federated search engine and web portal that allows users to search many discrete health sciences databases of the NCBI website. The name”Entrez” (a greeting meaning “Come in” in French) was chosen to reflect the spirit of welcoming the public to search the content available from NCBI.”
+Entrez is a search engine and web portal that allows users to search many discrete health sciences databases of the NCBI website. The name “Entrez” (a greeting meaning “Come in” in French) was chosen to reflect the spirit of welcoming the public to search the content available from NCBI.
-“Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system.”
+Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system.
https://en.wikipedia.org/wiki/BLAST_(biotechnology)
+“BLAST (basic local alignment search tool)is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.”
+BLAST (Basic Local Alignment Search Tool) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.
We now need to load up all our bioinformatics and phylogenetics software into R. This is done with the library()
command.
To run this code just clock on the sideways green triangle all the way to the right of the code.
NOTE: You’ll likely see some red code appear on your screen. No worries, totally normal!
-# github packages
-library(compbio4all)
-
-# CRAN packages
-library(rentrez)
-library(seqinr)
-library(ape)
-
-# Bioconductor packages
-library(msa)
-library(Biostrings)
# github packages
+library(compbio4all)
+
+# CRAN packages
+library(rentrez)
+library(seqinr)
+library(ape)
+
+# Bioconductor packages
+library(msa)
+library(Biostrings)
Normally we’d have to download these sequences by hand through pointing and clicking on GeneBank records on the NCBI website. In R we can do it automatically; this might take a second.
All the code needed is this:
-# Human shroom 3 (H. sapiens)
-<- entrez_fetch(db = "protein",
- hShroom3 id = "NP_065910",
- rettype = "fasta")
# Human shroom 3 (H. sapiens)
+<- entrez_fetch(db = "protein",
+ hShroom3 id = "NP_065910",
+ rettype = "fasta")
The output is in FASTA format; we’ll use the cat()
to do a little formatting for us:
cat(hShroom3)
cat(hShroom3)
## >NP_065910.3 protein Shroom3 [Homo sapiens]
## MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKL
## QAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAA
@@ -574,29 +576,29 @@ 18.3 Downloading macro-molecular
## IPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL
Note the initial >
, then the header line of NP_065910.3 protein Shroom3 [Homo sapiens]
. After that is the amino acid sequence. The underlying data also includes the newline character \n
to designate where each line of amino acids stops.
We can get the rest of the data by just changing the id = ...
argument:
# Mouse shroom 3a (M. musculus)
-<- entrez_fetch(db = "protein",
- mShroom3a id = "AAF13269",
- rettype = "fasta")
-
-# Human shroom 2 (H. sapiens)
-<- entrez_fetch(db = "protein",
- hShroom2 id = "CAA58534",
- rettype = "fasta")
-
-
-# Sea-urchin shroom
-<- entrez_fetch(db = "protein",
- sShroom id = "XP_783573",
- rettype = "fasta")
# Mouse shroom 3a (M. musculus)
+<- entrez_fetch(db = "protein",
+ mShroom3a id = "AAF13269",
+ rettype = "fasta")
+
+# Human shroom 2 (H. sapiens)
+<- entrez_fetch(db = "protein",
+ hShroom2 id = "CAA58534",
+ rettype = "fasta")
+
+
+# Sea-urchin shroom
+<- entrez_fetch(db = "protein",
+ sShroom id = "XP_783573",
+ rettype = "fasta")
I’m going to check about how long each of these sequences is - each should have an at least slightly different length. If any are identical, I might have repeated an accession name or re-used an object name. The function nchar()
counts of the number of characters in an R object.
nchar(hShroom3)
nchar(hShroom3)
## [1] 2070
-nchar(mShroom3a)
nchar(mShroom3a)
## [1] 2083
-nchar(sShroom)
nchar(sShroom)
## [1] 1758
-nchar(hShroom2)
nchar(hShroom2)
## [1] 1673
We can remove this non-sequence information using a function I wrote called fasta_cleaner()
, which is in the compbio4all
package. The function uses regular expressions to remove the info we don’t need.
ASIDE: If we run the name of the command with out any quotation marks we can see the code:
- fasta_cleaner
fasta_cleaner
## function(fasta_object, parse = TRUE){
##
## fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)","\\4",fasta_object)
@@ -625,87 +627,87 @@ 18.4 Prepping macromolecular sequ
##
## return(fasta_object[[1]])
## }
-## <bytecode: 0x7f81ca9a24b8>
+## <bytecode: 0x7fcccbb38a70>
## <environment: namespace:compbio4all>
End ASIDE
Now use the function to clean our sequences; we won’t worry about what pare = ...
is for.
<- fasta_cleaner(hShroom3, parse = F)
- hShroom3 <- fasta_cleaner(mShroom3a, parse = F)
- mShroom3a <- fasta_cleaner(hShroom2, parse = F)
- hShroom2 <- fasta_cleaner(sShroom, parse = F) sShroom
<- fasta_cleaner(hShroom3, parse = F)
+ hShroom3 <- fasta_cleaner(mShroom3a, parse = F)
+ mShroom3a <- fasta_cleaner(hShroom2, parse = F)
+ hShroom2 <- fasta_cleaner(sShroom, parse = F) sShroom
Now let’s take a peek at what our sequences look like:
- hShroom3
hShroom3
## [1] "MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKLQAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAAWSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYLRRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGRERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNIPRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSLGSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGMLYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFCSVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDFQEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTKAEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQGFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEPLGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGSPESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPHTPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFPESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAYLQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLSGANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSGFGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTPCPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGPRGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTSDPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPVYSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEAESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDKSLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYYSVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEALISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHEDARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDFIPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL"
We can do a global alignment of one sequence against another using the pairwiseAlignment()
function from the Bioconductor package Biostrings
(note that capital “B” in Biostrings
; most R package names are all lower case, but not this one).
Let’s align human versus mouse shroom:
-<- Biostrings::pairwiseAlignment(
- align.h3.vs.m3a
- hShroom3, mShroom3a)
<- Biostrings::pairwiseAlignment(
+ align.h3.vs.m3a
+ hShroom3, mShroom3a)
We can peek at the alignment
- align.h3.vs.m3a
align.h3.vs.m3a
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: MMRTTEDFHKPSATLN-SNTATKGRYIYLEAFLE...KAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL
## subject: MK-TPENLEEPSATPNPSRTPTE-RFVYLEALLE...KAGAISLPPALTGHATPGGTSVFGGVFPTLTSPL
## score: 2189.934
The score tells us how closely they are aligned; higher scores mean the sequences are more similar. Its hard to interpret the number on its own so we can get the percent sequence identity (PID) using the pid()
function.
::pid(align.h3.vs.m3a) Biostrings
::pid(align.h3.vs.m3a) Biostrings
## [1] 70.56511
So, shroom3 from humans and shroom3 from mice are ~71% similar (at least using this particular method of alignment, and there are many ways to do this!)
What about human shroom 3 and sea-urchin shroom?
-<- Biostrings::pairwiseAlignment(
- align.h3.vs.h2
- hShroom3, hShroom2)
<- Biostrings::pairwiseAlignment(
+ align.h3.vs.h2
+ hShroom3, hShroom2)
First check out the score using score()
, which accesses it directly without all the other information.
score(align.h3.vs.h2)
score(align.h3.vs.h2)
## [1] -5673.853
Now the percent sequence alignment with pid()
:
::pid(align.h3.vs.h2) Biostrings
::pid(align.h3.vs.h2) Biostrings
## [1] 33.83277
So Human shroom 3 and Mouse shroom 3 are 71% identical, but Human shroom 3 and human shroom 2 are only 34% similar? How does it work out evolutionary that a human and mouse gene are more similar than a human and a human gene? What are the evolutionary relationships among these genes within the shroom gene family?
I’ve copied a table from a published paper which has accession numbers for 15 different Shroom genes.
-<- c("CAA78718" , "X. laevis Apx" , "xShroom1",
- shroom_table "NP_597713" , "H. sapiens APXL2" , "hShroom1",
- "CAA58534" , "H. sapiens APXL", "hShroom2",
- "ABD19518" , "M. musculus Apxl" , "mShroom2",
- "AAF13269" , "M. musculus ShroomL" , "mShroom3a",
- "AAF13270" , "M. musculus ShroomS" , "mShroom3b",
- "NP_065910", "H. sapiens Shroom" , "hShroom3",
- "ABD59319" , "X. laevis Shroom-like", "xShroom3",
- "NP_065768", "H. sapiens KIAA1202" , "hShroom4a",
- "AAK95579" , "H. sapiens SHAP-A" , "hShroom4b",
- #"DQ435686" , "M. musculus KIAA1202" , "mShroom4",
- "ABA81834" , "D. melanogaster Shroom", "dmShroom",
- "EAA12598" , "A. gambiae Shroom", "agShroom",
- "XP_392427" , "A. mellifera Shroom" , "amShroom",
- "XP_783573" , "S. purpuratus Shroom" , "spShroom") #sea urchin
<- c("CAA78718" , "X. laevis Apx" , "xShroom1",
+ shroom_table "NP_597713" , "H. sapiens APXL2" , "hShroom1",
+ "CAA58534" , "H. sapiens APXL", "hShroom2",
+ "ABD19518" , "M. musculus Apxl" , "mShroom2",
+ "AAF13269" , "M. musculus ShroomL" , "mShroom3a",
+ "AAF13270" , "M. musculus ShroomS" , "mShroom3b",
+ "NP_065910", "H. sapiens Shroom" , "hShroom3",
+ "ABD59319" , "X. laevis Shroom-like", "xShroom3",
+ "NP_065768", "H. sapiens KIAA1202" , "hShroom4a",
+ "AAK95579" , "H. sapiens SHAP-A" , "hShroom4b",
+ #"DQ435686" , "M. musculus KIAA1202" , "mShroom4",
+ "ABA81834" , "D. melanogaster Shroom", "dmShroom",
+ "EAA12598" , "A. gambiae Shroom", "agShroom",
+ "XP_392427" , "A. mellifera Shroom" , "amShroom",
+ "XP_783573" , "S. purpuratus Shroom" , "spShroom") #sea urchin
I’ll do a bit of formatting; you can ignore these details if you want
-# convert to matrix
-<- matrix(shroom_table,
- shroom_table_matrix byrow = T,
- nrow = 14)
- # convert to dataframe
-<- data.frame(shroom_table_matrix,
- shroom_table stringsAsFactors = F)
-
-# name columns
-names(shroom_table) <- c("accession", "name.orig","name.new")
-
-# Create simplified species names
-$spp <- "Homo"
- shroom_table$spp[grep("laevis",shroom_table$name.orig)] <- "Xenopus"
- shroom_table$spp[grep("musculus",shroom_table$name.orig)] <- "Mus"
- shroom_table$spp[grep("melanogaster",shroom_table$name.orig)] <- "Drosophila"
- shroom_table$spp[grep("gambiae",shroom_table$name.orig)] <- "mosquito"
- shroom_table$spp[grep("mellifera",shroom_table$name.orig)] <- "bee"
- shroom_table$spp[grep("purpuratus",shroom_table$name.orig)] <- "sea urchin" shroom_table
# convert to matrix
+<- matrix(shroom_table,
+ shroom_table_matrix byrow = T,
+ nrow = 14)
+ # convert to dataframe
+<- data.frame(shroom_table_matrix,
+ shroom_table stringsAsFactors = F)
+
+# name columns
+names(shroom_table) <- c("accession", "name.orig","name.new")
+
+# Create simplified species names
+$spp <- "Homo"
+ shroom_table$spp[grep("laevis",shroom_table$name.orig)] <- "Xenopus"
+ shroom_table$spp[grep("musculus",shroom_table$name.orig)] <- "Mus"
+ shroom_table$spp[grep("melanogaster",shroom_table$name.orig)] <- "Drosophila"
+ shroom_table$spp[grep("gambiae",shroom_table$name.orig)] <- "mosquito"
+ shroom_table$spp[grep("mellifera",shroom_table$name.orig)] <- "bee"
+ shroom_table$spp[grep("purpuratus",shroom_table$name.orig)] <- "sea urchin" shroom_table
Take a look:
- shroom_table
shroom_table
## accession name.orig name.new spp
## 1 CAA78718 X. laevis Apx xShroom1 Xenopus
## 2 NP_597713 H. sapiens APXL2 hShroom1 Homo
@@ -725,41 +727,41 @@ 18.6 The shroom family of genes
18.7 Downloading multiple sequences
Instead of getting one sequence at a time we can download several by accessing the “accession” column from the table
-$accession shroom_table
+$accession shroom_table
## [1] "CAA78718" "NP_597713" "CAA58534" "ABD19518" "AAF13269" "AAF13270"
## [7] "NP_065910" "ABD59319" "NP_065768" "AAK95579" "ABA81834" "EAA12598"
## [13] "XP_392427" "XP_783573"
We can give this whole set of accessions to entrez_fetch()
:
-<- entrez_fetch(db = "protein",
- shrooms id = shroom_table$accession,
- rettype = "fasta")
+<- entrez_fetch(db = "protein",
+ shrooms id = shroom_table$accession,
+ rettype = "fasta")
We can look at what we got here with cat()
(I won’t display this because it is very long!)
-cat(shrooms)
+cat(shrooms)
The current format of these data is a single, long set of data. This is a standard way to store, share and transmit FASTA files, but in R we’ll need a slightly different format.
We’ll download all of the sequences again, this time using a function from compbio4all
called entrez_fetch_list()
which is a wrapper function I wrote to put the output of entrez_fetch()
into an R data format called a list.
-<- entrez_fetch_list(db = "protein",
- shrooms_list id = shroom_table$accession,
- rettype = "fasta")
+<- entrez_fetch_list(db = "protein",
+ shrooms_list id = shroom_table$accession,
+ rettype = "fasta")
Now we have a list which as 14 elements, one for each sequence in our table.
-length(shrooms_list)
+length(shrooms_list)
## [1] 14
We now need to clean up each one fo these sequences. We can do that using a simple for()
loop:
-for(i in 1:length(shrooms_list)){
-<- fasta_cleaner(shrooms_list[[i]], parse = F)
- shrooms_list[[i]] }
+for(i in 1:length(shrooms_list)){
+<- fasta_cleaner(shrooms_list[[i]], parse = F)
+ shrooms_list[[i]] }
Second to last step: we need to take each one of our sequences from our list and put it into a vector, in particular a named vector
-# make a vector to store output
-<- rep(NA, length(shrooms_list))
- shrooms_vector
-# run the loop
-for(i in 1:length(shrooms_vector)){
-<- shrooms_list[[i]]
- shrooms_vector[i]
- }
-# name the vector
-names(shrooms_vector) <- names(shrooms_list)
+# make a vector to store output
+<- rep(NA, length(shrooms_list))
+ shrooms_vector
+# run the loop
+for(i in 1:length(shrooms_vector)){
+<- shrooms_list[[i]]
+ shrooms_vector[i]
+ }
+# name the vector
+names(shrooms_vector) <- names(shrooms_list)
Now the final step: we need to convert our named vector to a string set using Biostrings::AAStringSet()
. Note the _ss
tag at the end of the object we’re assigning the output to, which designates this as a string set.
-<- Biostrings::AAStringSet(shrooms_vector) shrooms_vector_ss
+<- Biostrings::AAStringSet(shrooms_vector) shrooms_vector_ss
18.8 Multiple sequence alignment
@@ -767,8 +769,8 @@ 18.8 Multiple sequence alignment<
18.8.1 Building an Multiple Sequence Alignment (MSA)
We’ll use the software msa,
which implements the ClustalW multiple sequence alignment algorithm. Normally we’d have to download the ClustalW program and either point-and-click our way through it or use the command line*, but these folks wrote up the algorithm in R so we can do this with a line of R code. This will take a second or two.
-<- msa(shrooms_vector_ss,
- shrooms_align method = "ClustalW")
+<- msa(shrooms_vector_ss,
+ shrooms_align method = "ClustalW")
## use default substitution matrix
@@ -777,7 +779,7 @@ 18.8.2 Viewing an MSA
18.8.2.1 Viewing an MSA in R
We can look at the output from msa()
, but its not very helpful
- shrooms_align
+ shrooms_align
## CLUSTAL 2.1
##
## Call:
@@ -802,19 +804,19 @@ 18.8.2.1 Viewing an MSA in R
## Con -------------------------...------------------------- Consensus
A function called print_msa()
(Coghlan 2011) which I’ve put intocombio4all
can give us more informative output by printing out the actual alignment into the R console.
To use print_msa()
We need to make a few minor tweaks though first. These are behind the scenes changes so don’t worry about the details right now. We’ll change the name to shrooms_align_seqinr
to indicate that one of our changes is putting this into a format defined by the bioinformatics package seqinr
.
-class(shrooms_align) <- "AAMultipleAlignment"
-<- msaConvert(shrooms_align, type = "seqinr::alignment") shrooms_align_seqinr
+class(shrooms_align) <- "AAMultipleAlignment"
+<- msaConvert(shrooms_align, type = "seqinr::alignment") shrooms_align_seqinr
I won’t display the output from shrooms_align_seqinr
because its very long; we have 14 shroom genes, and shroom happens to be a rather long gene.
-print_msa(alignment = shrooms_align_seqinr,
-chunksize = 60)
+print_msa(alignment = shrooms_align_seqinr,
+chunksize = 60)
18.8.2.2 Displaying an MSA as an R plot
I’m going to just show about 100 amino acids near the end of the alignment, where there is the most overlap across all of the sequences. This is set with the start = ...
and end = ...
arguments. Note that we’re using the shrooms_align
object.
-::ggmsa(shrooms_align, # shrooms_align, NOT shrooms_align_seqinr
- ggmsastart = 2000,
- end = 2100)
+::ggmsa(shrooms_align, # shrooms_align, NOT shrooms_align_seqinr
+ ggmsastart = 2000,
+ end = 2100)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
@@ -822,17 +824,17 @@ 18.8.2.2 Displaying an MSA as an
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
-![](lbrb_files/figure-html/unnamed-chunk-157-1.png)
+![](lbrb_files/figure-html/unnamed-chunk-159-1.png)
18.8.2.3 Saving an MSA as PDF
We can take a look at the alignment in PDF format if we want. I this case I’m going to just show about 100 amino acids near the end of the alignment, where there is the most overlap across all of the sequences. This is set with the y = c(...)
argument.
-msaPrettyPrint(shrooms_align, # alignment
-file = "shroom_msa.pdf", # file name
- y=c(2000, 2100), # range
- askForOverwrite=FALSE)
+msaPrettyPrint(shrooms_align, # alignment
+file = "shroom_msa.pdf", # file name
+ y=c(2000, 2100), # range
+ askForOverwrite=FALSE)
You can see where R is saving things by running getwd()
-getwd()
+getwd()
## [1] "/Users/nlb24/OneDrive - University of Pittsburgh/0-books/lbrb-bk/lbrb"
On a Mac you can usually find the file by searching in Finder for the file name, which I set to be “shroom_msa.pdf” using the file = ...
argument above.
@@ -842,13 +844,13 @@ 18.8.2.3 Saving an MSA as PDF
18.9 Genetic distance.
Next need to first get an estimate of how similar each sequences is. The more amino acids that are identical to each other, the more similar.
Instead of similarity, we usually work in terms of difference or genetic distance (a.k.a. evolutionary distance). This is done with the dist.alignment()
function.
-<- seqinr::dist.alignment(shrooms_align_seqinr,
- shrooms_dist matrix = "identity")
+<- seqinr::dist.alignment(shrooms_align_seqinr,
+ shrooms_dist matrix = "identity")
We’ve made a matrix using dist.alignment()
; let’s round it off so its easier to look at using the round()
function.
-<- round(shrooms_dist,
- shrooms_dist_rounded digits = 3)
+<- round(shrooms_dist,
+ shrooms_dist_rounded digits = 3)
Now let’s look at it
- shrooms_dist_rounded
+ shrooms_dist_rounded
## NP_065768 AAK95579 AAF13269 AAF13270 NP_065910 ABD59319 CAA58534
## AAK95579 0.000
## AAF13269 0.884 0.917
@@ -882,84 +884,84 @@ 18.9 Genetic distance.
18.10 Phylognetic trees (finally!)
We got our sequence, built multiple sequence alignment, and calculated the genetic distance between sequences. Now we are - finally - ready to build a phylogenetic tree.
First, we let R figure out the structure of the tree. There are MANY ways to build phylogenetic trees. We’ll use a common one used for exploring sequences called neighbor joining algorithm via the function nj()
. Neighbor joining uses genetic distances to cluster sequences into clades.
-<- nj(shrooms_dist) tree
+<- nj(shrooms_dist) tree
18.10.1 Plotting phylogenetic trees
Now we’ll make a quick plot of our tree using plot()
(and add a little label using a function called mtext()
).
-# plot tree
-plot.phylo (tree, main="Phylogenetic Tree",
-type = "unrooted",
- use.edge.length = F)
-
-# add label
-mtext(text = "Shroom family gene tree - unrooted, no branch lengths")
-![](lbrb_files/figure-html/unnamed-chunk-164-1.png)
-This is an **unrooted tree*. For the sake of plotting we’ve also ignored the evolutionary distance between the sequences.
-To make a rooted tree we remove type = "unrooted
.
-# plot tree
-plot.phylo (tree, main="Phylogenetic Tree",
-use.edge.length = F)
-
-# add label
-mtext(text = "Shroom family gene tree - rooted, no branch lenths")
-![](lbrb_files/figure-html/unnamed-chunk-165-1.png)
-We can include information about branch length by setting use.edge.length = ...
to T
.
# plot tree
plot.phylo (tree, main="Phylogenetic Tree",
-use.edge.length = T)
-
-# add label
-mtext(text = "Shroom family gene tree - rooted, with branch lenths")
+type = "unrooted",
+ use.edge.length = F)
+
+# add label
+mtext(text = "Shroom family gene tree - unrooted, no branch lengths")
![](lbrb_files/figure-html/unnamed-chunk-166-1.png)
+This is an **unrooted tree*. For the sake of plotting we’ve also ignored the evolutionary distance between the sequences.
+To make a rooted tree we remove type = "unrooted
.
+# plot tree
+plot.phylo (tree, main="Phylogenetic Tree",
+use.edge.length = F)
+
+# add label
+mtext(text = "Shroom family gene tree - rooted, no branch lenths")
+![](lbrb_files/figure-html/unnamed-chunk-167-1.png)
+We can include information about branch length by setting use.edge.length = ...
to T
.
+# plot tree
+plot.phylo (tree, main="Phylogenetic Tree",
+use.edge.length = T)
+
+# add label
+mtext(text = "Shroom family gene tree - rooted, with branch lenths")
+![](lbrb_files/figure-html/unnamed-chunk-168-1.png)
Some of the branches are now very short, but most are very long, indicating that these genes have been evolving independently for many millions of years.
Let’s make a fancier plot. Don’t worry about all the steps; I’ve added some more code to add some annotations on the right-hand side to help us see what’s going on.
-plot(tree, main="Phylogenetic Tree")
-mtext(text = "Shroom family gene tree")
-
-<- 0.551
- x <- 0.6
- x2
-# label Shrm 3
-segments(x0 = x, y0 = 1,
-x1 = x, y1 = 4,
- lwd=2)
- text(x = x*1.01, y = 2.5, "Shrm 3",adj = 0)
-
-segments(x0 = x, y0 = 5,
-x1 = x, y1 = 6,
- lwd=2)
- text(x = x*1.01, y = 5.5, "Shrm 2",adj = 0)
-
-segments(x0 = x, y0 = 7,
-x1 = x, y1 = 9,
- lwd=2)
- text(x = x*1.01, y = 8, "Shrm 1",adj = 0)
-
-segments(x0 = x, y0 = 10,
-x1 = x, y1 = 13,
- lwd=2)
- text(x = x*1.01, y = 12, "Shrm ?",adj = 0)
-
-
-segments(x0 = x, y0 = 14,
-x1 = x, y1 = 15,
- lwd=2)
- text(x = x*1.01, y = 14.5, "Shrm 4",adj = 0)
-
-
-segments(x0 = x2, y0 = 1,
-x1 = x2, y1 = 6,
- lwd=2)
-
-segments(x0 = x2, y0 = 7,
-x1 = x2, y1 = 9,
- lwd=2)
-
-segments(x0 = x2, y0 = 10,
-x1 = x2, y1 = 15,
- lwd=2)
-![](lbrb_files/figure-html/unnamed-chunk-167-1.png)
+plot(tree, main="Phylogenetic Tree")
+mtext(text = "Shroom family gene tree")
+
+<- 0.551
+ x <- 0.6
+ x2
+# label Shrm 3
+segments(x0 = x, y0 = 1,
+x1 = x, y1 = 4,
+ lwd=2)
+ text(x = x*1.01, y = 2.5, "Shrm 3",adj = 0)
+
+segments(x0 = x, y0 = 5,
+x1 = x, y1 = 6,
+ lwd=2)
+ text(x = x*1.01, y = 5.5, "Shrm 2",adj = 0)
+
+segments(x0 = x, y0 = 7,
+x1 = x, y1 = 9,
+ lwd=2)
+ text(x = x*1.01, y = 8, "Shrm 1",adj = 0)
+
+segments(x0 = x, y0 = 10,
+x1 = x, y1 = 13,
+ lwd=2)
+ text(x = x*1.01, y = 12, "Shrm ?",adj = 0)
+
+
+segments(x0 = x, y0 = 14,
+x1 = x, y1 = 15,
+ lwd=2)
+ text(x = x*1.01, y = 14.5, "Shrm 4",adj = 0)
+
+
+segments(x0 = x2, y0 = 1,
+x1 = x2, y1 = 6,
+ lwd=2)
+
+segments(x0 = x2, y0 = 7,
+x1 = x2, y1 = 9,
+ lwd=2)
+
+segments(x0 = x2, y0 = 10,
+x1 = x2, y1 = 15,
+ lwd=2)
+![](lbrb_files/figure-html/unnamed-chunk-169-1.png)
diff --git a/lbrb.Rmd b/lbrb.Rmd
index 8de713f..6669f0e 100644
--- a/lbrb.Rmd
+++ b/lbrb.Rmd
@@ -1957,6 +1957,15 @@ Original content by OpenStax (CC BY 4.0; Download for free at
**NOTE:** The following material was partially adapted by N. Brouwer from [Wikipedia](https://en.wikipedia.org/wiki/National_Center_for_Biotechnology_Information). See the underlying .Rmd file for information on specific paragraphs from Wikipedia.
+## Key concepts
+
+* NCBI
+* Rentrez
+* accession numbers
+* BLAST
+
+## NCBI
+
The **National Center for Biotechnology Information (NCBI)** is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It was founded in 1988 and is approved and funded by the government of the United States. The NCBI houses a series of **databases** relevant to the basic and applied life sciences and is an important resource for **bioinformatics** tools and services. Major databases include **GenBank** for DNA sequences and **PubMed**, a **bibliographic** database for biomedical literature. All these databases are available online through the **Entrez** search engine.
@@ -1967,7 +1976,7 @@ In this chapter we'll briefly discuss the major databases, following up with spe
The GenBank sequence database is an open access collection of publicly available DNA and protein sequences. If you've work with sequence data, you'll work with GenBank. GenBank is the actual database, and it can be searched several ways. For example, you can search for a sequence by its ID number (**accession number**) if you know it, or do a **BLAST search** using an actual sequence to look for similar sequences.
-A key component of GenBank are the **GeneBank Records**, which are annotated summaries of sequences in the databases. For example, below is shown record for a gene [pallysin](https://www.ncbi.nlm.nih.gov/protein/AAC65720) from syphilis In addition to the actual A, T, C, and Gs of the sequences, the record provides **metadata**, such as the scientific name of the organism (*Treponema pallidum*), who did the sequencing, the name of the paper where the sequence was published, and important features of the gene.
+A key component of GenBank are the **GeneBank Records**, which are annotated summaries of sequences in the databases. For example, below is shown record for a gene [pallysin](https://www.ncbi.nlm.nih.gov/protein/AAC65720) https://www.ncbi.nlm.nih.gov/protein/AAC65720 from syphilis In addition to the actual A, T, C, and Gs of the sequences, the record provides **metadata**, such as the scientific name of the organism (*Treponema pallidum*), who did the sequencing, the name of the paper where the sequence was published, and important features of the gene.
A key feature of PubMed records is that they are **hyperlinked** to other NCBI databases. For example, you can click link under the name of the paper which reported the sequence of the gene and it will take you to the PubMed record for that paper (see below). You can also click the "Run BLAST" link and you can search the database for similar sequences. This protein coded for by this particular gene has had its structured solved using x-ray crystallography, and you can see these results under "Protein 3D Structure." In a later chapter we'll get to know these records in further detail.
@@ -1984,21 +1993,19 @@ knitr::include_graphics("images/genbank_record.png")
## Entrez
-https://en.wikipedia.org/wiki/Entrez
-
+
-"Entrez is a federated search engine and web portal that allows users to search many discrete health sciences databases of the NCBI website. The name "Entrez" (a greeting meaning "Come in" in French) was chosen to reflect the spirit of welcoming the public to search the content available from NCBI."
+Entrez is a search engine and web portal that allows users to search many discrete health sciences databases of the NCBI website. The name "Entrez" (a greeting meaning "Come in" in French) was chosen to reflect the spirit of welcoming the public to search the content available from NCBI.
-"Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system."
+Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system.
## BLAST
-https://en.wikipedia.org/wiki/BLAST_(biotechnology)
-
+
-"BLAST (basic local alignment search tool)is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence."
+BLAST (Basic Local Alignment Search Tool) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.
@@ -2037,7 +2044,7 @@ library(compbio4all)
## Introduction
-**NCBI** is the National Center for Biotechnology Information. The [NCBI Webiste](www.ncbi.nlm.nih.gov/) is the entry point to a large number of databases giving access to **biological sequences** (DNA, RNA, protein) and biology-related publications.
+**NCBI** is the National Center for Biotechnology Information. The [NCBI Webiste](https://www.ncbi.nlm.nih.gov/) www.ncbi.nlm.nih.gov/ is the entry point to a large number of databases giving access to **biological sequences** (DNA, RNA, protein) and biology-related publications.
When scientists sequence DNA, RNA and proteins they typically publish their data via databases with the NCBI. Each is given a unique identification number known as an **accession number**. For example, each time a unique human genome sequence is produced it is uploaded to the relevant databases, assigned a unique **accession**, and a website created to access it. Sequence are also cross-referenced to related papers, so you can start with a sequence and find out what scientific paper it was used in, or start with a paper and see if any sequences are associated with it.
@@ -2053,7 +2060,7 @@ In this chapter we'll typically refer generically to "NCBI data" and "NCBI datab
Almost published biological sequences are available online, as it is a requirement of every scientific journal that any published DNA or RNA or protein sequence must be deposited in a public database. The main resources for storing and distributing sequence data are three large databases:
-1. USA: **[NCBI database](www.ncbi.nlm.nih.gov/)** (www.ncbi.nlm.nih.gov/)
+1. USA: **[NCBI database](https://www.ncbi.nlm.nih.gov/)** (www.ncbi.nlm.nih.gov/)
1. Europe: **European Molecular Biology Laboratory (EMBL)** database (https://www.ebi.ac.uk/ena)
1. Japan: **DNA Database of Japan (DDBJ)** database (www.ddbj.nig.ac.jp/).
These databases collect all publicly available DNA, RNA and protein sequence data and make it available for free. They exchange data nightly, so contain essentially the same data. The redundancy among the databases allows them to serve different communities (e.g. native languages), provide different additional services such as tutorials, and assure that the world's scientists have their data backed up in different physical locations -- a key component of good data management!
@@ -2093,7 +2100,7 @@ As mentioned above, for each sequence the NCBI database stores some extra inform
To view the GenBank entry for the DEN-1 Dengue virus, follow these steps:
-1. Go to the [NCBI website](www.ncbi.nlm.nih.gov) (www.ncbi.nlm.nih.gov).
+1. Go to the [NCBI website](https://www.ncbi.nlm.nih.gov) (www.ncbi.nlm.nih.gov).
1. Search for the accession number NC_001477.
1. Since we searched for a particular accession we are only returned a single main result which is titled "NUCLEOTIDE SEQUENCE: Dengue virus 1, complete genome."
1. Click on "Dengue virus 1, complete genome" to go to the GenBank entry.
@@ -2273,7 +2280,7 @@ The following chapter was originally written by Avril Coghlan. It provides brie
## Retrieving genome sequence data via the NCBI website
-You can easily retrieve DNA or protein sequence data by hand from the [NCBI](www.ncbi.nlm.nih.gov) Sequence Database via its website www.ncbi.nlm.nih.gov.
+You can easily retrieve DNA or protein sequence data by hand from the [NCBI](https://www.ncbi.nlm.nih.gov/) Sequence Database via its website www.ncbi.nlm.nih.gov.
Dengue DEN-1 DNA is a viral DNA sequence and its NCBI **accession number** is NC_001477. To retrieve the DNA sequence for the Dengue DEN-1 virus from NCBI, go to the NCBI website, type “NC_001477” in the Search box at the top of the webpage, and press the “Search” button beside the Search box.
@@ -2320,7 +2327,7 @@ In a previous vignette you learned how to retrieve sequences from the NCBI datab
As mentioned previously, a subsection of the NCBI database called **RefSeq** consists of high quality DNA and protein sequence data. Furthermore, the NCBI entries for the RefSeq sequences have been **manually curated**, which means that expert biologists employed by NCBI have added additional information to the NCBI entries for those sequences, such as details of scientific papers that describe the sequences.
-Another extremely important manually curated database is [**UniProt**](www.uniprot.org), which focuses on protein sequences. UniProt aims to contains manually curated information on all known protein sequences. While many of the protein sequences in UniProt are also present in RefSeq, the amount and quality of manually curated information in UniProt is much higher than that in RefSeq.
+Another extremely important manually curated database is [UniProt](https://www.uniprot.org/) www.uniprot.org, which focuses on protein sequences. UniProt aims to contains manually curated information on all known protein sequences. While many of the protein sequences in UniProt are also present in RefSeq, the amount and quality of manually curated information in UniProt is much higher than that in RefSeq.
For each protein in UniProt, the UniProt curators read all the scientific papers that they can find about that protein, and add information from those papers to the protein’s UniProt entry. For example, for a human protein, the UniProt entry for the protein usually includes information about the biological function of the protein, in what human tissues it is expressed, whether it interacts with other human proteins, and much more. All this information has been manually gathered by the UniProt curators from scientific papers, and the papers in which the found the information are always listed in the UniProt entry for the protein.
@@ -2340,7 +2347,7 @@ This tells us that *Mycobacterium* is a species of bacteria, which belongs to a
Back up at the top under "organism" is says "Status", which tells us the **annotation score** is 2 out of 5, that it is a "Protein inferred from homology", which means what we know about it is derived from bioinformatics and computational tools, not lab work.
-Beside the heading “Function”, it says that the function of this protein is that it “Removes the pyruvyl group from chorismate to provide 4-hydroxybenzoate (4HB)”. This tells us this protein is an enzyme (a protein that increases the rate of a specific biochemical reaction), and tells us what is the particular biochemical reaction that this enzyme is involved in. At the end of this info it says "By similarity", which again indicates that what we know about this protein comes from bioinformatics, not lab work.
+Beside the heading “Function”, it says that the function of this protein is that it “Removes the pyruvyl group from chorismate to provide 4-hydroxybenzoate (4HB)”. This tells us this protein is an enzyme (a protein that increases the rate of a specific biochemical reaction), and tells us what is the particular biochemical reaction that this enzyme is involved in. At the end of this info it says "By similarity", which again indicates that what we know about this protein comes from bioinformatics, not lab work.
### Protein sequence and size
@@ -2386,6 +2393,13 @@ We can confirm
str(lepraeseq)
```
+```{r eval = F}
+ # 'SeqFastadna' chr [1:210] "m" "t" "n" "r" "t" "l" "s" "r" "e" "e" "i" ...
+ # - attr(*, "name")= chr "sp|Q9CD83|PHBS_MYCLE"
+ # - attr(*, "Annot")= chr ">sp|Q9CD83|PHBS_MYCLE Chorismate pyruvate-lyase
+ # OS=Mycobacterium leprae (strain TN) OX=272631 GN=ML0133 PE=3 SV=1"
+```
+
For the other sequence
@@ -2397,6 +2411,16 @@ file.2 <- system.file("./extdata/A0PQ23.fasta", package = "compbio4all")
# load fasta
ulcerans <- read.fasta(file = file.2)
ulceransseq <- ulcerans[[1]]
+
+str(ulceransseq)[[1]]
+```
+
+
+```{r}
+ # 'SeqFastadna' chr [1:212] "m" "l" "a" "v" "l" "p" "e" "k" "r" "e" "m" ...
+ # - attr(*, "name")= chr "tr|A0PQ23|A0PQ23_MYCUA"
+ # - attr(*, "Annot")= chr ">tr|A0PQ23|A0PQ23_MYCUA Chorismate pyruvate-lyase
+# OS=Mycobacterium ulcerans (strain Agy99) OX=362242 GN=MUL_2003 PE=4 SV=1"
```
diff --git a/lbrb.log b/lbrb.log
index 24eb8b2..a815718 100644
--- a/lbrb.log
+++ b/lbrb.log
@@ -1,4 +1,4 @@
-This is XeTeX, Version 3.14159265-2.6-0.999992 (TeX Live 2020) (preloaded format=xelatex 2020.4.7) 18 SEP 2021 00:33
+This is XeTeX, Version 3.14159265-2.6-0.999992 (TeX Live 2020) (preloaded format=xelatex 2020.4.7) 18 SEP 2021 01:24
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
@@ -1123,23 +1123,33 @@ t;!
]
Chapter 12.
-Underfull \hbox (badness 10000) in paragraph at lines 2355--2356
+Underfull \hbox (badness 2376) in paragraph at lines 2373--2374
+[]\TU/lmr/m/n/10 A key component of GenBank are the \TU/lmr/bx/n/10 GeneBank Re
+cords\TU/lmr/m/n/10 , which are annotated summaries
+ []
+
+
+Underfull \hbox (badness 10000) in paragraph at lines 2373--2374
[]
+[65]
File: images/genbank_record.png Graphic file (type bmp)
-Overfull \hbox (338.73183pt too wide) in paragraph at lines 2359--2360
+Overfull \hbox (338.73183pt too wide) in paragraph at lines 2377--2378
[][]
[]
-[65] [66] [67] [68
-]
+Underfull \vbox (badness 10000) has occurred while \output is active []
+
+[66] [67] [68]
Chapter 13.
-[69] [70] [71] [72] [73]
-Underfull \hbox (badness 1540) in paragraph at lines 2642--2642
+[69
+
+] [70] [71] [72] [73]
+Underfull \hbox (badness 1540) in paragraph at lines 2656--2656
[]\TU/lmr/bx/n/14.4 Example: finding the sequences published in Nature
[]
@@ -1151,14 +1161,14 @@ Chapter 14.
Chapter 15.
[77
-] [78] [79] [80
-
-]
+] [78] [79] [80]
Chapter 16.
Underfull \vbox (badness 1838) has occurred while \output is active []
-[81] [82] [83] [84]
+[81
+
+] [82] [83] [84]
Chapter 17.
[85
@@ -1167,43 +1177,43 @@ Chapter 17.
]
Chapter 18.
[87] [88]
-Underfull \hbox (badness 10000) in paragraph at lines 3221--3223
+Underfull \hbox (badness 10000) in paragraph at lines 3251--3253
[]
-Underfull \hbox (badness 10000) in paragraph at lines 3236--3238
+Underfull \hbox (badness 10000) in paragraph at lines 3266--3268
[]
[89] [90] [91]
-Overfull \hbox (11526.3668pt too wide) in paragraph at lines 3428--3429
+Overfull \hbox (11526.3668pt too wide) in paragraph at lines 3458--3459
[][]
[]
[92] [93] [94]
-Underfull \hbox (badness 1824) in paragraph at lines 3601--3602
+Underfull \hbox (badness 1824) in paragraph at lines 3631--3632
[]\TU/lmr/m/n/10 We’ll download all of the sequences again, this time using a f
unction from [][][][] called
[]
[95] [96]
-File: lbrb_files/figure-latex/unnamed-chunk-157-1.pdf Graphic file (type pdf)
-
19.1.4.4 Comments
One of the reasons we use script files is that we can combine R code with comments that tell us what the R code is doing. Comments are preceded by the hashtag symbol #. Frequently we’ll write code like this:
-If you highlight all of this code (including the comment) and then click on “run”, you’ll see that RStudio sends all of the code over console.
Comments can also be placed at the end of a line of code
-Sometimes we write code and then don’t want R to run it. We can prevent R from executing the code even if its sent to the console by putting a “#” infront of the code.
If I run this code, I will get just the mean but not the sd.
-Doing this is called commenting out a line of code.