Intro to R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

plot(pressure)

You will be making an RMarkdown document as part of the lab to turn in at the end of this section

Microbiome Lab Background

This is a Daphnia

This is a Daphnia

Goals

  1. Be able to understand and use the DADA2 pipeline to get accurate sample inference from amplicon data wrt possible errors like chimeras or 3% OTU assignment
  2. Be able to assess the quality and limitations of your microbiome data
  3. Be able to visualize microbiome analyses using ggplot2 in R

Data Files

  1. Demultiplexed fasta files for several microbiome accessions from three genotype lines of Daphnia magna from Dr. Shaack’s lab (Frankel-Bricker et al., in review) from three different temperature treatments (low, medium, and high)
  2. A phyloseq object, that Mick made when doing his analysis of the paper, in cooking-show fashion, to allow you to start making graphs in R on an already curated dataset while you wait during the filtering, trim, and taxonomic assignment steps.
  3. A CSV file with the experimental metadata.
  4. The most recent SILVA dataset for assigning ASVs.
  5. A cheatsheet of Mick’s R Markdown file, in case you get really stuck, though we encourage you to troubleshoot with colleagues first.

Biological Background

We want to assess variation in the microbiota of Daphnia magna across genotypes, populations, and temperature. According to Wikipedia, which you can reference in peer-reviewed journals now (see many papers by the famous scientist Mike Freeling), the microbiome “describes either the collective genomes of the microorganisms that reside in an environmental niche or the microorganisms themselves… [which] includes bacteria, archaea, protists, fungi and viruses… and [has] been found to be crucial for immunologic, hormonal and metabolic homeostasis of their host.” In short, microbes affect Daphnia fitness and therefore the microbiome is interesting from an evolutionary and ecological perspective, especially in the light of climate change. In particular, the idea of a super organism or holobiont has become popular in the microbiome literature, but it is uncertain what that means with respect to our theories on how natural selection acts on “individuals.”

Methodological Background

The study of microbiomes is a burgeoning field often described as the “wild west,” due to its descriptive nature, high-impact ideas with low methodological standards, ad hoc hypotheses, and general irreproducibility. Therefore understanding the methodologies used in the field will allow us to read these papers with a more critical eyes, as well as help us develop best-practices as the field matures, since many of us are interested in the biological questions raised by such an approach.

16S rRNA gene sequencing is a standard way of conducting microbiome analyses and is the standard method of classifying and identifying Bacteria and Archaea. We will be learning the DADA2 pipeline as implemented in R which takes as its starting input “Illumina-sequenced paired-end fastq files that have been split (or “demultiplexed”) by sample and from which the barcodes/adapters have already been removed. The end product is an amplicon sequence variant (ASV) table, a higher-resolution analogue of the traditional OTU table, which records the number of times each exact amplicon sequence variant was observed in each sample.” We will be talking more about the difference between ASVs and traditional 3% OTUs in discussion.

Sampling sites

Sampling sites

Then we will be re-analyzing the data from Frankel-Bricker et al. (in review) in order to see if we can reproduce their results and learn about creative ways to do data visualization in RStudio using R markdown and the ggplot2 package.

Daphnia magna lineages used in this experiment were descendants of lineages originally collected from three locales: Finland (F), Israel (I), and Germany (G; see Figure 1 and Supplemental Table Sx). Three genotypes were assayed from each location, for a total of 9 genotypes: (FA, FB, FC, IA, IB, IC, GA, GB, GC). Experimental animals for each lineage were maintained in duplicate (C1, C2) at each temperature in order to test for a jar effect (referred to as ‘Source’ in Table 2).

The V3-V4 hypervariable region of the microbial 16S rDNA gene was amplified using the primer pair 341f-785R (Klindworth et al., 2013/Illumina MiSeq Manufacturer’s Protocol) using linker sequences proposed by Takahashi (Takahashi et al., 2014), and CS Tags (adapters) recommended by the University of Idaho Genomics Resources Core. PCR produced amplicons of approximately 460 bp. Polymerase chain reactions (50 ul) were performed using QuantaBio 5PRIME HotMasterMix following the manufacturer’s protocol. Secondary PCR products were quantified using a Qubit fluorometer and subsequently pooled and sequenced on the Illumina MiSeq V3 platform.

An interesting methodological aspect of this study is that they make use of many different controls to see the degree to which individual laboratory environments shape the microbiome of the hosts raised therein. We will also be looking at the results of the controls and brainstorm best practices in how we actually can use controls, which is surprisingly not a trivial problem.

Microbiome Lab Assignment

We will be implementing this lab in the Rstudio environment. Please work on this lab in this Rmarkdown file and export as a html or PDF when you are finished. You can answer the questions in the markdown file. Go to the CyVerse website and login in to download the files. Then create a directory on your desktop and modify their path to go to these files.

setwd(PATH)

Part 1. From de-multiplexed reads to a Phyloseq object

First we need to load some packages. Remember you can use the install.packages function to first download these packages if you do not already have them. They have been commented out of the file for ease of knitting.

# source("https://bioconductor.org/biocLite.R")
# biocLite("BiocStyle")
# biocLite("DECIPHER")
# biocLite("phangorn")
# biocLite("adespatial")
# biocLite("DESeq2")

# library(dada2)
# library("knitr")
# library(phyloseq)
# library(ggplot2)
# library(tidyverse)
# library("BiocStyle")
# library("DECIPHER")
# library("phangorn")
# library(devtools)
# library("vegan")
# library("DESeq2")
# library("adespatial")
# library("genefilter")
# library("dbplyr")
# library("microbiomeSeq")
# library("nlme")
# library("lme4")
# library("lmerTest")
# library("pairwiseAdonis")
# library("RVAideMemoire")
# library("gridExtra")

DADA2

Go to the DADA2 website (https://benjjneb.github.io/dada2/tutorial.html) and work through the tutorial with our data. Use the CSV file and the SILVA dataset to assign ASVs. Remember to embed your R code in the Markdown file in a way that visualizes the data nicely in the output. Below is a template from Mick’s markdown file that you can use to modify. You can either uncomment the lines and modify them to work on your datasets or you can use the tutorial to write the code yourselves.

Set up Path and filter reads

# pathF <- "/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/R1"
# pathR <- "/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/R2"
# filtpathF <- file.path(pathF, "filtered")
# filtpathR <- file.path(pathR, "filtered")
# fastqFs <- sort(list.files(pathF, pattern="fastq"))
# fastqRs <- sort(list.files(pathR, pattern="fastq"))
# if(length(fastqFs) != length(fastqRs)) stop("Forward and reverse files do not match.")
# 
# out3 <- filterAndTrim(fwd=file.path(pathF, fastqFs),
#                      filt=file.path(filtpathF, fastqFs), rev=file.path(pathR, fastqRs),
#                      filt.rev=file.path(filtpathR, fastqRs), truncLen=c(250,240),
#                      maxN=0, maxEE=c(6,6), truncQ=2,
#                      rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=TRUE)

Learn error rates and merge pair-ends

# filtpathF <- "/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/R1/filtered"
# filtpathR <- "/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/R2/filtered"
# filtFs <- list.files(filtpathF, pattern="fastq", full.names = TRUE)
# filtRs <- list.files(filtpathR, pattern="fastq", full.names = TRUE)
# sample.names <- sapply(strsplit(basename(filtFs), "_"), `[`, 1)
# sample.namesR <- sapply(strsplit(basename(filtRs), "_"), `[`, 1)
# if(!identical(sample.names, sample.namesR)) stop("Forward and reverse files do not match.")
# names(filtFs) <- sample.names
# names(filtRs) <- sample.names
# set.seed(100)
# errF <- learnErrors(filtFs, nbases=1e8, multithread=TRUE)
# errR <- learnErrors(filtRs, nbases=1e8, multithread=TRUE)
# mergers <- vector("list", length(sample.names))
# names(mergers) <- sample.names
# 
# for(sam in sample.names) {
#   cat("Processing:", sam, "\n")
#   derepF <- derepFastq(filtFs[[sam]])
#   ddF <- dada(derepF, err=errF, multithread=TRUE)
#   derepR <- derepFastq(filtRs[[sam]])
#   ddR <- dada(derepR, err=errR, multithread=TRUE)
#   merger <- mergePairs(ddF, derepF, ddR, derepR)
#   mergers[[sam]] <- merger
# }
Question 1

How do we know how much we should filter our reads? Following the recommendations in the DADA2 tutorial, did this work? What compromise do we need to do here? Write this in the space below:

Save reads

# seqtab3 <- makeSequenceTable(mergers)
# saveRDS(seqtab3, "/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/output/seqtabnew3.rds")
# st3 <- readRDS("/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/output/seqtabnew3.rds")
# seqtab3 <- removeBimeraDenovo(st3, method="consensus", multithread=TRUE)

Check filters

Let’s see how many of the initial reads were retaiened after the filtering step and the removal of the chimeras.

# getN <- function(x) sum(getUniques(x))
# track <- cbind(out3,rowSums(seqtab3))
# colnames(track) <- c("input", "filtered","nonchim")
# rownames(track) <- sample.names
# head(track)
# write.csv(track, "~/Desktop/track.csv")
# track <- read.csv("track.csv")
# head(track)

Assign OTUs

We are using the Silva database v132 to assign OTUs.

# tax <- assignTaxonomy(seqtab3, "/Users/michaelsong/Downloads/silva_nr_v132_train_set1.fasta", multithread=TRUE)
# saveRDS(seqtab3, "/Users/michaelsong/Downloads/newseqtab4_final.rds")
# saveRDS(tax, "/Users/michaelsong/Downloads/newtax4_final.rds")
#tax <- readRDS("/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/output/newtax3_final.rds")
Question 2

How might this impact how we think about diversity? Write around three sentence below about the goals of the study in the markdown file and about some of the different ways people can think about microbial or just taxonomic diversity and how that is interesting both in terms of the microbiome, but also evolutionarily. Don’t spend too much time on this, it is just good to muse on these types of questions, as never to lose sight of the forest of evolution.

Build Tree

We’re building a ML tree for phylogenetic distance.

# seqs <- getSequences(seqtab3)
# names(seqs) <- seqs
# alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)
# phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
# dm <- dist.ml(phang.align)
# treeNJ <- NJ(dm)
# 
# fit = pml(treeNJ, data=phang.align)
# fitGTR <- update(fit, k=4, inv=0.2)
# fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
#                     rearrangement = "stochastic", control = pml.control(trace = 0))
# saveRDS(fitGTR, "/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/output/fitGTR.rds")
# fitGTR1 <- readRDS("/Volumes/Micksexternal/jonas/data/demultiplexed/dbcAmplicons/output/fitGTR.rds")
# detach("package:phangorn", unload=TRUE)
#load table and make Phyloseq object
#samdf <- read.csv("/Volumes/Micksexternal/Microbiome4.csv", row.names = 1)
# pstree <- phyloseq(otu_table(seqtab3, taxa_are_rows=FALSE, errorIfNULL = TRUE),
#                   sample_data(samdf),
#                  tax_table(tax)
#                   ,phy_tree(fitGTR1$tree))
# head(samdf)

Modify phyloseq object

#Separate Experimental from Control Data
# pstreeExp <- phyloseq::subset_samples(pstree, Experiment=="exp")
# pstreeExp <- phyloseq::subset_samples(pstreeExp, Population!="Control")
#If Finaland is bad it can be removed
# pstreeNoFin <- phyloseq::subset_samples(pstreeExp, Population!="Finland")
# pstreeNoFinNoIs <- phyloseq::subset_samples(pstreeNoFin, Population!="Israel")
# pstreeNoFinNoGer <- phyloseq::subset_samples(pstreeNoFin, Population!="Germany")
# pstreeFin <- phyloseq::subset_samples(pstreeExp, Population="Finland")
#Reduce experiment samples to top 20 most abundant OTUs
# top20 <- names(sort(taxa_sums(pstreeNoFin), decreasing=TRUE))[1:20]
# ps.top20 <- transform_sample_counts(pstreeNoFin, function(OTU) OTU/sum(OTU))
# ps.top20 <- prune_taxa(top20, ps.top20)

Part 2. Data Visualization

For this part of the lab, we can use Mick’s already curated Phyloseq object which includes all three genotypes and the controls. You can load “pstreeNoFin.RDS” (just the experimental samples) and “pstreeExp.RDS” (experimental samples and controls).

Question 3

Go to the Tutorials of https://joey711.github.io/phyloseq/ and to http://userweb.eng.gla.ac.uk/umer.ijaz/projects/microbiomeSeq_Tutorial.html. There are so many interesting ways to visual microbiome data. Make 5 different plots that assess at least 4 different aspects of this study. Write figure legends for them. Make sure to have at least one of the plots include data from the controls. Embed the r scripts and plots below:

Question 4

Using these figures, write a three-ish sentences in the markdown file below about the role of Temperature and Genotype of microbial communities in Daphnia pulex:

Question 5

What about the controls? What if anything can we say from the one or two plots you made. Write a couple sentences below about the limitations of the study and summarize a quick google search on the current best practices of controls in microbiome studies. Remember to cite this!

Below are some examples of decent (good enough to be submitted to Molecular Ecology, but you can do so much better) plots. Feel free to use this data to explore data visualization, but do not post or publish it without permission from Dr. Sarah Schaack, who is a great person just to talk to!

Ordination Plots

library("phyloseq")
library("ggplot2")
pstreeNoFin <- readRDS("/Users/michaelsong/Desktop/pstreeNoFin.RDS") 
ps.prop <- transform_sample_counts(pstreeNoFin, function(otu) otu/sum(otu))
ord.nmds.bray <- ordinate(ps.prop, method="DCA", distance="unifrac")
## Warning in decorana(veganifyOTU(physeq), ...): some species were removed
## because they were missing in the data
pORD <- plot_ordination(ps.prop, ord.nmds.bray, color="Temperature", shape = "Population" ,title="unifrac NMDS")
plot420 <- pORD + geom_point(size=7, alpha=0.75)
plot420

Alpha Diversity

library("microbiomeSeq")
library("vegan")
pt <- plot_anova_diversity(pstreeNoFin, method = c("richness", "fisher", "simpson", "shannon", "evenness"), 
    grouping_column = "Temperature", pValueCutoff = 0.05)

pg <- plot_anova_diversity(pstreeNoFin, method = c("richness", "fisher", "simpson", "shannon", "evenness"), 
    grouping_column = "Genotype", pValueCutoff = 0.05)

print(pt)

Beta Diversity

library(dada2)
library(phyloseq)
library(phangorn)
library(tidyverse)
library(reshape2)

testps_real <- pstreeNoFin
# rarify reads
set.seed(23000)
psrare <- rarefy_even_depth(testps_real, sample.size = 15000)

# get taxa make up at least 1% average reads/sample
# convert reads to relative abundance per sample
psrare_relabundance <- transform_sample_counts(psrare, function(x) x / sum(x))


# Filter taxa that have >= 1% average reads/sample
ps_filter <- filter_taxa(psrare_relabundance, function(x) mean(x) >= 1e-2, TRUE)



#Plot for both Populations and Temperatures, plotted at family level...not scaled for percent of reads
Final_Test1 <- plot_bar(ps_filter, "Temperature", fill = "Family", title = "Final Test 1") +
facet_wrap(~Population)

#subsets based on population (different for my project)
popI <- subset_samples(ps_filter, Population == "Israel")
popG <- subset_samples(ps_filter, Population == "Germany")

#Merge samples by temperature, rename variables in new object by “levels” of former object
popItemp <- merge_samples(popI, "Temperature")
sample_data(popItemp)$Population <- levels(sample_data(popI)$Population)
sample_data(popItemp)$Temperature <- levels(sample_data(popI)$Temperature)

popGtemp <- merge_samples(popG, "Temperature")
sample_data(popGtemp)$Population <- levels(sample_data(popG)$Population)
sample_data(popGtemp)$Temperature <- levels(sample_data(popG)$Temperature)

#Convert relative reads to percentages
Itemp_percent <- transform_sample_counts(popItemp, function(x) 100 * x/sum(x))
Gtemp_percent <- transform_sample_counts(popGtemp, function(x) 100 * x/sum(x))

###These are the money plots right here
#Separate plots for each population, bars for temps and Family as legend/color

popItemp_plot <- plot_bar(Itemp_percent, "Temperature", fill = "Family", title = "Israel") +
facet_wrap(~Population)
popItemp_plot

popGtemp_plot <- plot_bar(Gtemp_percent, "Temperature", fill = "Family", title = "Germany") +
facet_wrap(~Population)
popGtemp_plot