Tools for Reproducible Research and Dynamic Documents with R

IAMCS Machine Learning and Applied Statistics Workshop Series

February 25, 2014

Mathew W. McLean
Research Assistant Professor
Texas A & M University
http://stat.tamu.edu/~mmclean

Full screen Print Source

Your research should be reproducible!

  • An alarming number of results published in top academic journals cannot be reproduced
  • Ideally, output of research is not just a paper, but all code and data needed to reproduce it
  • Literate Programming:
    • A program should have plain language explanations interpersed with source code
    • Tools such as Sweave and knitr, make this simple to do
    • Results are automatically included when document is created (no exporting needed)
    • Changes are automatically incorporated if methods change

It makes your life easier!

  • Already know LaTeX?
    • The basics of knitr can be learned quite quickly
      • Just write LaTeX as usual with benefit of being able to add chunks of R code
      • No wasted time saving R results and manually inputting them into a TeX file
  • Do you really need all the bells and whistles of LaTeX?
    • If you're fine with your document just being viewed on the web, then you don't
    • If you don't need to submit for publication to a journal, then you probably don't
    • If you don't, you can use Markdown!
      • Has an extremely simple syntax which can be learned in minutes
      • Can be converted to HTML or PDF formats nearly instantaneously
  • For teaching, the interactivity we can gain by using HTML has considerable potential

It makes your life easier!

  • RStudio development environment for R
  • Several features that make it easier to code in R including:

It makes your life easier!

  • Using GitHub or another version control service for collaborations
    • No back and forth emailing of code (each developer gets their own copy)
    • Easy to go back to an old version of code if something breaks
    • Compare multiple versions of code at once
    • Add visibility to your work, makes it easier for others to access and share it
    • If you want, other users (total strangers!) can edit your code and suggest changes
    • Putting code on it is easier than submitting to CRAN (code doesn't even have to work!)
    • Keep track of progress and how much work is being done by whom
  • A system for annotating a document to separate metadata from text
  • E.g. TeX, LaTeX, HTML (HyperText Markup Language)
  • Sometimes the markup is hidden from the author, e.g. Microsoft Word (WYSIWYG)
  • Procedural markup: provides instructions for other programs to process, e.g. TeX
  • Descriptive/semantic markup: the markup provides descriptive labels, e.g. HTML
  • Lightweight markup language - markup language with simple syntax
    • Easy to write and easy to read
    • Markdown is a widely used example
    • Can be quickly converted to other markup languages such as HTML
  • Provides a simple syntax for creating reproducible documents
  • R code is executed and embedded, then report is converted to Markdown and HTML
    • Objects created in one chunk of code are available to later chunks
  • In it's simplest/easiest use, any R script may be converted to HTML or PDF
  • RStudio Notebooks
    • Runs R code in script, outputs results immediately after each line of code
    • Open any R script in RStudio and press Ctrl+Shift+Alt+H
  • The stitch and spin functions in knitr provide additional formatting possibilties
  • A quick reference document is available in RStudio and here
  • Template is provided when .Rmd file started
  • Mathematical notation is supported via MathJax; uses same markup to denote math as TeX
  • Appearance can be customized using CSS

rstudiopic

Header 1
======
### Header 2
Here is some inline R code: `r 1+1`
**This is bold** and *this is emphasized* and `this is monospace` and
[this is a link](http://example.com)
* Here
* is
  23. a
  54. nested
  1. list

And now a code block

```{r} rnorm(3) ```

Header 1

Header 2

Here is some inline R code: 2 This is bold and this is emphasized and this is monospace and this is a link

  • Here
  • is
    1. a
    2. nested
    3. list

And now a code block

rnorm(3)
## [1] -1.7282  0.8667 -1.2091
  • Sweave and knitr use a special TeX file with extension .Rnw
  • A LaTeX document that includes R code
  • Code chunks are started with << >>= and ended with @
  • Inline code is written \Sexpr{ }
  • Helpful quick reference
<<chunkname, options>>=
  library(ggplot2)
  qplot(mpg, wt, data=mtcars)
@
Here is some inline code \Sexpr{1+1}
  • cache - set to TRUE to store long calculations
    • even the slightest change to the chunk will cause it to be recalculated
    • use dependson='chunkname' to specify which chunks the current chunk depends on
  • echo - FALSE to not output the code, or a numeric vector of which lines of code to hide
  • eval - FALSE to not evaluate the chunk, or numeric vector on lines to not evaluate 8 include - FALSE to evaluate a chunk but not include the output
  • tidy - FALSE to turn off formatting of code (useful if text isn't wrapping properly)
  • highlight - FALSE to not colour code and output
  • Lots of options for controlling plot output; see here
  • For more customization you may need to set hooks
  • Goal is to automatically generate a quiz from a test bank of questions
  • Below are three questions written in syntax for .Rnw files and stored in an R vector
    • \Sexpr{...} for inline R code <<>>= ... @ for chunks
TestBank <- c("Construct a \\emph{\\Sexpr{c('box plot', 'stem-and-left plot',
  'Normal Q-Q plot')[sample(3, 1)]}} for the following sample of data
  (\\Sexpr{paste0(round(rnorm(10, mean = 60), 1), collapse = ', ')}).",
"Construct a \\Sexpr{sample(c(90, 95, 99), 1)}\\% confidence interval for
the ratio of the variances of the following two samples
(\\Sexpr{paste0(rpois(10, 10), collapse = ', ')}) and
(\\Sexpr{paste0(rpois(6, 6), collapse = ', ')}).",
"What is the probability that an expontential random variable with $\\lambda = 1$
exceeds 1.5 given that it exceeds 1.")
  • We can randomly choose questions from the test bank
  • Here notice the first two questions will have new data each time they are generated
  • We can then write a brew document to generate the questions and output an .Rnw file
  • Assuming we saved the TestBank vector in TestBank.R...
  • Chunks of R code are delimited using <%...>% or <%= ... >% (latter outputs the results)
\documentclass[12pt]{article}
\title{An Example Quiz using \texttt{brew} and \texttt{knitr} \\ \large IAMCS Seminar}
\author{Mathew McLean}
\usepackage{amsmath,graphicx}
\begin{document}
\maketitle
<%  source("TestBank.R")
test.length <- 2
questions <- TestBank[sample(length(TestBank), test.length)] %>
<%= for (i in seq_along(questions)){
  cat(paste0("\\textbf{Question ", i, "}\n"))
  cat(paste(questions[i], "\\vskip 5cm\n", sep = "\n"))
}  %>
\end{document}

Assuming we have saved the document on the previous slide as "quiz.brew"...

library(brew)
library(knitr)
brew("quiz.brew", "quiz.Rnw")
knit2pdf("quiz.Rnw", "quiz.pdf")
  • Slidify and Rstudio R Presentations convert Rmd files into HTML5 slides
    • Rstudio's version is easier to use
    • Slidify allows for more customization, can use several different frameworks
  • Slightly different syntax for both
    • For Slidify, slides are separated using "---"
    • For Rstudio R Presentations, new slide begins with header using "====="
  • Allows users to interact with your slides
    • All the powers of JavaScript are at your fingertips
  • Compilation is much faster than LaTeX/Beamer
  • Easier to type and less commands to remember
  • With a little CSS and HTML knowledge becomes very easy to customize and debug

Average daily temperature for several Canadian cities (click and hover on curves and names)

See rCharts, googleVis, and ggvis packages in R

Average daily temperature (℃) and total yearly rainfall (mm.)

``` ## Warning: there is no package called 'googleVis' ``` ``` ## Error: could not find function "gvisGeoChart" ``` ``` ## Error: error in evaluating the argument 'x' in selecting a method for function 'print': Error: object 'myGp' not found ``` - Yearly rainfall - Average daily temperature
library(googleVis)
## ... omitted formatting of cw.dat data.frame
gvisGeoChart(cw.dat, locationvar="lat.long", colorvar='ave.temp', hovervar='place',
                          sizevar='total.precip', options=list(region='CA', resolution='provinces'))
  • Shiny, can be used to write applications for the web in R
  • As interacts with the app (changes inputs), R code is rerun to update the outputs (reactivity)
  • As with the tools presented earlier, no knowledge of HTML or Javascript is needed
  • Apps can be deployed locally using the shiny package or hosted on the web
  • A great tutorial and lots of examples are available on the RStudio website

(May need a moment to load)

  • For managing changes to documents and code
  • In a distributed VCS

    • There is no central repository
    • Each developer has their own copy of all code
    • No need to be connected to a server, other than to share with other developers
  • The two most popular systems are Git and Subversion (SVN)

    • Git is a distributed VCS, SVN is not
    • R project uses SVN
  • With Git, nearly all operations are done locally and very quickly

    • Delta compression is used: only differences between versions are stored
  • Backup of all files
  • Track history of all changes to every file
  • Public or private depending on preference
  • Easy to transfer files between workstations (e.g. home/laptop and office PC)
  • Take frequent "snapshots" (commits) of code, making it easier to find where issues started
  • Have a branch for working code only, one for fixing bugs, another for adding new features
    • branching and merging is so fast/easy that it can and should be done frequently
  • No emailing of code to anyone, they simply checkout a copy of the repository
  • Users can easily report issues and suggest changes (a pull request) on GitHub
  • I personally began using the GitHub GUI before knowing any Git, finding it very easy to use
    • Learn more as projects grow in size and add collaborators
  • Free public repositories
    • Private starts at $7/month for five repositories
  • GUI for Windows provides simple interface for many commands
  • Checkout R packages hosted on GitHub using devtools's install_github function
  • Can use Travis CI to automatically test your software
    • With each push to repo, Travis CI will build your project and run tests
  • init - initializes a new repository
  • clone - create a copy of a repository
  • branch - an isolated working copy of the project
  • commit - the state of the project at a particular time (used to record changes)
  • HEAD - symbolic name of the currently checked out commit
  • status - gives information about the current working branch
  • log - show the commit logs
  • remote - manage the repositories you track (the "remotes")
  • push - push your changes to another repo (a complete copy must be pushed)
  • fetch - get files from another repository
  • pull - fetch from another repo and merge it with your branch
    The GUI can take care of the common tasks, but you will need to use a shell for some actions
  • Makes it easy to print, search, update, import, combine references in R
  • Can read in .bib files and read PDFs using Poppler to create a citation
  • Citations can also be searched for on Crossref, Google Scholar, Pubmed, and Zotero
  • Searching is very flexible, can search by arbitrary field plus key or entry type
  • Citations can be generated for inclusion in RMarkdown and RHTML documents
  • Bibliography of all cited references can be printed at end of document
    • hyperlinks automatically added, many bibliography and citation style available
  • Supports BibLaTeX and Bibtex and converting BibLaTeX to BibTeX if forced
  • Simple interface for setting default behaviour for the most used functions
library(RefManageR)
bib <- ReadBib(system.file("Bib", "biblatexExamples.bib", package = "RefManageR"), check = FALSE)
bib[[49:51]]
## [1] B. Malinowski. _Argonauts of the Western Pacific. An account of native enterprise and
## adventure in the Archipelagoes of Melanesian New Guinea_. 8th ed. London: Routledge and Kegan
## Paul, 1972.
##
## [2] M. Maron. _Animal Triste_. Trans. from the German by B. Goldstein. Lincoln: University of
## Nebraska Press, 2000.
##
## [3] W. Massa. _Crystal structure determination_. 2nd ed. Berlin: Spinger, 2004.
bib2 <- ReadBib(system.file("Bib", "RJC.bib", package = "RefManageR"))
print(bib2[2:3], .opts = list(style = "text", bib.style = "alphabetic", sorting = "none"))
## [Jen+13] E. M. Jennings, J. S. Morris, R. J. Carroll, et al. "Bayesian methods for
## expression-based integration of various types of genomics data". In: _EURASIP Journal on
## Bioinformatics and Systems Biology_ 2013.1 (2013), pp. 1-11.
##
## [Gar+13] T. P. Garcia, S. Müller, R. J. Carroll, et al. "Identification of important
## regressor groups, subgroups and individuals via regularization methods: application to
## gut microbiome data". In: _Bioinformatics, btt_ 608 (2013).
  • Authors and other "name list" fields are stored in the person class
bib2[1]$author
## [1] "N Serban"    "A M Staicu"  "R J Carroll"
  • Authors and other "name list" fields are stored in the person class
  • Let's add url and urldate fields for Jennings et al and fix the journal for Garcia et al.
bib2[2:3] <- list(c(url="http://bsb.eurasipjournals.com/content/2013/1/13",
                          urldate = "2014-02-02"), c(doi="10.1093/bioinformatics/btt608",
                          journal = "Bioinformatics"))
print(bib2[2:3], .opts = list(style = "html", bib.style = "authoryear"))

Garcia, T. P, S. Müller, R. J. Carroll, et al. (2013). “Identification of important regressor groups, subgroups and individuals via regularization methods: application to gut microbiome data”. In: Bioinformatics 608. DOI: 10.1093/bioinformatics/btt608.

Jennings, E. M, J. S. Morris, R. J. Carroll, et al. (2013). “Bayesian methods for expression-based integration of various types of genomics data”. In: EURASIP Journal on Bioinformatics and Systems Biology 2013.1, pp. 1–11. URL: http://bsb.eurasipjournals.com/content/2013/1/13 (visited on Feb. 02, 2014).

  • How often is someone with family name "Wang" a coauthor?
length(bib2[author = "wang"])
## [1] 37
  • How often is N. Wang a coauthor?
length(SearchBib(bib2, author = "Wang, N.",
                 .opts = list(match.author = "family.with.initials")))
## [1] 19
  • Use names to extract keys bib$bibtype to get entry types
  • How often did RJC and Ruppert collaborate after leaving UNC in July, 1987?
length(SearchBib(bib2, author='ruppert', date="1987-07/",
   .opts = list(match.date = "exact")))
## [1] 53
  • Carroll and Ruppert papers NOT in the 1990's
length(SearchBib(bib2, author='ruppert', date = "!1990/1999"))
## [1] 59
  • Can specify keys using e.g. bib2[c(key1, key2)] or bib2[key1,key2]
  • Carroll and Ruppert tech reports at UNC "OR" Carroll and Ruppert JASA papers
length(bib2[list(author='ruppert', bibtype="report", institution="north carolina"),
  list(author="ruppert",journal="journal of the american statistical association")])
## [1] 22
  • Interface for options similar to options function
old.opts <- BibOptions(bib.style = "authoryear", match.author = "exact", ignore.case = FALSE,
                       first.inits = FALSE, style = "html")
bib[author = "Baez, John C."]

Baez, John C. and Aaron D. Lauda (2004a). “Higher-Dimensional Algebra V: 2-Groups”. Version 3. In: Theory and Applications of Categories 12, pp. 423-491. arXiv: math/0307200v3.

— (2004b). Higher-Dimensional Algebra V: 2-Groups. arXiv: math/0307200v3.

BibOptions(old.opts) # restore defaults
  • Cite functions for citations, convenience functions Citet, AutoCite, TextCite, Citep
    • Specify style as "html" or "markdown" for HTML and Markdown documents
    • Hyperlinks will be included automatically, and may be turned off if desired
    • The code in unevaluated, chunk form is as follows
BibOptions(check.entries = FALSE, style = "html")
Citep(bib, author = "Itzhaki", .opts = list(cite.style = "alphabetic"))
TextCite(bib, c("loh", "wilde"), .opts = list(cite.style = "authoryear", hyperlink = "#35"))
  • Inline (wrapped in I()): Here is one citation [Itz96], two more are Loh (1992); Wilde (1899)
  • Itz96 goes to the arXiv doc, the other two references go to the bibliography slide

Now we print the bibliography, using results='asis' in the knitr chunk

PrintBibliography(bib, .opts = list(bib.style = "numeric"))

[1] N. Itzhaki. Some remarks on 't Hooft's S-matrix for black holes. Mar. 11, 1996. arXiv: hep-th/9603067.

[2] N. C. Loh. “High-Resolution Micromachined Interferometric Accelerometer”. MA Thesis. Cambridge, Mass.: Massachusetts Institute of Technology, 1992.

[3] O. Wilde. The Importance of Being Earnest: A Trivial Comedy for Serious People. English and American drama of the Nineteenth Century. Leonard Smithers and Company, 1899. Google Books: googlebooks.

  • On a single webpage, the hyperlinks in e.g. [1], will go to the point of first citation
    • Getting this working automatically for HTML5 slides is not implemented yet

Learning More