Designing and Running NGS Workflows

Author: Thomas Girke (thomas.girke@ucr.edu)

Last update: 27 April, 2016

Alternative formats of this tutorial: .Rmd HTML, .Rmd, .R, HTML Slides

Introduction

systemPipeR provides utilities for building and running automated end-to-end analysis workflows for a wide range of next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (Girke , 2014). Important features include a uniform workflow interface across different NGS applications, automated report generation, and support for running both R and command-line software, such as NGS aligners or peak/variant callers, on local computers or compute clusters. The latter supports interactive job submissions and batch submissions to queuing systems of clusters. For instance, systemPipeR can be used with most command-line aligners such as BWA (Li , 2013; Li et al., 2009), TopHat2 (Kim et al., 2013) and Bowtie2 (Langmead et al., 2012), as well as the R-based NGS aligners Rsubread (Liao et al., 2013) and gsnap (gmapR) (Wu et al., 2010). Efficient handling of complex sample sets (e.g. FASTQ/BAM files) and experimental designs is facilitated by a well-defined sample annotation infrastructure which improves reproducibility and user-friendliness of many typical analysis workflows in the NGS area (Lawrence et al., 2013).

Motivation and advantages of sytemPipeR environment:

Facilitates design of complex NGS workflows involving multiple R/Bioconductor packages
Common workflow interface for different NGS applications
Makes NGS analysis with Bioconductor utilities more accessible to new users
Simplifies usage of command-line software from within R
Reduces complexity of using compute clusters for R and command-line software
Accelerates runtime of workflows via parallelzation on computer systems with mutiple CPU cores and/or multiple compute nodes
Automates generation of analysis reports to improve reproducibility

A central concept for designing workflows within the sytemPipeR environment is the use of workflow management containers called SYSargs (see Figure 1). Instances of this S4 object class are constructed by the systemArgs function from two simple tabular files: a targets file and a param file. The latter is optional for workflow steps lacking command-line software. Typically, a SYSargs instance stores all sample-level inputs as well as the paths to the corresponding outputs generated by command-line- or R-based software generating sample-level output files, such as read preprocessors (trimmed/filtered FASTQ files), aligners (SAM/BAM files), variant callers (VCF/BCF files) or peak callers (BED/WIG files). Each sample level input/outfile operation uses its own SYSargs instance. The outpaths of SYSargs usually define the sample inputs for the next SYSargs instance. This connectivity is established by writing the outpaths with the writeTargetsout function to a new targets file that serves as input to the next systemArgs call. Typically, the user has to provide only the initial targets file. All downstream targets files are generated automatically. By chaining several SYSargs steps together one can construct complex workflows involving many sample-level input/output file operations with any combinaton of command-line or R-based software.

**Figure 1:** Workflow design structure of *systemPipeR*

The intended way of running sytemPipeR workflows is via *.Rnw or *.Rmd files, which can be executed either line-wise in interactive mode or with a single command from R or the command-line using a Makefile. This way comprehensive and reproducible analysis reports in PDF or HTML format can be generated in a fully automated manner by making use of the highly functional reporting utilities available for R. Templates for setting up custom project reports are provided as *.Rnw files by the helper package systemPipeRdata and in the vignettes subdirectory of systemPipeR. The corresponding PDFs of these report templates are available here: systemPipeRNAseq, systemPipeRIBOseq, systemPipeChIPseq and systemPipeVARseq. To work with *.Rnw or *.Rmd files efficiently, basic knowledge of Sweave or knitr and Latex or R Markdown v2 is required.

Jump to: