The following performs variant calling with GATK, BCFtools and VariantTools in parallel mode on a compute cluster (McKenna et al., 2010; Li , 2011). If a cluster is not available, the runCommandline function can be used to run the variant calling with GATK and BCFtools for each sample sequentially on a single machine, or callVariants in case of VariantTools. Typically, the user would choose here only one variant caller rather than running several ones.

Variant calling with GATK

The following creates in the inital step a new targets file (targets_bam.txt). The first column of this file gives the paths to the BAM files created in the alignment step. The new targets file and the parameter file gatk.param are used to create a new SYSargs instance for running GATK. Since GATK involves many processing steps, it is executed by a bash script gatk_run.sh where the user can specify the detailed run parameters. All three files are expected to be located in the current working directory. Samples files for gatk.param and gatk_run.sh are available in the param subdirectory provided by systemPipeRdata.

moduleload("picard/1.130"); moduleload("samtools/1.3")
system("picard CreateSequenceDictionary R=./data/tair10.fasta O=./data/tair10.dict")
system("samtools faidx data/tair10.fasta")
args <- systemArgs(sysma="param/gatk.param", mytargets="targets_bam.txt")
resources <- list(walltime="20:00:00", nodes=paste0("1:ppn=", 1), memory="10gb")
reg <- clusterRun(args, conffile=".BatchJobs.R", template="torque.tmpl", Njobs=18, runid="01",
                  resourceList=resources)
waitForJobs(reg)
# unlink(outfile1(args), recursive = TRUE, force = TRUE)
writeTargetsout(x=args, file="targets_gatk.txt", overwrite=TRUE)

Variant calling with BCFtools

The following runs the variant calling with BCFtools. This step requires in the current working directory the parameter file sambcf.param and the bash script sambcf_run.sh.

args <- systemArgs(sysma="param/sambcf.param", mytargets="targets_bam.txt")
resources <- list(walltime="20:00:00", nodes=paste0("1:ppn=", 1), memory="10gb")
reg <- clusterRun(args, conffile=".BatchJobs.R", template="torque.tmpl", Njobs=18, runid="01",
                  resourceList=resources)
waitForJobs(reg)
# unlink(outfile1(args), recursive = TRUE, force = TRUE)
writeTargetsout(x=args, file="targets_sambcf.txt", overwrite=TRUE)

Variant calling with VariantTools

library(gmapR); library(BiocParallel); library(BatchJobs)
args <- systemArgs(sysma="param/vartools.param", mytargets="targets_gsnap_bam.txt")
f <- function(x) {
    library(VariantTools); library(gmapR); library(systemPipeR)
    args <- systemArgs(sysma="param/vartools.param", mytargets="targets_gsnap_bam.txt")
    gmapGenome <- GmapGenome(systemPipeR::reference(args), directory="data", name="gmap_tair10chr", create=FALSE)
    tally.param <- TallyVariantsParam(gmapGenome, high_base_quality = 23L, indels = TRUE)
    bfl <- BamFileList(infile1(args)[x], index=character())
    var <- callVariants(bfl[[1]], tally.param)
    sampleNames(var) <- names(bfl)
    writeVcf(asVCF(var), outfile1(args)[x], index = TRUE)
}
funs <- makeClusterFunctionsTorque("torque.tmpl")
param <- BatchJobsParam(length(args), resources=list(walltime="20:00:00", nodes="1:ppn=1", memory="6gb"), cluster.functions=funs)
register(param)
d <- bplapply(seq(along=args), f)
writeTargetsout(x=args, file="targets_vartools.txt", overwrite=TRUE)

Inspect VCF file

VCF files can be imported into R with the readVcf function. Both VCF and VRanges objects provide convenient data structure for working with variant data (e.g. SNP quality filtering).

library(VariantAnnotation)
args <- systemArgs(sysma="param/filter_gatk.param", mytargets="targets_gatk.txt")
vcf <- readVcf(infile1(args)[1], "A. thaliana")
vcf
vr <- as(vcf, "VRanges")
vr
Jump to: next_page