Rqc : A Bioconductor Package for Quality Control of High-Throughput Sequencing Data

As sequencing costs drop with the constant improvements in the ﬁeld, next-generation sequencing becomes one of the most used technologies in biological research. Sequencing technology allows the detailed characterization of events at the molecular level, including gene expression, genomic sequence and structural variants. Such experiments result in billions of sequenced nucleotides and each one of them is associated to a quality score. Several software tools allow the quality assessment of whole experiments. However, users need to switch between software environments to perform all steps of data analysis, adding an extra layer of complexity to the data analysis workﬂow. We developed Rqc , a Bioconductor package designed to assist the analyst during assessment of high-throughput sequencing data quality. The package uses parallel computing strategies to optimize large data sets processing, regardless of the sequencing platform. We created new data quality visualization strategies by using established analytical procedures. That improves the ability of identifying patterns that may aﬀect downstream procedures, including undesired sources technical variability. The software provides a framework for writing customized reports that integrates seamlessly to the R / Bioconductor environment, including publication-ready images. The package also oﬀers an interactive tool to generate quality reports dynamically. Rqc is implemented in R and it is freely available through the Bioconductor project ( https://bioconductor.org/packages/Rqc/ ) for Windows, Linux and Mac OS X operating systems.


Introduction
Next-generation sequencing (NGS) has become the standard tool to investigate the association between molecular data and phenotypes of interest. This is the result of improvements on the  Figure 1: Workflow for data analysis using the Rqc package. The standard analytical pipeline processes the FASTQ files to map the reads to a reference. The resulting BAM files are later used for quantification and downstream analyses. Our software supports both FASTQ and BAM files as input and does not affect existing pipelines as it provides information that complements the workflow. The standard Rqc output is a self-contained HTML report.
technology and constant drops in costs. High-performance equipment sequences millions of short DNA fragments, also known as reads, yielding billions of nucleotides. During the basecalling process, the sequencer assigns a quality score to every base that comprises the read. This score indicates the degree of certainty that the equipment has correctly identified the nucleotide. Compared to previous approaches, NGS technologies produce bigger yields faster and cheaper. However, higher performance comes at a price: Larger amounts of data make it harder to identify failures that might have occurred during the sequencing run. Common problems are contamination with adapter sequences, drops of quality scores at specific cycles and redundant reads. These issues should be accounted for as early as possible, as they may affect downstream analyses negatively (Bravo and Irizarry 2010). Also, studies may use different sequencing platforms to generate data, which may add extra complexity to the quality assessment step (The 1000 Genomes Project Consortium 2012), requiring tools to be efficient and capable of processing large amounts of data regardless of the technology used for sequencing.
The analysis of high-throughput sequencing data is a procedure that requires careful quality assessment (QA), which can take place at different moments, as Figure 1 shows. The FASTQ file, obtained after base-calling, is comprised of sequenced fragments along with base-specific quality scores. This is the first opportunity for quality control, where researchers search for systematic deviations of quality using PHRED-scaled scores. The second occasion where additional quality control can take place is after mapping the reads to a reference. At this point, the FASTQ files are no longer the objects of interest and, instead, the researcher performs QA on BAM files, which result from the mapping strategy of choice. On both situations, the analyst gathers information to decide what procedures (e.g., removal of reads, trimming, clipping) should be applied to ensure that the data comply with the quality requirements of the study. Acting at these moments allows the researcher to obtain higher success rates in downstream procedures, as they may be affected by low quality reads or contaminants.
Another source of complexity during the analysis of NGS data is the integration of different tools used during the process. One common workflow to identify candidates for differential expression using RNA-Seq data includes: (A) QA using FastQC (Andrews 2016), which is implemented in Java and executed either from the command line or using a graphical user interface. We developed Rqc (Souza and Carvalho 2018) to address QA of NGS data using highperformance strategies to handle file formats that are agnostic to sequencing platforms (i.e., FASTQ and BAM files). Additionally, our solution uses the Bioconductor environment, which is known for delivering high-quality software that implements cutting-edge methodologies for analysis of biological data. Therefore, we provide the users with a quality control tool for NGS data that can be easily integrated to existing pipelines and also used to establish a Bioconductor-based workflow for the analysis of high-throughput data.
This paper describes the Rqc software. Section 2 describes how the Rqc package was developed. In Section 3, we show the usefulness of Rqc package in a QA of public NGS data. Section 4 describes how the Rqc package can help analyzing NGS data.

Implementation
The Rqc package is developed in R, using the Bioconductor framework to efficiently process FASTQ and BAM files and to simplify the delivery of a unified pipeline for NGS data analysis. Our software makes constant use of the Bioconductor three main concepts (Gentleman et al. 2004): transparency, through the development of a free and open-source software; efficiency, by using the infrastructure defined by core-maintained packages integrated into a single development ecosystem; and reproducibility, by ensuring that our software runs on any operating system supported by the R software and generates the same result.
Rqc uses the ShortRead (Morgan, Anders, Lawrence, Aboyoun, Pages, and Gentleman 2009) and Rsamtools (Morgan, Pagés, Obenchain, and Hayden 2016b) packages to extract information from raw and aligned data files respectively, allowing the use of different file formats. Our package supports both FASTQ and BAM file formats. By default, the software processes a random sample of records from each input file to improve execution time. The user can adjust the size of the chunk: larger sizes increase the statistical power, generating more reliable results; smaller values allow for better control of computational resources. The user can also choose to process the whole file, which Rqc does without compromising the available computing resources.
We use the BiocParallel (Morgan, Obenchain, Lang, and Thompson 2016a) package, the main Bioconductor's backend for high-performance computing, to process multiple files in parallel.  Figure 2: Rqc uses the R statistical environment to create reports for quality assessment. The rqcQA method processes the input data and returns a list of 'RqcResultSet' objects, which contains quality-related statistics used to create the final report. The rqcReport method uses the result list combined with a template file, which can be customized by the user, to produce the HTML report.
The parallel processing takes place on single thread or multiple cores, depending on the user's setup. Using this feature, execution time can be significantly reduced. Rqc performs parallel processing of files automatically, setting the number of processing units according to the available computational resources. We made the choice for automatic configuration to deploy a software that is both efficient and simple to use. However, the user has full control of the parallel processing settings, which can be customized when needed.
Given the input files, as shown by Figure 2, Rqc depends on the execution of the rqcQA function to produce the table containing the summary statistics required to create the images for the final report. These statistics are stored as tidy data (Wickham 2014) using a listlike data structure called 'RqcResultSet'. The list containing the result data is processed through the rqcReport method, which combines high-quality images produced via the ggplot2 (Wickham 2009) package with R Markdown template files to generate an HTML report using the knitr (Xie 2018) package.

Results
We developed Rqc, an optimized Bioconductor package to assist the analyst during quality control steps for high-throughput sequencing data. It uses parallel computing strategies to process multiple files efficiently. The package handles both FASTQ and BAM files, allowing the user to assess the quality of the data at different stages of the analysis. Sequencing data are summarized into frequency tables that are used later to generate charts, tables and other statistics.
To demonstrate the operation of the Rqc package we used a set of public data of a study on RNA sequencing (RNA-seq), which is part of the 1000 Genomes Project (The 1000 Genomes Project Consortium 2012). We used 5 paired-end samples of each population group available in the public data set resulting in 50 FASTQ files. These files are used as input for the rqcQA function, which was configured to process 4 files in parallel. To reproduce this demonstration, R version equal or greater than 3.4.2, Bioconductor 3.6 and Rqc 1.12.0 are required. The following code shows how to install the Rqc package.

'RqcResultSet' data structure
The 'RqcResultSet' class stores summarized data from one file that was previously processed. As Rqc package may process multiple files at the same time, a list of 'RqcResultSet' objects is returned. This result list is used as input by many methods with different tasks such as computation of statistics used in plots, chart and report generation. Table 1 presents the available methods for accessing a list of 'RqcResultSet' objects. These methods are also used by other Rqc functions to extract data and produce additional statistics required by different visualization strategies. Table 2 shows general information about processed files.

Method
Description perFileInformation File name, base directory, number of sampled reads, total reads. perFileTopReads Most represented sequencing reads and their counts. perReadWidth Frequency distribution of read width. perReadFrequency Number of unique reads, duplicated reads, etc. perReadQuality Frequency distribution of mean quality of reads. perCycleQuality Frequency distribution of cycle-specific quality. perCycleBasecall Frequency distribution of cycle-specific base call.

Available plots
The Rqc package provides a number of plots to the user. The overall objective of these graphical summaries is to allow the analyst to make an informed decision regarding the quality of the data under inspection. All Rqc methods that generate graphs return an object of class 'ggplot'. These objects can be combined with graphical elements provided by other ggplot2 functions, such as themes and color palettes.
rqcReadQualityBoxPlot generates a graphic chart of per file average read quality distribution that provides an overview of the files (Figure 3). From this chart we can check whether all files have similar qualities. In situations where the DNA samples are sequenced by different research groups, or over several days, there may be quality changes of the fragments through this graph.
Working with many files may complicate the visualization of some graphic charts created by Rqc. The package provides methods for subsetting a list of 'RqcResultSet' objects to address this issue. For example, we could select the result data only from CEU samples or we can also subset by pair of files. R> qa.ceu <-subsetByGroup(qa, "CEU") R> qa.pair1 <-subsetByPair(qa, 1) One new visualization scheme that we implemented in Rqc allows the user to easily determine what percentage of the reads exceed a given threshold of average quality. We can use this information as an indicator of success of a sequencing experiment. For example, a sample that presents 10% of the reads exceeding the Q20 threshold (i.e., 10% of the reads have PHRED-scaled average quality of at least 20) will likely be removed from the analysis or even re-sequenced, if possible. A graphic chart with this information can be obtained using the rqcReadQualityPlot method. Figure 4 presents the average quality pattern by showing on the x-axis quality thresholds and on the y-axis the percentage of reads that exceed that quality level; we use this chart as an indicator of the need for re-sequencing samples according to the minimum average quality.
The biplot is generated by performing a principal component analysis (PCA) on a matrix containing the sample-specific average quality scores per cycle. Using this approach, we can easily investigate differences of patterns of quality at different sequencing cycles. The user can generate such plot using the method rqcCycleAverageQualityPcaPlot ( Figure 5).
The analyst can investigate the average quality per cycle by using box plots or line plots. This strategy allows for the user to pinpoint cycles that show specific behaviors, like sudden drop in quality. rqcCycleAverageQualityPlot generates a graphic chart of cycle-specific average quality ( Figure 6).  R> rqcCycleAverageQualityPcaPlot(qa.pair1) + theme_bw()  Figure 5: Biplot from PCA of cycle-specific read average quality. There are some discrepancies in the quality of the files of the same sample.
R> rqcCycleAverageQualityPlot(qa.pair1) + scale_color_brewer(palette = "Set1") q q q q q q qq q qq qq q q qq q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q qqq q q q q q q q q qqq qq q q q q q q qq qq q q q q q q q qqq q q q q q q q q q qq q q q q q q q q q q q Another addition by the Rqc package is the ability of generating a heatmap of the similarity between the most common reads observed at a given sample. In our experience, this is of great importance when assessing whether or not different reads may actually be originated from the same fragment, but one of them has some contamination by adapters. This chart can be generated with the rqcFileHeatmap method, which allows for setting the number of most frequent reads to be used. On Figure 8, we present the similarity heatmap, which can assist on the identification of reads that are likely to be measuring the same target.
The analysis of nucleotide composition per cycle is essential to identify biases like those caused by Illumina primers on transcriptome sequencing (Hansen, Brenner, and Dudoit 2010). rqcCycleBaseCallsPlot generates a stacked bar plot that describes the proportion of each nucleotide called for every cycle of sequencing (Figure 9). rqcCycleBaseCallsLinePlot shows the same result using lines instead of stacked bars.

Customized and interactive reports
The rqcQA and rqcReport functions are closely related, as the former generates the input required by the latter. We chose to have these two different functions to allow the user to use the Rqc results in other situations, like using a different engine to create graphics. The rqcQA function requires the input files (FASTQ or BAM) and returns a list of 'RqcResultSet' objects, which is a table containing the summary statistics needed for the quality report. This table and a template file are the input for the rqcReport, which creates the HTML quality report.
R> rqcReport(qa, outdir = ".", file = "qa_report") [1] "/home/welliton/git/rqcpaper/Code/qa_report.html" Users can compose R Markdown files, which contain texts and chunks of R code. These files should be passed as arguments to the rqcReport method. Chunks of R code are executed and their results are captured and merged with pre-existing text, generating a self-contained HTML report that contains text, tables and figures.
The user can also choose to use interactive tools to generate quality reports. We provide the rqcShinyReport method, which uses the shiny package (Chang, Cheng, Allaire, Xie, and McPherson 2018) to deliver a graphical interface where the user can select samples, groups of samples and plots of interest. This service is deployed as a web server and all computations happen in real-time.

Conclusion
We developed the Rqc Bioconductor package to provide a simple and effective strategy for reporting information about the quality high-throughput sequencing data sets. In addition, it can provide a deeper view of data quality by reading entire files without degrading the system performance and allowing the data analyst to identify biases that would be undetectable if only a subset of the data was analyzed. Rqc builds a self-contained report with high-resolution graphics that can be used directly in publications or shared with others. It also implements an interactive web application that simplifies the analysis of data sets, as it does not require coding. Rqc is completely integrated with and deployed through Bioconductor, simplifying its incorporation in Bioconductor-based pipelines. All of the software dependencies are transparently resolved by the R/Bioconductor package management system. The Rqc package is technology-independent and cross-platform, providing the user with the same experience on the three major operating systems (MS Windows, Mac OS X and Linux).