Deep transcriptome sequencing (RNA-seq) provides massive and valuable information about functional elements in the genome. Ideally, transcriptome sequencing should be able to directly identify and quantify all RNA species, small or large, low or high abundance. However, RNA-seq is a complicated, multistep process involving sample preparation, amplification, fragmentation, purification and sequencing. A single improper operation would result in biased or even unusable data. Therefore, it is always a good practice to check the quality of your RNA-seq data before analyses.
The sample is sequenced as non-stranded RNA-seq
#================================================== #All numbers are READ count #================================================== Total records: 153595188 QC failed: 0 Optical/PCR duplicate: 0 Non primary hits 41941600 Unmapped reads: 0 mapq < mapq_cut (non-unique): 27964578 mapq >= mapq_cut (unique): 83689010 Read-1: 41844505 Read-2: 41844505 Reads map to '+': 41844505 Reads map to '-': 41844505 Non-splice reads: 71676706 Splice reads: 12012304 Reads mapped in proper pairs: 83689010 Proper-paired reads map to different chrom:0
Provided a BAM/SAM file and reference gene model, this module will calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions). When genome features are overlapped (e.g. a region could be annotated as both exon and intron by two different transcripts) , they are prioritize as: CDS exons > UTR exons > Introns > Intergenic regions, for example, if a read was mapped to both CDS exon and intron, it will be assigned to CDS exons.
RSeQC cannot assign those reads that:
processing /home/statonse/db/Ha412v1r1_genes_v1.0.bed ... Done Total Reads 111653588 Total Tags 126498278 Total Assigned Tags 95045641 ===================================================================== Group Total_bases Tag_count Tags/Kb CDS_Exons 50673964 76929639 1518.13 5'UTR_Exons 0 0 0.00 3'UTR_Exons 0 0 0.00 Introns 83648462 2871536 34.33 TSS_up_1kb 42018537 997779 23.75 TSS_up_5kb 195923394 3444855 17.58 TSS_up_10kb 370551414 5624355 15.18 TES_down_1kb 40257870 3827196 95.07 TES_down_5kb 183624376 7031468 38.29 TES_down_10kb 353055864 9620111 27.25 =====================================================================
Two strategies were used to determine reads duplication rate: * Sequence based: reads with exactly the same sequence content are regarded as duplicated reads. * Mapping based: reads mapped to the same genomic location are regarded as duplicated reads. For splice reads, reads mapped to the same starting position and splice the same way are regarded as duplicated reads.
This module is used to check the nucleotide composition bias. In ideal condition (genome is random and RNA-seq reads is randomly sampled from genome), we expect A%=C%=G%=T%=25% at each position of reads. It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected. This bias could be easily examined by NVC (Nucleotide versus cycle) plot. NVC plot is generated by overlaying all reads together, then calculating nucleotide composition for each position of read (or each sequencing cycle).
NOTE: this program expect a fixed read length
According to SAM specification, if Q is the character to represent “base calling quality” in SAM file, then Phred Quality Score = ord(Q) - 33. Here ord() is python function that returns an integer representing the Unicode code point of the character when the argument is a unicode object, for example, ord(‘a’) returns 97. Phred quality score is widely used to measure “reliability” of base-calling, for example, phred quality score of 20 means there is 1/100 chance that the base-calling is wrong, phred quality score of 30 means there is 1/1000 chance that the base-calling is wrong. In general: Phred quality score = -10xlog(10)P, here P is probability that base-calling is wrong.
Heatmap: use different color to represent nucleotide density (“blue”=low density,”orange”=median density,”red”=high density”)
Read coverage over gene body. This module is used to check if reads coverage is uniform and if there is any 5’/3’ bias. This module scales all transcripts to 100 nt and calculates the number of reads covering each nucleotide position. Finally, it generates a plot illustrating the coverage profile along the gene body.
This module is used to calculate the inner distance (or insert size) between two paired RNA reads. The distance is the mRNA length between two paired fragments. We first determine the genomic (DNA) size between two paired reads: D_size = read2_start - read1_end, then
NOTE: Not all read pairs were used to estimate the inner distance distribution. Those low quality, PCR duplication, multiple mapped reads were skipped.
This program will compare detected splice junctions to reference gene model. splicing annotation is performed in two levels: splice event level and splice junction level.
All detected junctions can be grouped to 3 exclusive categories:
It’s very important to check if current sequencing depth is deep enough to perform alternative splicing analyses. For a well annotated organism, the number of expressed genes in particular tissue is almost fixed so the number of splice junctions is also fixed. The fixed splice junctions can be predetermined from reference gene model. All (annotated) splice junctions should be rediscovered from a saturated RNA-seq data, otherwise, downstream alternative splicing analysis is problematic because low abundance splice junctions are missing. This module checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments from BAM or SAM file, and then detects splice junctions from each subset and compares them to reference gene model.
Wang L, Wang S, Li W* RSeQC: quality control of RNA-seq experiments Bioinformatics (2012) 28 (16): 2184-2185. doi: 10.1093/bioinformatics/bts356 pubmed