This article provides a complete resource for researchers and bioinformaticians seeking to enhance the accuracy of their RNA-seq analyses, particularly for detecting novel splicing events and complex transcriptomes.
This article provides a complete resource for researchers and bioinformaticians seeking to enhance the accuracy of their RNA-seq analyses, particularly for detecting novel splicing events and complex transcriptomes. We detail the STAR two-pass mapping methodology, a powerful technique that significantly improves splice junction discovery and quantification by using splice junction information from an initial mapping pass to guide a more sensitive second alignment. The content covers foundational concepts, step-by-step protocols for both single and multi-sample scenarios, critical troubleshooting and optimization strategies, and a comparative validation of the method's performance against standard approaches. By implementing this workflow, scientists in drug development and clinical research can achieve more reliable detection of biologically meaningful splicing alterations, such as those relevant in cancer, ultimately strengthening findings from precision oncology studies.
The accurate reconstruction of the transcriptome through RNA sequencing (RNA-seq) fundamentally depends on the precise alignment of sequencing reads to a reference genome. A central challenge in this process is the identification of splice junctionsâthe points where exons connect following the removal of introns. While standard alignment pipelines perform robustly for annotated junctions, their ability to detect novel splice junctions, including those with non-canonical signals, is significantly limited. These limitations stem from inherent algorithmic constraints and a reliance on pre-existing gene annotations, which can confound downstream analyses such as isoform discovery, differential splicing analysis, and the identification of novel biomarkers [1] [2]. False positive splice junctions can introduce erroneous edges in splice graphs, dramatically increasing their complexity and compromising the accuracy of transcript assembly algorithms [1]. This application note delineates the specific limitations of standard RNA-seq alignment and provides detailed protocols for employing advanced methods, such as the STAR two-pass mapping method, to overcome these challenges within the context of a broader research thesis on improving detection accuracy.
Standard RNA-seq aligners face several intrinsic hurdles that impede the sensitive discovery of novel splicing events.
Benchmarking studies reveal measurable disparities in the performance of various aligners, particularly at the base-level and junction base-level resolution. A recent assessment of five popular RNA-seq alignment tools on Arabidopsis thaliana data provides a clear comparison of their capabilities.
Table 1: Benchmarking RNA-Seq Aligner Performance on Base-Level and Junction-Level Accuracy
| Aligner | Overall Base-Level Accuracy | Junction Base-Level Accuracy | Key Strengths |
|---|---|---|---|
| STAR | >90% [3] | Information Missing | Superior base-level accuracy, efficient seed-based algorithm [3] [4] |
| SubRead | Information Missing | >80% [3] | Highest junction base-level accuracy, robust for plant data [3] |
| HISAT2 | Information Missing | Information Missing | Efficient local indexing, successor to TopHat2 [3] |
| BBMap | Information Missing | Information Missing | Splice-aware, aligns to significantly mutated genomes [3] |
As illustrated in Table 1, an aligner that excels in overall base-level accuracy (e.g., STAR) may not be the top performer in the specific task of junction base-level alignment, where SubRead emerged as the most accurate in the tested context [3]. This underscores the importance of selecting an alignment tool based on the primary research objectiveâwhether it is overall expression quantification or the discovery of novel splice variants.
The STAR (Spliced Transcripts Alignment to a Reference) software package employs a novel algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays, followed by seed clustering and stitching, to enable ultra-fast and accurate spliced alignment [4]. Its two-pass mapping mode is specifically designed to enhance the detection of novel splice junctions.
The fundamental principle of two-pass mapping is to use information gleaned from an initial alignment pass to inform and refine a second alignment pass. In the first pass, RNA-seq reads are aligned to the reference genome, and novel splice junctions are discovered de novo. In the second pass, these newly discovered junctions are incorporated into the genome index, providing a more complete reference for the final alignment. This process significantly increases the sensitivity of mapping reads that span novel junctions [5] [6].
Necessary Resources
Input Files
Protocol Steps
Generate the Genome Indices (First Time Setup)
Note: The --sjdbOverhang parameter should be set to the read length minus 1. This specifies the length of the genomic sequence around the annotated junction to be used for constructing the splice junction database [6].
First Pass Mapping Align your RNA-seq reads to the initial genome index. The goal here is to generate a list of novel junctions for each sample.
The critical output from this step is the pass1_SJ.out.tab file, which contains all detected splice junctions.
Second Pass Mapping
For the most comprehensive novel junction discovery, combine the SJ.out.tab files from all samples in your experiment. You can then run the second pass for each individual sample, incorporating the aggregated junction information.
Alternative Simplified Workflow: If you are processing a single sample or wish to run the two-pass process on a per-sample basis, you can use the --twopassMode option directly, which performs both passes automatically in a single command [5]:
The final alignments from the second pass, typically in a BAM file (Aligned.out.bam or Aligned.sortedByCoord.out.bam), will contain significantly more reads accurately mapped across novel junctions, providing a superior foundation for downstream analysis.
The following diagram illustrates the logical flow and key components of the STAR two-pass mapping protocol.
Successful implementation of a robust RNA-seq alignment pipeline requires both specific computational tools and careful consideration of experimental design. The following table details key resources.
Table 2: Key Research Reagent Solutions for RNA-Seq Junction Detection
| Item Name | Function/Application | Specifications/Notes |
|---|---|---|
| STAR Aligner | Ultra-fast spliced alignment of RNA-seq reads to a reference genome. | Uses sequential maximum mappable prefix (MMP) search and clustering/stitching algorithm. Requires significant RAM (~30 GB for human) [4] [6]. |
| Reference Genome | The genomic sequence to which RNA-seq reads are aligned. | Critical for accuracy. Use the most recent and comprehensive assembly available for your species. |
| Gene Annotation (GTF) | Provides coordinates of known genes, transcripts, and exon boundaries. | Used by STAR during genome indexing to improve initial junction mapping accuracy (e.g., from Ensembl or GENCODE) [6]. |
| High-Quality RNA-seq Library | The starting material for sequencing. | Use protocols that minimize technical variation. Random hexamers and library prep kits designed for full-transcript coverage are recommended [7]. |
| Post-Alignment Classifier (e.g., DeepSplice) | A deep learning tool to filter false positive splice junctions. | Uses convolutional neural networks to classify candidate junctions based on sequence features, independent of read support [1]. |
| Scutebata C | Scutebata C, CAS:1207181-59-6, MF:C28H35NO9, MW:529.6 g/mol | Chemical Reagent |
| 2-Epitormentic acid | 2-Epitormentic Acid|High Purity|For Research | Research-grade 2-Epitormentic Acid. Study its antimicrobial and anti-inflammatory properties. This product is for Research Use Only (RUO). Not for human consumption. |
The limitations of standard RNA-seq alignment in detecting novel splice junctions present a critical challenge that can compromise the integrity of transcriptomic studies. However, by understanding these constraintsâincluding annotation dependence, short anchor biases, and performance gaps between alignersâresearchers can make informed decisions. The implementation of the STAR two-pass mapping method, as detailed in this application note, provides a powerful and validated strategy to overcome these limitations. By systematically incorporating novel junction discoveries from a first pass into the reference for a second pass, this protocol significantly enhances mapping sensitivity and accuracy. When combined with careful experimental design and potential supplementary tools like DeepSplice for junction filtering, this approach ensures that the analysis of RNA-seq data yields a more complete and reliable picture of the transcriptome, ultimately strengthening subsequent biological conclusions.
Two-pass mapping is an advanced bioinformatic framework designed to enhance the detection and quantification of splice junctions in RNA sequencing (RNA-seq) data. This method directly addresses a fundamental limitation of conventional single-pass alignment: the inherent bias towards known, annotated splice junctions, which reduces sensitivity for novel splice junction discovery [8]. The core innovation of two-pass mapping lies in its separation of the alignment process into two distinct phasesâa discovery phase and a quantification phase. This separation allows for a more sensitive and comprehensive transcriptomic analysis, as it enables the alignment algorithm to utilize evidence from the entire dataset to inform the final read alignments, rather than relying solely on pre-existing annotations [8] [9].
The rationale for this approach is elegantly simple yet powerful. In the first pass, splice junctions are discovered with high stringency, generating a sample-specific set of potential splicing events. In the second pass, these newly discovered junctions are used as an augmented reference, allowing the aligner to quantify them with the same sensitivity typically reserved for annotated junctions [8]. This process has been shown to significantly improve the accuracy of intron detection and is particularly valuable for quantifying novel splicing events, which are crucial for understanding complex biological processes like development and disease [9].
Traditional single-pass RNA-seq alignment methods typically rely on existing gene annotations to guide the mapping of reads across splice junctions. While this approach reduces background noise, it introduces a significant quantification bias. The alignment software inherently requires more evidence to align a read to a novel, unannotated splice junction than to a known one [8]. This preferential treatment stems from the more stringent alignment scores often applied to novel junctions, implicitly demanding greater evidence for reads spliced over novel junctions compared with known splice junctions [8]. Consequently, this bias can lead to under-representation of novel biological splicing events in the final quantification, potentially masking important transcriptomic variations.
The two-pass methodology fundamentally restructures this process by decoupling junction discovery from quantification:
This approach effectively creates a sample-specific annotation reference, tailored to the actual splicing events present in the dataset, thereby overcoming the annotation-dependent bias of single-pass methods.
Empirical studies across diverse RNA-seq datasets have consistently demonstrated the substantial benefits of two-pass alignment. The performance gains are evident across multiple metrics, particularly for novel splice junction quantification.
Table 1: Performance Improvements with Two-Pass Alignment Across Various RNA-seq Datasets
| Sample Type | Read Length | Splice Junctions Improved | Median Read Depth Ratio (2-pass vs 1-pass) |
|---|---|---|---|
| Lung Adenocarcinoma Tissue | 48 nt | 99% | 1.68Ã |
| Lung Normal Tissue | 48 nt | 98% | 1.71Ã |
| Reference RNA (UHRR) | 75 nt | 94-97% | 1.25-1.26Ã |
| Lung Cancer Cell Lines | 101 nt | 97% | 1.19-1.21Ã |
| Arabidopsis Samples | 101 nt | 95-97% | 1.12Ã |
The data reveal that two-pass alignment improves quantification for the vast majority (94-99%) of splice junctions across various biological contexts, from human cancer samples to plant specimens [8]. The median read depth over these splice junctions increased by as much as 1.7-fold, indicating substantially improved detection sensitivity [8]. This enhanced coverage is particularly valuable for detecting lower-abundance transcripts and for applications requiring high quantification accuracy, such as differential splicing analysis and isoform-level expression studies.
The improved sensitivity of two-pass mapping directly translates to enhanced performance in downstream analytical applications, particularly differential splicing analysis. Recent methodologies like the Differential Exon-Junction Usage (DEJU) workflow leverage two-pass alignment to achieve superior detection of alternative splicing events.
Table 2: Statistical Power and False Discovery Rate (FDR) in Differential Splicing Detection
| Splicing Pattern | Sample Size (n) | DEJU-edgeR Power | DEJU-edgeR FDR | DEJU-limma Power | DEJU-limma FDR |
|---|---|---|---|---|---|
| Exon Skipping (ES) | 3 | 0.977 | 0.022 | 0.975 | 0.043 |
| Mutually Exclusive Exon (MXE) | 5 | 0.993 | 0.040 | 0.995 | 0.062 |
| Alternative Splice Site (ASS) | 10 | 0.977 | 0.038 | 0.979 | 0.047 |
| Intron Retention (IR) | 10 | 0.964 | 0.042 | 0.968 | 0.050 |
The DEJU workflow, which incorporates two-pass alignment, demonstrates exceptionally high statistical power (>0.97) for detecting various splicing events, while effectively controlling the false discovery rate at or near the nominal 0.05 level [10]. The power increases with larger sample sizes, particularly for more challenging events like alternative splice site usage and intron retention [10]. This performance represents a significant advancement over traditional differential exon usage approaches that do not incorporate junction information.
The following detailed protocol outlines the implementation of two-pass alignment using the STAR aligner, which has been extensively validated for this purpose:
First Pass - Junction Discovery:
This step identifies high-confidence splice junctions, including novel events [8].
SJ.out.tab files). Filter to include junctions supported by sufficient evidence (e.g., at least 3 uniquely mapping reads across all samples) [10].Second Pass - Enhanced Quantification:
This creates a sample-enhanced reference that includes both annotated and discovered junctions [10] [8].
This results in BAM alignment files with optimized mapping across all detected splicing events [10].
Following two-pass alignment, the DEJU workflow provides a robust framework for differential splicing analysis:
nonSplitOnly=TRUE and juncCounts=TRUE arguments to obtain both internal exon and exon-exon junction count matrices simultaneously [10].diffSpliceDGE in edgeR or diffSplice in limma, applying the Simes method or F-test to summarize feature-level results at the gene level [10].
Table 3: Key Research Reagent Solutions for Two-Pass Mapping
| Resource | Type | Function | Implementation Notes |
|---|---|---|---|
| STAR Aligner | Software | Spliced alignment of RNA-seq reads | Ultrafast universal RNA-seq aligner using sequential maximum mappable seed search [4] |
| RSubread/featureCounts | Software | Read quantification at exon and junction levels | Used with nonSplitOnly=TRUE and juncCounts=TRUE for DEJU analysis [10] |
| edgeR | Software | Differential expression and splicing analysis | Used with diffSpliceDGE function for statistical testing [10] |
| limma | Software | Linear models for microarray and RNA-seq data | Used with diffSplice function for differential splicing [10] |
| GENCODE Annotation | Reference | High-quality gene annotation | Provides baseline splice junction database for initial alignment [8] |
| 2passtools | Software | Machine-learning-filtered splice junctions | Extends two-pass concept to long-read RNA sequencing [9] |
Successful implementation of two-pass mapping requires careful parameter selection:
--outFilterType BySJout in STAR to ensure alignment output consistency with reported splice junction results [8].alignSJoverhangMin 8) while allowing known junctions to be spanned by as few as 3 nucleotides (alignSJDBoverhangMin 3) [8].While initially developed for short-read RNA-seq, the two-pass principle has been successfully adapted to emerging technologies:
The separation of discovery and quantification in two-pass mapping represents a fundamental advancement in RNA-seq analysis, providing researchers with enhanced sensitivity for detecting novel splicing events and more accurate quantification of transcriptomic diversity. This approach has become particularly valuable in biomedical research, where comprehensive detection of splicing variations can reveal novel disease mechanisms and potential therapeutic targets.
Within the context of a broader thesis on RNA-seq bioinformatics, this application note addresses a critical methodological challenge: the accurate detection and quantification of novel splice junctions from RNA sequencing data. Splicing accuracy is paramount for downstream analyses in both basic research and drug development, from identifying disease-associated splicing variants to characterizing therapeutic targets. The STAR two-pass mapping method has emerged as a powerful solution to enhance splice junction detection, particularly for unannotated splicing events. This protocol details the experimental and computational procedures to implement this method and provides quantitative validation of its performance, specifically documenting the 1.7-fold improvement in median read depth over novel splice junctions that can be achieved through this approach [12].
The core advantage of the two-pass alignment method is demonstrated through rigorous computational benchmarking. The following table summarizes the key quantitative findings from a foundational study that evaluated the performance of two-pass alignment across diverse transcriptome sequencing datasets [12].
Table 1: Quantitative Improvements from Two-Pass Alignment
| Metric | Performance Improvement | Scope of Improvement |
|---|---|---|
| Novel Junction Quantification | Improvement for at least 94% of simulated novel junctions | Observed per sample across a variety of transcriptome sequencing datasets |
| Median Read Depth | Up to 1.7-fold deeper median read depth over novel splice junctions | Compared to single-pass alignment methods |
This evidence confirms that the two-pass method is not merely a procedural change but yields a substantial, measurable enhancement in the data quality available for subsequent splicing analysis.
This section provides a step-by-step workflow for implementing the STAR two-pass alignment method for a multi-sample RNA-seq study.
The objective of the first pass is to generate a comprehensive, sample-specific catalog of splice junctions.
Genome Generation: Generate a STAR genome index using the reference genome and, if available, a gene annotation file (GTF/GFF). Using annotations at this stage is recommended.
First-Pass Mapping: Run STAR in the first-pass mode for each sample individually. For samples with multiple sequencing lanes, combine the FASTQ files for the sample.
--outSAMtype None instructs STAR not to output aligned reads in BAM format, saving disk space as these are temporary files.SJ.out.tab file, which contains all detected splice junctions for the sample.To enhance the efficiency and accuracy of the second pass, the junctions from all samples are consolidated and filtered.
Concatenate Junctions: Collect the SJ.out.tab files from the first pass of all samples.
Filter Junctions: Apply filters to remove low-confidence junctions. A recommended filter includes [13]:
SJ.out.tab) below a threshold (e.g., >= 5 uniquely mapping reads) are removed.chrM) are often excluded.The filtered junction file is used to create an enhanced genome index for the final alignment.
Re-generate Genome Index: Re-run the STAR genome generation command, this time including the filtered, consolidated junction file from the previous step.
Final Alignment: Map the original sample reads to the new, junction-informed genome index. This produces the final BAM alignment files for downstream analysis.
The following diagram illustrates the logical flow and data products of the two-pass mapping protocol detailed above.
The successful implementation of this protocol relies on a suite of software tools and reference data. The following table details these essential components and their functions.
Table 2: Key Research Reagents and Computational Tools
| Item Name | Function / Purpose in Protocol |
|---|---|
| STAR Aligner | The core splice-aware aligner used for both the first and second passes of read mapping [14]. |
| Reference Genome (e.g., GRCh38) | The FASTA file of the subject's genome used for building the alignment index and mapping reads. |
| Gene Annotation (e.g., Gencode GTF) | Provides known transcript models to guide initial alignment and assist in classifying novel junctions [14]. |
| High-Performance Computing (HPC) Cluster | Essential for handling the significant memory and CPU requirements of genome indexing and parallel alignment of multiple samples. |
| Splice Junction File (SJ.out.tab) | The data file output by STAR that contains the coordinates and supporting read counts for all detected splice junctions. This is the key product of the first pass [14]. |
| Arillatose B | Arillatose B, CAS:137941-45-8, MF:C22H30O14, MW:518.5 g/mol |
| Methyl chanofruticosinate | Methyl chanofruticosinate, MF:C23H26N2O5, MW:410.5 g/mol |
The advent of high-throughput RNA sequencing (RNA-seq) has revolutionized precision oncology by enabling a comprehensive view of the transcriptome. Accurate alignment of sequencing reads is a critical first step, and the STAR (Spliced Transcripts Alignment to a Reference) two-pass mapping method has emerged as a gold standard for its speed and improved accuracy in detecting complex genomic events. This protocol details the application of the STAR two-pass method for two critical analyses in oncology: differential splicing and clinical variant detection, framing them within a thesis on enhanced accuracy through two-pass mapping.
Differential splicing analysis identifies genes that undergo alternative splicing changes between conditions (e.g., tumor vs. normal), which can reveal novel biomarkers or therapeutic targets.
2.1. Experimental Protocol
Step 1: Two-Pass Genome Alignment with STAR
First Pass: Align samples individually to discover novel splice junctions.
Junction Compilation: Collect all splice junctions from all samples (SJ.out.tab files) into a single list.
Step 2: Quantification and Analysis
2.2. Data Presentation
Table 1: Example Output from rMATS Analysis of Tumor vs. Normal Tissue
| Gene Symbol | Splicing Event Type | P-Value | FDR | Inclusion Level (Tumor) | Inclusion Level (Normal) | ÎInclusion Level |
|---|---|---|---|---|---|---|
| MYC | Skipped Exon (SE) | 1.2E-08 | 0.001 | 0.15 | 0.85 | -0.70 |
| BCL2L1 | Alternative 5'SS (A5SS) | 5.7E-05 | 0.024 | 0.90 | 0.45 | +0.45 |
| CD44 | Mutually Exclusive Exons (MXE) | 3.4E-06 | 0.005 | 0.10 (Exon A) / 0.90 (Exon B) | 0.80 (Exon A) / 0.20 (Exon B) | N/A |
2.3. Visualization
STAR Two-Pass Splicing Analysis
RNA-seq enables the detection of expressed variants, including single nucleotide variants (SNVs) and indels, which can be driver mutations in cancer. Two-pass mapping improves the alignment in complex regions, reducing false positives.
3.1. Experimental Protocol
Step 1: Two-Pass Alignment and Processing
Step 2: Variant Calling and Filtration
3.2. Data Presentation
Table 2: Example Clinically Relevant Variants Detected from RNA-seq
| Gene | cDNA Change | Protein Change | Variant Type | dbSNP ID | Allele Frequency (Tumor) | Predicted Effect (e.g., VEP) |
|---|---|---|---|---|---|---|
| KRAS | c.35G>A | p.G12D | SNV | rs121913529 | 0.32 | Missense, Oncogenic |
| EGFR | c.2573T>G | p.L858R | SNV | rs121434568 | 0.28 | Missense, Responsive to TKI |
| BRAF | c.1799T>A | p.V600E | SNV | rs113488022 | 0.41 | Missense, Oncogenic |
3.3. Visualization
RNA-seq Clinical Variant Calling
Table 3: Essential Research Reagent Solutions for RNA-seq Analysis in Oncology
| Item | Function/Benefit |
|---|---|
| STAR Aligner | Ultra-fast RNA-seq aligner that utilizes sequential maximum mappable seed search for high accuracy, especially for splice junction discovery. |
| rMATS | Statistical software for detecting differential alternative splicing from replicate RNA-seq data. |
| GATK | Industry-standard toolkit for variant discovery in high-throughput sequencing data, with specific best practices for RNA-seq. |
| GENCODE Annotation | High-quality, comprehensive human gene annotation, providing the reference transcriptome for alignment and quantification. |
| VEP (Variant Effect Predictor) | Tool that determines the functional consequences (e.g., missense, frameshift) of genomic variants. |
| dbSNP Database | Public repository of human single nucleotide variations and indels, used for annotating and filtering common polymorphisms. |
| TruSeq RNA Library Prep Kit | A widely used kit for preparing stranded RNA-seq libraries, ensuring high-quality input for sequencing. |
| 12-Epinapelline | 12-Epinapelline, MF:C22H33NO3, MW:359.5 g/mol |
| Epoxyparvinolide | Epoxyparvinolide, MF:C15H22O3, MW:250.33 g/mol |
Within the broader investigation of STAR two-pass mapping for enhanced accuracy in transcriptome analysis, a critical operational decision faces researchers: selecting the optimal two-pass implementation. The choice between Genome Re-generation and Direct Two-Pass Mode fundamentally shapes the balance between computational burden, project scalability, and analytical sensitivity. This Application Note delineates these two methodologies, providing structured quantitative data, detailed protocols, and strategic guidance to empower researchers and drug development professionals in selecting the most efficacious workflow for their experimental objectives.
RNA sequencing data analysis is fundamentally complicated by the gapped nature of RNA transcripts, where exons are separated by introns of varying lengths. Spliced alignment tools must therefore identify non-contiguous genomic segments from which reads originate. Conventional single-pass alignment methods inherently favor known, annotated splice junctions, creating a quantification bias against novel splicing events [8].
Two-pass alignment addresses this limitation by separating the processes of splice junction discovery and read quantification. The rationale is elegant: an initial alignment pass performed with high stringency identifies splice junctions present in the sample. These newly discovered junctions are then provided as an augmented annotation for a second alignment pass, which can proceed with lower stringency parameters, thereby achieving higher sensitivity to novel or rare splicing variants [8]. This approach has been demonstrated to improve the quantification of at least 94% of simulated novel splice junctions, providing as much as a 1.7-fold increase in median read depth over those junctions compared to single-pass methods [8].
The core methodological decision in implementing two-pass alignment lies in how the splice junctions discovered in the first pass are integrated into the second pass. The two primary strategies, Genome Re-generation and Direct Two-Pass Mode, differ significantly in their computational handling of this information.
This method involves a distinct genome index generation step between the two alignment passes.
Workflow:
SJ.out.tab file).Advantages: Considered the "gold standard" for sensitivity, as the merged junction list provides the most complete set of splicing information for the entire project [8]. This is particularly powerful for heterogeneous datasets, such as cancer transcriptomes, where comprehensive junction discovery is paramount [15].
This streamlined approach performs both passes and the required indexing in a single execution for each sample.
Workflow:
Advantages: Offers a significant gain in operational simplicity and convenience, as it requires no manual file management between steps. It is suitable for studies containing single samples or incompatible samples where a unified project-wide index is not necessary [16].
The logical flow of these two strategies is illustrated below.
Empirical evaluations consistently demonstrate that two-pass alignment significantly enhances the sensitivity of spliced alignment. The following table summarizes key performance metrics from a study profiling two-pass alignment across diverse RNA-seq datasets, including human cancer samples and Arabidopsis thaliana.
Table 1: Performance Gains of Two-Pass Alignment Across Various RNA-seq Datasets [8]
| Sample | Description | Read Pairs (millions) | Splice Junctions Improved | Median Read Depth Ratio |
|---|---|---|---|---|
| TCGA-50â5933_T | Lung Adenocarcinoma Tissue | 48 | 99% | 1.68Ã |
| TCGA-50â5933_N | Lung Normal Tissue | 52 | 98% | 1.71Ã |
| UHRR_rep1 | Universal Human Reference RNA | 83 | 94% | 1.25Ã |
| UHRR_rep2 | Universal Human Reference RNA | 85 | 97% | 1.26Ã |
| LCS22T | Lung Adenocarcinoma Tissue | 52 | 98% | 1.20Ã |
| A549 | Lung Cancer Cell Line | 92 | 97% | 1.21Ã |
| AT_flowerbuds | Arabidopsis Flower Buds | 192 | 97% | 1.12Ã |
The performance benefit primarily arises from the method's ability to permit alignment of sequence reads with shorter overhangs at splice junctions. The first pass identifies junctions from reads with long, unambiguous anchors, allowing the second pass to confidently align reads that have as few as 3-5 nucleotides spanning the same junction, which would otherwise be discarded in a single-pass approach [8] [17]. While this increased sensitivity can potentially introduce more false positive junctions, these are often readily identifiable by simple classification and filtering based on alignment metrics [8] [9].
This protocol is recommended for projects where the highest possible splice junction sensitivity is required and computational resources are available.
Step 1: First Pass Alignment (Per Sample)
SJ.out.tab file from each sample contains the discovered splice junctions.Step 2: Merge Junctions from All Samples
SJ.out.tab files from all samples in the project. Filtering can be applied at this stage (e.g., based on read counts supporting the junction) to remove likely artifacts.Step 3: Re-generate Genome Index
Step 4: Second Pass Alignment (All Samples)
This protocol is optimal for rapid analysis of individual samples or when computational simplicity is a priority.
STAR Command:
Process: STAR automatically executes the two-pass procedure internally. The final alignments are written to the output file specified by --outSAMtype (default is SAM). For downstream variant calling, such as with the VarRNA pipeline, it is recommended to output coordinate-sorted BAM files [15].
Successful implementation of the two-pass alignment workflows requires careful preparation of key computational reagents. The following table details these essential components.
Table 2: Key Research Reagents and Computational Resources
| Item / Resource | Function / Role in the Workflow | Specifications & Notes |
|---|---|---|
| Reference Genome | The DNA sequence against which RNA-seq reads are aligned. | FASTA format. Must be the same version used for generating the genome index and annotations. Example: GRCh38 for human. |
| Gene Annotation | Provides known transcript models and splice junctions to guide the initial alignment. | GTF or GFF3 format. Source (e.g., GENCODE, Ensembl) and version must be consistent with the reference genome. |
| STAR Aligner | The software package that performs the spliced alignment of RNA-seq reads. | Open-source. Requires compilation on Unix/Linux/Mac OS systems. |
| Genome Index | A pre-processed representation of the reference genome that enables ultra-fast read mapping. | Generated by STAR's genomeGenerate run mode. For the human genome, requires ~30GB of RAM. |
| High-Performance Computing Node | Provides the necessary computational power and memory to run alignment jobs. | RAM: ⥠32GB for human genome. Storage: >100GB free space for output. CPU: Multiple cores (e.g., 12) to use --runThreadN for parallel processing. |
The choice between Genome Re-generation and Direct Two-Pass Mode should be guided by the experimental design and analytical goals.
For Novel Isoform Discovery and Cancer Transcriptomics: The Genome Re-generation approach is unequivocally recommended. Its ability to leverage splice junction information across all samples before the final alignment maximizes sensitivity for detecting rare and sample-specific splicing events, which is critical in contexts like cancer [8] [15]. This makes it ideal for assembling novel transcripts or working with non-model organisms.
For Differential Gene Expression (DGE) Analysis: For standard gene-level (as opposed to isoform-level) DGE in annotated organisms, the convenience of Direct Two-Pass Mode (--twopassMode Basic) often makes it the preferred choice [16]. It provides a sensitivity boost over single-pass mapping without the complexity of managing a custom genome index for the entire project.
General Consideration: The fundamental trade-off is between the maximum sensitivity and project-level consistency offered by Genome Re-generation and the operational simplicity and per-sample autonomy of the Direct Two-Pass Mode.
In conclusion, both two-pass strategies represent a significant advancement over single-pass alignment, directly addressing the quantification bias against novel splicing events. By integrating these methods, researchers in both basic science and drug development can achieve a more complete and accurate picture of the transcriptome, ultimately strengthening findings related to disease mechanisms and therapeutic targets.
Within the framework of research on STAR's two-pass mapping method for enhanced accuracy, the first alignment pass serves a critical function: it is the primary discovery phase for identifying splice junctions, including novel or unannotated splicing events. The parameters configured during this initial pass directly govern the sensitivity and specificity of this junction discovery process, thereby fundamentally influencing the quality of all subsequent analyses. This protocol details the essential parameters for the first alignment pass, providing a validated methodology for researchers and drug development professionals to optimize initial junction discovery.
The first pass of STAR alignment prioritizes the detection of splice junctions from the RNA-seq data. The following parameters are crucial for balancing sensitivity with specificity.
The table below summarizes the key parameters for the first pass, their recommended settings and the rationale for their use.
Table 1: Essential parameters for the first pass of STAR alignment.
| Parameter | Recommended Setting | Impact on Junction Discovery |
|---|---|---|
--twopassMode |
Basic |
Activates the standard two-pass mode, where junctions discovered in Pass 1 are used in Pass 2 [18] [19]. |
--sjdbOverhang |
ReadLength - 1 |
Specifies the length of the genomic sequence around annotated junctions; critical for sensitive junction alignment [20] [18]. |
--alignSJoverhangMin |
8 (or higher) |
Minimum overhang for unannotated junctions; higher values increase specificity but may reduce novel junction sensitivity [8]. |
--alignSJDBoverhangMin |
3 (or 1) |
Minimum overhang for annotated junctions; lower values increase sensitivity for known junctions [8] [21]. |
--outFilterMultimapNmax |
20 |
Maximum number of loci a read can map to; helps control for multi-mapping reads [21] [22]. |
--outSAMtype |
BAM SortedByCoordinate |
Outputs a coordinate-sorted BAM file, which is standard for downstream analysis [20] [22]. |
--limitSjdbInsertNsj |
1000000 |
Maximum number of junctions to be inserted into the genome; prevents memory overload with large datasets [19] [23]. |
--twopassMode Basic instructs STAR to perform a first pass to discover junctions and then automatically use them as "annotated" for the more sensitive second pass [18] [19]. This method has been shown to significantly improve the quantification of novel splice junctions, providing as much as a 1.7-fold increase in median read depth over those junctions compared to single-pass alignment [8].--sjdbOverhang parameter is typically set to the maximum read length minus one. For example, for 100 bp paired-end reads, the ideal value is 99 [20] [22]. This ensures the aligner has sufficient sequence context for accurate mapping across splice junctions.--alignSJoverhangMin (e.g., 8) in the first pass increases the confidence in the initially discovered novel junctions by requiring a longer exact match on either side of the junction [8]. This creates a high-quality set of novel junctions to be used in the second pass.This protocol outlines the steps for generating genome indices and performing the first pass of alignment, which produces the initial set of splice junctions.
Before alignment, a genome index must be generated. This step requires a reference genome FASTA file and a gene annotation file (GTF format is recommended).
Command for Genome Index Generation:
Key Parameters:
--runMode genomeGenerate: Sets STAR to index generation mode.--genomeDir: Path to the directory where the genome indices will be stored.--sjdbOverhang 99: This should match the value planned for the alignment step. For reads of varying length, use the maximum read length minus one [20] [22].The first-pass alignment uses the genome index to map reads and, crucially, to discover splice junctions.
Command for First-Pass Alignment:
Key Parameters for First Pass:
--twopassMode Basic: This single parameter simplifies the process, as STAR will automatically use the first pass to collect junctions for the second pass [19].--readFilesCommand zcat: Use if input FASTQ files are compressed (e.g., .gz).--outSAMtype BAM SortedByCoordinate: Outputs a sorted BAM file, which is standard for downstream analysis [22].The following diagram illustrates the logical flow and critical outputs of the two-pass alignment method, highlighting the central role of the first pass.
Successful execution of the two-pass RNA-seq alignment protocol requires the following key materials and computational resources.
Table 2: Key Research Reagent Solutions for STAR Two-Pass Alignment.
| Item | Function / Role in the Protocol |
|---|---|
| Reference Genome (FASTA) | The reference genomic sequence to which RNA-seq reads are aligned (e.g., GRCh38 for human). It is strongly recommended to include major chromosomes and scaffolds [20]. |
| Gene Annotation (GTF/GFF) | Provides known transcript models and splice junctions, which are used during genome indexing to guide initial alignment. GENCODE annotations are recommended for human data [8] [21]. |
| High-Performance Computing (HPC) | STAR is memory-intensive; aligning to the human genome typically requires ~38GB of RAM. Multi-core processors are necessary for parallelization [18] [22]. |
| Splice Junction File (SJ.out.tab) | The key intermediate output of the first pass. It contains the coordinates of all detected splice junctions, which become the custom annotation for the second pass [13] [19]. |
| STAR Aligner | The software tool that performs the spliced alignment of RNA-seq reads to the reference genome using the described algorithm [4] [22]. |
| Piscidinol A | Piscidinol A, CAS:100198-09-2, MF:C30H50O4, MW:474.7 g/mol |
| Macrocarpal N | Macrocarpal N, CAS:172617-99-1, MF:C28H38O7, MW:486.6 g/mol |
Accurate identification of splice junctions is a critical step in RNA-seq data analysis, directly influencing downstream interpretations of transcript diversity and gene expression. The STAR aligner summarizes high-confidence splice junctions in an SJ.out.tab file, which provides a comprehensive tab-delimited summary of splice junctions detected during alignment [24]. Within the context of STAR two-pass mapping research, proper curation of this file is essential for improving the accuracy of novel junction discovery and quantification [8] [9]. Two-pass alignment methodology significantly enhances sensitivity for detecting unannotated junctions by using junctions discovered in an initial alignment pass to inform a second, more sensitive mapping round [8]. This guide presents a standardized framework for filtering the SJ.out.tab file to distinguish high-confidence splicing events, enabling researchers to maximize the reliability of their transcriptomic analyses.
The SJ.out.tab file generated by STAR provides a complete summary of detected splice junctions with precise genomic coordinates and supporting evidence metrics. Proper interpretation of each column is fundamental to effective filtering.
Table 1: Comprehensive Description of SJ.out.tab Columns
| Column | Name | Description | Interpretation |
|---|---|---|---|
| 1 | contig name | Chromosome or contig name | Genomic location of the junction |
| 2 | first base | First base of intron (1-based) | Donor splice site position |
| 3 | last base | Last base of intron (1-based) | Acceptor splice site position |
| 4 | strand | Strand orientation (0: undefined, 1: +, 2: -) | Direction of transcription |
| 5 | intron motif | Splice site sequence motif | Canonical/non-canonical classification |
| 6 | annotated | Junction annotation status | 0: unannotated, 1: annotated |
| 7 | unique reads | Uniquely mapping reads spanning junction | Primary evidence support |
| 8 | multi-mapping reads | Multi-mapping reads spanning junction | Supporting but ambiguous evidence |
| 9 | max overhang | Maximum spliced alignment overhang | Anchoring quality indicator |
The intron motif column (column 5) classifies splice sites using a standardized code: 0 for noncanonical, 1 for GT/AG, 2 for CT/AC, 3 for GC/AG, 4 for CT/GC, and 5 for AT/AC [24]. This classification is crucial as canonical motifs (particularly GT/AG) represent the vast majority of biologically valid splice sites. The maximum spliced alignment overhang (column 9) represents the length of the read sequence that anchors across the junction, serving as a key confidence metric - longer overhangs indicate more reliable alignment across the splice junction [24].
The following filtering strategy combines established practices from the DRAGEN Bio-IT Platform with community-validated approaches to create a robust framework for junction curation.
Table 2: Tiered Filtering Thresholds for Junction Curation
| Filter Category | Parameter | Threshold | Biological Rationale |
|---|---|---|---|
| Read Support | Unique mapping reads | ⥠3 for noncanonical motifs | Reduces false positives from alignment errors |
| Unique mapping reads | ⥠2 for canonical motifs | Balances sensitivity and specificity | |
| Unique mapping reads | Increased thresholds for long introns | Addresses mapping complexity over long distances | |
| Junction Anchoring | Maximum overhang | ⥠12 for canonical motifs | Ensures sufficient sequence evidence |
| Maximum overhang | ⥠30 for noncanonical motifs | Compensates for unusual splice sequences | |
| Splice Site Preference | Intron motif | Prioritize canonical (1,3,4,5) | Reflects biological prevalence of major splice types |
| Annotation Status | Annotated junctions | Automatic inclusion | Leverages existing transcriptomic knowledge |
Long introns require special consideration in filtering strategies. Based on DRAGEN implementation, junctions longer than 50,000 bases should require at least 2 uniquely mapping reads, those longer than 100,000 bases should require at least 3 reads, and junctions exceeding 200,000 bases should require at least 4 uniquely mapping reads [24]. These adjusted thresholds compensate for the increased mapping complexity across large genomic distances.
The two-pass alignment method significantly improves novel junction detection by separating junction discovery from quantification [8]. In the first alignment pass, junctions are identified using stringent parameters. These discovered junctions are then collected and used as a "novel annotation" to guide a second alignment pass with more sensitive parameters [6]. This approach particularly benefits the quantification of novel splice junctions by permitting alignment of reads with shorter junction overhangs that would otherwise fail to map [8].
Research demonstrates that two-pass alignment can improve quantification of approximately 94-99% of simulated novel splice junctions across diverse RNA-seq datasets, providing as much as 1.7-fold deeper median read depth over these junctions [8]. The method works by increasing alignment of reads to splice junctions by short lengths, effectively rescuing junctions that would be missed in single-pass approaches.
The following workflow diagram illustrates the complete junction curation process, integrating both standard filtering and two-pass methodology:
For researchers analyzing non-model organisms or working with long-read sequencing technologies, advanced tools like 2passtools provide machine-learning-enhanced filtering. This approach uses alignment metrics and sequence information to filter spurious splice junctions from long-read alignments before the second alignment pass [9]. The integration of alignment and sequence information produces significant improvement in splice junction accuracy for subsequent genome-guided annotation.
Necessary Resources
Procedure
Apply specialized filtering for noncanonical junctions requiring stronger evidence:
Extract annotated junctions regardless of other metrics:
Merge and deduplicate the high-confidence junction sets:
Necessary Resources
Procedure
Second pass alignment - Use discovered junctions for sensitive mapping:
Junction curation - Apply filtering thresholds to final SJ.out.tab:
Table 3: Essential Research Reagent Solutions for Junction Analysis
| Resource | Function | Application Notes |
|---|---|---|
| STAR Aligner | Spliced alignment of RNA-seq reads | Ultra-fast, sensitive; requires significant RAM [4] |
| Reference Genome | Genomic coordinate system | Use primary assembly without alternate contigs for STAR [25] |
| Gene Annotation (GTF) | Known splice junction reference | GENCODE/Ensembl recommended; crucial for annotation status [6] |
| 2passtools | Machine-learning junction filtering | Particularly valuable for long-read RNA-seq data [9] |
| Qualimap | Alignment quality assessment | Evaluates 3'/5' bias, genomic region distribution [25] |
| DRAGEN Bio-IT | Accelerated RNA-seq pipeline | Provides validated, production-grade filtering thresholds [24] |
| Liangshanin A | Liangshanin A | Liangshanin A is a patented kaurane diterpene for SARS-CoV research. This product is For Research Use Only (RUO). Not for human or veterinary use. |
| Shizukaol C | Shizukaol C, MF:C36H42O10, MW:634.7 g/mol | Chemical Reagent |
Systematic curation of splice junctions from the SJ.out.tab file is an essential component of robust RNA-seq analysis. By implementing the tiered filtering framework presented in this guide - incorporating read support thresholds, junction anchoring quality, motif classification, and annotation status - researchers can significantly enhance the reliability of their splice junction datasets. When integrated with the two-pass alignment method, this approach provides a comprehensive strategy for maximizing both sensitivity and specificity in junction detection, particularly for novel splicing events. The standardized protocols and reagent solutions outlined here offer practical implementation guidance, enabling the research community to adopt consistent, reproducible practices for high-confidence junction curation in diverse transcriptomic studies.
In the analysis of RNA sequencing (RNA-seq) data, accurate alignment of sequencing reads to a reference genome is a foundational step, yet it is complicated by the phenomenon of pre-mRNA splicing, which creates splice junctions. Two-pass alignment is a sophisticated computational strategy designed to enhance the sensitivity of spliced alignment, particularly for the discovery and quantification of novel splice junctions not present in existing annotation files [8]. The core rationale is elegant: an initial alignment pass is performed to discover splice junctions from the data itself. These discovered junctions are then filtered to create a custom set of junctions, which is provided to guide a second, more sensitive alignment pass [26]. This method directly addresses a key limitation of standard, single-pass alignment, where the alignment algorithm inherently requires more stringent evidence to align a read across a novel junction compared to an annotated one, thus creating a quantification bias against novel splicing events [8]. By treating the newly discovered junctions as "annotated" in the second pass, two-pass alignment reduces this bias, permitting a more sensitive realignment of reads. Profiling of this approach has demonstrated that it can improve the quantification of a vast majority of novel splice junctions, delivering as much as a 1.7-fold deeper median read depth over these junctions compared to single-pass methods [8].
The implementation of a two-pass alignment workflow requires careful execution at each stage. The following section details the core protocol and the critical step of junction filtering.
The STAR (Spliced Transcripts Alignment to a Reference) aligner is widely used for implementing two-pass alignment due to its speed and sensitivity [8]. The standard workflow for multiple samples is as follows:
Initial Genome Indexing: Generate a genome index using the reference genome sequence and, if available, a gene annotation file (GTF/GFF).
Note: The --sjdbOverhang parameter should be set to the maximum read length minus 1. [27]
First-Pass Alignment: Perform the first alignment pass for all samples individually. This step generates a list of detected splice junctions for each sample (SJ.out.tab file).
Junction Consolidation and Filtering: Collect the SJ.out.tab files from all samples into a single directory. These files are then concatenated and, crucially, filtered to remove likely spurious junctions (see Section 2.2 for detailed criteria).
Second Genome Indexing: Generate a new genome index that includes the filtered set of splice junctions from the first pass.
Second-Pass Alignment: Re-align each sample using the new genome index containing the filtered, sample-derived junctions. The resulting alignments (BAM files) are used for downstream splicing analysis.
bash
STAR --genomeDir /path/to/secondpass_genome_index \
--readFilesIn sample1.R1.fastq.gz sample1.R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample1.pass2.
[27]
The following diagram illustrates the logical flow and data products of this workflow:
A critical advancement in the two-pass protocol is the filtering of splice junctions prior to the second indexing step. Incorporating all discovered junctions, including those with low support, can introduce noise, increase computational burden, and reduce the percentage of uniquely mapped reads [13]. Filtering aims to retain high-confidence junctions. Based on community best practices and recommendations from the STAR author, the following criteria should be applied to the concatenated SJ.out.tab file [27]:
chrM.This filtering step has been shown to mitigate the drawbacks of two-pass alignment, reducing the drop in uniquely mapped reads to about 0.4% and substantially decreasing the number of technical splicing differences detected between first and second passes of the same sample [13].
The application of two-pass alignment with filtered junctions has a measurable impact on splicing analysis. The following tables summarize key quantitative findings from empirical studies.
Table 1: Effect of Two-Pass Alignment on Splice Junction Quantification [8]
| Sample Type | Splice Junctions with Improved Quantification | Median Read Depth Ratio (2-pass / 1-pass) |
|---|---|---|
| Lung Adenocarcinoma | 98 - 99% | 1.68x - 1.71x |
| Reference RNA (UHRR) | 94 - 97% | 1.25x - 1.26x |
| Lung Cancer Cell Lines | 97% | 1.19x - 1.21x |
| Arabidopsis Tissues | 95 - 97% | 1.12x |
Table 2: Comparison of Splicing Analysis Outcomes (1-pass vs. 2-pass) [13]
| Metric | 1-Pass Alignment | 2-Pass Alignment (with Filtering) |
|---|---|---|
| Detection Power | Baseline | Identifies more significant splicing changes (LSVs) |
| Novelty | Relies on prior annotation | Enables discovery of unannotated, sample-specific junctions |
| Reproducibility | High for detected events | Additional events unique to 2-pass are less reproducible |
| Uniquely Mapped Reads | Baseline | ~0.4% decrease |
| Computational Load | Baseline | Increased runtime and memory (moderated by filtering) |
The data reveals a core trade-off: two-pass alignment boosts sensitivity and discovery power at the cost of introducing a set of less reproducible splicing events. While the vast majority of differential splicing events detected in one pass are also detected in the other, with over 99% showing minimal difference (dPSI < 0.025), the subset of events found exclusively in the second pass tends to have lower reproducibility rates between biological replicate analyses [13]. Therefore, the choice to use two-pass alignment should be guided by the study's goalâit is excellent for hypothesis generation but may require more stringent validation of its unique findings.
Table 3: Essential Computational Tools and Resources for Two-Pass Alignment
| Item | Function in the Two-Pass Protocol |
|---|---|
| STAR Aligner | The primary software used for fast, sensitive spliced alignment and for implementing the two-pass workflow [8] [27]. |
| Reference Genome Sequence (FASTA) | The genomic DNA sequence for the organism being studied, used for building the alignment index. |
| Gene Annotation (GTF/GFF) | A file of known gene models, providing a set of high-confidence splice junctions for the initial alignment pass [8]. |
| SJ.out.tab File | STAR's output file format that details all splice junctions detected in an alignment, used as input for the second pass [27] [28]. |
| 2passtools | A software package that provides advanced, machine-learning-based filtering of splice junctions from long-read sequencing data for use in two-pass alignment [26]. |
| Dragen Bio-IT Platform | An aligned platform that also supports a two-pass mode using an SJ.out.tab file from a first pass to increase sensitivity [28]. |
| Splicing Quantification Tools (e.g., MAJIQ) | Downstream software that detects and quantifies alternative splicing changes from the BAM files produced by the alignment pipeline [13]. |
| Pierreione B | Pierreione B, CAS:1292766-21-2, MF:C26H28O7, MW:452.5 g/mol |
| Qingyangshengenin a | Qingyangshengenin a, MF:C49H72O17, MW:933.1 g/mol |
In summary, two-pass alignment with integrated junction filtering is a powerful method for enhancing the sensitivity of RNA-seq analysis towards novel splicing events. The empirical evidence demonstrates clear benefits, including significantly improved read depth over novel junctions [8]. However, researchers must be aware of the trade-offs, namely a slight reduction in uniquely mapped reads and the introduction of less reproducible splicing events [13].
The decision to employ this method should be strategic. For studies where the goal is a comprehensive, hypothesis-generating survey of the transcriptome, including in non-model organisms or disease contexts with expected novel splicing, two-pass alignment is highly recommended. For studies focused on the highly accurate quantification of a pre-defined set of splicing events, a well-executed single-pass alignment may be sufficient. Regardless of the choice, the implementation of rigorous junction filtering is a critical step that mitigates the drawbacks of the two-pass approach, making it a more robust and reliable protocol for modern splicing analysis.
Within the context of research on the STAR two-pass mapping method for improved accuracy, a critical methodological decision arises in multi-sample studies: how to most effectively leverage splice junction information across an entire dataset. The standard two-pass alignment method, which significantly enhances novel splice junction discovery and quantification by performing an initial discovery pass followed by a more sensitive alignment pass using the discovered junctions, presents a particular challenge when applied to multiple samples [8]. Researchers must choose between performing two-pass alignment on a per-sample basis or implementing a strategy to pool splice junctions discovered across all samples to create a unified reference for the second pass. This protocol outlines the best practices for the latter approach, providing a systematic framework for pooling junctions across an entire dataset to maximize the consistency and sensitivity of splice junction detection in multi-sample RNA-seq studies. Quantitative evidence demonstrates that two-pass alignment can improve quantification for 94-99% of novel splice junctions and provide as much as 1.7-fold deeper median read coverage over these features [8].
The fundamental principle behind two-pass alignment involves separating the processes of splice junction discovery and quantification to overcome the inherent bias in standard alignment methods that favor annotated junctions over novel ones [8]. In the first pass, alignment is performed with high stringency to discover splice junctions present in the data. In the second pass, these discovered junctions are used as "annotated" junctions, allowing the aligner to apply less stringent parameters and achieve higher sensitivity when aligning reads across these splice boundaries [8]. This approach has been shown to work particularly well by permitting alignment of sequence reads with shorter spanning lengths across splice junctions, thereby increasing the recovery of junction-spanning reads that might otherwise be missed [8].
When working with multiple samples from the same experimental conditions, pooling junctions across the entire dataset provides several distinct advantages over sample-specific two-pass alignment:
According to Alexander Dobin, the creator of STAR, this pooled approach "allows for more uniform detection of novel splicing across the samples" compared to sample-specific methods [29].
The pooled junction approach follows a structured workflow that can be conceptually divided into three main phases: initial sample processing, junction consolidation, and final alignment. The logical flow and dependencies between these stages are illustrated below:
The decision of which samples to include in junction pooling should be guided by biological and technical considerations:
Effective filtering of pooled junctions is critical for balancing sensitivity and specificity:
Table 1: Junction Filtering Parameters and Recommendations
| Filtering Parameter | Recommended Value | Rationale | Considerations |
|---|---|---|---|
| Minimum Unique Reads | 3-5 reads across samples | Balances sensitivity with false positive reduction | Higher thresholds reduce spurious junctions but may miss low-expression events |
| Overhang Length | 8-10 bp minimum | Ensures sufficient evidence for splice site | Shorter overhangs increase sensitivity but may introduce alignment errors |
| Canonical Splice Sites | Preferentially retain GT-AG, GC-AG, AT-AC | Biological relevance | Non-canonical sites may represent true biological variants or alignment artifacts |
| Multimapping Reads | Exclude or carefully evaluate | Potential for misassignment | Some true junctions may initially appear as multimapping |
For each sample in the dataset, perform first-pass alignment using STAR with the following key parameters:
Table 2: Critical First-Pass Alignment Parameters
| Parameter | Value | Function | Biological Rationale |
|---|---|---|---|
--alignIntronMin |
20 | Minimum intron size | Prevents identification of very short indels as introns |
--alignIntronMax |
1000000 | Maximum intron size | Accommodates known long introns while limiting spurious alignments |
--alignSJoverhangMin |
8 | Minimum overhang for unannotated junctions | Balances sensitivity and specificity for novel junctions |
--alignSJDBoverhangMin |
3 | Minimum overhang for annotated junctions | Increased sensitivity for known junctions |
--outFilterType BySJout |
Enabled | Reduces false junctions | Filters output based on splice junction evidence |
After first-pass alignment for each sample, carefully manage the output files:
Collect all SJ.out.tab files from the first-pass alignment and combine them:
Apply comprehensive filtering to the pooled junction list:
Generate new genome indices incorporating the filtered pooled junctions:
The --sjdbOverhang parameter should be set to the read length minus 1, which is a critical parameter for proper junction inclusion [29].
Align all samples using the new genome index containing pooled junctions:
Table 3: Research Reagent Solutions for STAR Two-Pass Alignment with Junction Pooling
| Resource Type | Specific Tool/Resource | Function in Protocol | Availability |
|---|---|---|---|
| Alignment Software | STAR (v2.4.0h1 or newer) | Spliced alignment of RNA-seq reads | https://github.com/alexdobin/STAR |
| Reference Genome | Species-specific (e.g., GRCh38 for human) | Genomic coordinate system for alignment | GENCODE, UCSC, ENSEMBL |
| Annotation File | GTF/GFF3 file (e.g., GENCODE Basic) | Gene model annotation for guided alignment | GENCODE, ENSEMBL |
| Junction Database | Custom-generated from pooled samples | Enhanced splice site reference for second pass | Generated in protocol |
| Quality Control | FastQC, MultiQC | Assessment of raw read and alignment quality | https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ |
| Sequence Processing | fastp, Trim Galore | Adapter trimming and quality control | https://github.com/OpenGene/fastp |
| Quantification | featureCounts, HTSeq | Read counting for gene expression analysis | https://subread.sourceforge.net/ |
After completing the two-pass alignment with pooled junctions, several quality metrics should be examined:
The resulting alignments are suitable for various downstream analyses:
The following diagram illustrates a systematic approach to addressing common challenges in the junction pooling workflow:
Pooling junctions across an entire dataset for STAR two-pass alignment represents a powerful strategy for maximizing splicing detection consistency and sensitivity in multi-sample RNA-seq studies. This approach leverages the collective evidence from all samples to create a comprehensive junction database that enhances the alignment of individual samples. The method is particularly valuable for studies focusing on alternative splicing, novel isoform discovery, or detection of low-prevalence splicing events. When properly implemented with appropriate filtering and quality control, this protocol can significantly enhance the reliability and biological relevance of RNA-seq analyses in both basic research and drug development contexts.
Accurate detection of alternative splicing from RNA sequencing data remains computationally challenging due to fundamental limitations in aligning short reads across splice junctions. Standard alignment approaches exhibit preferential alignment to known splice junctions, creating a discovery bias against novel splicing events and reducing statistical power for differential splicing analysis [8]. This bias is particularly problematic in cancer research, where novel splice variants can drive tumor progression and represent potential therapeutic targets [30].
The two-pass alignment method, implemented in splice-aware aligners like STAR, addresses this limitation through an iterative approach that separates junction discovery from quantification. The first alignment pass identifies splice junctions across all samples with high stringency, while the second pass utilizes these discovered junctions as an expanded reference to enable more sensitive alignment of reads with short exonic overlaps [8] [31]. This method significantly improves the quantification of novel splice junctions, with studies reporting as much as 1.7-fold deeper median read coverage over these junctions [8].
The DEJU (Differential Exon-Junction Usage) workflow represents a sophisticated downstream integration of this two-pass alignment approach, specifically designed to leverage the enhanced junction detection for improved differential splicing analysis [10]. By explicitly incorporating exon-exon junction reads alongside traditional exon counts, DEJU resolves the double-counting issue inherent in standard differential exon usage (DEU) methods while expanding the range of detectable splicing events, particularly alternative splice sites and intron retention [10] [32].
Comprehensive simulation studies benchmarked the performance of the DEJU workflow against existing methods across various alternative splicing patterns. The results demonstrate that incorporating exon-exon junction reads significantly enhances detection power while effectively controlling false discovery rates.
Table 1: Performance of DEJU-edgeR across splicing patterns and sample sizes
| Splicing Pattern | Sample Size (n) | FDR | Statistical Power |
|---|---|---|---|
| Exon Skipping (ES) | 3 | 0.022 | 0.977 |
| 5 | 0.029 | 0.991 | |
| 10 | 0.038 | 0.992 | |
| Mutually Exclusive Exons (MXE) | 3 | 0.030 | 0.990 |
| 5 | 0.040 | 0.993 | |
| 10 | 0.045 | 0.995 | |
| Alternative Splice Sites (ASS) | 3 | 0.027 | 0.839 |
| 5 | 0.027 | 0.927 | |
| 10 | 0.038 | 0.977 | |
| Intron Retention (IR) | 3 | 0.030 | 0.866 |
| 5 | 0.031 | 0.934 | |
| 10 | 0.042 | 0.964 |
Data derived from Pham et al. (2025) benchmarking studies [10].
DEJU-edgeR consistently demonstrated effective FDR control at the nominal 0.05 level across all splicing patterns, though it operated slightly more conservatively than DEJU-limma [10]. The statistical power for detecting differential splicing events increased substantially with larger sample sizes, with particularly notable improvements for alternative splice site and intron retention events, which are traditionally challenging to detect with standard DEU approaches.
Table 2: Method comparison across computational frameworks
| Method | Junction Read Incorporation | Double-Counting Resolution | Computational Efficiency | ASS/IR Detection Sensitivity |
|---|---|---|---|---|
| DEJU-edgeR | Yes | Yes | High | High |
| DEJU-limma | Yes | Yes | High | High |
| DEU-edgeR | No | No | High | Limited |
| DEU-limma | No | No | High | Limited |
| DEXSeq | No | Partial | Moderate | Moderate |
| JunctionSeq | Partial | Partial | Moderate | Moderate |
DEJU methods demonstrated superior performance in detecting a broader range of alternative splicing events while effectively controlling the false discovery rate [10]. The integration of two-pass alignment specifically enhanced detection of alternative splice sites and intron retention events that are often missed by standard exon-based approaches.
The initial alignment phase establishes the foundation for successful downstream DEJU analysis through comprehensive junction discovery.
Step 1: Genome Indexing and First-Pass Alignment
Step 2: Splice Junction Consolidation and Filtering
Step 3: Second-Pass Alignment with Enhanced Junction Database
This two-pass approach improves quantification of at least 94% of simulated novel splice junctions and provides as much as 1.7-fold deeper median read depth over these junctions [8].
The alignment files from two-pass STAR are processed to generate combined exon and junction count matrices.
Step 1: Generate Flattened Exon Annotation
Step 2: Create Junction Database from Annotation
Step 3: Simultaneous Exon and Junction Quantification
The nonSplitOnly=TRUE and juncCounts=TRUE parameters are crucial for resolving the double-counting issue by distinguishing internal exon reads from exon-exon junction reads [10].
The final phase implements statistical detection of differential splicing using the combined exon-junction count matrix.
Step 1: Data Preprocessing and Normalization
Step 2: Differential Exon-Junction Usage Analysis
Step 3: Results Interpretation and Visualization
Two-Pass Alignment and DEJU Integration Workflow: This diagram illustrates the complete analytical pathway from raw sequencing data to differential splicing detection, highlighting the critical integration points between two-pass alignment and the DEJU statistical framework.
Table 3: Essential research reagents and computational solutions for two-pass DEJU analysis
| Category | Tool/Resource | Specific Function | Implementation Notes |
|---|---|---|---|
| Alignment | STAR (2-pass mode) | Spliced alignment with novel junction discovery | Use --outFilterType BySJout, alignSJoverhangMin 8 [8] |
| Quantification | Rsubread featureCounts | Simultaneous exon and junction counting | Set nonSplitOnly=TRUE, juncCounts=TRUE [10] |
| Statistical Analysis | edgeR (diffSpliceDGE) | Differential exon-junction usage testing | Apply filterByExpr filtering, TMM normalization [10] |
| Reference Genome | GENCODE Basic | Comprehensive gene annotation | Provides high-quality gene models for human/mouse [32] |
| Quality Control | FastQC + MultiQC | Sequencing data quality assessment | Identify sequencing biases affecting junction detection [30] |
| Preprocessing | trim_galore | Adapter trimming and quality filtering | Improves alignment accuracy, particularly at read ends [32] |
The effectiveness of the integrated two-pass/DEJU workflow is highly dependent on experimental design parameters. Statistical power for detecting differential splicing events increases substantially with larger sample sizes, particularly for challenging patterns like alternative splice sites and intron retention [10]. Researchers should aim for at least 5 biological replicates per condition to achieve power >0.92 for most splicing events, with 10 replicates providing optimal detection capability across all event types.
For organisms with incomplete annotations, the two-pass approach provides particular value by systematically discovering novel junctions in the first alignment phase and incorporating them into the quantitative framework of DEJU analysis. However, the workflow is not recommended for non-model organisms with highly fragmented reference genomes, as the dependency on genome-guided alignment may yield unreliable results [32].
The two-pass alignment process requires substantial computational resources, particularly during the genome re-indexing phase that incorporates discovered junctions. Researchers should allocate sufficient memory (â¥32GB RAM for mammalian genomes) and processing cores to ensure practical runtime. The DEJU analysis component itself is computationally efficient, with benchmarking studies demonstrating favorable performance compared to alternative methods like DEXSeq and JunctionSeq [10].
The integration of two-pass alignment with the DEJU analytical framework represents a significant advancement in differential splicing detection, particularly for clinical and cancer research applications where comprehensive splicing characterization can reveal biologically meaningful events with potential diagnostic and therapeutic implications [10] [30].
In the analysis of RNA-sequencing (RNA-seq) data, junction filtering refers to the critical process of identifying and quantifying splice junctionsâthe points where exons are joined together after introns are removed from pre-messenger RNA. The fundamental challenge in this process lies in balancing sensitivity (the ability to correctly identify true novel splice junctions) with specificity (the ability to avoid false positive alignments). The STAR (Spliced Transcripts Alignment to a Reference) aligner, a widely used tool for RNA-seq analysis, employs sophisticated algorithms to navigate this trade-off, with its two-pass mapping method representing a significant advancement for improving the accuracy of novel junction discovery [8] [6].
Traditional single-pass alignment methods exhibit an inherent bias toward known, annotated splice junctions. These methods require greater evidence to align reads across novel splice junctions compared to known junctions, systematically reducing sensitivity for discovering unannotated splicing events [8]. The two-pass alignment strategy directly addresses this limitation by separating the processes of junction discovery and quantification, thereby increasing sensitivity without substantially compromising specificity. This approach is particularly valuable for research applications where comprehensive transcriptome characterization is essential, such as in cancer genomics, developmental biology, and the study of rare genetic disorders [8] [33].
The STAR two-pass mapping method operates through a sequential process that enhances junction detection sensitivity. In the first pass, alignment is performed with high stringency parameters across all samples individually. During this stage, STAR performs splice junction discovery, identifying all potential splice junctions, including both annotated and novel candidates. The key innovation of two-pass mapping lies in the second pass, where the junctions discovered in the first pass are used as an augmented "annotation" file. This customized reference allows the aligner to apply less stringent alignment parameters specifically for the novel junctions, effectively reducing the systematic bias against them [8] [6].
Table 1: Comparison of STAR Mapping Modes
| Feature | Single-Pass Mapping | Two-Pass Mapping |
|---|---|---|
| Junction Discovery | Single step using only existing annotations | Separate discovery and quantification phases |
| Sensitivity to Novel Junctions | Reduced due to alignment bias | Improved through customized junction database |
| Computational Requirements | Lower | Approximately double the alignment time |
| Handling of Unannotated Splice Sites | Requires more spanning nucleotides | Permits alignment with shorter spanning lengths |
| Recommended Use Cases | Routine expression analysis with focus on annotated features | Novel isoform discovery, comprehensive transcriptome characterization |
There are two primary implementations of the two-pass method: two-pass individual, which updates the splice junction database with novel junctions from the first pass of a single experiment, and two-pass collective, where junctions are discovered across multiple experiments before being used to create a unified junction database [34]. The individual approach is generally recommended for most applications, as it avoids potential batch effects while still providing significant benefits for novel junction detection [34].
Empirical studies have demonstrated that two-pass alignment significantly improves the quantification of novel splice junctions. Across diverse RNA-seq datasets, including human tissues, cell lines, and even Arabidopsis samples, two-pass alignment improved quantification for at least 94% of simulated novel splice junctions [8]. The method provides as much as 1.7-fold deeper median read depth over these splice junctions compared to single-pass alignment, substantially enhancing the statistical power for downstream differential splicing analyses [8].
Table 2: Performance Metrics of Two-Pass Alignment Across Sample Types
| Sample Type | Splice Junctions Improved | Median Read Depth Ratio | Expected Read Depth Ratio |
|---|---|---|---|
| Lung Adenocarcinoma Tissue | 99% | 1.68Ã | 1.75Ã |
| Reference RNA (UHRR) | 94-97% | 1.25-1.26Ã | 1.35Ã |
| Lung Normal Tissue | 96-98% | 1.18-1.71Ã | 1.23Ã |
| Lung Cancer Cell Lines | 97% | 1.19-1.21Ã | 1.19Ã |
| Arabidopsis Samples | 95-97% | 1.12Ã | 1.12Ã |
The mechanism behind this improvement involves the alignment of sequence reads with shorter spanning lengths across splice junctions. By treating newly discovered junctions as "known" in the second pass, STAR reduces the minimum required overhang length, allowing more reads to map confidently to these locations [8]. This technical adjustment directly addresses the sensitivity-specificity trade-off by maintaining specificity through the initial high-stringency discovery phase while enhancing sensitivity during the final quantification phase.
Before initiating the two-pass mapping protocol, ensure appropriate computational resources are available. For the human genome, STAR requires approximately 30 GB of RAM, with 32 GB recommended for optimal performance. Sufficient disk space (>100 GB) should be available for storing output files, and multiple processing threads (typically 8-12 for standard workstations) will significantly improve processing speed [6].
The necessary software components include:
The following dot code generates a workflow diagram illustrating the complete two-pass mapping process:
Workflow of STAR Two-Pass Mapping
Step 1: Generate Genome Indices Create reference genome indices using the STAR genomeGenerate function:
The --sjdbOverhang parameter should be set to (read length - 1), which is 100 for standard 101bp paired-end reads [6].
Step 2: First Pass Mapping - Junction Discovery Perform initial alignment to discover novel splice junctions:
This step generates the SJ.out.tab file containing all discovered splice junctions, which will be used in the second pass [8] [6].
Step 3: Second Pass Mapping - Enhanced Alignment In the second pass, incorporate the discovered junctions for improved sensitivity:
The critical parameter --sjdbFileChrStartEnd specifies the junctions discovered in the first pass, and the reduced --alignSJoverhangMin value (5 vs. 8) enhances sensitivity for novel junctions [8].
Table 3: Essential Research Reagents and Computational Tools for Junction Analysis
| Resource | Type | Function/Purpose | Implementation Notes |
|---|---|---|---|
| STAR Aligner | Software | Spliced alignment of RNA-seq reads | Version 2.4.0h1 or newer recommended; supports two-pass mode [8] [6] |
| GENCODE Annotations | Reference | Comprehensive gene annotation | Provides baseline junction database; v21+ recommended for human studies [8] |
| NCBI SRA Tools | Software | Access to public RNA-seq datasets | Useful for method validation and comparison [8] |
| SAMtools/BEDTools | Software | Processing alignment files | Essential for downstream analysis of BAM files [33] |
| High-Performance Computing | Infrastructure | Computational resource requirements | 30+ GB RAM, multi-core processors for efficient two-pass execution [6] |
The two-pass mapping method proves particularly valuable in research contexts where comprehensive junction identification is critical. In clinical genetics, for instance, long-read sequencing technologies are increasingly employed to detect diverse genomic alterations, but short-read RNA-seq with enhanced junction detection remains invaluable for quantifying splicing changes [33]. The improved sensitivity of two-pass mapping enables more accurate detection of pathogenic splicing variants and novel isoforms associated with disease states.
For drug discovery research, particularly in studies focusing on tight junction biology in epithelial and endothelial barriers, accurate transcriptome quantification is essential. AI-based prediction of drug-gene interactions relies on high-quality RNA-seq data, where comprehensive junction detection can reveal subtle changes in gene expression and isoform usage in response to therapeutic compounds [35]. The two-pass method provides the robust data foundation necessary for such computational approaches.
The following dot code illustrates how two-pass mapping integrates with broader research workflows:
Integration with Downstream Analyses
The STAR two-pass mapping method represents a significant methodological advancement for junction filtering in RNA-seq analysis, effectively balancing the competing demands of sensitivity and specificity. By treating junction discovery and quantification as separate processes, this approach mitigates the systematic bias against novel splice junctions inherent in conventional methods while maintaining high alignment accuracy. The protocol outlined in this document provides researchers with a robust framework for implementing this powerful technique, complete with performance benchmarks and technical specifications.
As RNA-seq applications continue to evolve toward more complex clinical and diagnostic settings, methods that enhance sensitivity without compromising specificity will remain increasingly valuable. The two-pass junction filtering strategy establishes a foundation for reliable splice junction detection that can support diverse research objectives, from basic transcriptome characterization to clinical biomarker discovery.
The STAR (Spliced Transcripts Alignment to a Reference) two-pass mapping method is a foundational technique in RNA-seq data analysis, designed to significantly improve the accuracy of spliced alignment, particularly for the discovery of novel splice junctions not present in existing genome annotations [6]. This approach operates on a simple yet powerful principle: information about splice junctions gathered from an initial mapping pass over a dataset is used to create an enhanced genome index, which subsequently guides a second, more sensitive alignment of the reads [26]. While this method offers substantial benefits for detecting complex RNA sequence arrangements and novel isoforms, its implementation in large-scale studiesâsuch as those involving multiple patients, tissues, or time seriesâpresents considerable computational challenges that require careful resource management and strategic planning.
The primary resource challenge stems from the two-fold nature of the process. The first pass alignment must be executed for multiple samples, each generating a unique set of splice junctions. The subsequent genome indexing step, which incorporates these discovered junctions, is both memory and compute-intensive [6]. In studies encompassing dozens or hundreds of samples, this can lead to exponential growth in both computational overhead and data storage requirements. Furthermore, the decision between performing a two-pass analysis per individual sample versus a coordinated two-pass across all samples in a study has profound implications for data consistency, computational efficiency, and final analytical outcomes [34]. This protocol details strategies to navigate these demands effectively, enabling researchers to leverage the improved accuracy of two-pass mapping without being thwarted by its computational cost.
Successful execution of a large-scale two-pass mapping study requires a clear understanding of the necessary computational resources. The most significant demands are placed on system memory (RAM), processing power (CPU cores), and storage space. The table below summarizes the key resource requirements for a standard STAR two-pass workflow, using the human genome as a reference point.
Table 1: Computational Resource Requirements for STAR Two-Pass Mapping (Human Genome Reference)
| Resource Component | Minimum Requirement | Recommended for Large Studies | Notes |
|---|---|---|---|
| System Memory (RAM) | 32 GB [6] | 64 GB or higher | Scale with genome size; critical for genome generation step. |
| CPU Cores | 8 cores | 16-32 cores | --runThreadN parameter; scales mapping speed [6]. |
| Storage (Hard Disk) | 100 GB free space [6] | 1 TB+ | For genome indices, temporary files, and output BAM/Junction files. |
| Genome Indexing | ~30 GB RAM for human genome [6] | N/A | A one-time, memory-intensive process. |
| Two-Pass Runtime | ~2x single-pass time [34] | N/A | Varies with read depth, number of samples, and system specs. |
The memory requirement is perhaps the most critical constraint. As a rule of thumb, STAR requires approximately 10 x GenomeSize bytes of RAM [6]. For a human genome (~3 GigaBases), this translates to ~30 GB, making 32 GB a practical minimum. When planning for large-scale studies, allocating 64 GB or more provides a comfortable buffer for parallel processing and handling larger-than-expected intermediate files. The processing throughput is highly dependent on the number of available CPU cores. The --runThreadN parameter controls this, and it is typically set to the number of physical cores available [6]. On systems with efficient hyper-threading, increasing this number to up to twice the number of physical cores can further improve speed. It is crucial to note that the two-pass process effectively doubles the mapping workload, and the runtime difference, while not exactly 100% more, is still significant compared to single-pass mapping [34].
A key decision in designing a large-scale study is the choice between two distinct two-pass modes: the "Individual Sample" two-pass and the "Multi-Sample" two-pass. This strategic choice has a direct and major impact on project logistics, computational load, and the biological interpretation of the results.
In this mode, each sample in a study is processed independently through the complete two-pass workflow. The splice junctions discovered in the first pass of a specific sample are used to create a custom genome index for that same sample, which is then used for its second pass [34]. This method is ideal for studies containing single samples or for projects where samples are biologically heterogeneous or incompatible (e.g., different species, different treatments where novel splicing is not expected to be shared) [16]. The primary advantage is that it maximizes the sensitivity for finding sample-specific novel junctions. The disadvantage is the high computational cost, as a new genome index must be generated for every single sample.
For cohesive studies involving multiple replicates or related samples (e.g., a time-course experiment, patient cohorts), a more efficient approach is the multi-sample two-pass. In this strategy, the first pass is run on all samples individually. Then, the splice junction files (SJ.out.tab) from all samples are collected and used to generate a single, unified, study-wide genome index [36]. This single index is then used for the second pass mapping of every sample. This method is computationally more efficient as it requires only one genome re-generation step, drastically saving time and computational resources. It ensures consistency across the dataset, which is critical for downstream comparative analyses like differential expression or splicing. A potential drawback is that junctions unique to a single sample and supported by very few reads might be lost if filtering is applied when merging the junction files.
The following diagram illustrates the logical decision process and the two workflows:
This section provides a step-by-step protocol for executing a multi-sample two-pass mapping analysis, which is the most resource-efficient strategy for a typical large-scale cohort study.
First, generate the standard reference genome index using your genome FASTA file and annotation GTF file. This initial index will be used for all first-pass alignments.
Run the first mapping pass for each sample in your study. This step should be executed for every sample (e.g., via a loop or job array on a cluster).
This command will produce, among other output files, a SJ.out.tab file in each sample's output directory. This file contains the list of splice junctions detected in that sample and is the crucial input for the next step.
After all first-pass jobs are complete, create a new genome index that incorporates the novel junctions discovered across the entire study. The latest STAR best practices recommend providing the junction files from all samples separately to the --sjdbFileChrStartEnd parameter, rather than merging them manually [36].
Finally, using the newly created genome index, perform the final alignment for each sample. This second pass will have higher sensitivity as it utilizes the study-wide set of discovered splice junctions.
The final output, a coordinate-sorted BAM file, is now ready for downstream analyses such as transcript quantification or differential expression testing, with the improved accuracy afforded by the two-pass method.
The following table details the key research reagents and computational materials essential for implementing the two-pass mapping protocol described above.
Table 2: Essential Research Reagents and Computational Materials
| Item Name | Specifications / Version | Function / Purpose |
|---|---|---|
| STAR Aligner | Version 2.4.1a or later [6] | The core software that performs the ultra-fast spliced alignment of RNA-seq reads to the reference genome. |
| Reference Genome | Species-specific FASTA file (e.g., GRCh38 for human) [6] | The genomic sequence to which the RNA-seq reads are aligned. Serves as the foundational coordinate system. |
| Gene Annotation | GTF/GFF3 file (e.g., from Ensembl, GENCODE) [6] | Provides coordinates of known genes, transcripts, and exon boundaries to guide the aligner. |
| RNA-seq Reads | FASTQ files (paired-end or single-end) [6] | The raw input data representing the sequenced fragments of the transcriptome. |
| High-Performance Computing (HPC) Node | 16+ CPU cores, 32+ GB RAM, Linux/Unix OS [6] | The physical or virtual computational environment required to execute the memory- and processor-intensive alignment steps. |
The sequencing of full-length RNAs using long-read technologies from PacBio and Oxford Nanopore Technologies (ONT) has revolutionized our ability to decipher the true complexity of eukaryotic transcriptomes. Unlike short-read sequencing, long-read RNA sequencing (lrRNA-seq) can capture complete transcript isoforms, allowing for unambiguous identification of splicing events, alternative transcription start and termination sites, and polyadenylation patterns [9]. However, this potential is tempered by a significant challenge: the relatively high error rates inherent to long-read technologies can substantially reduce the accuracy of intron identification during genome alignment [9]. These alignment errors are not merely random noise; they often manifest systematically as spurious splice junctionsâincorrectly inferred exon-intron boundaries that lead to mis-annotated open reading frames, incorrectly truncated protein predictions, and ultimately, compromised biological interpretations [9].
The fundamental issue arises because spliced alignment algorithms must balance the bonus for aligning a short exon with sequencing errors against the penalty for opening two flanking introns. When the former is insufficient to overcome the latter, alignment failures occur, resulting in common error profiles such as the failure to align terminal exons, the skipping of short internal exons, the introduction of spurious terminal exons, and large, erroneous insertions relative to the reference genome [9]. For example, in Arabidopsis, a specific error at the short (42 nt) exon 6 of the FLM gene (AT1G77080) resulted in only 19.3% of simulated reads aligning to the correct transcript isoform when using standard methods [9]. Such systematic errors confound downstream analyses, including transcript quantification and differential splicing detection, underscoring the critical need for robust computational methods to identify and mitigate spurious junctions.
The two-pass alignment strategy represents a powerful methodological framework designed to overcome the limitations of single-pass alignment by separating the processes of splice junction discovery and read quantification. This approach directly addresses the core problem that aligners, when working without prior knowledge, implicitly require greater evidence to align reads across novel splice junctions compared to known ones, creating a bias against the discovery and accurate quantification of novel splicing events [8].
The rationale behind two-pass alignment is both elegant and effective. In the first pass, reads are aligned with high stringency to generate an initial set of splice junctions from the data itself. This set is then filtered to remove likely false positives (spurious junctions). In the second pass, these high-confidence, sample-specific junctions are provided to the aligner as "known" junctions to guide the realignment of all reads. This strategy permits lower stringency alignment in the second pass, thereby increasing sensitivity, particularly for reads that span novel splice junctions with short overhangs [8]. The process effectively shares junction information across all alignments within a sample, reducing the systematic under-detection of novel junctions that plagues single-pass methods.
Benchmarking studies have demonstrated that this two-pass workflow provides significant benefits, including improved mapping rates for junction-spanning reads, superior read placement accuracy, and enhanced splice junction recall [8]. Notably, it has been shown to improve the quantification of a vast majority (at least 94%) of simulated novel splice junctions, delivering as much as a 1.7-fold increase in median read depth over these junctions compared to single-pass alignment [8].
The performance advantages of two-pass alignment are consistent and measurable across diverse biological contexts. The following table synthesizes key quantitative findings from a benchmark study that evaluated two-pass alignment across twelve RNA-seq samples, including human tissues and Arabidopsis samples [8].
Table 1: Performance Benefits of Two-Pass Alignment Across Various RNA-Seq Samples
| Sample Description | Read Length | Splice Junctions Improved | Median Read Depth Ratio (2-pass / 1-pass) |
|---|---|---|---|
| Lung Adenocarcinoma Tissue | 48 nt | 99% | 1.68Ã |
| Lung Normal Tissue | 48 nt | 98% | 1.71Ã |
| Universal Human Reference RNA (UHRR) | 75 nt | 94-97% | 1.25-1.26Ã |
| Lung Cancer Cell Lines | 101 nt | 97% | ~1.20Ã |
| Arabidopsis Flower Buds | 101 nt | 97% | 1.12Ã |
| Arabidopsis Leaves | 101 nt | 95% | 1.12Ã |
The data reveal two critical trends. First, the two-pass method improves quantification for the vast majority of splice junctions in every sample tested (94% to 99%). Second, the magnitude of improvement, reflected in the median read depth ratio, is most pronounced in samples with shorter read lengths (e.g., 48-75 nucleotides), while still providing a substantial benefit for longer reads and in non-model organisms like Arabidopsis [8]. This evidence underscores the broad applicability of the method for enhancing the accuracy of novel junction discovery and quantification.
This section provides a detailed, step-by-step protocol for implementing a two-pass alignment workflow with a focus on filtering spurious junctions, incorporating best practices from established tools like STAR and 2passtools.
The goal of the first pass is to generate an initial, comprehensive set of splice junctions from the raw sequencing data.
Step 1: Initial Genome Indexing. Prior to alignment, a reference genome index must be created. This requires a genome sequence in FASTA format and, if available, a high-quality annotation file (GTF format). While annotation is not strictly mandatory for the first pass, it can improve initial mapping accuracy.
Step 2: First-Pass Alignment with STAR.
Critical Parameters:
--runThreadN: Number of CPU threads to use.--genomeDir: Path to the genome index.--readFilesCommand zcat: For reading gzipped FASTQ files.--outSJfilterCountTotalMin: Filters out junctions with very low read support, an initial step against spurious calls [6].Step 3: Extract Initial Splice Junction List. After alignment, STAR outputs a file named SJ.out.tab containing all detected splice junctions. This file is the starting point for downstream filtering.
The SJ.out.tab file contains both genuine and spurious junctions. The 2passtools software provides a sophisticated method to filter the latter using alignment metrics and sequence information [9].
Step 4: Filter Junctions with 2passtools.
2passtools employs a machine learning-based (logistic regression) approach to classify junctions as genuine or spurious based on features such as sequence motifs, read support, and alignment metrics [9].
The filtered, high-confidence junctions are now used to create an enhanced genome index for the final alignment.
Step 5: Generate a New Genome Index with Filtered Junctions.
Step 6: Perform Second-Pass Alignment.
The resulting BAM file (sample1_pass2_finalAligned.sortedByCoord.out.bam) contains the final alignments, which exhibit significantly improved accuracy for splice junctions and are suitable for downstream analyses like transcript quantification and differential splicing [9] [8].
The following diagram illustrates the complete workflow:
Successful implementation of this workflow relies on a combination of software tools, reference data, and computational resources. The table below catalogs the key components.
Table 2: Essential Resources for the Two-Pass Alignment Workflow
| Category | Item | Function / Description |
|---|---|---|
| Software | STAR Aligner [6] | Performs ultra-fast spliced alignment of RNA-seq reads; supports two-pass mode. |
| 2passtools [9] | Applies machine-learning-based filtering to remove spurious splice junctions from long-read alignments. | |
| SQANTI3 [37] | Performs rigorous quality control, curation, and functional annotation of long-read transcript models post-alignment. | |
| Reference Data | Genome Sequence (FASTA) | The reference genome of the organism under study (e.g., GRCh38 for human, TAIR10 for Arabidopsis). |
| Gene Annotation (GTF) | A high-quality annotation file (e.g., from GENCODE or Ensembl) to guide initial alignment and for result annotation. | |
| Computational Resources | High-Memory Server | ~30 GB RAM for human genome alignment; sufficient free disk space (>100 GB) for temporary and output files [6]. |
| Multi-core CPUs | Multiple physical cores to run alignment threads in parallel, significantly reducing computation time. |
Even with a robust pipeline, quality control is paramount. Researchers should leverage tools like SQANTI3 to perform in-depth characterization of the final transcript models [37]. SQANTI3 classifies transcripts into structural categories (e.g., Full-Splice Match, Novel-In-Catalog), evaluates the reliability of transcription start and termination sites using supporting data like CAGE-seq, and identifies non-canonical splice sites and potential reverse transcriptase artifacts [37]. Key metrics to monitor include:
Systematic errors, such as a high proportion of junctions with non-canonical motifs or a bias towards 3'-fragmented transcripts (Incomplete-Splice-Match), can indicate issues with RNA degradation or library preparation that alignment alone cannot fully resolve [37]. Integrating these QC metrics creates a feedback loop for continuously refining both wet-lab and computational processes, ensuring the highest possible accuracy in defining the transcriptome.
Within the framework of research on the STAR two-pass mapping method for improved accuracy, fine-tuning specific alignment parameters is paramount for maximizing the sensitivity and reliability of RNA-seq analyses. The two-pass mapping method, a hallmark of the STAR aligner, significantly enhances the discovery of novel splice junctions by leveraging information from all samples in an experiment [10]. In its first pass, STAR performs alignment and compiles a comprehensive set of detected junctions. These junctions are then incorporated into the genome index for a second mapping pass, substantially improving alignment accuracy for reads spanning splice junctions [10]. However, the efficacy of this sophisticated approach is highly dependent on the proper configuration of key parameters that govern splice junction awareness.
Among these parameters, --sjdbOverhang and --outFilterType require particular attention, as they directly influence how the aligner handles reads spanning exon-exon boundaries. Misconfiguration of these settings can lead to suboptimal alignment rates, reduced junction discovery, and ultimately, compromised downstream analyses such as differential exon-junction usage (DEJU) studies [10]. This protocol provides detailed guidance on optimizing these critical parameters, with specific consideration for varying read lengths and experimental designs commonly encountered in pharmaceutical and clinical research settings.
The --sjdbOverhang parameter is exclusively utilized during the genome generation step (--runMode genomeGenerate) and defines the length of genomic sequence flanking each side of annotated splice junctions that STAR incorporates into its splice junction database [38] [39]. When constructing the reference, STAR extracts N exonic bases from both the donor and acceptor sites for each annotated junction, creating hybrid sequences that facilitate the alignment of reads spanning these junctions [39]. This parameter effectively determines the maximum possible alignment overhang for reads crossing splice junctions during the mapping process.
According to Alexander Dobin, the developer of STAR, the ideal value for this parameter is mate_length - 1 [38] [39]. For example, with 100-base pair reads, the optimal setting is 99, as this would permit a read to map with 99 bases on one side of a junction and a single base on the other [38]. This configuration ensures that even reads aligning with minimal crossing at junction boundaries can be successfully mapped.
The optimal configuration of --sjdbOverhang varies significantly based on read length characteristics, necessitating different strategies for homogeneous versus heterogeneous datasets:
Table 1: Recommended sjdbOverhang Settings for Different Read Length Scenarios
| Read Length Scenario | Recommended Value | Rationale | Technical Considerations |
|---|---|---|---|
| Uniform read length (e.g., 100 bp) | Read length - 1 (e.g., 99) | Ideal for maximum sensitivity with specific read length [38] | Ensures even reads with minimal junction overhangs are captured |
| Mixed read lengths | 100 (default) [39] or maximum read length - 1 | Balances sensitivity with practicality for diverse datasets [40] | Default 100 works well for most longer reads; very short reads (<50 bp) may need special consideration [39] |
| Trimmed reads with variable lengths | 100 (default) or maximum post-trimming length - 1 | Accommodates length variation while maintaining junction sensitivity [39] | Using the maximum possible overhang is safer than too short; marginally impacts efficiency [39] |
For very short reads (<50 bases), Dobin strongly recommends using the optimum sjdbOverhang = mateLength - 1 to preserve sensitivity [39]. For longer reads, a generic value of 100 is generally sufficient and more practical for multi-study comparisons [39]. When dealing with multiple datasets of varying read lengths, researchers must either generate separate indices for each distinct length or utilize the default value of 100, which provides robust performance across most common sequencing scenarios [40] [39].
Experimental Objective: Establish the correct --sjdbOverhang value during genome indexing to maximize splice junction detection sensitivity for your specific RNA-seq dataset.
Materials and Reagents:
Methodology:
--sjdbOverhang as read_length - 1.Validation: Execute alignment on a subset of data and examine the SJ.out.tab file to verify adequate junction discovery rates compared to historical data with similar experimental conditions.
The --outFilterType BySJout parameter plays a distinct but complementary role to --sjdbOverhang in optimizing alignment precision. When enabled, this filter utilizes the splice junction information collected during alignment to selectively remove alignments that do not align with established or predicted junction models [10]. This filtering occurs after the initial mapping phase and serves as a quality control step to eliminate spurious alignments that might otherwise compromise the accuracy of junction quantification.
In the context of two-pass mapping, --outFilterType BySJout is particularly valuable as it leverages the comprehensive junction database compiled from all samples to refine alignments in the second pass [10]. This creates a more stringent and biologically plausible set of alignments, especially important for detecting subtle splicing alterations in differential splicing analyses.
Experimental Objective: Implement --outFilterType BySJout to improve alignment accuracy for differential exon-junction usage (DEJU) analysis.
Materials and Reagents:
--sjdbOverhangMethodology:
BySJout filter to compile an initial junction database:
Perform the second pass of alignment with the BySJout filter enabled:
Proceed with read quantification using featureCounts with both nonSplitOnly and juncCounts set to TRUE to generate both exon and junction count matrices [10].
Validation: Compare the proportion of uniquely mapping reads and the number of detected junctions between runs with and without the BySJout filter. Expect a moderate reduction in overall alignment rate with a concomitant increase in alignment quality.
The following workflow diagram illustrates the integrated relationship between parameter configuration, two-pass mapping, and downstream splicing analysis:
Table 2: Key Research Reagents and Computational Tools for STAR Two-Pass Mapping
| Tool/Reagent | Function | Implementation Notes |
|---|---|---|
| STAR Aligner (v2.7.0+) | Spliced alignment of RNA-seq reads | Requires compilation for HPC optimization; supports two-pass mode [10] |
| Reference Genome Sequence | Genomic coordinate system | ENSEMBL or GENCODE recommended; ensure consistency with annotation [41] |
| Annotation File (GTF) | Gene model definitions | Use version-matched annotations from GENCODE for optimal results [41] |
| RSubread/featureCounts | Read quantification | Set both nonSplitOnly and juncCounts to TRUE for DEJU analysis [10] |
| edgeR/limma | Differential splicing analysis | Implement diffSpliceDGE or diffSplice functions for DEJU workflow [10] |
| High-Performance Computing Cluster | Computational infrastructure | 32+ GB RAM recommended for human genome indexing [41] |
Proper configuration of --sjdbOverhang and --outFilterType parameters within the STAR two-pass mapping workflow significantly enhances the detection of splice junctions, particularly those that are novel or exhibit subtle usage differences between experimental conditions. The DEJU analysis workflow, which incorporates both exon and exon-exon junction reads, has demonstrated superior statistical power in detecting differential splicing events compared to methods that rely solely on exon-level counts [10]. When optimized according to the protocols outlined herein, researchers can expect improved sensitivity for alternative splicing events, enhanced accuracy in differential splicing analysis, and more biologically meaningful results in subsequent functional investigations, thereby advancing drug discovery and development pipelines that rely on precise transcriptome characterization.
Within the broader research on STAR two-pass mapping methods for improved accuracy, the validation of results and interpretation of mapping metrics stand as critical, non-negotiable steps. The two-pass alignment method, which involves an initial mapping pass to discover novel splice junctions followed by a second pass that uses these junctions to guide the final alignment, significantly enhances sensitivity in detecting novel splicing events [8] [6]. However, the increased sensitivity necessitates rigorous quality control procedures to ensure that improvements in junction discovery do not come at the cost of precision or introduce alignment artifacts. This protocol provides a comprehensive framework for researchers, scientists, and drug development professionals to systematically validate their two-pass alignment outcomes, correctly interpret key mapping metrics, and confirm biological validity.
Following the execution of a STAR two-pass alignment, the first quality control checkpoint involves a thorough examination of the mapping statistics generated by STAR. These metrics provide the initial indicators of both data quality and alignment performance. Proper interpretation distinguishes between expected outcomes of a successful two-pass alignment and potential warning signs of issues.
Table 1: Key STAR Alignment Metrics and Their Interpretation in Two-Pass Context
| Metric | Description | Expected Range/Value for Healthy RNA-seq | Significance in Two-Pass Context |
|---|---|---|---|
| Uniquely Mapped Reads | Percentage of reads mapped to a single genomic location | Typically >70-80% for human | Slight decreases may occur due to increased junction discovery; large drops may indicate issues. |
| Multi-Mapped Reads | Reads mapped to multiple locations | Varies by genome complexity | May increase in second pass as novel junctions provide new alignment possibilities. |
| Mismatch Rate | Percentage of mismatched bases in alignments | <2% for high-quality libraries | Should remain stable between passes; increases could indicate alignment errors. |
| Junction Reads | Reads spanning splice junctions | Highly variable; should be substantial | Should increase in second pass, indicating novel junction incorporation. |
| Insertion/Deletion Rate | Frequency of small indels in alignments | Generally low | Monitor for increases suggesting alignment stringency is too low in second pass. |
The Log.final.out file generated by STAR provides a comprehensive summary of the alignment results. When working with two-pass alignment, particular attention should be paid to the percentage of reads mapped to multiple loci and the splice junction counts. The two-pass method inherently increases sensitivity to novel splice junctions, which may manifest as a moderate increase in multi-mapped reads compared to a basic single-pass approach [8]. This occurs because the discovery of novel junctions in the first pass provides additional, sometimes less unique, alignment possibilities for reads in the second pass. However, a dramatic increase in multi-mapped reads (e.g., >30%) may indicate that alignment parameters are too permissive.
The SJ.out.tab files from both the first and second passes should be compared. A successful two-pass execution will typically show a substantial increase in the number of detected splice junctions in the second pass, reflecting the incorporation of novel junctions discovered in the first pass. The SJ.out.tab file includes crucial information about each junction, including the genomic coordinates, strand, motif, annotation status, and the number of uniquely and multi-mapped reads supporting the junction.
Purpose: To distinguish high-confidence novel splice junctions from potential alignment artifacts, ensuring the biological relevance of two-pass findings.
Materials:
SJ.out.tab files from both alignment passesMethodology:
SJ.out.tab file that are not present in the reference annotation. These are your candidate novel junctions.Expected Outcomes: A robust two-pass alignment should yield a set of novel junctions with strong read support and predominantly canonical motifs. The number of novel junctions is highly experiment-dependent but should be biologically plausible.
Purpose: To leverage the improved junction detection from two-pass alignment for more powerful differential splicing analysis, specifically testing for Differential Exon-Junction Usage (DEJU).
Materials:
Methodology:
featureCounts from the Rsubread package with both nonSplitOnly=TRUE and juncCounts=TRUE arguments to simultaneously quantify internal exon reads and exon-exon junction reads [10]. This generates a unified count matrix of exons and junctions.filterByExpr function in edgeR.calcNormFactors in edgeR to account for compositional biases between libraries.diffSpliceDGE in edgeR or diffSplice in limma. These functions test for differential usage of each feature (exon or junction) between conditions.Expected Outcomes: The DEJU workflow, powered by two-pass alignment data, demonstrates increased statistical power to detect true differential splicing events compared to methods using only exon counts [10]. This is quantified by higher statistical power in simulation studies while effectively controlling the false discovery rate.
The following workflow diagram illustrates the integrated quality control and validation process for STAR two-pass alignment, connecting the key steps from initial alignment to biological interpretation.
Table 2: Essential Research Reagent Solutions for Two-Pass Alignment and Validation
| Tool/Resource | Function | Application in Protocol |
|---|---|---|
| STAR Aligner | Spliced Transcripts Alignment to a Reference; performs the two-pass alignment | Core alignment engine for both first and second passes [6] |
| RSubread/featureCounts | Quantifies reads mapping to genomic features including exons and junctions | Generation of count matrices for DEJU analysis [10] |
| edgeR/limma | Statistical analysis packages for differential expression/usage | Statistical testing for differential exon-junction usage [10] |
| GENCODE Annotations | High-quality reference gene annotations | Provides known splice junctions for initial guidance and novelty assessment [21] |
| GM12878 RNA-seq Data | Benchmarking dataset from ENCODE | Positive control for pipeline validation and performance comparison [6] |
| Picard Tools | Java-based command-line utilities for handling sequencing data | Post-alignment quality metrics and BAM file processing [21] |
Robust quality control for STAR two-pass mapping extends beyond simple metric checking to encompass a holistic validation framework. By systematically interpreting mapping statistics, rigorously filtering novel junctions, leveraging integrated analysis approaches like DEJU, and confirming biological relevance, researchers can fully capitalize on the enhanced sensitivity of two-pass alignment. This comprehensive approach ensures that the discovered novel splicing events and quantitative results provide a reliable foundation for downstream biological insights and therapeutic development.
Alternative splicing (AS) is a pivotal post-transcriptional regulatory mechanism in eukaryotes, enabling the generation of multiple mRNA isoforms from a single gene and significantly contributing to transcriptomic and proteomic diversity [10] [42]. The detection of differential splicing (DS) between biological conditions is crucial for understanding cellular adaptation, development, and disease mechanisms. In oncology, for instance, dysregulated splicing is extensively linked to cancer pathogenesis, where mis-spliced isoforms of cancer-related genes can drive cellular transformation, proliferation, and metastasis [10] [42].
Differential Exon Usage (DEU) analysis, which relies on exon-level read counts from short-read RNA-seq data, has been a standard method for studying DS. However, a key limitation of conventional DEU analysis is its disregard for exon-exon junction information, which can reduce statistical power in detecting splicing alterations [10] [43] [42]. The standard exon counting method often leads to double-counting of reads that span two exons (junction reads) and lacks sensitivity in detecting specific splicing events like exon extensions, alternative splice sites, or retained introns [10].
To address these limitations, we present the Differential Exon-Junction Usage (DEJU) workflow. This approach integrates both exon and exon-exon junction reads into the established Rsubread-edgeR/limma frameworks, resolving the double-counting issue and providing a more powerful and accurate method for DS detection [10]. This application note details the DEJU protocol, benchmarks its performance against existing methods, and demonstrates its integration with the STAR two-pass mapping method to enhance detection accuracy within a comprehensive thesis research framework.
The DEJU workflow introduces a novel feature quantification strategy that jointly analyzes exon and exon-exon junction reads. The core principle involves treating exon-junctions as distinct features alongside exons in the final count matrix. This ensures that each exon-junction read is uniquely assigned to a single feature, effectively resolving the double-counting problem inherent in traditional DEU approaches and ensuring that library sizes accurately reflect the true number of sequence reads [10] [42].
This methodology significantly improves the detection of various splicing events, including:
Notably, the DEJU workflow is the first to enable robust detection of intron retention events, which are often missed by methods that do not incorporate junction information [42]. The following diagram illustrates the complete DEJU analysis workflow, from read alignment to differential usage analysis.
Comprehensive simulation studies were performed to benchmark DEJU against established methods (DEU-edgeR, DEU-limma, DEXSeq, JunctionSeq) using datasets with known ground truths. The simulations incorporated various splicing patterns (ES, MXE, ASS, IR) and tested different sample sizes (n=3, 5, 10 per group) and library sizes to evaluate robustness [10] [42].
Table 1: Statistical Power and False Discovery Rate (FDR) of DEJU Workflows Across Different Splicing Patterns and Sample Sizes
| Splicing Pattern | Sample Size (n) | DEJU-edgeR | DEJU-edgeR | DEJU-limma | DEJU-limma |
|---|---|---|---|---|---|
| FDR | Power | FDR | Power | ||
| Exon Skipping (ES) | 3 | 0.022 | 0.977 | 0.043 | 0.975 |
| 5 | 0.029 | 0.991 | 0.044 | 0.990 | |
| 10 | 0.038 | 0.992 | 0.051 | 0.992 | |
| Mutually Exclusive Exons (MXE) | 3 | 0.030 | 0.990 | 0.061 | 0.991 |
| 5 | 0.040 | 0.993 | 0.062 | 0.995 | |
| 10 | 0.045 | 0.995 | 0.063 | 0.995 | |
| Alternative Splice Site (ASS) | 3 | 0.027 | 0.839 | 0.038 | 0.877 |
| 5 | 0.027 | 0.927 | 0.037 | 0.947 | |
| 10 | 0.038 | 0.977 | 0.047 | 0.979 | |
| Intron Retention (IR) | 3 | 0.030 | 0.866 | 0.042 | 0.880 |
| 5 | 0.031 | 0.934 | 0.041 | 0.940 | |
| 10 | 0.042 | 0.964 | 0.050 | 0.968 |
The simulation results demonstrate that DEJU-edgeR and DEJU-limma maintain high statistical power (>0.83 across all patterns even at n=3) while effectively controlling the FDR. DEJU-edgeR showed slightly more conservative FDR control compared to DEJU-limma, particularly for MXE events. Both implementations showed improved detection power with larger sample sizes, with the most notable gains observed for ASS and IR events [10].
The benchmarking studies revealed several key advantages of the DEJU workflow:
Table 2: Method Comparison for Detecting Differential Splicing Events (n=3 per group)
| Method | True Positives | False Positives | Key Strengths | Limitations |
|---|---|---|---|---|
| DEJU-edgeR | High | Lowest | Best FDR control, high power for all events | Slightly conservative |
| DEJU-limma | High | Low | High power, good FDR control | Struggles with MXE FDR control |
| DEU-edgeR | Moderate | Low | Standard DEU workflow | Misses IR, lower power for ASS |
| DEU-limma | Moderate | Low | Standard DEU workflow | Misses IR, lower power for ASS |
| DEXSeq | High for ASS | Moderate | Good for ASS events | Lower power for other events |
| JunctionSeq | High for IR | High | Detects IR events | Poor FDR control |
DEJU-based workflows uniquely enabled the detection of intron retention events, which were missed by DEU methods that lack junction information [42]. JunctionSeq, while capable of detecting IR events, demonstrated poorer FDR control compared to DEJU. The junction-only approaches (junc-edgeR and junc-limma) also effectively controlled FDR and outperformed DEU methods, further highlighting the value of junction information [42].
The following performance diagram visualizes the true positive and false positive detection rates across methods, highlighting DEJU's superior performance profile.
Principle: The two-pass mapping method maximizes sensitivity for novel junction detection by incorporating junctions discovered in an initial mapping pass into a refined genome index for the final alignment [10]. This approach is particularly valuable for detecting previously unannotated splicing events that may be biologically significant.
Procedure:
--twopassMode Basic option to enable two-pass mode.Junction Filtering and Genome Re-indexing:
Second Pass Mapping:
--outFilterType BySJout option to retain only reads aligning to junctions passing filtering thresholds.Principle: The featureCounts function within the Rsubread package is configured to differentiate internal exon reads from exon-exon junction reads, generating separate count matrices that are subsequently combined [10].
Procedure:
useMetaFeatures = FALSE (to quantify individual exons)nonSplitOnly = TRUE (to quantify internal exon reads)juncCounts = TRUE (to generate exon-exon junction count matrix)Procedure:
filterByExpr function in edgeR.normLibSizes to account for composition biases between libraries.Differential Usage Testing:
diffSpliceDGE function in edgeR to identify differentially used features.diffSplice function in limma to test for differential usage.Result Interpretation:
Table 3: Essential Research Reagents and Computational Tools for DEJU Analysis
| Category | Item | Specification/Version | Function in DEJU Workflow |
|---|---|---|---|
| Alignment Tool | STAR Aligner | Version 2.7.10a or higher | Splice-aware read alignment using two-pass method for novel junction detection [10] |
| Quantification Software | Rsubread (featureCounts) | Bioconductor release 3.18 or higher | Quantification of internal exon and exon-exon junction reads with specific parameter settings [10] |
| Statistical Analysis | edgeR + limma | Bioconductor release 3.18 or higher | Differential exon-junction usage testing and result summarization at gene level [10] |
| Reference Genome | Species-specific genome assembly | e.g., GRCh38 (human), GRCm39 (mouse) | Reference sequence for read alignment and annotation [44] |
| Annotation File | GTF/GFF annotation | e.g., GENCODE, Ensembl | Gene model information for exon and junction quantification [10] |
| RNA-seq Library Prep Kit | TruSeq Stranded mRNA Kit | Illumina | Library preparation for strand-specific RNA sequencing [44] |
| Exome Capture Kit | SureSelect XTHS2 | Agilent Technologies | Target enrichment for combined RNA and DNA exome sequencing approaches [44] |
The DEJU workflow was applied to an RNA-seq dataset from mouse mammary gland epithelial cells, comparing luminal progenitor and mature luminal cell populations. This application revealed biologically meaningful splicing events that were not detected by previous methods that lacked junction information [10]. The integration of junction reads enabled the identification of specific isoform switches potentially critical for mammary gland development and differentiation.
Comprehensive splicing analysis is increasingly important for clinical interpretation of genetic variants. Studies of the CDH1 gene (associated with hereditary diffuse gastric cancer) demonstrate how RNA-seq splicing profiles enable classification of variants of uncertain significance (VUS) [45]. The CDH1 gene exhibits a complex alternative splicing profile with at least eleven alternative splicing events, including four novel junctions originating from intron 2 [45]. DEJU-based approaches provide the quantitative framework necessary to distinguish pathological splicing events from normal splicing variation.
The DEJU workflow directly builds upon thesis research focusing on STAR two-pass mapping methodology. The two-pass approach is integral to DEJU's performance, as it ensures comprehensive detection of both annotated and novel splicing events. The first pass identifies candidate junctions, while the second pass utilizes this expanded junction database to improve alignment accuracy and sensitivity [10]. This alignment strategy, combined with DEJU's quantification and analysis methods, creates a complete framework for enhanced differential splicing detection.
Robust quality control is essential throughout the DEJU workflow. Key QC metrics include:
For clinical applications, additional validation using reference standards containing known splicing variants is recommended to establish analytical performance characteristics [44] [46].
The DEJU workflow represents a significant advancement in differential splicing analysis by incorporating exon-exon junction reads alongside traditional exon counts. This approach demonstrates enhanced statistical power, improved false discovery rate control, and unique capability to detect intron retention events compared to existing methods. The integration with STAR two-pass mapping provides a robust framework for comprehensive splicing analysis that captures both annotated and novel splicing events. As RNA-seq applications continue to expand in both basic research and clinical diagnostics, the DEJU methodology offers researchers a powerful tool for uncovering biologically and clinically relevant splicing alterations.
In the analysis of high-throughput RNA sequencing (RNA-seq) data, controlling the false discovery rate (FDR) is a critical statistical challenge to ensure that identified differentially spliced genes are biologically meaningful and not artifacts of multiple hypothesis testing. The Differential Exon-Junction Usage (DEJU) workflow represents a significant methodological advancement by incorporating exon-exon junction reads into differential splicing analysis, thereby enhancing the detection of alternative splicing events while effectively controlling FDR. Framed within the context of research on STAR two-pass mapping for improved accuracy, this protocol details how DEJU-edgeR and DEJU-limma achieve robust FDR control through their unique analytical approach.
The DEJU methodology addresses a fundamental limitation in conventional differential exon usage (DEU) analysis, which typically relies solely on exon-level read counts without considering junction reads. This omission not only reduces statistical power but can also lead to double-counting issues where reads spanning two exons contribute counts to both exons, potentially inflating false positive rates. By jointly analyzing exon and exon-exon junction reads as distinct features, the DEJU workflow provides a more comprehensive view of splicing events while implementing rigorous statistical controls to maintain FDR at the desired threshold [10].
In RNA-seq experiments, where tens of thousands of hypotheses are tested simultaneously, controlling the family-wise error rate (FWER) often proves overly conservative, severely limiting power to detect true biological effects. The false discovery rate (FDR), defined as the expected proportion of incorrectly rejected null hypotheses among all rejected hypotheses, has emerged as a more practical error metric that balances discovery with reliability [47]. Without proper FDR control, researchers risk pursuing false leads, misallocating resources, and drawing incorrect biological conclusions.
The challenge of FDR control is particularly acute in splicing analysis, where multiple exons and junctions within the same gene exhibit complex dependency structures. Traditional FDR methods like the Benjamini-Hochberg (BH) procedure assume exchangeability of all tests, but this assumption often fails in splicing analyses where features within genes are correlated. The DEJU workflow addresses this through gene-level summarization of feature-level statistics using established methods like the Simes procedure or F-tests, which appropriately account for these dependencies while controlling the FDR [10].
Conventional DEU analysis tools, including earlier versions of DEXSeq and JunctionSeq, typically rely on flattened exon counts without leveraging the information contained in exon-exon junction reads. This approach limits their ability to detect certain types of splicing events, particularly those involving alternative splice sites, retained introns, or nested exon skipping [10]. Furthermore, the common practice of double-counting exon-junction reads that span multiple exons can distort expression estimates and potentially inflate error rates.
Evidence suggests that even popular differential expression methods like DESeq2 and edgeR can exhibit exaggerated false positive rates in some scenarios, with actual FDRs sometimes exceeding 20% when the target is 5% in population-level RNA-seq studies with large sample sizes [48]. This highlights the critical need for specialized methods like DEJU that are specifically designed for splicing analysis and incorporate appropriate statistical controls.
The DEJU workflow is tightly integrated with the STAR two-pass mapping method, which enhances junction discovery and alignment accuracy. This integration begins with sensitive initial alignment, followed by genome re-indexing using discovered junctions, and concludes with a second mapping pass that optimizes junction read placement [10] [32]. The following diagram illustrates the complete workflow from read alignment through FDR-controlled differential splicing analysis:
The DEJU workflow implements multiple strategies to ensure robust FDR control while maintaining high statistical power:
Unique Molecular Feature Assignment: By setting both nonSplitOnly=TRUE and juncCounts=TRUE in featureCounts, DEJU ensures that each exon-junction read is uniquely assigned to a single feature, eliminating double-counting artifacts that could distort statistical estimates and inflate false discovery rates [10].
Comprehensive Feature Filtering: The workflow employs filterByExpr from edgeR to remove lowly expressed exons and junctions prior to statistical testing, reducing the multiple testing burden and focusing analysis on features with sufficient data to support reliable inference [10] [32].
Appropriate Normalization: TMM (Trimmed Mean of M-values) normalization accounts for composition biases between libraries, ensuring that technical artifacts do not masquerade as biological splicing differences [10].
Hierarchical Testing Approach: DEJU implements a two-level testing strategy where differential usage is first tested at the individual exon/junction level, then summarized to gene-level FDR control using established methods like the Simes procedure, which appropriately accounts for the dependency structure among features within genes [10].
Comprehensive simulation studies benchmarking DEJU against existing methods demonstrate its ability to maintain FDR control while achieving high statistical power. The following table summarizes the FDR control and statistical power of DEJU-edgeR and DEJU-limma across different alternative splicing patterns and sample sizes, based on simulated datasets where the ground truth was known [10]:
Table 1: FDR and Statistical Power of DEJU Methods Across Splicing Patterns
| Splicing Pattern | Sample Size (n) | DEJU-edgeR FDR | DEJU-edgeR Power | DEJU-limma FDR | DEJU-limma Power |
|---|---|---|---|---|---|
| Exon Skipping (ES) | 3 | 0.022 | 0.977 | 0.043 | 0.975 |
| Exon Skipping (ES) | 5 | 0.029 | 0.991 | 0.044 | 0.990 |
| Exon Skipping (ES) | 10 | 0.038 | 0.992 | 0.051 | 0.992 |
| Mutually Exclusive Exons (MXE) | 3 | 0.030 | 0.990 | 0.061 | 0.991 |
| Mutually Exclusive Exons (MXE) | 5 | 0.040 | 0.993 | 0.062 | 0.995 |
| Mutually Exclusive Exons (MXE) | 10 | 0.045 | 0.995 | 0.063 | 0.995 |
| Alternative Splice Sites (ASS) | 3 | 0.027 | 0.839 | 0.038 | 0.877 |
| Alternative Splice Sites (ASS) | 5 | 0.027 | 0.927 | 0.037 | 0.947 |
| Alternative Splice Sites (ASS) | 10 | 0.038 | 0.977 | 0.047 | 0.979 |
| Intron Retention (IR) | 3 | 0.030 | 0.866 | 0.042 | 0.880 |
| Intron Retention (IR) | 5 | 0.031 | 0.934 | 0.041 | 0.940 |
| Intron Retention (IR) | 10 | 0.042 | 0.964 | 0.050 | 0.968 |
The data demonstrate that both DEJU implementations effectively control FDR near or below the nominal 0.05 threshold across all splicing patterns and sample sizes. DEJU-edgeR shows slightly more conservative FDR control compared to DEJU-limma, particularly for mutually exclusive exons where DEJU-limma's FDR reaches 0.063 at larger sample sizes. Both methods show increasing statistical power with larger sample sizes, with particularly notable improvements for alternative splice site and intron retention events [10].
The DEJU workflow shows superior performance compared to existing DEU analysis methods. In benchmarking studies, DEJU demonstrated enhanced power to detect differential splicing events while effectively controlling the FDR, unlike some conventional methods that either become overly conservative or fail to control FDR adequately [10]. The incorporation of exon-exon junction reads provides particular advantages for detecting certain splicing events:
Table 2: Advantages of DEJU for Different Splicing Event Types
| Splicing Event Type | Limitation of Conventional DEU | DEJU Advantage |
|---|---|---|
| Alternative 3'/5' Splice Sites | Limited detection capability without junction information | Junction reads directly capture alternative splice site usage |
| Intron Retention | Difficult to distinguish from exon definition | Junction reads provide evidence of intronic sequence inclusion |
| Exon Skipping | Relies on changes in exon coverage | Both exon coverage and junction exclusion provide complementary evidence |
| Mutually Exclusive Exons | May miss coordinated changes | Junction patterns reveal mutually exclusive relationships |
The ability to detect these subtle splicing events without inflating false positives makes DEJU particularly valuable for studies aiming to identify clinically relevant splicing biomarkers or therapeutic targets [10].
The accuracy of junction read alignment is fundamental to DEJU's FDR control performance. The following protocol details the optimized STAR two-pass mapping procedure:
This two-pass approach significantly improves junction detection sensitivity compared to single-pass mapping, particularly for novel or low-abundance splicing events. The --outFilterType BySJout option filters out alignments with spurious junctions, reducing false positive junction calls that could compromise downstream FDR control [10] [32].
Following alignment, exon and junction reads are quantified using featureCounts with specific parameters optimized for DEJU analysis:
The critical parameters nonSplitOnly=TRUE and juncCounts=TRUE ensure proper separation of internal exon reads from exon-exon junction reads, eliminating double-counting and creating distinct features for statistical testing [10] [32].
The core DEJU analysis implements rigorous statistical testing with built-in FDR control:
Both implementations use the Simes method to combine feature-level p-values within genes, providing strong FDR control while accounting for the dependency structure among exons and junctions from the same gene [10].
Table 3: Key Research Reagent Solutions for DEJU Analysis
| Reagent/Resource | Function in DEJU Workflow | Implementation Notes |
|---|---|---|
| STAR Aligner | Read alignment and junction discovery | Two-pass mode with --outFilterType BySJout for optimal junction detection |
| Rsubread featureCounts | Exon and junction quantification | Critical parameters: nonSplitOnly=TRUE, juncCounts=TRUE |
| edgeR Package | Statistical testing for DEJU-edgeR | Uses diffSpliceDGE function with gene-level Simes correction |
| limma Package | Statistical testing for DEJU-limma | Uses diffSplice function with voom transformation for RNA-seq data |
| Reference Genome | Genomic coordinate system | Must match annotation files (GTF/SAF) |
| Annotation Files | Feature definitions (exons, genes) | Flattened exon SAF file and junction database |
| High-Quality RNA Samples | Input material for RNA-seq | RIN > 8.0 recommended for splicing analysis |
| Stranded RNA-seq Library Prep | Library construction | Preserves strand information for accurate junction assignment |
The DEJU workflow represents a significant advancement in differential splicing analysis by incorporating exon-exon junction reads to enhance detection power while maintaining robust FDR control. Through its integration with STAR two-pass mapping, specialized quantification procedures, and rigorous statistical testing with gene-level FDR control, DEJU addresses critical limitations of conventional differential exon usage methods. The quantitative performance data demonstrate that both DEJU-edgeR and DEJU-limma effectively control FDR near or below the nominal 0.05 threshold across diverse splicing patterns and sample sizes, while achieving high statistical power particularly for larger sample sizes. This balanced performance makes DEJU particularly valuable for clinical and pharmacological research where reliable identification of splicing biomarkers can inform drug development programs.
Accurate detection of alternative splicing is fundamental to understanding transcriptomic diversity in health and disease. The standard differential exon usage (DEU) analysis often relies solely on exon-level read counts, which can lead to reduced statistical power in identifying splicing alterations. This application note presents a comprehensive benchmark of a novel differential exon-junction usage (DEJU) workflow that integrates STAR two-pass mapping with junction-aware quantification, comparing its performance against established methods like DEXSeq and JunctionSeq. Our findings demonstrate that incorporating exon-exon junction reads through a optimized two-pass alignment strategy significantly enhances detection capabilities for biologically meaningful splicing events while effectively controlling false discovery rates.
The DEJU workflow implements a sophisticated integration of alignment and quantification strategies specifically designed to overcome limitations in conventional differential splicing analysis.
Splice-Aware Alignment with STAR Two-Pass Mapping: RNA-seq reads are first aligned using the STAR aligner in two-pass mapping mode [10]. In the initial pass, splice junctions are discovered from all samples across experimental conditions. These junctions are collapsed and filtered before being used to re-index the reference genome for a second round of mapping. This approach significantly improves sensitivity to novel junction detection, with studies demonstrating up to 1.7-fold deeper median read depth over splice junctions compared to single-pass alignment [8]. The --outFilterType BySJout option ensures only high-quality junctions are retained.
Junction-Incorporated Feature Quantification: Aligned reads are quantified using featureCounts from the Rsubread package with specific parameters: useMetaFeatures=FALSE, nonSplitOnly=TRUE (for internal exon counts), and juncCounts=TRUE (for exon-exon junction counts) [10]. This configuration generates separate count matrices for internal exons and junctions, which are subsequently concatenated into a single exon-junction count matrix. This strategy resolves the double-counting issue prevalent in standard exon counting, where junction reads contribute counts to both exons.
Downstream Statistical Analysis: The combined exon-junction count matrix undergoes standard pre-processing including low-expression filtering with filterByExpr and TMM normalization. Differential splicing analysis is then performed using either diffSpliceDGE in edgeR or diffSplice in limma, with feature-level results summarized to the gene level via the Simes method or an F-test [10].
To evaluate performance, we implemented multiple workflows including DEJU-edgeR, DEJU-limma, standard DEU-edgeR, DEU-limma, DEXSeq, and JunctionSeq. Comprehensive simulation studies generated data with known ground truths across various splicing patterns: exon skipping (ES), mutually exclusive exons (MXE), alternative 5' and 3' splice sites (ASS), and intron retention (IR). Performance was assessed based on statistical power, false discovery rate (FDR) control, and robustness across different sample sizes (3, 5, and 10 per group) and library sizes [10].
Table 1: Key Computational Tools for Implementing Splicing Analysis Workflows
| Tool Name | Primary Function | Key Parameters/Setting | Role in Workflow |
|---|---|---|---|
| STAR Aligner | Spliced alignment of RNA-seq reads | --twopassMode Basic, --outFilterType BySJout |
Performs two-pass alignment to detect novel and known splice junctions |
| Rsubread/featureCounts | Read quantification | nonSplitOnly=TRUE, juncCounts=TRUE |
Generates exon and exon-exon junction count matrices |
| edgeR | Differential analysis | diffSpliceDGE(), filterByExpr() |
Statistical testing for differential exon-junction usage |
| limma | Differential analysis | diffSplice() |
Alternative statistical testing for differential usage |
| DEXSeq | Differential exon usage | Exonic bin counting | Benchmark method for exon-based differential splicing |
| JunctionSeq | Differential junction usage | Junction and exon counting | Benchmark method incorporating junction information |
The benchmarking results demonstrate that the DEJU workflow significantly enhances detection capabilities across all alternative splicing patterns while maintaining stringent false discovery rate control.
Table 2: Performance Comparison of DEJU-edgeR Across Splicing Patterns (5 samples per group)
| Splicing Pattern | FDR | Statistical Power | Key Advantage |
|---|---|---|---|
| Exon Skipping (ES) | 0.029 | 0.991 | Superior detection of complete exon exclusion |
| Mutually Exclusive Exons (MXE) | 0.040 | 0.993 | Enhanced identification of alternative exon pairs |
| Alternative Splice Sites (ASS) | 0.027 | 0.927 | Improved resolution of subtle splice site shifts |
| Intron Retention (IR) | 0.031 | 0.934 | Unique capability to detect intron retention events |
The DEJU workflow demonstrated particularly notable advantages in detecting alternative splice site events and intron retention, which are challenging for conventional methods. While DEJU-edgeR effectively controlled FDR at or below the nominal 0.05 level across all splicing patterns, DEJU-limma showed slightly elevated FDR for mutually exclusive exons (0.062) [10]. The power for detecting alternative splice sites and intron retention events increased substantially with larger sample sizes, reaching 0.977 and 0.964 respectively with 10 samples per group.
When benchmarked against established methods, the DEJU workflow demonstrated superior performance characteristics:
Comparison with Standard DEU Methods: DEJU-edgeR and DEJU-limma detected a significantly higher number of true positive events compared to standard DEU approaches, particularly for alternative splice site (ASS) and intron retention (IR) events [42]. Standard DEU methods showed limited capability in detecting IR events due to their reliance solely on exon-level counts.
Comparison with DEXSeq and JunctionSeq: While DEXSeq detected more true positive ASS cases than standard DEU methods, it was outperformed by DEJU workflows in comprehensive event detection [10]. JunctionSeq, which also incorporates junction information, demonstrated capability in detecting IR events but showed less effective FDR control compared to the DEJU method.
Junction-Only Approaches: Workflows utilizing only junction counts (junc-edgeR and junc-limma) effectively controlled FDR and outperformed standard DEU methods, though they were less comprehensive than the full DEJU approach that integrates both exon and junction information [42].
For researchers seeking to implement this optimized workflow, the following step-by-step protocol provides a detailed guide:
Sample Preparation and Sequencing
Computational Requirements
Two-Pass Alignment with STAR
Perform first alignment pass for all samples:
Collate splice junctions from all samples and filter (e.g., junctions with â¥3 uniquely mapping reads)
Re-generate genome indices incorporating filtered junctions:
Perform second alignment pass with updated indices:
Junction-Aware Quantification and Differential Analysis
Combine exon and junction count matrices in R:
Perform differential splicing analysis with edgeR:
To validate computational predictions, we recommend:
Diagram 1: DEJU Workflow Integrating Two-Pass Mapping and Junction-Aware Analysis. This workflow illustrates the sequential process from raw RNA-seq data to differential splicing detection, highlighting the critical two-pass alignment and junction incorporation steps.
The enhanced detection capability of the DEJU workflow has significant implications for disease research and drug development:
Enhanced Biomarker Discovery: The improved sensitivity in detecting alternative splicing events, particularly intron retention and alternative splice site usage, enables identification of more comprehensive splicing signatures in disease states such as cancer [10]. These signatures can serve as valuable diagnostic or prognostic biomarkers.
Therapeutic Target Identification: Aberrant splicing is increasingly recognized as a driver in numerous diseases. The DEJU workflow's ability to detect subtle splicing alterations provides a more complete picture of potential therapeutic targets, including splice-switching opportunities.
Experimental Validation Efficiency: By reducing false positives and increasing true detection rates, the DEJU workflow streamlines downstream experimental validation efforts, optimizing resource allocation in drug development pipelines.
Researchers should consider several factors when implementing this workflow:
Computational Resources: Two-pass mapping requires substantial computational resources, including approximately 30GB RAM for human genome alignment and sufficient temporary storage for intermediate files [6].
Sample Size Requirements: While the DEJU workflow performs well with modest sample sizes (3-5 per group), power for detecting certain splicing events (particularly ASS and IR) increases substantially with larger sample sizes (10 per group) [10].
Method Selection: DEJU-edgeR is recommended when strict FDR control is paramount, while DEJU-limma may offer slightly higher sensitivity for some splicing patterns, though with potentially less stringent FDR control for mutually exclusive exons.
While the DEJU workflow represents a significant advancement, several limitations warrant consideration:
Computational Intensity: The two-pass mapping approach requires additional computational time compared to single-pass methods, though this is partially mitigated by STAR's efficient alignment algorithm.
Complex Splicing Patterns: While performance is improved across all major splicing patterns, detection of complex combinatorial splicing events remains challenging and represents an area for future methodology development.
Integration with Emerging Technologies: As long-read sequencing technologies mature, integration of short-read based DEJU analysis with long-read validation presents a promising approach for comprehensive splicing characterization.
The DEJU workflow, leveraging STAR two-pass mapping and junction-aware quantification, represents a substantial improvement over existing methods for differential splicing analysis. By effectively incorporating exon-exon junction information and resolving the double-counting problem inherent in standard DEU approaches, this method demonstrates enhanced statistical power across diverse splicing patterns while maintaining rigorous false discovery rate control. The implementation of this workflow will enable researchers to more comprehensively characterize alternative splicing in transcriptomic studies, with particular value for disease mechanism investigation and therapeutic development.
The study of alternative splicing (AS) using RNA sequencing (RNA-seq) is pivotal for understanding transcriptomic diversity in health and disease. AS is a key post-transcriptional mechanism in eukaryotes that enables a single gene to produce multiple mRNA isoforms, contributing significantly to proteomic complexity and cellular adaptation [10]. In the context of mammary gland biology, deciphering AS patterns is essential for uncovering molecular mechanisms underlying development, lactation, and diseases such as breast cancer and mastitis [49] [50]. However, standard differential exon usage (DEU) analyses often lack statistical power as they typically rely on exon-level read counts and fail to incorporate exon-exon junction information, which is crucial for capturing the full spectrum of splicing events [10]. This methodology gap can obscure the detection of biologically meaningful splicing alterations. Recent advancements in spliced alignment algorithms, specifically the two-pass mapping method with the STAR aligner, coupled with new analytical workflows like Differential Exon-Junction Usage (DEJU), have demonstrated enhanced capability to overcome these limitations. This application note details a case study within a broader research thesis on improving splicing detection accuracy, showcasing how these integrated methods revealed previously undetectable, biologically significant splicing events in a mouse mammary gland dataset [10]. The findings underscore the real-world impact of refined computational protocols in mammary gland transcriptomics, offering researchers and drug development professionals powerful tools to uncover novel regulatory mechanisms and potential therapeutic targets.
The Differential Exon-Junction Usage (DEJU) workflow represents a significant evolution in differential splicing analysis. It was developed to address a key limitation in standard DEU analysis, where exon reads mapped to regions spanning two exons (exon-exon junction reads) are, by default, counted for both exons. This double-counting issue can reduce statistical power and obscure true splicing variations [10]. The DEJU workflow resolves this by jointly analyzing exon and exon-exon junction reads as distinct features.
The complete DEJU protocol involves the following key steps [10]:
--outFilterType BySJout option is used to retain only junction reads aligned to junctions passing a predefined filtering threshold (e.g., more than three uniquely mapping reads across all samples) in the final BAM files.featureCounts function within the Rsubread package. Crucially, the annotation file used is a flattened and merged exon annotation, and the arguments useMetaFeatures=FALSE, nonSplitOnly=TRUE, and juncCounts=TRUE are set. This configuration allows for the simultaneous generation of two count matrices: one for internal exon reads and another for exon-exon junction reads.filterByExpr function in edgeR. Library composition biases are then corrected using the Trimmed Mean of M-values (TMM) normalization method, implemented via the normLibSizes function in edgeR.diffSpliceDGE function in edgeR or the diffSplice function in limma. These functions identify specific exons or junctions that are differentially used between experimental groups. The results from these feature-level tests are subsequently summarized at the gene level using either the Simes method (to combine p-values) or an F-test (to combine quasi-likelihood F-test statistics) to determine differentially spliced genes.The two-pass mapping strategy with STAR is critical for enhancing the detection of novel splice junctions, which are essential for a comprehensive DEJU analysis. The protocol, as detailed in [6], is as follows:
SJ.out.tab files).SJ.out.tab files from all samples. Filter the combined junction list to remove low-confidence junctions. A common filter, as applied in the DEJU study and other analyses, includes removing junctions with low read support (e.g., column 7, which represents uniquely mapping reads, < 5), non-canonical splice sites (column 5 = 0), and junctions from mitochondrial DNA (chrM) [10] [13].--sjdbFileChrStartEnd option.The following workflow diagram illustrates the key stages of this integrated methodology:
The DEJU workflow, underpinned by STAR two-pass mapping, was applied to a mouse mammary gland RNA-seq dataset comparing two mammary epithelial cell (MEC) types: luminal progenitor and mature luminal cells [10]. This real-world case study serves as a benchmark for the protocol's efficacy in revealing biologically meaningful insights that were previously undetectable with standard DEU approaches.
Comprehensive simulation studies were conducted to benchmark the performance of the DEJU workflow against existing methods like DEU-edgeR, DEU-limma, DEXSeq, and JunctionSeq. The benchmarks evaluated the ability to detect various AS patternsâExon Skipping (ES), Mutually Exclusive Exons (MXE), Alternative 5' and 3' Splice Sites (ASS), and Intron Retention (IR)âwhile controlling the false discovery rate (FDR). The DEJU workflow demonstrated superior statistical power across all splicing events [10].
The following table summarizes the key performance metrics for DEJU-edgeR and DEJU-limma from these simulation studies:
Table 1: Performance of DEJU Analysis on Simulated Data
| Splicing Pattern | Sample Size (n) | DEJU-edgeR | DEJU-limma | ||
|---|---|---|---|---|---|
| FDR | Power | FDR | Power | ||
| Exon Skipping (ES) | 3 | 0.022 | 0.977 | 0.043 | 0.975 |
| 5 | 0.029 | 0.991 | 0.044 | 0.990 | |
| 10 | 0.038 | 0.992 | 0.051 | 0.992 | |
| Mutually Exclusive Exons (MXE) | 3 | 0.030 | 0.990 | 0.061 | 0.991 |
| 5 | 0.040 | 0.993 | 0.062 | 0.995 | |
| 10 | 0.045 | 0.995 | 0.063 | 0.995 | |
| Alternative Splice Site (ASS) | 3 | 0.027 | 0.839 | 0.038 | 0.877 |
| 5 | 0.027 | 0.927 | 0.037 | 0.947 | |
| 10 | 0.038 | 0.977 | 0.047 | 0.979 | |
| Intron Retention (IR) | 3 | 0.030 | 0.866 | 0.042 | 0.880 |
| 5 | 0.031 | 0.934 | 0.041 | 0.940 | |
| 10 | 0.042 | 0.964 | 0.050 | 0.968 |
Data adapted from [10]. FDR: False Discovery Rate. Power: Statistical Power. The table shows that both DEJU-edgeR and DEJU-limma maintain high power (>0.83) across all event types, with power increasing with sample size. DEJU-edgeR consistently controls the FDR at or below the nominal 0.05 level, while DEJU-limma shows a slightly elevated FDR, particularly for MXE events.
The application of the DEJU workflow to the mouse mammary gland RNA-seq dataset yielded significant biological insights. The analysis successfully identified statistically significant differential splicing events in genes with known and potential roles in mammary gland development and function that were not detected by the standard DEU approach [10]. This demonstrates the practical utility and enhanced sensitivity of the integrated two-pass mapping and DEJU protocol in a real research scenario, enabling the discovery of novel regulatory mechanisms in mammary biology.
The benchmark data provides clear guidance for experimental design. The power to detect differential splicing, particularly for complex events like ASS and IR, increases substantially with larger sample sizes [10]. While both DEJU-edgeR and DEJU-limma are powerful, the choice between them can be based on FDR control stringency; DEJU-edgeR is more conservative, making it suitable for studies where minimizing false positives is critical.
The following diagram synthesizes the logical relationship between the methodological improvements and their resulting biological impact, as demonstrated in the case study:
Successful implementation of the described protocols requires a set of key software tools and genomic resources. The following table details these essential components, their specific functions, and relevant considerations for researchers.
Table 2: Essential Research Reagents and Computational Tools
| Tool/Resource | Function in the Workflow | Key Specifications / Notes |
|---|---|---|
| STAR Aligner | Spliced alignment of RNA-seq reads to the reference genome. Supports two-pass mapping for novel junction discovery. | Critical parameters include --runThreadN, --genomeDir, --sjdbGTFfile, --sjdbOverhang, and --outFilterType BySJout [10] [6]. |
| R/Bioconductor | Provides the statistical computing environment for downstream analysis. | Essential packages: Rsubread (for featureCounts), edgeR (for filtering, normalization, and diffSpliceDGE), and limma (for diffSplice) [10]. |
| Reference Genome | The sequence from the target species used as a map for read alignment. | Must be in FASTA format. Ensure consistency between the genome build and the annotation file (e.g., both GRCm38 for mouse) [6]. |
| Gene Annotation | A file (GTF format) specifying the coordinates of known genes, transcripts, and exons. | Used by STAR during genome indexing and by featureCounts for read quantification. The --sjdbOverhang should be set to read length minus 1 [6]. |
| Junction Filtering Script | A custom script or command to filter the collated junctions from the first pass of STAR mapping. | Typical filters: remove junctions with < 5 unique reads, non-canonical splice sites, and mitochondrial junctions. This step improves efficiency [10] [13]. |
This application note has detailed a robust and powerful protocol for differential splicing analysis that integrates the STAR two-pass mapping method with the DEJU workflow. The case study on mouse mammary gland RNA-seq data underscores the real-world impact of this approach, enabling the discovery of biologically meaningful splicing events that are invisible to standard methodologies. The double-counting problem inherent in conventional exon-based analyses is effectively resolved by treating exon-junction reads as distinct features, thereby enhancing statistical power and accuracy [10].
The comprehensive benchmarking data confirms that the DEJU workflow maintains high statistical power across a diverse range of alternative splicing eventsâincluding the often hard-to-detect alternative splice sites and intron retentionâwhile effectively controlling the false discovery rate. This makes it a superior choice for researchers aiming to obtain a complete picture of the splicing landscape. Furthermore, the protocol is scalable and integrates seamlessly within the widely adopted Rsubread-edgeR/limma frameworks, ensuring computational efficiency and accessibility to the scientific community [10].
For researchers and drug development professionals working in mammary gland biology and beyond, adopting this integrated two-pass mapping and DEJU analysis protocol offers a significant advantage. It transforms the ability to link transcriptional diversity to cellular function and disease mechanisms, paving the way for the identification of novel diagnostic markers and therapeutic targets rooted in the regulation of alternative splicing.
In precision oncology, the accurate detection of expressed mutations is critical for clinical decision-making, therapy selection, and patient stratification. While DNA-based sequencing identifies potential genetic variants, it cannot distinguish whether these mutations are actually transcribed into RNA and therefore likely to impact protein function and therapeutic response. Targeted RNA sequencing (RNA-Seq) has emerged as a powerful solution to this limitation, bridging the "DNA to protein divide" by enabling focused, sensitive detection of expressed mutations, fusion transcripts, and splicing variants in clinically relevant genes. The clinical utility of this approach is substantially enhanced when coupled with advanced bioinformatic methods, particularly the STAR two-pass mapping method, which significantly improves the accuracy of splice junction detection and mutation identification. This Application Note outlines established protocols and analytical frameworks for implementing targeted RNA-Seq with two-pass alignment to enhance expressed mutation detection in clinical and research settings, providing researchers and drug development professionals with practical methodologies to strengthen somatic mutation findings for diagnostic, prognostic, and therapeutic predictive purposes [11].
The STAR two-pass alignment method represents a significant advancement for detecting novel splicing events and expressed mutations in RNA-Seq data. In conventional single-pass alignment, preference is given to known splice junctions, which biases quantification against novel splice junctions and reduces detection power for unannotated variants. Two-pass alignment addresses this limitation by separating the processes of splice junction discovery and quantification [8].
In the first pass, splice junctions are discovered with high stringency, and these newly identified junctions are then used as annotations in a second alignment pass to permit lower stringency alignment and higher sensitivity [8]. This approach has demonstrated remarkable improvements in detection capabilities, providing as much as 1.7-fold deeper median read coverage over novel splice junctions and improving quantification accuracy for at least 94% of simulated novel splice junctions across diverse RNA-Seq datasets [8]. For clinical applications, this enhanced sensitivity is crucial for identifying low-abundance mutant transcripts and novel splicing events that may have therapeutic implications.
The following protocol outlines the key steps for implementing STAR two-pass alignment in targeted RNA-Seq analysis:
1. First-Pass Alignment and Junction Discovery
--twopassMode Basic parameter to enable two-pass mode2. Second-Pass Alignment
--outFilterType BySJout option [10]3. Quality Control and Error Mitigation
Table 1: Key Parameters for STAR Two-Pass Alignment in Targeted RNA-Seq
| Parameter | Recommended Setting | Purpose |
|---|---|---|
--twopassMode |
Basic |
Enables two-pass alignment mode |
--outFilterType |
BySJout |
Filters junctions based on quality metrics |
--alignSJoverhangMin |
8 |
Requires reads span novel junctions by at least 8 nucleotides |
--alignSJDBoverhangMin |
3 |
Requires reads span known junctions by at least 3 nucleotides |
--outSAMtype |
BAM SortedByCoordinate |
Outputs coordinate-sorted BAM files |
The DEJU analysis workflow represents a significant advancement over traditional differential exon usage (DEU) analysis by incorporating both exon and exon-exon junction reads, thereby resolving the double-counting issue inherent in standard approaches [10]. This method enhances statistical power while effectively controlling the false discovery rate (FDR), making it particularly valuable for clinical applications where accuracy is paramount.
DEJU Workflow Protocol:
Feature Quantification:
nonSplitOnly and juncCounts arguments to TRUE to obtain both internal exon and junction count matrices simultaneously [10]Data Integration:
filterByExpr function in edgeR [10]normLibSizes function to account for composition biases between libraries [10]Downstream Analysis:
diffSpliceDGE function in edgeR or diffSplice function in limma [10]Table 2: Performance Comparison of DEJU vs Traditional Methods for Detecting Splicing Events
| Splicing Pattern | Sample Size (n) | DEJU-edgeR Power | DEJU-limma Power | DEJU-edgeR FDR | DEJU-limma FDR |
|---|---|---|---|---|---|
| Exon Skipping (ES) | 3 | 0.977 | 0.975 | 0.022 | 0.043 |
| Mutually Exclusive Exon (MXE) | 3 | 0.990 | 0.991 | 0.030 | 0.061 |
| Alternative 3'/5' Splice Site (ASS) | 3 | 0.839 | 0.877 | 0.027 | 0.038 |
| Intron Retention (IR) | 3 | 0.866 | 0.880 | 0.030 | 0.042 |
Targeted RNA-Seq provides orthogonal validation for DNA-identified mutations while independently detecting additional expressed variants. The clinical implementation requires careful consideration of two primary scenarios [11]:
Scenario 1: RNA-Seq to Verify and Prioritize DNA Variants
Scenario 2: Independent RNA-Seq Analysis
The following workflow diagram illustrates the integrated process of two-pass mapping and differential exon-junction usage analysis for improved mutation detection:
Table 3: Essential Research Reagents and Computational Tools for Targeted RNA-Seq Mutation Detection
| Reagent/Tool | Function | Application Notes |
|---|---|---|
| STAR Aligner | Spliced alignment of RNA-Seq reads | Use version 2.4.1a or later; supports two-pass mapping for novel junction detection [6] |
| Rsubread/featureCounts | Quantification of exon and junction reads | Set nonSplitOnly=TRUE and juncCounts=TRUE for DEJU analysis [10] |
| edgeR/limma | Differential expression and usage analysis | Use diffSpliceDGE (edgeR) or diffSplice (limma) for junction-level analysis [10] |
| Agilent Clear-seq Panels | Targeted enrichment of cancer transcripts | Longer probes (120bp); comprehensive cancer coverage [11] |
| Roche Comprehensive Cancer Panels | Targeted RNA sequencing | Shorter probes (70-100bp); focused cancer gene content [11] |
| AMPure XP Beads | cDNA purification and size selection | 0.6:1 beads to cDNA ratio recommended for library purification [51] |
| RNase Inhibitor | Prevention of RNA degradation | Critical for maintaining RNA integrity during library prep [51] |
Targeted RNA-Seq with two-pass mapping has demonstrated significant clinical utility in multiple oncology contexts. In cancer research, particularly in hematological malignancies, this approach can detect gene fusion events that are elusive to traditional detection methods [52]. For example, in acute myeloid leukemia (AML), targeted RNA-Seq can identify clinically relevant markers and detect acquired somatic mutations with high sensitivity [52].
The technology also enables detection of point mutations, insertions or deletions, and complex splicing variations that may be missed by DNA sequencing alone [11] [53]. Studies have revealed that up to 18% of single nucleotide variants (SNVs) detected by DNA-Seq in lung and other cancers were not transcribed, indicating that some mutations detected by DNA-Seq are likely clinically irrelevant [11]. This highlights the critical need to validate the clinical relevance of mutations using RNA-Seq, ensuring that tumor molecular classification and treatment decisions are based on actionable genetic targets that are expressed in the patient's tumor [11].
The accurate detection of expressed mutations through targeted RNA-Seq plays an increasingly important role in drug development, particularly for targeted therapies and mRNA-based individualized neoantigen therapies. For example, mRNA-4157 (V940) is a novel mRNA-based individualized neoantigen therapy encoding up to 34 neoantigens, designed to target a patient's unique set of cancer neoantigens [11]. The neoantigen selection algorithm developed for this therapy verifies and prioritizes amino acid candidates, underscoring how RNA-level validation complements DNA-based mutation detection in advanced therapeutic development [11].
The integration of STAR two-pass mapping with targeted RNA-Seq analysis represents a significant advancement in precision medicine, enabling more accurate detection of expressed mutations with direct clinical relevance. The DEJU analytical framework further enhances this approach by incorporating exon-junction information, thereby improving statistical power while controlling false discovery rates. For researchers and drug development professionals, these methodologies provide a robust foundation for validating DNA-identified mutations while independently discovering novel expressed variants that may impact therapeutic decisions. As precision medicine continues to evolve, the combination of targeted RNA-Seq with advanced alignment and analytical methods will play an increasingly critical role in ensuring that clinical decisions are based on functionally relevant, expressed mutations rather than DNA variants of uncertain transcriptional significance.
STAR two-pass mapping is a proven, powerful methodology that decisively overcomes a fundamental limitation in standard RNA-seq analysis by significantly improving the detection and quantification of novel and annotated splice junctions. By adopting this approach, researchers can achieve greater statistical power in differential splicing analyses, uncover previously hidden biologically relevant isoforms, and generate more robust data for precision medicine applications. The future of clinical RNA-seq, particularly in oncology, will increasingly rely on such sensitive methods to distinguish driver from passenger mutations and identify therapeutically actionable expressed variants. As the field advances, the integration of two-pass mapping with emerging long-read sequencing technologies and machine-learning-based junction filtering, as seen in tools like 2passtools, promises to further lower the detection barriers for complex splicing landscapes across diverse biological and clinical contexts.