
Complete Oxford Nanopore poly(A) tail analysis pipeline for Dorado DRS data.
Source:R/check_tails_dorado_DRS.R
check_tails_dorado_DRS.RdThis comprehensive wrapper function orchestrates the complete analysis of Oxford Nanopore direct RNA sequencing (DRS) data processed with Dorado basecaller (>=1.0.0) using POD5 file format. The pipeline identifies and characterizes non-adenosine nucleotides within poly(A) tails through advanced signal processing, machine learning-based classification, and statistical analysis.
Usage
check_tails_dorado_DRS(
dorado_summary,
pod5_dir,
num_cores = 1,
qc = TRUE,
save_dir,
prefix = "",
part_size = 40000,
cleanup = FALSE
)Arguments
- dorado_summary
Character string or data frame. Either the full path to a Dorado summary file (.txt, .tsv, .csv) or an in-memory data frame containing summary information. Required columns:
read_id(unique identifier),filename(POD5 file name),poly_tail_length(tail length in nucleotides),poly_tail_start(start coordinate),poly_tail_end(end coordinate). Additional columns such asalignment_genome,alignment_mapqare automatically included if present.- pod5_dir
Character string. Full path to the directory containing POD5 files. The directory should contain POD5 files referenced in the
filenamecolumn of the Dorado summary. Files can be organized in subdirectories (recursive search is performed). Typical ONT directory structures are automatically handled.- num_cores
Integer [1]. Number of physical CPU cores to use for parallel processing. Should not exceed
parallel::detectCores() - 1to maintain system responsiveness. Higher core counts significantly reduce processing time for large datasets. Memory usage scales approximately linearly with core count.- qc
Logical [TRUE]. Enable comprehensive quality control filtering. When
TRUE, applies stringent filters including:Minimum poly(A) tail length (10 nucleotides)
Coordinate validation and sanitization
Terminal position masking (reduces false positives)
Statistical outlier detection and removal
Set to
FALSEonly for preliminary analysis or when using pre-filtered data.- save_dir
Character string. Full path to the output directory where all results, intermediate files, and logs will be saved. The directory will be created if it doesn't exist. If the directory contains existing files, the user will be prompted to choose whether to abort or overwrite. Requires sufficient disk space (depending on dataset size).
- prefix
Character string [""]. Optional prefix for output filenames. When provided, this string is prepended to all output files between the timestamp and file type identifier. Useful for organizing multiple analyses or experimental conditions. Should contain only filesystem-safe characters (alphanumeric, underscore, hyphen).
- part_size
Integer [40000]. Maximum number of reads to process in each file partition. Must be >= 1. Larger values increase memory usage but may improve processing efficiency. Smaller values reduce memory footprint and enable processing of very large datasets on memory-constrained systems. Optimal values typically range from 10,000 to 100,000 depending on available RAM.
- cleanup
Logical [FALSE]. Controls removal of intermediate files after successful analysis completion. When
TRUE, removes all temporary subdirectories (dorado_summary_dir,polya_signal_dir,nonA_temp_dir,polya_chunks_dir) keeping only final results and log files. WhenFALSE, preserves all intermediate files for debugging or detailed inspection. Recommended to keepTRUEto save disk space.
Value
A named list containing comprehensive analysis results:
read_classesData frame with per-read classification results including:
readname: Unique read identifiercontig: Reference genome/transcript name (if aligned)polya_length: Estimated poly(A) tail lengthqc_tag: Quality control statusclass: Classification result ("decorated", "blank", "unclassified")comments: Detailed classification rationale using standard codes
nonadenosine_residuesData frame with detailed information about detected non-A nucleotides including:
readname: Read identifiercontig: Reference informationprediction: Predicted nucleotide type (C, G, U)est_nonA_pos: Estimated position within poly(A) tailpolya_length: Total tail lengthqc_tag: Quality metrics
Details
Reads are classified into three main categories with specific biological interpretations. The classification system is hierarchical, with quality control failures taking precedence over biological interpretation.
The Dorado and Guppy pipelines share the same core algorithm but differ in several technical details that affect output interpretation.
- Dorado
Summary files include alignment_direction field explicitly. Unmapped reads have direction = "*"
- Dorado
Integer positions:
round(position, 0). Reflects the discrete nature of nucleotide positions unlike nanopolish-based predictions- Dorado
Does provides move data in BAM files, however iteration through them is time-consuming, so it was omitted in dorado pipelines. Pseudomoves are computed from raw signal using z-score peak detection algorithm.
- Dorado
BAC code specifically checks poly_tail_start = 0. In DRS, this is definitively an error as adapter sequences precede poly(A) signal
Despite these technical differences, both pipelines produce compatible output tables with identical column names and interpretable values. Users can compare results across pipelines by accounting for the integer vs decimal position difference. Classification categories (decorated/blank/unclassified) have identical biological meanings across both pipelines.
Recommendation: Use Dorado pipeline for new analyses (modern format, actively maintained). Use Guppy pipeline only for legacy data or when POD5 conversion is not feasible.
Note
Important considerations:
Ensure sufficient disk space (typically 2-5x input size) in
save_dirThe function generates detailed log files for troubleshooting
Results should always be assigned to a variable to prevent console overflow
POD5 files must correspond exactly to reads in the Dorado summary
For datasets >1M reads, consider batch processing or increased
part_sizePosition estimates are approximate (±2-3 nt accuracy); validate critical findings
Pipeline Overview
The analysis pipeline consists of several integrated stages:
Input Validation: Validates Dorado summary files and POD5 directories
Data Preprocessing: Splits large datasets and applies quality filters
Signal Extraction: Extracts poly(A) tail signals from POD5 files
Feature Engineering: Computes pseudomoves and signal characteristics
Segmentation: Identifies candidate modification regions
GAF Creation: Creates Gramian Angular Fields for CNN input
Classification: Applies trained neural networks for prediction
Output Creation: Produces comprehensive results and statistics
Cleanup: Optionally removes intermediate files to save disk space
Input Requirements
This pipeline requires specific input formats:
Dorado Summary: Must contain poly(A) information columns:
read_id,filename,poly_tail_length,poly_tail_start,poly_tail_endPOD5 Files: Raw signal files corresponding to reads in summary
Output table explanation
1. read_classes Table This table provides a complete accounting of ALL reads in the analysis, with each read assigned to one of three biological categories (class) and given a specific technical code (comments) explaining the classification.
2. nonadenosine_residues Table
This table provides modification-level detail for decorated reads only. Each row represents a single predicted non-adenosine residue within a poly(A) tail. Reads can have multiple rows if multiple modifications detected.
Output Structure
The function creates several output subdirectories:
dorado_summary_dir/Split summary files for parallel processing
polya_signal_dir/Extracted poly(A) signals in RDS format
nonA_temp_dir/Intermediate non-A prediction files
polya_chunks_dir/Signal segmentation data
Performance Characteristics
Scalability: Efficient parallel processing across multiple cores
Memory Management: Smart data partitioning prevents memory overflow
Progress Tracking: Comprehensive logging and progress indicators
Error Handling: Robust error recovery and informative diagnostics
Quality Control
When qc = TRUE, the pipeline applies several quality filters:
Poly(A) tail length filtering (minimum 10 nucleotides)
Coordinate validation (start < end positions)
Terminal position masking to reduce false positives
Duplicate read detection and handling
Algorithm Details
The pipeline employs several sophisticated algorithms:
Pseudomove Computation: Z-score based outlier detection with adaptive windowing
Signal Segmentation: Run-length encoding for modification region identification
GAF Transformation: Gramian Angular Field generation for CNN compatibility
Neural Classification: Pre-trained CNNs for nucleotide type prediction
Classification Codes
The comments column uses standardized 3-letter codes for precise
technical documentation. These codes are essential for filtering datasets
and understanding pipeline behavior:
Comments:
- YAY
Non-A residue successfully detected and classified
- MAU
No signal deviations detected; pure poly(A) signal
- MPU
Signal deviations present but predicted as adenosine only
- IRL
Poly(A) tail too short (< 10 nt) for reliable analysis
- UNM
Read unmapped to reference genome
- BAC
Invalid coordinate system (poly_tail_start = 0)
System Requirements
R Version: >= 3.5.0
Python: >= 3.6 with pod5 module (
pip install pod5)Memory: >= 8GB RAM (16GB+ recommended for large datasets)
Storage: Temporary space ~2-5x input file size
Dependencies: Tensorflow/Keras for neural network inference
Error Handling
Input validation with informative error messages
Graceful handling of corrupted or missing files
Automatic retry mechanisms for transient failures
Detailed logging of all errors and warnings
Safe cleanup of temporary files on failure
Performance Tips
Use SSDs for
save_dirto improve I/O performanceAdjust
part_sizebased on available RAM (larger = faster, more memory)Use
num_cores = parallel::detectCores() - 1for maximum speedEnsure POD5 files and output directory are on fast storage
Consider preprocessing very large datasets (>1M reads) in batches
See also
check_tails_guppy for legacy Guppy-based analysis,
preprocess_inputs for input preprocessing,
process_dorado_signal_files for signal processing,
create_outputs_dorado for output generation,
filter_signal_by_threshold for signal analysis