
Create Ninetails output tables for Dorado DRS pipeline
Source:R/ninetails_core_functions_dorado_DRS.R
create_outputs_dorado.RdThis function integrates Dorado poly(A) tail summaries, non-adenosine predictions from CNN models, and poly(A) chunk information to generate two main outputs: per-read classifications and non-adenosine residue predictions with estimated positions along the poly(A) tail.
Usage
create_outputs_dorado(
dorado_summary_dir,
nonA_temp_dir,
polya_chunks_dir,
num_cores = 1,
qc = TRUE,
original_summary
)Arguments
- dorado_summary_dir
Character string. Path to a directory containing Dorado summary files (.txt, .tsv, or .csv) with per-read poly(A) tail information for reads that passed Ninetails quality control filtering. These files should contain only reads meeting the four QC criteria: (I) mapped to reference, (II) MAPQ > 0, (III) valid poly(A) coordinates (poly_tail_start != 0), and (IV) poly(A) tail length >= 10 nt.
- nonA_temp_dir
Character string. Path to a directory containing non-adenosine prediction RDS files generated from CNN models. Each RDS file contains predictions for signal chunks, stored as either named vectors or lists with 'chunkname' and 'prediction' elements.
- polya_chunks_dir
Character string. Path to a directory containing poly(A) chunk RDS files. These files contain segmented signal chunks with positional information (chunk_start_pos, chunk_end_pos) used for calculating estimated positions of non-A residues along the poly(A) tail.
- num_cores
Integer. Number of cores to use for parallelized file loading and processing. Must be a positive integer. Default is 1. Parallel processing uses foreach with doSNOW backend and displays progress bars during file loading operations.
- qc
Logical. Whether to apply quality control filtering of terminal predictions. When
TRUE(default), predictions within 2 nt of either end of the poly(A) tail are removed to reduce false positives from adapter-tail boundary artifacts. Reads with only terminal predictions are reclassified as blank (MPU).- original_summary
Character string or data frame. Path to the original unfiltered Dorado summary file, or the data frame itself. This file should contain ALL reads from the sequencing run, including those that failed QC filtering. Required for complete read accounting - the function will classify all reads, including those filtered during preprocessing. Must contain columns: filename, read_id, poly_tail_length, poly_tail_start, poly_tail_end, alignment_genome, alignment_direction, alignment_mapq.
Value
A named list with two data frames:
- read_classes
Data frame with per-read classification results for ALL reads in original_summary. Includes columns: readname, contig, polya_length, qc_tag (MAPQ), class (decorated/blank/unclassified), and comments (classification code).
- nonadenosine_residues
Data frame with per-chunk predictions of non-adenosine residues for decorated reads only. Includes columns: readname, contig, prediction (C/G/U), est_nonA_pos (estimated position from 3' end), polya_length, qc_tag (MAPQ). Empty dataframe if no non-A residues detected.
Details
The function implements a complete read accounting system that classifies ALL reads from the original summary file into biologically meaningful categories based on both quality control metrics and modification detection results.
This function is not intended to be used outside the pipeline wrapper (as standalone function).
Read Classification System
The function classifies reads into three main categories with specific biological interpretations:
1. decorated - Reads with detected non-adenosine modifications
Comment code: YAY (Yes, A modification was detected - Yeah!)
Criteria: Read passed all QC filters AND CNN detected at least one non-A residue (C, G, or U) that survived terminal filtering
Biological meaning: Poly(A) tail contains non-adenosine residues, indicating potential guanylation, uridylation, or cytidylation
2. blank - Reads without detected modifications
MAU (Move transition Absent, Unmodified): No signal deviations detected; tail appears to be pure poly(A)
MPU (Move transition Present, Unmodified): Signal deviations detected but CNN predicted only A residues, or predictions were filtered as terminal artifacts
IRL (Insufficient Read Length): Poly(A) tail < 10 nt; too short for reliable modification detection
Biological meaning: Poly(A) tail contains only adenosine residues or is too short/ambiguous for modification calling
3. unclassified - Reads that failed quality control
UNM (UnMapped): Read unmapped to reference (alignment_direction = "*" AND alignment_mapq = 0); cannot determine tail context
BAC (BAd Coordinates): Invalid poly(A) tail coordinates (poly_tail_start = 0); indicates segmentation failure
Biological meaning: Technical failures preventing reliable modification analysis; excluded from biological interpretation
Position Estimation
Non-adenosine residue positions are estimated using the formula:
est_nonA_pos = round(poly_tail_length - ((poly_tail_length * centr_signal_pos) / signal_length), 0)
Where:
centr_signal_pos: Mean of chunk start and end positions in signal spacesignal_length: 0.2 × (poly_tail_end - poly_tail_start) converts signal coordinates to nucleotide spacepoly_tail_length: Total poly(A) tail length from Dorado
Positions are reported as distance from the 3' end of the tail (position 1 = most 3' nucleotide). In Dorado pipeline, all position calculations are rounded to integers, unlike the Guppy legacy pipeline which uses decimal precision.
Quality Control Criteria
Reads in dorado_summary_dir must meet four criteria to be
processed by the neural network:
Mapped: alignment_direction is "+" or "-", not "*"
High mapping quality: alignment_mapq > 0
Valid coordinates: poly_tail_start != 0 (in DRS, adapter passes through pore first, so poly(A) cannot start at position 0)
Sufficient length: poly_tail_length >= 10 nt
Reads failing any criterion are marked with appropriate comment codes (UNM, BAC, IRL) but are still included in the output for complete read accounting.
Output Format Details
read_classes columns:
- readname
Read identifier (UUID)
- contig
Reference contig/transcript name
- polya_length
Poly(A) tail length in nucleotides
- qc_tag
Mapping quality score (MAPQ)
- class
Read classification: decorated, blank, or unclassified
- comments
3-letter code: YAY, MAU, MPU, IRL, UNM, or BAC
nonadenosine_residues columns:
- readname
Read identifier (UUID)
- contig
Reference contig/transcript name
- prediction
Predicted nucleotide: C, G, or U (A excluded)
- est_nonA_pos
Estimated position from 3' end (integer, 1-based)
- polya_length
Total poly(A) tail length in nucleotides
- qc_tag
Mapping quality score (MAPQ)
Examples
if (FALSE) { # \dontrun{
# Basic usage with complete read accounting
results <- create_outputs_dorado(
dorado_summary_dir = "data/filtered_summaries",
nonA_temp_dir = "data/cnn_predictions",
polya_chunks_dir = "data/signal_chunks",
num_cores = 8,
qc = TRUE,
original_summary = "data/original_dorado_summary.tsv"
)
} # }