Detection of non-adenosines

This vignette describes the three analysis pipelines available in ninetails for detecting non-adenosine residues in poly(A) tails. For working with the output data, see vignette("postprocessing"). For visualization, see vignette("plotting") and vignette("shiny_app").

Dorado DRS Pipeline (Recommended)

check_tails_dorado_DRS() is the recommended function for direct RNA sequencing (DRS) data basecalled with Dorado ≥ 1.0.0 using POD5 format.

results <- ninetails::check_tails_dorado_DRS(
  dorado_summary = "path/to/dorado_alignment_summary.txt",
  pod5_dir       = "path/to/pod5_dir/",
  num_cores      = 2,
  qc             = TRUE,
  save_dir       = "~/output/",
  prefix         = "experiment1",
  part_size      = 40000,
  cleanup        = FALSE
)

Parameters

Parameter	Type	Default	Description
`dorado_summary`	character/data.frame	required	Path to Dorado summary file. Must contain: `read_id`, `filename`, `poly_tail_length`, `poly_tail_start`, `poly_tail_end`
`pod5_dir`	character	required	Path to directory containing POD5 files
`num_cores`	integer	`1`	Number of CPU cores for parallel processing
`qc`	logical	`TRUE`	Apply quality control filtering (see below)
`save_dir`	character	required	Output directory for results files
`prefix`	character	`""`	Optional prefix for output filenames (e.g., `"experiment1_"`)
`part_size`	integer	`40000`	Maximum reads per processing chunk (reduce if memory limited)
`cleanup`	logical	`FALSE`	Remove intermediate files after completion

Quality control (`qc = TRUE`)

When qc = TRUE, the pipeline applies the following filters before signal analysis:

Reads with poly_tail_start = 0 are classified as BAC (bad coordinates) and excluded from signal processing
Reads with poly_tail_length < 10 are classified as IRL (insufficient read length)
Unmapped reads (no contig assignment) are classified as UNM

These reads are still included in the output read_classes table but are marked as unclassified.

Generating the Dorado summary file

The Dorado summary file is produced by running dorado summary on your aligned BAM file:

dorado summary aligned_reads.bam > dorado_summary.txt

The summary must include poly(A) tail estimates. Ensure that Dorado was run with --estimate-poly-a during basecalling.

Pipeline steps

The pipeline executes the following steps in order:

Input validation (preprocess_inputs()): Validates file paths, checks required columns, verifies POD5 directory structure
Summary processing (process_dorado_summary()): Reads and standardizes the Dorado summary file
Quality filtering (filter_dorado_summary()): Applies QC filters and assigns preliminary classifications
Signal extraction (extract_tails_from_pod5()): Extracts raw poly(A) tail signals from POD5 files via the Python pod5 module
Feature engineering (create_tail_features_list_dorado()): Computes pseudomoves, signal statistics, and identifies candidate modification positions
Segmentation (create_tail_chunk_list_dorado()): Splits tail signals into 100-point segments centered on candidate positions
GAF creation (process_dorado_signal_files()): Converts segments to Gramian Angular Fields (GASF + GADF two-channel images)
CNN classification (predict_gaf_classes()): Applies the trained convolutional neural network to classify each segment as A, C, G, or U
Output creation (create_outputs_dorado()): Aggregates per-segment predictions into per-read classifications and produces final output tables

Output files

Two tab-separated files are saved in save_dir:

{prefix}read_classes.txt — one row per read
{prefix}nonadenosine_residues.txt — one row per detected non-A residue (decorated reads only)

The function also returns both tables as a named list:

class_data <- results$read_classes
residue_data <- results$nonadenosine_residues

Dorado cDNA Pipeline (Under Development)

Warning

This pipeline is currently under construction. Please do not use it for production analyses.

check_tails_dorado_cDNA() extends the DRS pipeline for complementary DNA (cDNA) sequencing data with BAM file processing and automatic read orientation classification.

results <- ninetails::check_tails_dorado_cDNA(
  bam_file       = "path/to/aligned_cdna.bam",
  dorado_summary = "path/to/dorado_summary.txt",
  pod5_dir       = "path/to/pod5_dir/",
  num_cores      = 2,
  qc             = TRUE,
  save_dir       = "~/output/",
  prefix         = "experiment1",
  part_size      = 40000,
  cleanup        = FALSE
)

Additional parameter (vs DRS)

Parameter	Type	Description
`bam_file`	character	Path to BAM file containing aligned cDNA reads with basecalled sequences

All other parameters are identical to check_tails_dorado_DRS().

cDNA-specific processing

The cDNA pipeline includes additional steps not present in the DRS pipeline:

BAM splitting (split_bam_file_cdna()): Splits the BAM file into manageable chunks
Sequence extraction (extract_data_from_bam()): Extracts basecalled sequences from BAM records
Orientation detection (detect_orientation_single() / detect_orientation_multiple()): Classifies each read as polyA, polyT, or unidentified using Dorado-style SSP/VNP primer matching against basecalled sequences
Separate processing: Poly(A) and poly(T) reads are processed independently through the signal analysis pipeline
Result merging (merge_cdna_results()): Combines poly(A) and poly(T) results into unified output tables

Output

Output tables include an additional tail_type column (polyA or polyT) indicating the read orientation.

Guppy Legacy Pipeline

Warning

This pipeline works with Guppy basecaller 6.0.0 and lower only. It is maintained for backward compatibility and will not be further optimised. For new analyses, use the Dorado DRS pipeline.

check_tails_guppy() is the legacy function for DRS data basecalled with Guppy using fast5 format and Nanopolish poly(A) coordinates.

results <- ninetails::check_tails_guppy(
  polya_data = system.file('extdata', 'test_data',
                           'nanopolish_output.tsv',
                           package = 'ninetails'),
  sequencing_summary = system.file('extdata', 'test_data',
                                   'sequencing_summary.txt',
                                   package = 'ninetails'),
  workspace = system.file('extdata', 'test_data',
                          'basecalled_fast5',
                          package = 'ninetails'),
  num_cores = 2,
  basecall_group = 'Basecall_1D_000',
  pass_only = TRUE,
  save_dir = '~/output/'
)

Parameters

Parameter	Type	Default	Description
`polya_data`	character/data.frame	required	Path to Nanopolish polya output or tailfindr output
`sequencing_summary`	character/data.frame	required	Path to sequencing summary file
`workspace`	character	required	Path to directory with multi-fast5 files
`num_cores`	integer	`1`	Number of CPU cores
`basecall_group`	character	`"Basecall_1D_000"`	Fast5 hierarchy level
`pass_only`	logical	`TRUE`	Include only “PASS” reads from Nanopolish QC
`qc`	logical	`TRUE`	Label terminal positions with “-WARN” suffix
`save_dir`	character	required	Output directory
`part_size`	integer	`1000000`	Max rows per processing chunk

Note: For tailfindR compatibility, see vignette("tailfindr").

Output structure

All pipelines return a named list with two data frames:

read_classes

Complete accounting of all reads in the analysis.

Column	Description
`readname`	Unique read identifier
`contig`	Reference sequence/transcript
`polya_length`	Estimated poly(A) tail length (nt)
`qc_tag`	Quality tag from basecaller QC
`class`	Classification: `decorated`, `blank`, or `unclassified`
`comments`	3-letter code explaining the classification (see below)
`tail_type`	`polyA` or `polyT` (cDNA pipeline only)

nonadenosine_residues

Modification-level detail for decorated reads only. Each row represents one detected non-adenosine residue.

Column	Description
`readname`	Unique read identifier
`contig`	Reference sequence/transcript
`prediction`	Predicted nucleotide: `C`, `G`, or `U`
`est_nonA_pos`	Estimated position within the poly(A) tail (nucleotides from the 3’ end)
`polya_length`	Total tail length (nt)
`qc_tag`	Quality tag
`tail_type`	`polyA` or `polyT` (cDNA pipeline only)

Note: A single read can have multiple rows in nonadenosine_residues if it contains more than one non-A residue.

Classification codes

Dorado pipelines (DRS and cDNA)

Class	Code	Explanation
`decorated`	`YAY`	Non-adenosine residue detected
`blank`	`MAU`	No signal deviation; pure poly(A) tail
`blank`	`MPU`	Signal deviation present but predicted as adenosine
`unclassified`	`IRL`	Poly(A) tail too short (< 10 nt)
`unclassified`	`UNM`	Read unmapped
`unclassified`	`BAC`	Invalid coordinates (`poly_tail_start = 0`)

Guppy legacy pipeline

Class	Code	Explanation
`decorated`	`YAY`	Move transition present, non-A detected
`blank`	`MAU`	Move transition absent, non-A undetected
`blank`	`MPU`	Move transition present, non-A undetected
`unclassified`	`QCF`	Nanopolish QC failed
`unclassified`	`NIN`	Not included (`pass_only = TRUE`)
`unclassified`	`IRL`	Insufficient read length

Memory considerations

Warning

Signal transformations can place a heavy load on memory, especially for full sequencing runs with millions of reads.

To manage memory:

Adjust part_size to control reads per chunk. Default is 40,000 (Dorado) or 1,000,000 (Guppy). Reduce to 10,000–20,000 if experiencing memory issues.
Use cleanup = TRUE to remove intermediate files after processing completes.
Monitor system memory during the GAF creation and CNN prediction steps, which are the most memory-intensive.
For very large runs (>1M reads), consider splitting input files manually and processing in batches.

Performance tips

Parallelization: Set num_cores to the number of available CPU cores minus one. The pipeline parallelizes signal extraction, feature engineering, and GAF creation across chunks.
Disk space: Intermediate files (signal data, feature lists, GAF matrices) can be large. Ensure sufficient free disk space in save_dir. Use cleanup = TRUE to remove intermediates automatically.
GPU acceleration: If TensorFlow is configured with GPU support, the CNN classification step benefits from GPU acceleration. This is the single largest speedup available.
SSD storage: Using SSD storage for pod5_dir and save_dir significantly reduces I/O bottlenecks during signal extraction.

Next steps

After running the pipeline:

Postprocess: Load, annotate, merge, and summarize results — vignette("postprocessing")
Visualize: Generate static plots — vignette("plotting")
Inspect signals: Examine raw squiggles — vignette("signal_inspection")
Explore interactively: Launch the Shiny dashboard — vignette("shiny_app")

Dorado DRS Pipeline (Recommended)

Parameters

Quality control (qc = TRUE)

Generating the Dorado summary file

Pipeline steps

Output files

Dorado cDNA Pipeline (Under Development)

Additional parameter (vs DRS)

cDNA-specific processing

Output

Guppy Legacy Pipeline

Parameters

Output structure

read_classes

nonadenosine_residues

Classification codes

Dorado pipelines (DRS and cDNA)

Guppy legacy pipeline

Memory considerations

Performance tips

Next steps

Quality control (`qc = TRUE`)