Skip to contents

This vignette describes the three analysis pipelines available in ninetails for detecting non-adenosine residues in poly(A) tails. For working with the output data, see vignette("postprocessing"). For visualization, see vignette("plotting") and vignette("shiny_app").

check_tails_dorado_DRS() is the recommended function for direct RNA sequencing (DRS) data basecalled with Dorado ≥ 1.0.0 using POD5 format.

results <- ninetails::check_tails_dorado_DRS(
  dorado_summary = "path/to/dorado_alignment_summary.txt",
  pod5_dir       = "path/to/pod5_dir/",
  num_cores      = 2,
  qc             = TRUE,
  save_dir       = "~/output/",
  prefix         = "experiment1",
  part_size      = 40000,
  cleanup        = FALSE
)

Parameters

Parameter Type Default Description
dorado_summary character/data.frame required Path to Dorado summary file. Must contain: read_id, filename, poly_tail_length, poly_tail_start, poly_tail_end
pod5_dir character required Path to directory containing POD5 files
num_cores integer 1 Number of CPU cores for parallel processing
qc logical TRUE Apply quality control filtering (see below)
save_dir character required Output directory for results files
prefix character "" Optional prefix for output filenames (e.g., "experiment1_")
part_size integer 40000 Maximum reads per processing chunk (reduce if memory limited)
cleanup logical FALSE Remove intermediate files after completion

Quality control (qc = TRUE)

When qc = TRUE, the pipeline applies the following filters before signal analysis:

  • Reads with poly_tail_start = 0 are classified as BAC (bad coordinates) and excluded from signal processing
  • Reads with poly_tail_length < 10 are classified as IRL (insufficient read length)
  • Unmapped reads (no contig assignment) are classified as UNM

These reads are still included in the output read_classes table but are marked as unclassified.

Generating the Dorado summary file

The Dorado summary file is produced by running dorado summary on your aligned BAM file:

dorado summary aligned_reads.bam > dorado_summary.txt

The summary must include poly(A) tail estimates. Ensure that Dorado was run with --estimate-poly-a during basecalling.

Pipeline steps

The pipeline executes the following steps in order:

  1. Input validation (preprocess_inputs()): Validates file paths, checks required columns, verifies POD5 directory structure
  2. Summary processing (process_dorado_summary()): Reads and standardizes the Dorado summary file
  3. Quality filtering (filter_dorado_summary()): Applies QC filters and assigns preliminary classifications
  4. Signal extraction (extract_tails_from_pod5()): Extracts raw poly(A) tail signals from POD5 files via the Python pod5 module
  5. Feature engineering (create_tail_features_list_dorado()): Computes pseudomoves, signal statistics, and identifies candidate modification positions
  6. Segmentation (create_tail_chunk_list_dorado()): Splits tail signals into 100-point segments centered on candidate positions
  7. GAF creation (process_dorado_signal_files()): Converts segments to Gramian Angular Fields (GASF + GADF two-channel images)
  8. CNN classification (predict_gaf_classes()): Applies the trained convolutional neural network to classify each segment as A, C, G, or U
  9. Output creation (create_outputs_dorado()): Aggregates per-segment predictions into per-read classifications and produces final output tables

Output files

Two tab-separated files are saved in save_dir:

  • {prefix}read_classes.txt — one row per read
  • {prefix}nonadenosine_residues.txt — one row per detected non-A residue (decorated reads only)

The function also returns both tables as a named list:

class_data <- results$read_classes
residue_data <- results$nonadenosine_residues

Dorado cDNA Pipeline (Under Development)

Warning

This pipeline is currently under construction. Please do not use it for production analyses.

check_tails_dorado_cDNA() extends the DRS pipeline for complementary DNA (cDNA) sequencing data with BAM file processing and automatic read orientation classification.

results <- ninetails::check_tails_dorado_cDNA(
  bam_file       = "path/to/aligned_cdna.bam",
  dorado_summary = "path/to/dorado_summary.txt",
  pod5_dir       = "path/to/pod5_dir/",
  num_cores      = 2,
  qc             = TRUE,
  save_dir       = "~/output/",
  prefix         = "experiment1",
  part_size      = 40000,
  cleanup        = FALSE
)

Additional parameter (vs DRS)

Parameter Type Description
bam_file character Path to BAM file containing aligned cDNA reads with basecalled sequences

All other parameters are identical to check_tails_dorado_DRS().

cDNA-specific processing

The cDNA pipeline includes additional steps not present in the DRS pipeline:

  1. BAM splitting (split_bam_file_cdna()): Splits the BAM file into manageable chunks
  2. Sequence extraction (extract_data_from_bam()): Extracts basecalled sequences from BAM records
  3. Orientation detection (detect_orientation_single() / detect_orientation_multiple()): Classifies each read as polyA, polyT, or unidentified using Dorado-style SSP/VNP primer matching against basecalled sequences
  4. Separate processing: Poly(A) and poly(T) reads are processed independently through the signal analysis pipeline
  5. Result merging (merge_cdna_results()): Combines poly(A) and poly(T) results into unified output tables

Output

Output tables include an additional tail_type column (polyA or polyT) indicating the read orientation.


Guppy Legacy Pipeline

Warning

This pipeline works with Guppy basecaller 6.0.0 and lower only. It is maintained for backward compatibility and will not be further optimised. For new analyses, use the Dorado DRS pipeline.

check_tails_guppy() is the legacy function for DRS data basecalled with Guppy using fast5 format and Nanopolish poly(A) coordinates.

results <- ninetails::check_tails_guppy(
  polya_data = system.file('extdata', 'test_data',
                           'nanopolish_output.tsv',
                           package = 'ninetails'),
  sequencing_summary = system.file('extdata', 'test_data',
                                   'sequencing_summary.txt',
                                   package = 'ninetails'),
  workspace = system.file('extdata', 'test_data',
                          'basecalled_fast5',
                          package = 'ninetails'),
  num_cores = 2,
  basecall_group = 'Basecall_1D_000',
  pass_only = TRUE,
  save_dir = '~/output/'
)

Parameters

Parameter Type Default Description
polya_data character/data.frame required Path to Nanopolish polya output or tailfindr output
sequencing_summary character/data.frame required Path to sequencing summary file
workspace character required Path to directory with multi-fast5 files
num_cores integer 1 Number of CPU cores
basecall_group character "Basecall_1D_000" Fast5 hierarchy level
pass_only logical TRUE Include only “PASS” reads from Nanopolish QC
qc logical TRUE Label terminal positions with “-WARN” suffix
save_dir character required Output directory
part_size integer 1000000 Max rows per processing chunk

Note: For tailfindR compatibility, see vignette("tailfindr").


Output structure

All pipelines return a named list with two data frames:

read_classes

Complete accounting of all reads in the analysis.

Column Description
readname Unique read identifier
contig Reference sequence/transcript
polya_length Estimated poly(A) tail length (nt)
qc_tag Quality tag from basecaller QC
class Classification: decorated, blank, or unclassified
comments 3-letter code explaining the classification (see below)
tail_type polyA or polyT (cDNA pipeline only)

nonadenosine_residues

Modification-level detail for decorated reads only. Each row represents one detected non-adenosine residue.

Column Description
readname Unique read identifier
contig Reference sequence/transcript
prediction Predicted nucleotide: C, G, or U
est_nonA_pos Estimated position within the poly(A) tail (nucleotides from the 3’ end)
polya_length Total tail length (nt)
qc_tag Quality tag
tail_type polyA or polyT (cDNA pipeline only)

Note: A single read can have multiple rows in nonadenosine_residues if it contains more than one non-A residue.


Classification codes

Dorado pipelines (DRS and cDNA)

Class Code Explanation
decorated YAY Non-adenosine residue detected
blank MAU No signal deviation; pure poly(A) tail
blank MPU Signal deviation present but predicted as adenosine
unclassified IRL Poly(A) tail too short (< 10 nt)
unclassified UNM Read unmapped
unclassified BAC Invalid coordinates (poly_tail_start = 0)

Guppy legacy pipeline

Class Code Explanation
decorated YAY Move transition present, non-A detected
blank MAU Move transition absent, non-A undetected
blank MPU Move transition present, non-A undetected
unclassified QCF Nanopolish QC failed
unclassified NIN Not included (pass_only = TRUE)
unclassified IRL Insufficient read length

Memory considerations

Warning

Signal transformations can place a heavy load on memory, especially for full sequencing runs with millions of reads.

To manage memory:

  • Adjust part_size to control reads per chunk. Default is 40,000 (Dorado) or 1,000,000 (Guppy). Reduce to 10,000–20,000 if experiencing memory issues.
  • Use cleanup = TRUE to remove intermediate files after processing completes.
  • Monitor system memory during the GAF creation and CNN prediction steps, which are the most memory-intensive.
  • For very large runs (>1M reads), consider splitting input files manually and processing in batches.

Performance tips

  • Parallelization: Set num_cores to the number of available CPU cores minus one. The pipeline parallelizes signal extraction, feature engineering, and GAF creation across chunks.
  • Disk space: Intermediate files (signal data, feature lists, GAF matrices) can be large. Ensure sufficient free disk space in save_dir. Use cleanup = TRUE to remove intermediates automatically.
  • GPU acceleration: If TensorFlow is configured with GPU support, the CNN classification step benefits from GPU acceleration. This is the single largest speedup available.
  • SSD storage: Using SSD storage for pod5_dir and save_dir significantly reduces I/O bottlenecks during signal extraction.

Next steps

After running the pipeline:

  1. Postprocess: Load, annotate, merge, and summarize results — vignette("postprocessing")
  2. Visualize: Generate static plots — vignette("plotting")
  3. Inspect signals: Examine raw squiggles — vignette("signal_inspection")
  4. Explore interactively: Launch the Shiny dashboard — vignette("shiny_app")