This vignette describes the three analysis pipelines available in
ninetails for detecting non-adenosine residues in
poly(A) tails. For working with the output data, see
vignette("postprocessing"). For visualization, see
vignette("plotting") and
vignette("shiny_app").
Dorado DRS Pipeline (Recommended)
check_tails_dorado_DRS() is the recommended function for
direct RNA sequencing (DRS) data basecalled with Dorado ≥ 1.0.0 using
POD5 format.
results <- ninetails::check_tails_dorado_DRS(
dorado_summary = "path/to/dorado_alignment_summary.txt",
pod5_dir = "path/to/pod5_dir/",
num_cores = 2,
qc = TRUE,
save_dir = "~/output/",
prefix = "experiment1",
part_size = 40000,
cleanup = FALSE
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dorado_summary |
character/data.frame | required | Path to Dorado summary file. Must contain: read_id,
filename, poly_tail_length,
poly_tail_start, poly_tail_end
|
pod5_dir |
character | required | Path to directory containing POD5 files |
num_cores |
integer | 1 |
Number of CPU cores for parallel processing |
qc |
logical | TRUE |
Apply quality control filtering (see below) |
save_dir |
character | required | Output directory for results files |
prefix |
character | "" |
Optional prefix for output filenames (e.g.,
"experiment1_") |
part_size |
integer | 40000 |
Maximum reads per processing chunk (reduce if memory limited) |
cleanup |
logical | FALSE |
Remove intermediate files after completion |
Quality control (qc = TRUE)
When qc = TRUE, the pipeline applies the following
filters before signal analysis:
- Reads with
poly_tail_start = 0are classified asBAC(bad coordinates) and excluded from signal processing - Reads with
poly_tail_length < 10are classified asIRL(insufficient read length) - Unmapped reads (no
contigassignment) are classified asUNM
These reads are still included in the output
read_classes table but are marked as
unclassified.
Generating the Dorado summary file
The Dorado summary file is produced by running
dorado summary on your aligned BAM file:
The summary must include poly(A) tail estimates. Ensure that Dorado
was run with --estimate-poly-a during basecalling.
Pipeline steps
The pipeline executes the following steps in order:
-
Input validation
(
preprocess_inputs()): Validates file paths, checks required columns, verifies POD5 directory structure -
Summary processing
(
process_dorado_summary()): Reads and standardizes the Dorado summary file -
Quality filtering
(
filter_dorado_summary()): Applies QC filters and assigns preliminary classifications -
Signal extraction
(
extract_tails_from_pod5()): Extracts raw poly(A) tail signals from POD5 files via the Pythonpod5module -
Feature engineering
(
create_tail_features_list_dorado()): Computes pseudomoves, signal statistics, and identifies candidate modification positions -
Segmentation
(
create_tail_chunk_list_dorado()): Splits tail signals into 100-point segments centered on candidate positions -
GAF creation
(
process_dorado_signal_files()): Converts segments to Gramian Angular Fields (GASF + GADF two-channel images) -
CNN classification
(
predict_gaf_classes()): Applies the trained convolutional neural network to classify each segment as A, C, G, or U -
Output creation
(
create_outputs_dorado()): Aggregates per-segment predictions into per-read classifications and produces final output tables
Output files
Two tab-separated files are saved in save_dir:
-
{prefix}read_classes.txt— one row per read -
{prefix}nonadenosine_residues.txt— one row per detected non-A residue (decorated reads only)
The function also returns both tables as a named list:
class_data <- results$read_classes
residue_data <- results$nonadenosine_residuesDorado cDNA Pipeline (Under Development)
Warning
This pipeline is currently under construction. Please do not use it for production analyses.
check_tails_dorado_cDNA() extends the DRS pipeline for
complementary DNA (cDNA) sequencing data with BAM file processing and
automatic read orientation classification.
results <- ninetails::check_tails_dorado_cDNA(
bam_file = "path/to/aligned_cdna.bam",
dorado_summary = "path/to/dorado_summary.txt",
pod5_dir = "path/to/pod5_dir/",
num_cores = 2,
qc = TRUE,
save_dir = "~/output/",
prefix = "experiment1",
part_size = 40000,
cleanup = FALSE
)Additional parameter (vs DRS)
| Parameter | Type | Description |
|---|---|---|
bam_file |
character | Path to BAM file containing aligned cDNA reads with basecalled sequences |
All other parameters are identical to
check_tails_dorado_DRS().
cDNA-specific processing
The cDNA pipeline includes additional steps not present in the DRS pipeline:
-
BAM splitting (
split_bam_file_cdna()): Splits the BAM file into manageable chunks -
Sequence extraction
(
extract_data_from_bam()): Extracts basecalled sequences from BAM records -
Orientation detection
(
detect_orientation_single()/detect_orientation_multiple()): Classifies each read aspolyA,polyT, orunidentifiedusing Dorado-style SSP/VNP primer matching against basecalled sequences - Separate processing: Poly(A) and poly(T) reads are processed independently through the signal analysis pipeline
-
Result merging (
merge_cdna_results()): Combines poly(A) and poly(T) results into unified output tables
Guppy Legacy Pipeline
Warning
This pipeline works with Guppy basecaller 6.0.0 and lower only. It is maintained for backward compatibility and will not be further optimised. For new analyses, use the Dorado DRS pipeline.
check_tails_guppy() is the legacy function for DRS data
basecalled with Guppy using fast5 format and Nanopolish poly(A)
coordinates.
results <- ninetails::check_tails_guppy(
polya_data = system.file('extdata', 'test_data',
'nanopolish_output.tsv',
package = 'ninetails'),
sequencing_summary = system.file('extdata', 'test_data',
'sequencing_summary.txt',
package = 'ninetails'),
workspace = system.file('extdata', 'test_data',
'basecalled_fast5',
package = 'ninetails'),
num_cores = 2,
basecall_group = 'Basecall_1D_000',
pass_only = TRUE,
save_dir = '~/output/'
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
polya_data |
character/data.frame | required | Path to Nanopolish polya output or tailfindr output |
sequencing_summary |
character/data.frame | required | Path to sequencing summary file |
workspace |
character | required | Path to directory with multi-fast5 files |
num_cores |
integer | 1 |
Number of CPU cores |
basecall_group |
character | "Basecall_1D_000" |
Fast5 hierarchy level |
pass_only |
logical | TRUE |
Include only “PASS” reads from Nanopolish QC |
qc |
logical | TRUE |
Label terminal positions with “-WARN” suffix |
save_dir |
character | required | Output directory |
part_size |
integer | 1000000 |
Max rows per processing chunk |
Note: For tailfindR compatibility, see
vignette("tailfindr").
Output structure
All pipelines return a named list with two data frames:
read_classes
Complete accounting of all reads in the analysis.
| Column | Description |
|---|---|
readname |
Unique read identifier |
contig |
Reference sequence/transcript |
polya_length |
Estimated poly(A) tail length (nt) |
qc_tag |
Quality tag from basecaller QC |
class |
Classification: decorated, blank, or
unclassified
|
comments |
3-letter code explaining the classification (see below) |
tail_type |
polyA or polyT (cDNA pipeline only) |
nonadenosine_residues
Modification-level detail for decorated reads only. Each
row represents one detected non-adenosine residue.
| Column | Description |
|---|---|
readname |
Unique read identifier |
contig |
Reference sequence/transcript |
prediction |
Predicted nucleotide: C, G, or
U
|
est_nonA_pos |
Estimated position within the poly(A) tail (nucleotides from the 3’ end) |
polya_length |
Total tail length (nt) |
qc_tag |
Quality tag |
tail_type |
polyA or polyT (cDNA pipeline only) |
Note: A single read can have multiple rows in
nonadenosine_residuesif it contains more than one non-A residue.
Classification codes
Dorado pipelines (DRS and cDNA)
| Class | Code | Explanation |
|---|---|---|
decorated |
YAY |
Non-adenosine residue detected |
blank |
MAU |
No signal deviation; pure poly(A) tail |
blank |
MPU |
Signal deviation present but predicted as adenosine |
unclassified |
IRL |
Poly(A) tail too short (< 10 nt) |
unclassified |
UNM |
Read unmapped |
unclassified |
BAC |
Invalid coordinates (poly_tail_start = 0) |
Guppy legacy pipeline
| Class | Code | Explanation |
|---|---|---|
decorated |
YAY |
Move transition present, non-A detected |
blank |
MAU |
Move transition absent, non-A undetected |
blank |
MPU |
Move transition present, non-A undetected |
unclassified |
QCF |
Nanopolish QC failed |
unclassified |
NIN |
Not included (pass_only = TRUE) |
unclassified |
IRL |
Insufficient read length |
Memory considerations
Warning
Signal transformations can place a heavy load on memory, especially for full sequencing runs with millions of reads.
To manage memory:
- Adjust
part_sizeto control reads per chunk. Default is 40,000 (Dorado) or 1,000,000 (Guppy). Reduce to 10,000–20,000 if experiencing memory issues. - Use
cleanup = TRUEto remove intermediate files after processing completes. - Monitor system memory during the GAF creation and CNN prediction steps, which are the most memory-intensive.
- For very large runs (>1M reads), consider splitting input files manually and processing in batches.
Performance tips
-
Parallelization: Set
num_coresto the number of available CPU cores minus one. The pipeline parallelizes signal extraction, feature engineering, and GAF creation across chunks. -
Disk space: Intermediate files (signal data,
feature lists, GAF matrices) can be large. Ensure sufficient free disk
space in
save_dir. Usecleanup = TRUEto remove intermediates automatically. - GPU acceleration: If TensorFlow is configured with GPU support, the CNN classification step benefits from GPU acceleration. This is the single largest speedup available.
-
SSD storage: Using SSD storage for
pod5_dirandsave_dirsignificantly reduces I/O bottlenecks during signal extraction.
Next steps
After running the pipeline:
-
Postprocess: Load, annotate, merge, and summarize
results —
vignette("postprocessing") -
Visualize: Generate static plots —
vignette("plotting") -
Inspect signals: Examine raw squiggles —
vignette("signal_inspection") -
Explore interactively: Launch the Shiny dashboard —
vignette("shiny_app")
