Skip to contents

This vignette describes functions for loading, processing, annotating, and analyzing ninetails output data. All functions documented here operate on the two output tables produced by the main pipelines (check_tails_dorado_DRS(), check_tails_dorado_cDNA(), or check_tails_guppy()):

  • read_classes: Per-read classification (decorated/blank/unclassified)
  • nonadenosine_residues: Modification-level detail for decorated reads

For visualization of these data, see vignette("plotting") and vignette("shiny_app").

Reading files into R

Reading single files

# Read read_classes output
class_path <- "/path/to/read_classes.tsv"
class_data <- ninetails::read_class_single(class_path)

# Read nonadenosine_residues output
residue_path <- "/path/to/nonadenosine_residues.tsv"
residue_data <- ninetails::read_residue_single(residue_path)
Function Input Output
read_class_single() Path to a single read_classes file Data frame with readname, contig, polya_length, qc_tag, class, comments
read_residue_single() Path to a single nonadenosine_residues file Data frame with readname, contig, prediction, est_nonA_pos, polya_length, qc_tag

Reading multiple files

For experiments with multiple samples, use read_class_multiple() and read_residue_multiple() with a metadata table. The metadata table must contain columns sample_name, group, class_path, and residue_path.

# Define metadata table
samples_table <- data.frame(
  sample_name = c("sample_1", "sample_2", "sample_3", "sample_4"),
  group = c("control", "control", "treatment", "treatment"),
  class_path = c(
    "/path/to/sample_1/read_classes.tsv",
    "/path/to/sample_2/read_classes.tsv",
    "/path/to/sample_3/read_classes.tsv",
    "/path/to/sample_4/read_classes.tsv"
  ),
  residue_path = c(
    "/path/to/sample_1/nonadenosine_residues.tsv",
    "/path/to/sample_2/nonadenosine_residues.tsv",
    "/path/to/sample_3/nonadenosine_residues.tsv",
    "/path/to/sample_4/nonadenosine_residues.tsv"
  )
)

# Read all files at once
class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)

The output data frames include additional sample_name and group columns (as factors), which are used as grouping variables in downstream analysis and plotting.

Using a YAML configuration file

Alternatively, provide metadata in a YAML configuration file. This is the same format used by the Shiny dashboard (see vignette("shiny_app")):

config <- yaml::yaml.load_file("config.yml")
samples_table <- data.frame(t(sapply(config$samples, unlist)))
rownames(samples_table) <- NULL

class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)

Example config.yml:

samples:
  sample_1:
    sample_name: sample_1
    group: control
    class_path: /path/to/sample_1/read_classes.tsv
    residue_path: /path/to/sample_1/nonadenosine_residues.tsv
  sample_2:
    sample_name: sample_2
    group: treatment
    class_path: /path/to/sample_2/read_classes.tsv
    residue_path: /path/to/sample_2/nonadenosine_residues.tsv

Annotation with biomaRt

annotate_with_biomart() adds gene symbols and Ensembl transcript IDs to ninetails output tables by querying biomaRt. This enables transcript-level filtering by human-readable gene names instead of raw contig identifiers.

# Annotate class data
class_data <- ninetails::annotate_with_biomart(
  class_data,
  species = "mmusculus"
)

# Annotate residue data
residue_data <- ninetails::annotate_with_biomart(
  residue_data,
  species = "mmusculus"
)

Parameters

Parameter Type Default Description
data data.frame required Ninetails output table with a contig column
species character required Species identifier (see below)

Supported species

Identifier Species
"mmusculus" Mus musculus
"hsapiens" Homo sapiens
"athaliana" Arabidopsis thaliana
"scerevisiae" Saccharomyces cerevisiae
"celegans" Caenorhabditis elegans
"tbrucei" Trypanosoma brucei

See function documentation for other species and custom biomaRt datasets.

Output columns added

Column Description
symbol Gene symbol (e.g., Actb, Gapdh)
ensembl_transcript_id_short Ensembl transcript ID without version suffix

Note: Annotation should be performed before merging, summarizing, or launching the dashboard, since downstream functions and the dashboard use the symbol column for transcript labels when available.


Correcting classification

Sometimes nucleotides from AT-rich transcript ends are misidentified as poly(A) tails. Use reclassify_ninetails_data() to minimize segmentation artifacts by comparing detected residue positions against known transcript sequences:

ninetails_data <- ninetails::reclassify_ninetails_data(
  residue_data      = residue_data,
  class_data        = class_data,
  grouping_factor   = "sample_name",
  transcript_column = "ensembl_transcript_id_short",
  ref               = "mmusculus"
)

# Retrieve corrected data frames
class_data <- ninetails_data[[1]]
residue_data <- ninetails_data[[2]]

Parameters

Parameter Type Default Description
residue_data data.frame required Non-A residue table
class_data data.frame required Read classification table
grouping_factor character required Grouping column (e.g., "sample_name")
transcript_column character required Column with transcript IDs
ref character required Reference species (same identifiers as annotate_with_biomart())

Note: Apply this function before further analysis. It requires annotated data (i.e., run annotate_with_biomart() first).


Merging results

Combine class and residue data into a single table with one row per read. Decorated reads gain a nonA_residues column summarizing all detected positions.

merged_tables <- ninetails::merge_nonA_tables(
  class_data  = class_data,
  residue_data = residue_data,
  pass_only   = TRUE
)

Parameters

Parameter Type Default Description
class_data data.frame required Read classification table
residue_data data.frame required Non-A residue table
pass_only logical TRUE If TRUE, include only QC-passed reads

Output

The merged table contains all columns from class_data plus a nonA_residues column. For decorated reads, this column contains all non-A positions from 5’ to 3’ separated by : (e.g., "C:G" for a read with a cytidine and a guanosine). For blank and unclassified reads, the column is NA.

Spreading residue columns

spread_nonA_residues() reshapes the merged table to create separate columns for each non-A position, which can be useful for per-position analysis:

spread_table <- ninetails::spread_nonA_residues(
  merged_nonA_tables = merged_tables
)

The output adds columns nonA_1, nonA_2, etc., each containing the residue type at that position (or NA if the read has fewer modifications).


Counting and summarizing

Count by class

# Simple counts
class_counts <- ninetails::count_class(class_data)

# Grouped by sample
class_counts <- ninetails::count_class(
  class_data,
  grouping_factor = "sample_name"
)

# Detailed (by comments code) vs simplified (by class)
counts_detailed <- ninetails::count_class(class_data, detailed = TRUE)
counts_simple <- ninetails::count_class(class_data, detailed = FALSE)

Parameters

Parameter Type Default Description
class_data data.frame required Read classification table
grouping_factor character NA Grouping column
detailed logical TRUE If TRUE, count by comment code; if FALSE, count by class

Count residues

residue_counts <- ninetails::count_residues(
  residue_data,
  grouping_factor = "sample_name"
)

Count non-A abundance

Count how many reads have one, two, or three or more non-A residues:

abundance <- ninetails::count_nonA_abundance(
  residue_data,
  grouping_factor = "sample_name"
)

Summarizing by transcript

Generate comprehensive summary statistics per transcript:

summarized <- ninetails::summarize_nonA(
  merged_nonA_tables    = merged_tables,
  summary_factors       = "group",
  transcript_id_column  = "ensembl_transcript_id_short"
)

Parameters

Parameter Type Default Description
merged_nonA_tables data.frame required Output from merge_nonA_tables()
summary_factors character required Grouping column(s) for the summary
transcript_id_column character NULL Column with transcript IDs (optional)

Output columns

Column Description
Grouping columns As specified by summary_factors
contig / transcript ID Transcript identifier
n_reads Total reads mapping to the transcript
n_decorated Number of decorated reads
n_blank Number of blank reads
n_C_counts / n_G_counts / n_U_counts Reads containing C, G, or U residues
n_C_hits / n_G_hits / n_U_hits Total C, G, or U residue occurrences
mean_polya_length Mean poly(A) tail length
median_polya_length Median poly(A) tail length

Statistical analysis

Fisher’s exact test

Compare non-adenosine frequencies between two conditions:

fisher_results <- ninetails::calculate_fisher(
  ninetails_data,
  grouping_factor = "condition",
  comparison = c("control", "treatment")
)

Parameters

Parameter Type Description
ninetails_data data.frame Class data or merged table
grouping_factor character Column defining conditions
comparison character(2) Two condition names to compare

The function returns a data frame with Fisher’s exact test results (p-value, odds ratio, confidence interval) per transcript.


Typical workflow

library(ninetails)

# 1. Load data from multiple samples
samples <- data.frame(
  sample_name = c("WT_1", "WT_2", "KO_1", "KO_2"),
  group = c("WT", "WT", "KO", "KO"),
  class_path = c("WT_1_classes.tsv", "WT_2_classes.tsv",
                 "KO_1_classes.tsv", "KO_2_classes.tsv"),
  residue_path = c("WT_1_residues.tsv", "WT_2_residues.tsv",
                   "KO_1_residues.tsv", "KO_2_residues.tsv")
)

class_data <- ninetails::read_class_multiple(samples)
residue_data <- ninetails::read_residue_multiple(samples)

# 2. Annotate with gene symbols
class_data <- ninetails::annotate_with_biomart(class_data, species = "mmusculus")
residue_data <- ninetails::annotate_with_biomart(residue_data, species = "mmusculus")

# 3. Reclassify to remove artifacts
ninetails_data <- ninetails::reclassify_ninetails_data(
  residue_data = residue_data,
  class_data = class_data,
  grouping_factor = "sample_name",
  transcript_column = "ensembl_transcript_id_short",
  ref = "mmusculus"
)

class_data <- ninetails_data[[1]]
residue_data <- ninetails_data[[2]]

# 4. Count classifications
class_counts <- ninetails::count_class(class_data, grouping_factor = "group")

# 5. Merge and summarize
merged <- ninetails::merge_nonA_tables(class_data, residue_data)
summary <- ninetails::summarize_nonA(merged, summary_factors = "group")

# 6. Statistical comparison
fisher <- ninetails::calculate_fisher(
  class_data,
  grouping_factor = "group",
  comparison = c("WT", "KO")
)

# 7. Explore interactively
ninetails::launch_signal_browser(
  config = "config.yml"
)

Summary of postprocessing functions

Function Description
read_class_single() Load a single read_classes file
read_class_multiple() Load multiple read_classes files with metadata
read_residue_single() Load a single nonadenosine_residues file
read_residue_multiple() Load multiple nonadenosine_residues files with metadata
annotate_with_biomart() Add gene symbols and transcript IDs via biomaRt
reclassify_ninetails_data() Correct misclassifications from AT-rich ends
correct_class_data() Apply corrections to class table
correct_residue_data() Apply corrections to residue table
correct_labels() Fix label inconsistencies
merge_nonA_tables() Combine class and residue data (one row per read)
spread_nonA_residues() Reshape merged table with per-position columns
count_class() Count reads by class or comment code
count_residues() Count residues by type
count_nonA_abundance() Count reads with 1, 2, 3+ non-A residues
summarize_nonA() Per-transcript summary statistics
calculate_fisher() Fisher’s exact test between conditions