Data postprocessing • ninetails

This vignette describes functions for loading, processing, annotating, and analyzing ninetails output data. All functions documented here operate on the two output tables produced by the main pipelines (check_tails_dorado_DRS(), check_tails_dorado_cDNA(), or check_tails_guppy()):

read_classes: Per-read classification (decorated/blank/unclassified)
nonadenosine_residues: Modification-level detail for decorated reads

For visualization of these data, see vignette("plotting") and vignette("shiny_app").

Reading files into R

Reading single files

# Read read_classes output
class_path <- "/path/to/read_classes.tsv"
class_data <- ninetails::read_class_single(class_path)

# Read nonadenosine_residues output
residue_path <- "/path/to/nonadenosine_residues.tsv"
residue_data <- ninetails::read_residue_single(residue_path)

Function	Input	Output
`read_class_single()`	Path to a single `read_classes` file	Data frame with `readname`, `contig`, `polya_length`, `qc_tag`, `class`, `comments`
`read_residue_single()`	Path to a single `nonadenosine_residues` file	Data frame with `readname`, `contig`, `prediction`, `est_nonA_pos`, `polya_length`, `qc_tag`

Reading multiple files

For experiments with multiple samples, use read_class_multiple() and read_residue_multiple() with a metadata table. The metadata table must contain columns sample_name, group, class_path, and residue_path.

# Define metadata table
samples_table <- data.frame(
  sample_name = c("sample_1", "sample_2", "sample_3", "sample_4"),
  group = c("control", "control", "treatment", "treatment"),
  class_path = c(
    "/path/to/sample_1/read_classes.tsv",
    "/path/to/sample_2/read_classes.tsv",
    "/path/to/sample_3/read_classes.tsv",
    "/path/to/sample_4/read_classes.tsv"
  ),
  residue_path = c(
    "/path/to/sample_1/nonadenosine_residues.tsv",
    "/path/to/sample_2/nonadenosine_residues.tsv",
    "/path/to/sample_3/nonadenosine_residues.tsv",
    "/path/to/sample_4/nonadenosine_residues.tsv"
  )
)

# Read all files at once
class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)

The output data frames include additional sample_name and group columns (as factors), which are used as grouping variables in downstream analysis and plotting.

Using a YAML configuration file

Alternatively, provide metadata in a YAML configuration file. This is the same format used by the Shiny dashboard (see vignette("shiny_app")):

config <- yaml::yaml.load_file("config.yml")
samples_table <- data.frame(t(sapply(config$samples, unlist)))
rownames(samples_table) <- NULL

class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)

Example config.yml:

samples:
  sample_1:
    sample_name: sample_1
    group: control
    class_path: /path/to/sample_1/read_classes.tsv
    residue_path: /path/to/sample_1/nonadenosine_residues.tsv
  sample_2:
    sample_name: sample_2
    group: treatment
    class_path: /path/to/sample_2/read_classes.tsv
    residue_path: /path/to/sample_2/nonadenosine_residues.tsv

Annotation with biomaRt

annotate_with_biomart() adds gene symbols and Ensembl transcript IDs to ninetails output tables by querying biomaRt. This enables transcript-level filtering by human-readable gene names instead of raw contig identifiers.

# Annotate class data
class_data <- ninetails::annotate_with_biomart(
  class_data,
  species = "mmusculus"
)

# Annotate residue data
residue_data <- ninetails::annotate_with_biomart(
  residue_data,
  species = "mmusculus"
)

Parameters

Parameter	Type	Default	Description
`data`	data.frame	required	Ninetails output table with a `contig` column
`species`	character	required	Species identifier (see below)

Supported species

Identifier	Species
`"mmusculus"`	Mus musculus
`"hsapiens"`	Homo sapiens
`"athaliana"`	Arabidopsis thaliana
`"scerevisiae"`	Saccharomyces cerevisiae
`"celegans"`	Caenorhabditis elegans
`"tbrucei"`	Trypanosoma brucei

See function documentation for other species and custom biomaRt datasets.

Output columns added

Column	Description
`symbol`	Gene symbol (e.g., `Actb`, `Gapdh`)
`ensembl_transcript_id_short`	Ensembl transcript ID without version suffix

Note: Annotation should be performed before merging, summarizing, or launching the dashboard, since downstream functions and the dashboard use the symbol column for transcript labels when available.

Correcting classification

Sometimes nucleotides from AT-rich transcript ends are misidentified as poly(A) tails. Use reclassify_ninetails_data() to minimize segmentation artifacts by comparing detected residue positions against known transcript sequences:

ninetails_data <- ninetails::reclassify_ninetails_data(
  residue_data      = residue_data,
  class_data        = class_data,
  grouping_factor   = "sample_name",
  transcript_column = "ensembl_transcript_id_short",
  ref               = "mmusculus"
)

# Retrieve corrected data frames
class_data <- ninetails_data[[1]]
residue_data <- ninetails_data[[2]]

Parameters

Parameter	Type	Default	Description
`residue_data`	data.frame	required	Non-A residue table
`class_data`	data.frame	required	Read classification table
`grouping_factor`	character	required	Grouping column (e.g., `"sample_name"`)
`transcript_column`	character	required	Column with transcript IDs
`ref`	character	required	Reference species (same identifiers as `annotate_with_biomart()`)

Note: Apply this function before further analysis. It requires annotated data (i.e., run annotate_with_biomart() first).

Merging results

Combine class and residue data into a single table with one row per read. Decorated reads gain a nonA_residues column summarizing all detected positions.

merged_tables <- ninetails::merge_nonA_tables(
  class_data  = class_data,
  residue_data = residue_data,
  pass_only   = TRUE
)

Parameters

Parameter	Type	Default	Description
`class_data`	data.frame	required	Read classification table
`residue_data`	data.frame	required	Non-A residue table
`pass_only`	logical	`TRUE`	If `TRUE`, include only QC-passed reads

Output

The merged table contains all columns from class_data plus a nonA_residues column. For decorated reads, this column contains all non-A positions from 5’ to 3’ separated by : (e.g., "C:G" for a read with a cytidine and a guanosine). For blank and unclassified reads, the column is NA.

Spreading residue columns

spread_nonA_residues() reshapes the merged table to create separate columns for each non-A position, which can be useful for per-position analysis:

spread_table <- ninetails::spread_nonA_residues(
  merged_nonA_tables = merged_tables
)

The output adds columns nonA_1, nonA_2, etc., each containing the residue type at that position (or NA if the read has fewer modifications).

Counting and summarizing

Count by class

# Simple counts
class_counts <- ninetails::count_class(class_data)

# Grouped by sample
class_counts <- ninetails::count_class(
  class_data,
  grouping_factor = "sample_name"
)

# Detailed (by comments code) vs simplified (by class)
counts_detailed <- ninetails::count_class(class_data, detailed = TRUE)
counts_simple <- ninetails::count_class(class_data, detailed = FALSE)

Parameters

Parameter	Type	Default	Description
`class_data`	data.frame	required	Read classification table
`grouping_factor`	character	`NA`	Grouping column
`detailed`	logical	`TRUE`	If `TRUE`, count by comment code; if `FALSE`, count by class

Count residues

residue_counts <- ninetails::count_residues(
  residue_data,
  grouping_factor = "sample_name"
)

Count non-A abundance

Count how many reads have one, two, or three or more non-A residues:

abundance <- ninetails::count_nonA_abundance(
  residue_data,
  grouping_factor = "sample_name"
)

Summarizing by transcript

Generate comprehensive summary statistics per transcript:

summarized <- ninetails::summarize_nonA(
  merged_nonA_tables    = merged_tables,
  summary_factors       = "group",
  transcript_id_column  = "ensembl_transcript_id_short"
)

Parameters

Parameter	Type	Default	Description
`merged_nonA_tables`	data.frame	required	Output from `merge_nonA_tables()`
`summary_factors`	character	required	Grouping column(s) for the summary
`transcript_id_column`	character	`NULL`	Column with transcript IDs (optional)

Output columns

Column	Description
Grouping columns	As specified by `summary_factors`
`contig` / transcript ID	Transcript identifier
`n_reads`	Total reads mapping to the transcript
`n_decorated`	Number of decorated reads
`n_blank`	Number of blank reads
`n_C_counts` / `n_G_counts` / `n_U_counts`	Reads containing C, G, or U residues
`n_C_hits` / `n_G_hits` / `n_U_hits`	Total C, G, or U residue occurrences
`mean_polya_length`	Mean poly(A) tail length
`median_polya_length`	Median poly(A) tail length

Statistical analysis

Fisher’s exact test

Compare non-adenosine frequencies between two conditions:

fisher_results <- ninetails::calculate_fisher(
  ninetails_data,
  grouping_factor = "condition",
  comparison = c("control", "treatment")
)

Parameters

Parameter	Type	Description
`ninetails_data`	data.frame	Class data or merged table
`grouping_factor`	character	Column defining conditions
`comparison`	character(2)	Two condition names to compare

The function returns a data frame with Fisher’s exact test results (p-value, odds ratio, confidence interval) per transcript.

Typical workflow

library(ninetails)

# 1. Load data from multiple samples
samples <- data.frame(
  sample_name = c("WT_1", "WT_2", "KO_1", "KO_2"),
  group = c("WT", "WT", "KO", "KO"),
  class_path = c("WT_1_classes.tsv", "WT_2_classes.tsv",
                 "KO_1_classes.tsv", "KO_2_classes.tsv"),
  residue_path = c("WT_1_residues.tsv", "WT_2_residues.tsv",
                   "KO_1_residues.tsv", "KO_2_residues.tsv")
)

class_data <- ninetails::read_class_multiple(samples)
residue_data <- ninetails::read_residue_multiple(samples)

# 2. Annotate with gene symbols
class_data <- ninetails::annotate_with_biomart(class_data, species = "mmusculus")
residue_data <- ninetails::annotate_with_biomart(residue_data, species = "mmusculus")

# 3. Reclassify to remove artifacts
ninetails_data <- ninetails::reclassify_ninetails_data(
  residue_data = residue_data,
  class_data = class_data,
  grouping_factor = "sample_name",
  transcript_column = "ensembl_transcript_id_short",
  ref = "mmusculus"
)

class_data <- ninetails_data[[1]]
residue_data <- ninetails_data[[2]]

# 4. Count classifications
class_counts <- ninetails::count_class(class_data, grouping_factor = "group")

# 5. Merge and summarize
merged <- ninetails::merge_nonA_tables(class_data, residue_data)
summary <- ninetails::summarize_nonA(merged, summary_factors = "group")

# 6. Statistical comparison
fisher <- ninetails::calculate_fisher(
  class_data,
  grouping_factor = "group",
  comparison = c("WT", "KO")
)

# 7. Explore interactively
ninetails::launch_signal_browser(
  config = "config.yml"
)

Summary of postprocessing functions

Function	Description
`read_class_single()`	Load a single read_classes file
`read_class_multiple()`	Load multiple read_classes files with metadata
`read_residue_single()`	Load a single nonadenosine_residues file
`read_residue_multiple()`	Load multiple nonadenosine_residues files with metadata
`annotate_with_biomart()`	Add gene symbols and transcript IDs via biomaRt
`reclassify_ninetails_data()`	Correct misclassifications from AT-rich ends
`correct_class_data()`	Apply corrections to class table
`correct_residue_data()`	Apply corrections to residue table
`correct_labels()`	Fix label inconsistencies
`merge_nonA_tables()`	Combine class and residue data (one row per read)
`spread_nonA_residues()`	Reshape merged table with per-position columns
`count_class()`	Count reads by class or comment code
`count_residues()`	Count residues by type
`count_nonA_abundance()`	Count reads with 1, 2, 3+ non-A residues
`summarize_nonA()`	Per-transcript summary statistics
`calculate_fisher()`	Fisher’s exact test between conditions