
Extract modification-centered signal fragments from a poly(A) tail
Source:R/ninetails_core_functions.R
split_tail_centered.RdFinds areas in the poly(A) tail signal containing potential non-adenosine residues and extracts 100-point signal fragments where the potential modification is always at the center of a given extracted fragment.
Arguments
- readname
Character string. Name of the given read (UUID) within the analyzed dataset.
- tail_feature_list
List object produced by
create_tail_feature_list. Must contain per-read entries withtail_signal,tail_moves, andtail_pseudomoves.
Value
A nested list where each element represents one candidate modification region. Each element is itself a list with:
- chunk_sequence
Numeric vector of raw signal values (length 100)
- chunk_start_pos
Integer. Start index of the chunk in the original signal
- chunk_end_pos
Integer. End index of the chunk in the original signal
- chunk_moves
Numeric vector. Basecaller moves for the chunk region
Element names follow the pattern <readname>_<chunk_index>.
Positions are indexed from the 3' end.
Details
Candidate modification regions are identified based on two assumptions:
the presence of significant raw signal distortion (recorded as a
pseudomove by the thresholding algorithm) and the transition of state
(move == 1) recorded by Guppy. If only move == 0 values
are present within a given signal chunk, then that chunk is dropped
from the analysis (the distortion is most likely caused by a sequencing
artifact, not a non-A residue itself).
If the data indicating the presence of modifications are near the signal ends (3' or 5'), missing upstream or downstream data are imputed based on the most frequent values in the entire signal.
The extraction procedure is as follows:
Run-length encoding (RLE) of the pseudomove vector
Filtering runs of pseudomoves with length >= 5
Centering a 100-point window on the midpoint of each qualifying run
Imputing NAs at boundaries with draws from the 5 most frequent signal values
Removing chunks where basecaller moves are all zero (likely artifacts)
See also
create_tail_feature_list for preparing the input,
create_tail_chunk_list for batch segmentation,
split_tail_centered_dorado for the Dorado-specific version