
Extracts fragments of poly(A) tail signal (Dorado mode) containing potential modifications along with their delimitation (positional indices; coordinates) within the tail.
Source:R/ninetails_core_functions_dorado_DRS.R
split_tail_centered_dorado.RdThis function processes raw poly(A) tail signal and Dorado-derived pseudomoves to identify and extract signal segments (chunks) potentially corresponding to modified positions (e.g., non-A residues). Each extracted chunk spans 100 signal points, centered on the midpoint of a pseudomove run.
Arguments
- readname
character string. Name of the given read within the analyzed dataset.
- tail_feature_list
list object produced by
create_tail_features_list_doradoDorado-tail feature extraction function. Must contain$tail_signal(numeric vector) and$tail_pseudomoves(integer vector) for each read.
Value
a nested list where each element corresponds to a signal fragment. Each fragment is itself a list with three entries:
chunk_sequence: numeric vector of raw signal valueschunk_start_pos: integer, start index of the chunkchunk_end_pos: integer, end index of the chunk
Details
In the Dorado pipeline, moves are not used: * retrieving them from BAM files is computationally expensive * processing is non-intuitive
Instead, only pseudomoves are considered. As a safeguard against Dorado’s tendency to extend poly(A) boundaries into the transcript body, the last 3 pseudomove values are forced to 0. This prevents misclassification of transcript nucleotides as part of the tail.
Candidate modification regions are detected by: * run-length encoding (RLE) of the pseudomove vector * filtering runs of pseudomoves with length ≥ 5
Extracted fragments are padded/imputed if they extend beyond signal boundaries: * upstream/downstream missing values (NAs) are replaced * imputation is based on random draws from the 5 most frequent signal values
The function returns a list object (nested), where each element represents one candidate modification region, containing: * `chunk_sequence`: the raw signal subsequence (length = 100, imputed if needed) * `chunk_start_pos`: starting index of the subsequence * `chunk_end_pos`: ending index of the subsequence