
Extracts features of poly(A) tails containing only A nucleotides for training-set preparation.
Source:R/ninetails_training_dataset_production_functions.R
create_tail_feature_list_A.RdTraining-set variant of the feature extraction wrapper designed
exclusively for A-only signals. It processes all reads in parallel via
extract_tail_data_trainingset, then applies an inverse
filtering criterion: only reads whose pseudomove vectors do not
contain consecutive non-zero runs of length >= 4 are retained. This
ensures the resulting dataset represents pure homopolymer A tails
without modification artefacts.
Usage
create_tail_feature_list_A(
nanopolish,
sequencing_summary,
workspace,
num_cores,
basecall_group,
pass_only = TRUE
)Arguments
- nanopolish
Character string. Full path of the
.tsvfile produced bynanopolish polya.- sequencing_summary
Character string. Full path of the
.txtfile with the sequencing summary.- workspace
Character string. Full path of the directory containing basecalled multi-Fast5 files.
- num_cores
Numeric
[1]. Number of physical cores to use. Do not exceed 1 less than the number of cores at your disposal.- basecall_group
Character string
["Basecall_1D_000"]. Name of the level in the Fast5 file hierarchy from which the data should be extracted.- pass_only
Logical
[TRUE]. IfTRUE, only reads tagged by nanopolish as"PASS"are retained. Otherwise, reads tagged as"PASS"or"SUFFCLIP"are included.
Value
A named list with one element:
- tail_feature_list
Named list of per-read feature lists (as returned by
extract_tail_data_trainingset) containing only reads with pure A tails.
Always assign this returned list to a variable; printing the full list to the console may crash the R session.
Details
The inverse filtering uses a sliding-window approach
(stats::embed) to detect runs of >= 4 consecutive non-zero
pseudomoves. Reads that pass this filter (i.e. have no such runs)
are collected as the A-only reference set. Unlike
create_tail_feature_list_trainingset, the returned list
does not include zeromoved_readnames or
nonpseudomoved_readnames categories.
See also
create_tail_feature_list_trainingset for the
non-A variant,
create_tail_chunk_list_A for the next pipeline step,
extract_tail_data_trainingset for per-read extraction,
prepare_trainingset for the top-level wrapper.