Filters out signals of a given nucleotide type for neural network training-set preparation.

Top-level convenience wrapper that orchestrates the complete training-set production pipeline for a single nucleotide category. Depending on the selected nucleotide, it chains the appropriate feature extraction, chunk splitting, filtering, and GAF creation functions into a single call.

Usage

prepare_trainingset(
  nucleotide,
  nanopolish,
  sequencing_summary,
  workspace,
  num_cores = 1,
  basecall_group = "Basecall_1D_000",
  pass_only = TRUE
)

Arguments

nucleotide: Character. One of "A", "C", "G", or "U". Defines the type of signal filtering applied to produce training data for the desired nucleotide context.
nanopolish: Character string. Full path of the .tsv file produced by nanopolish polya.
sequencing_summary: Character string. Full path of the .txt file with the sequencing summary.
workspace: Character string. Full path of the directory containing basecalled multi-Fast5 files.
num_cores: Numeric [1]. Number of physical cores to use. Do not exceed 1 less than the number of cores at your disposal.
basecall_group: Character string ["Basecall_1D_000"]. Name of the level in the Fast5 file hierarchy from which the data should be extracted.
pass_only: Logical [TRUE]. If TRUE, only reads tagged by nanopolish as "PASS" are retained. Otherwise, reads tagged as "PASS" or "SUFFCLIP" are included.

Value

A named list of GAF matrices organised by <read_ID>_<index>. Always assign this returned list to a variable; printing the full list to the console may crash the R session.

Details

The internal pipeline differs by nucleotide:

"A": Uses the A-only branch: create_tail_feature_list_A \(\rightarrow\) create_tail_chunk_list_A \(\rightarrow\) create_gaf_list_A.
"C": Uses the non-A branch with value = -1 (valley filtering): create_tail_feature_list_trainingset \(\rightarrow\) create_tail_chunk_list_trainingset \(\rightarrow\) filter_nonA_chunks_trainingset \(\rightarrow\) create_gaf_list.
"G": Uses the non-A branch with value = 1 (peak filtering).
"U": Uses the non-A branch with value = -1 (valley filtering), same filter direction as C.

Examples

if (FALSE) { # \dontrun{

prepare_trainingset(
  nucleotide = "A",
  nanopolish = '/path/to/file',
  sequencing_summary = '/path/to/file',
  workspace = '/path/to/guppy/workspace',
  num_cores = 10,
  basecall_group = 'Basecall_1D_000',
  pass_only = TRUE)

} # }