Skip to contents

Extracts tail features of RNA reads from multi-Fast5 files basecalled by Guppy and poly(A) tail characteristics (coordinates) produced by nanopolish polya function. Processing is parallelized across reads using foreach and doSNOW. A progress bar is displayed during extraction.

Usage

create_tail_feature_list(
  nanopolish,
  sequencing_summary,
  workspace,
  num_cores,
  basecall_group,
  pass_only = TRUE
)

Arguments

nanopolish

Character string or data frame. Full path of the .tsv file produced by nanopolish polya function, or an in-memory data frame.

sequencing_summary

Character string or data frame. Full path of the .txt file with sequencing summary, or an in-memory data frame.

workspace

Character string. Full path of the directory containing the basecalled multi-Fast5 files.

num_cores

Numeric. Number of physical cores to use in processing. Do not exceed 1 less than the number of cores at your disposal.

basecall_group

Character string. Name of the level in the Fast5 file hierarchy from which data should be extracted (e.g., "Basecall_1D_000").

pass_only

Logical. If TRUE (default), only reads tagged by nanopolish as "PASS" are taken into consideration. If FALSE, reads tagged as "PASS" and "SUFFCLIP" are both included.

Value

A named list with three elements:

tail_feature_list

Named list of per-read tail features. Each element contains fast5_filename, tail_signal, tail_moves, and tail_pseudomoves (see extract_tail_data).

zeromoved_readnames

Character vector. Read IDs discarded because all basecaller moves in their tail region were zero.

nonpseudomoved_readnames

Character vector. Read IDs discarded because their pseudomove chain was too short (< 5 consecutive positions) to indicate a potential modification.

Always assign the returned list to a variable. Printing the full output to the console may crash your R session.

Details

After extraction, reads with zero-moved tails and reads that do not satisfy the pseudomove condition (minimum run length of 5) are filtered out and their identifiers are stored separately for downstream classification.

See also

extract_polya_data for reading nanopolish and sequencing summary data, extract_tail_data for single-read extraction, create_tail_chunk_list for downstream chunk segmentation

Examples

if (FALSE) { # \dontrun{

tfl <- ninetails::create_tail_feature_list(
  nanopolish = system.file('extdata',
                           'test_data',
                           'nanopolish_output.tsv',
                           package = 'ninetails'),
  sequencing_summary = system.file('extdata',
                                   'test_data',
                                   'sequencing_summary.txt',
                                   package = 'ninetails'),
  workspace = system.file('extdata',
                          'test_data',
                          'basecalled_fast5',
                          package = 'ninetails'),
  num_cores = 2,
  basecall_group = 'Basecall_1D_000',
  pass_only = TRUE
)

} # }