Annotate

Description

The annotate command is the first in the chain of commands for processing MAS-seq data. Given an array model (e.g. ‘mas15’), annotate determines an optimal annotation (Viterbi path) for the adapters and unknown sequences in each read. The Viterbi path is automatically computed for both the forward and reverse-complement orientations of the read, and the path with the highest overall maximum likelihood is considered to be the correct one. Longbow then appends this information to the read’s auxillary BAM tags for use with downstream Longbow tools (e.g. filter, segment).

Several models are available for use with the --model argument. They are:

name	description	version
mas15	The standard MAS-seq 15-element array model	1.0.0
mas10	The MAS-seq 10-element array model	1.0.0
mas8prototype	The prototype MAS-seq 8-element array	1.0.0
slide-seq	The Slide-seq 15-element array model	0.0.1
bulk15	The 15-element bulk RNA model	(experimental)

By default, the annotate command streams its results to stdout for use with other Longbow commands or tools (e.g. samtools) via Unix pipes.

This command is parallelizable on a per-read level. On a system with N cores, it will attempt to use N-1 cores by default.

Optionally, this command can make use of a PacBio index (.pbi) file, which specifies the total number of reads in the file and can therefore be useful for accurate progress reporting. The .pbi file has no effect on speed or accuracy of results; it is solely for convenience in progress logging.

Command help

$ longbow annotate --help
Usage: longbow annotate [OPTIONS] INPUT_BAM

  Annotate reads in a BAM file with segments from the model.

Options:
  -v, --verbosity LVL    Either CRITICAL, ERROR, WARNING, INFO or DEBUG
  -p, --pbi PATH         BAM .pbi index file
  -t, --threads INTEGER  number of threads to use (0 for all)  [default: # Processors - 1]
  -o, --output-bam PATH  annotated bam output  [default: stdout]
  -m, --model TEXT       The model to use for annotation.  If the given value
                         is a pre-configured model name, then that model will
                         be used.  Otherwise, the given value will be treated
                         as a file name and Longbow will attempt to read in
                         the file and create a LibraryModel from it.  Longbow
                         will assume the contents are the configuration of a
                         LibraryModel as per LibraryModel.to_json().
                         [default: mas15]
  -c, --chunk TEXT       Process a single chunk of data (e.g. specify '2/4' to
                         process the second of four equally-sized chunks
                         across the dataset)
  --max-length INTEGER   Maximum length of a read to process.  Reads beyond
                         this length will not be annotated.  [default: 60000]
  --min-rq FLOAT         Minimum ccs-determined read quality for a read to be
                         annotated.  CCS read quality range is [-1,1].
                         [default: -2.0]
  --help                 Show this message and exit.

Example

$ longbow annotate -o annotated.bam tests/test_data/mas15_test_input.bam
[INFO 2021-08-12 09:55:07 annotate] Invoked via: longbow annotate -o annotated.bam tests/test_data/mas15_test_input.bam
[INFO 2021-08-12 09:55:07 annotate] Running with 11 worker subprocess(es)
[INFO 2021-08-12 09:55:07 annotate] Using The standard MAS-seq 15 array element model.
[INFO 2021-08-12 09:55:10 annotate] Annotating 8 reads
Progress: 100%|████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.26 read/s]
[INFO 2021-08-12 09:55:12 annotate] Annotated 8 reads with 557 total sections.
[INFO 2021-08-12 09:55:12 annotate] Done. Elapsed time: 5.35s. Overall processing rate: 1.50 reads/s.