The RF Count module is the core component of the framework. It can process any number of SAM/BAM files to calculate per-base RT-stops/mutations and read coverage on each transcript.

Usage

To list the required parameters, simply type:

$ rf-count -h
Parameter Type Description
-p or --processors int Number of processors (threads) to use (Default: 1)
-wt or --working-threads int Number of working threads to use for each instance of SAMTools/Bowtie (Default: 1).
Note: RT Counter executes 1 instance of SAMTools for each processor specified by -p. At least -p <processors> * -wt <threads> processors are required.
-o or --output-dir string Output directory for writing counts in RC (RNA Count) format (Default: rf_count/)
-ow or --overwrite Overwrites the output directory if already exists
-t or --tmp-dir string Path to a directory for temporary files creation (Default: /tmp)
Note: If the provided directory does not exist, it will be created
-nm or --no-mapped-count Disables counting of total mapped reads
Note: This option must be avoided when processing SAM/BAM files from Ψ-seq/Pseudo-seq and 2OMe-seq experiments.
-s or --samtools string Path to samtools executable (Default: assumes samtools is in PATH)
-r or --sorted In case SAM/BAM files are passed, assumes that they are already sorted lexicographically by transcript ID, and numerically by position
-t5 or --trim-5prime int[,int] Comma separated list (no spaces) of values indicating the number of bases trimmed from the 5'-end of reads in the respective sample SAM/BAM files (Default: 0)
Note #1: Values must be provided in the same order as the input files (e.g. rf-count -t5 0,5 file1.bam file2.bam, will consider 0 bases trimmed from file1 reads, and 5 bases trimmed from file2 reads)
Note #2: If a single value is specified along with multiple SAM/BAM files, it will be used for all files
-fh or --from-header Instead of providing the number of bases trimmed from 5'-end of reads through the -t5 (or --trim-5prime) parameter, RF Count will try to guess it automatically from the header of the provided SAM/BAM files
-f or --fasta string Path to a FASTA file containing the reference transcripts
Note #1: Transcripts in this file must match transcripts in SAM/BAM file headers
Note #2: This can be omitted if a Bowtie index is specified by -bi (or --bowtie-index)
-po or --paired-only When processing SAM/BAM files from paired-end experiments, only those reads for which both mates are mapped will be considered
-pp or --properly-paired When processing SAM/BAM files from paired-end experiments, only those reads mapped in a proper pair will be considered
-i or --include-clipped Include reads that have been soft/hard-clipped at their 5'-end when calculating RT-stops
Note: The default behavior is to exclude soft/hard-clipped reads. When this option is active, the RT-stop position is considered to be the position preceding the clipped bases. This option has no effect when -m (or --count-mutations) is enabled.
-mq or --map-quality int Minimum mapping quality to consider a read (Default: 10)
-co or --coverage-only Only calculates per-base coverage (disables RT-stops/mutations count)
-m or --count-mutations Enables mutations count instead of RT-stops count (for SHAPE-MaP/DMS-MaPseq)
Mutation count mode options
-q or --min-quality int Minimum quality score value to consider a mutation (Phred+33, requires -m, Default: 20)
-es or --eval-surrounding When considering a mutation/indel, also evaluates the quality of surrounding bases (±1 nt)
Note: the quality score threshold set by -q (or --min-quality) also applies to these bases
-nd or --no-deletions Ignores deletions
-ni or --no-insertions Ignores insertions
-na or --no-ambiguous Ignores ambiguously mapped deletions
Note: the default behavior is to re-align them to their left-most valid position (or to their right-most valid position if -ra has been specified)
-ra or --right-align Re-aligns ambiguously mapped deletions to their right-most valid position
-md or --max-deletion-len int Ignores deletions longer than this number of nucleotides (Default: 10)
-me or --max-edit-distance float Discards reads with editing distance frequency higher than this threshold (0<m≤1, Default: 0.15 [15%])
-eq or --median-quality int Median quality score threshold for discarding low-quality reads (Phred+33, Default: 20)
-cc or --collapse-consecutive Collapses consecutive mutations/indels toward the 3'-most one (recommended for SHAPE-MaP experiments)
-mc or --max-collapse-distance int Maximum distance between consecutive mutations/indels to allow collapsing (requires -cc, ≥0, Default_ 2)


Deletions re-alignment in mutational profiling-based methods

Mutational profiling (MaP) methods for RNA structure analysis are based on the ability of certain reverse transcriptase enzymes to read-through the sites of SHAPE/DMS modification under specific reaction conditions. Some of them (e.g. SuperScript II) can introduce deletions when encountering a SHAPE/DMS-modified residue. When performing reads mapping, the aligner often reports a single possible alignment of the deletion, although many equally-scoring alignments are possible.
To avoid counting of ambiguously aligned deletions, that can introduce noise in the measured structural signal, RF Count performs a deletion re-alignment step to detect and re-align/discard these ambiguously aligned deletions:

Ambiguous deletions

For more information, please refer to Smola et al., 2015 (PMID: 26426499).

Handling of mutations/indels

By giving a rapid look to the numerous parameters provided by RF Count, it appears immediately clear that different parameter combinations produce very different outcomes. Here follows a brief scheme aimed at illustrating the different behaviors of RF Count with different parameter combinations (dots correspond to sites of assigned mutations):

RF Count MaP handling

RC (RNA Count) format

RF Count produces a RC (RNA Count) file for each analyzed sample. RC files are proprietary binary files, that store transcript’s sequence, per-base RT-stop/mutation counts, per-base read coverage, and total number of mapped reads. These files can be indexed for fast random access.
Each entry in a RC file is structured as follows:

Field Description Type
len_transcript_id Length of the transcript ID (plus 1, including NULL) uint32_t
transcript_id Transcript ID (NULL terminated) char[len_transcript_id]
len_seq Length of sequence uint32_t
seq 4-bit encoded sequence: 'ACGTN' -> [0,4] (High nybble first) uint8_t[(len_seq+1)/2]
counts Transcript's per base RT-stops (or mutations) uint32_t[len_seq]
coverage Transcript's per base coverage uint32_t[len_seq]
nt Transcript's mapped reads unint64_t

RC files EOF stores the number of total mapped reads, and is structured as follows:

Field Description Type
n Total experiment mapped reads uint64_t
version RC file version uint16_t
marker EOF marker (\x5b\x65\x6f\x66\x72\x63\x5d) char[7]

The current RC standard's version is 1.
RCI (RC Index) files enable random access to transcript data within RC files.
The RCI index is structured as follows:

Field Description Type
len_transcript_id Length of the transcript ID (plus 1, including NULL) uint32_t
transcript_id Transcript ID (NULL terminated) char[len_transcript_id]
offset Offset of transcript in the RC file uint64_t

Information

All values are forced to be in little-endian byte-order.