The RF Count module is the core component of the framework. It can process any number of SAM/BAM files to calculate per-base RT-stops/mutations and read coverage on each transcript.

Usage

To list the required parameters, simply type:

$ rf-count -h
Parameter Type Description
-p or --processors int Number of processors (threads) to use (Default: 1)
-wt or --working-threads int Number of working threads to use for each instance of SAMTools/Bowtie (Default: 1).
Note: RT Counter executes 1 instance of SAMTools for each processor specified by -p. At least -p <processors> * -wt <threads> processors are required.
-o or --output-dir string Output directory for writing counts in RC (RNA Count) format (Default: rf_count/)
-ow or --overwrite Overwrites the output directory if already exists
-t or --tmp-dir string Path to a directory for temporary files creation (Default: /tmp)
Note: If the provided directory does not exist, it will be created
-nm or --no-mapped-count Disables counting of total mapped reads
Note: This option must be avoided when processing SAM/BAM files from Ψ-seq/Pseudo-seq and 2OMe-seq experiments.
-s or --samtools string Path to samtools executable (Default: assumes samtools is in PATH)
-r or --sorted In case SAM/BAM files are passed, assumes that they are already sorted lexicographically by transcript ID, and numerically by position
-t5 or --trim-5prime int[,int] Comma separated list (no spaces) of values indicating the number of bases trimmed from the 5'-end of reads in the respective sample SAM/BAM files (Default: 0)
Note #1: Values must be provided in the same order as the input files (e.g. rf-count -t5 0,5 file1.bam file2.bam, will consider 0 bases trimmed from file1 reads, and 5 bases trimmed from file2 reads)
Note #2: If a single value is specified along with multiple SAM/BAM files, it will be used for all files
-fh or --from-header Instead of providing the number of bases trimmed from 5'-end of reads through the -t5 (or --trim-5prime) parameter, RF Count will try to guess it automatically from the header of the provided SAM/BAM files
-f or --fasta string Path to a FASTA file containing the reference transcripts
Note #1: Transcripts in this file must match transcripts in SAM/BAM file headers
Note #2: This can be omitted if a Bowtie index is specified by -bi (or --bowtie-index)
-mf or --mask-file string Path to a mask file
-po or --paired-only When processing SAM/BAM files from paired-end experiments, only those reads for which both mates are mapped will be considered
-pp or --properly-paired When processing SAM/BAM files from paired-end experiments, only those reads mapped in a proper pair will be considered
-i or --include-clipped Include reads that have been soft/hard-clipped at their 5'-end when calculating RT-stops
Note: The default behavior is to exclude soft/hard-clipped reads. When this option is active, the RT-stop position is considered to be the position preceding the clipped bases. This option has no effect when -m (or --count-mutations) is enabled.
-mq or --map-quality int Minimum mapping quality to consider a read (Default: 10)
-co or --coverage-only Only calculates per-base coverage (disables RT-stops/mutations count)
-m or --count-mutations Enables mutations count instead of RT-stops count (for SHAPE-MaP/DMS-MaPseq)
Mutation count mode options
-q or --min-quality int Minimum quality score value to consider a mutation (Phred+33, requires -m, Default: 20)
-es or --eval-surrounding When considering a mutation/indel, also evaluates the quality of surrounding bases (±1 nt)
Note: the quality score threshold set by -q (or --min-quality) also applies to these bases
-nd or --no-deletions Ignores deletions
-ni or --no-insertions Ignores insertions
-na or --no-ambiguous Ignores ambiguously mapped deletions
Note: the default behavior is to re-align them to their left-most valid position (or to their right-most valid position if -ra has been specified)
-ra or --right-align Re-aligns ambiguously mapped deletions to their right-most valid position
-md or --max-deletion-len int Ignores deletions longer than this number of nucleotides (Default: 10)
-me or --max-edit-distance float Discards reads with editing distance frequency higher than this threshold (0<m≤1, Default: 0.15 [15%])
-eq or --median-quality int Median quality score threshold for discarding low-quality reads (Phred+33, Default: 20)
-cc or --collapse-consecutive Collapses consecutive mutations/indels toward the 3'-most one (recommended for SHAPE-MaP experiments)
-mc or --max-collapse-distance int Maximum distance between consecutive mutations/indels to allow collapsing (requires -cc, ≥0, Default_ 2)


Deletions re-alignment in mutational profiling-based methods

Mutational profiling (MaP) methods for RNA structure analysis are based on the ability of certain reverse transcriptase enzymes to read-through the sites of SHAPE/DMS modification under specific reaction conditions. Some of them (e.g. SuperScript II) can introduce deletions when encountering a SHAPE/DMS-modified residue. When performing reads mapping, the aligner often reports a single possible alignment of the deletion, although many equally-scoring alignments are possible.
To avoid counting of ambiguously aligned deletions, that can introduce noise in the measured structural signal, RF Count performs a deletion re-alignment step to detect and re-align/discard these ambiguously aligned deletions:

Ambiguous deletions

For more information, please refer to Smola et al., 2015 (PMID: 26426499).

Handling of mutations/indels

By giving a rapid look to the numerous parameters provided by RF Count, it appears immediately clear that different parameter combinations produce very different outcomes. Here follows a brief scheme aimed at illustrating the different behaviors of RF Count with different parameter combinations (dots correspond to sites of assigned mutations):

RF Count MaP handling

RC (RNA Count) format

RF Count produces a RC (RNA Count) file for each analyzed sample. RC files are proprietary binary files, that store transcript’s sequence, per-base RT-stop/mutation counts, per-base read coverage, and total number of mapped reads. These files can be indexed for fast random access.
Each entry in a RC file is structured as follows:

Field Description Type
len_transcript_id Length of the transcript ID (plus 1, including NULL) uint32_t
transcript_id Transcript ID (NULL terminated) char[len_transcript_id]
len_seq Length of sequence uint32_t
seq 4-bit encoded sequence: 'ACGTN' -> [0,4] (High nybble first) uint8_t[(len_seq+1)/2]
counts Transcript's per base RT-stops (or mutations) uint32_t[len_seq]
coverage Transcript's per base coverage uint32_t[len_seq]
nt Transcript's mapped reads unint64_t

RC files EOF stores the number of total mapped reads, and is structured as follows:

Field Description Type
n Total experiment mapped reads uint64_t
version RC file version uint16_t
marker EOF marker (\x5b\x65\x6f\x66\x72\x63\x5d) char[7]

The current RC standard's version is 1.
RCI (RC Index) files enable random access to transcript data within RC files.
The RCI index is structured as follows:

Field Description Type
len_transcript_id Length of the transcript ID (plus 1, including NULL) uint32_t
transcript_id Transcript ID (NULL terminated) char[len_transcript_id]
offset Offset of transcript in the RC file uint64_t

Information

All values are forced to be in little-endian byte-order.


Mask file

The mask file allows excluding specific transcript regions from being counted. This is particularly useful when performing targeted MaP analyses, in order to mask the primer pairing regions.
The mask file is composed of one or more lines, each one reporting the transcript ID and one or more base ranges (1-based, inclusive) that must be masked (or the nucleotide sequence of the regions that need to be masked), either separated by commas or semicolons (also mixed):

Transcript_1;AGCGTATTAGCGATGCGATGCGA;25-38;504-551
Transcript_2,331-402,AUAUGGAUCGGACG,984-1008
Transcript_3;GUUACAUUCGA,98-123;47-68


Transcript regions specified in the mask file will have both 0 counts and coverage in the resulting RC file.