The RF Count module is the core component of the framework. It can process any number of SAM/BAM files to calculate per-base RT-stops/mutations and read coverage on each transcript.

Usage

To list the required parameters, simply type:

$ rf-count -h
Parameter Type Description
-p or --processors int Number of processors (threads) to use (Default: 1)
-wt or --working-threads int Number of working threads to use for each instance of SAMTools/Bowtie (Default: 1).
Note: RT Counter executes 1 instance of SAMTools for each processor specified by -p. At least -p <processors> * -wt <threads> processors are required.
-t or --tmp-dir string Path to a directory for temporary files creation (Default: /tmp)
Note: If the provided directory does not exist, it will be created
-o or --output-dir string Output directory for writing counts in RC (RNA Count) format (Default: rf_count/)
-ow or --overwrite Overwrites the output directory if already exists
-nm or --no-mapped-count Disables counting of total mapped reads
Note: This option must be avoided when processing SAM/BAM files from Ψ-seq/Pseudo-seq and 2OMe-seq experiments.
-s or --samtools string Path to samtools executable (Default: assumes samtools is in PATH)
-r or --sorted In case SAM/BAM files are passed, assumes that they are already sorted lexicographically by transcript ID, and numerically by position
-t5 or --trim-5prime int[,int] Comma separated list (no spaces) of values indicating the number of bases trimmed from the 5'-end of reads in the respective sample SAM/BAM files (Default: 0)
Note #1: Values must be provided in the same order as the input files (e.g. rf-count -t5 0,5 file1.bam file2.bam, will consider 0 bases trimmed from file1 reads, and 5 bases trimmed from file2 reads)
Note #2: If a single value is specified along with multiple SAM/BAM files, it will be used for all files
-fh or --from-header Instead of providing the number of bases trimmed from 5'-end of reads through the -t5 (or --trim-5prime) parameter, RF Count will try to guess it automatically from the header of the provided SAM/BAM files
-f or --fasta string Path to a FASTA file containing the reference transcripts
Note #1: Transcripts in this file must match transcripts in SAM/BAM file headers
Note #2: This can be omitted if a Bowtie index is specified by -bi (or --bowtie-index)
-po or --paired-only When processing SAM/BAM files from paired-end experiments, only those reads for which both mates are mapped will be considered
-pp or --properly-paired When processing SAM/BAM files from paired-end experiments, only those reads mapped in a proper pair will be considered
-i or --include-clipped Include reads that have been soft/hard-clipped at their 5'-end when calculating RT-stops
Note: The default behavior is to exclude soft/hard-clipped reads. When this option is active, the RT-stop position is considered to be the position preceding the clipped bases. This option has no effect when -m (or --count-mutations) is enabled.
-m or --count-mutations Enables mutations count instead of RT-stops count (for SHAPE-MaP/DMS-MaPseq)
-mq or --min-quality int Minimum quality score value to consider a mutation (Phred+33, Default: 20)
-nd or --no-deletions Disables counting unambiguously mapped deletions as mutations (requires -m)
-md or --max-deletion-len int Ignores deletions longer than this number of nucleotides (requires -m, Default: 3)
-me or --min-edit-distance int Discards reads with less than this number of mutations/deletions (Default: 0)
-co or --coverage-only Only calculates per-base coverage (disables RT-stops/mutations count)


Deletions re-alignment in mutational profiling-based methods

Mutational profiling (MaP) methods for RNA structure analysis are based on the ability of certain reverse transcriptase enzymes to read-through the sites of SHAPE/DMS modification under specific reaction conditions. Some of them (e.g. SuperScript II) can introduce deletions when encountering a SHAPE/DMS-modified residue. When performing reads mapping, the aligner often reports a single possible alignment of the deletion, although many equally-scoring alignments are possible.
To avoid counting of ambiguously aligned deletions, that can introduce noise in the measured structural signal, RF Count performs a deletion re-alignment step to detect and discard these ambiguously aligned deletions:

ATTACGCGGATCTACGAAAGCTTTACGGACGGTAC     # Reference
ATTACGCGGATCTACGA-AGCTTTACGGACGGTAC     # Alignment

ATTACGCGGATCTACGA|AGCTTTACGGACGGTAC     # Sequence surrounding deletion

# Slide the deletion along sequence     # Extract surrounding sequence
ATTACGCGGATC-ACGAAAGCTTTACGGACGGTAC     ATTACGCGGATC|ACGAAAGCTTTACGGACGGTAC #1
ATTACGCGGATCT-CGAAAGCTTTACGGACGGTAC     ATTACGCGGATCT|CGAAAGCTTTACGGACGGTAC #2
ATTACGCGGATCTA-GAAAGCTTTACGGACGGTAC     ATTACGCGGATCTA|GAAAGCTTTACGGACGGTAC #3
ATTACGCGGATCTAC-AAAGCTTTACGGACGGTAC     ATTACGCGGATCTAC|AAAGCTTTACGGACGGTAC #4
ATTACGCGGATCTACG-AAGCTTTACGGACGGTAC     ATTACGCGGATCTACG|AAGCTTTACGGACGGTAC #5
ATTACGCGGATCTACGAA-GCTTTACGGACGGTAC     ATTACGCGGATCTACGAA|GCTTTACGGACGGTAC #6
ATTACGCGGATCTACGAAA-CTTTACGGACGGTAC     ATTACGCGGATCTACGAAA|CTTTACGGACGGTAC #7
ATTACGCGGATCTACGAAAG-TTTACGGACGGTAC     ATTACGCGGATCTACGAAAG|TTTACGGACGGTAC #8

# Compare surrounding sequence from sled deletion to that from the original alignment
ATTACGCGGATCTACGA|AGCTTTACGGACGGTAC     # Original alignment
ATTACGCGGATCTACG|AAGCTTTACGGACGGTAC     # 5
ATTACGCGGATCTACGAA|GCTTTACGGACGGTAC     # 6

# Concatenate surrounding sequences
ATTACGCGGATCTACGAAGCTTTACGGACGGTAC      # Original alignment
ATTACGCGGATCTACGAAGCTTTACGGACGGTAC      # 5
ATTACGCGGATCTACGAAGCTTTACGGACGGTAC      # 6

# Deletion is discarded because it is NOT unambiguously aligned

For more information, please refer to Smola et al., 2015 (PMID: 26426499).

RC (RNA Count) format

RF Count produces a RC (RNA Count) file for each analyzed sample. RC files are proprietary binary files, that store transcript’s sequence, per-base RT-stop/mutation counts, and per-base read coverage. These files can be indexed for fast random access.
Each entry in a RC file is structured as follows:

Field Description Type
len_transcript_id Length of the transcript ID (plus 1, including NULL) uint32_t
transcript_id Transcript ID (NULL terminated) char[len_transcript_id]
len_seq Length of sequence uint32_t
seq 4-bit encoded sequence: 'ACGTN' -> [0,4] (High nybble first) uint8_t[(len_seq+1)/2]
counts Transcript's per base RT-stops (or mutations) uint32_t[len_seq]
coverage Transcript's per base coverage uint32_t[len_seq]

RC files EOF stores the number of total mapped reads (uint64_t packed as 2 x uint32_t), and is structured as follows:

Field Description Type
n1 Total experiment mapped reads >> 32 uint32_t
n2 Total experiment mapped reads & 0xFFFFFFFF uint32_t
marker EOF marker (\x5b\x65\x66\x72\x74\x63\x5d) char[7]

RCI (RC Index) files enable random access to transcript data within RC files.
The RCI index is structured as follows:

Field Description Type
len_transcript_id Length of the transcript ID (plus 1, including NULL) uint32_t
transcript_id Transcript ID (NULL terminated) char[len_transcript_id]
offset Offset of transcript in the RC file uint32_t