batch_seqstructhmm

Trains multiple Hidden Markov Models for the sequence-structure binding preferences of a given set of RNA-binding proteins. The models are trained on sequences and structures in FASTA format located in a given data directory. During the training process, statistics about the models are printed on stdout. In every iteration, the current model and a visualization of the model are stored in the batch directory. The training processes terminate when no significant progress has been made for three iterations.

usage: batch_seqstructhmm [-h] [--cores CORES]
                          [--structure_type STRUCTURE_TYPE]
                          [--motif_length MOTIF_LENGTH] [--baum_welch]
                          [--flexibility FLEXIBILITY]
                          [--block_size BLOCK_SIZE] [--threshold THRESHOLD]
                          [--termination_interval TERMINATION_INTERVAL]
                          data_directory proteins batch_directory

Positional Arguments

data_directory data directory; must contain the sequence files under fasta/<protein>/positive.fasta and structure files under <structure_type>/<protein>/positive.txt
proteins list of RNA-binding proteins to analyze (surrounded by quotation marks, separated by whitespace)
batch_directory
 directory for batch output

Named Arguments

--cores, -c number of cores to use (if not given, all cores are used)
--structure_type, -s
 

structure type to use; must match location of structure files (see data_directory argument above) (default: shapes)

Default: “shapes”

--motif_length, -n
 

length of the motifs that shall be found (default: 6)

Default: 6

--baum_welch, -b
 

should the models be initialized with a Baum-Welch optimized sequence motif (default: yes)

Default: True

--flexibility, -f
 

greedyness of Gibbs sampler: model parameters are sampled from among the top f configurations (default: f=10), set f to 0 in order to include all possible configurations

Default: 10

--block_size

number of sequences to be held-out in each iteration (default: 1)

Default: 1

--threshold, -t
 

the iterative algorithm is terminated if this reduction in sequence structure loglikelihood is not reached for any of the 3 last measurements (default: 10)

Default: 10.0

--termination_interval, -i
 

produce output every <i> iterations (default: i=100)

Default: 100