batch_seqstructhmm¶

Trains multiple Hidden Markov Models for the sequence-structure binding preferences of a given set of RNA-binding proteins. The models are trained on sequences and structures in FASTA format located in a given data directory. During the training process, statistics about the models are printed on stdout. In every iteration, the current model and a visualization of the model are stored in the batch directory. The training processes terminate when no significant progress has been made for three iterations.

usage: batch_seqstructhmm [-h] [--cores CORES]
                          [--structure_type STRUCTURE_TYPE]
                          [--motif_length MOTIF_LENGTH] [--baum_welch]
                          [--flexibility FLEXIBILITY]
                          [--block_size BLOCK_SIZE] [--threshold THRESHOLD]
                          [--termination_interval TERMINATION_INTERVAL]
                          data_directory proteins batch_directory

Positional Arguments¶

`data_directory`	data directory; must contain the sequence files under fasta/<protein>/positive.fasta and structure files under <structure_type>/<protein>/positive.txt
`proteins`	list of RNA-binding proteins to analyze (surrounded by quotation marks, separated by whitespace)
`batch_directory`
	directory for batch output

Named Arguments¶

`--cores, -c`	number of cores to use (if not given, all cores are used)
`--structure_type, -s`
	structure type to use; must match location of structure files (see data_directory argument above) (default: shapes) Default: “shapes”
`--motif_length, -n`
	length of the motifs that shall be found (default: 6) Default: 6
`--baum_welch, -b`
	should the models be initialized with a Baum-Welch optimized sequence motif (default: yes) Default: True
`--flexibility, -f`
	greedyness of Gibbs sampler: model parameters are sampled from among the top f configurations (default: f=10), set f to 0 in order to include all possible configurations Default: 10
`--block_size`	number of sequences to be held-out in each iteration (default: 1) Default: 1
`--threshold, -t`
	the iterative algorithm is terminated if this reduction in sequence structure loglikelihood is not reached for any of the 3 last measurements (default: 10) Default: 10.0
`--termination_interval, -i`
	produce output every <i> iterations (default: i=100) Default: 100