preprocess_dataset¶

Pipeline for the preparation of a CLIP-Seq dataset in BED format. The pipeline consists of the following steps: 1 - Filter BED file 2 - Elongate BED file for later structure prediction 3 - Fetch genomic sequences for elongated BED file 4 - Produce FASTA file with genomic sequences in viewpoint format 5 - Secondary structure prediction with RNAshapes 6 - Secondary structure prediction with RNAstructures

DEPENDENCIES: This script requires bedtools (shuffle, slop, getfasta), RNAshapes, and RNAstructures.

A working directory and a dataset name (e.g., the protein name) have to be given. The output files can be found in: - <workingdir>/fasta/<dataset_name>/positive.fasta - genomic sequences in viewpoint format - <workingdir>/shapes/<dataset_name>/positive.txt - secondary structures of genomic sequence (predicted by RNAshapes) - <workingdir>/structures/<dataset_name>/positive.txt - secondary structures of genomic sequence (predicted by RNAstructures)

For classification, a negative set of binding sites with shuffled coordinates can be generated with the –generate_negative option. For this option, gene boundaries are required and need to be given as –genome_genes. They can be downloaded e.g. from the UCSC table browser (http://genome.ucsc.edu/cgi-bin/hgTables). Choose the most recent GENCODE track (currently GENCODE Gene V24lift37->Basic (for hg19) and All GENCODE V24->Basic (for hg38)) and ‘BED’ as output format.

usage: preprocess_dataset [-h] [--disable_filtering] [--disable_RNAshapes]
                          [--disable_RNAstructure] [--generate_negative]
                          [--min_score MIN_SCORE] [--min_length MIN_LENGTH]
                          [--max_length MAX_LENGTH] [--elongation ELONGATION]
                          [--genome_genes GENOME_GENES] [--skip_check]
                          working_dir dataset_name input genome genome_sizes

Positional Arguments¶

`working_dir`	working/output directory
`dataset_name`	dataset name
`input`	input file in .bed format
`genome`	reference genome in FASTA format
`genome_sizes`	chromosome sizes of reference genome (e.g. from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes)

Named Arguments¶

`--disable_filtering, -f`
	skip the filtering step Default: False
`--disable_RNAshapes`
	skip secondary structure prediction with RNAshapes Default: False
`--disable_RNAstructure`
	skip secondary structure prediction with RNAstructures Default: False
`--generate_negative, -n`
	generate a negative set for classification Default: False
`--min_score`	filtering: minimum score for binding site (default: 0.0) Default: 0.0
`--min_length`	filtering: minimum binding site length (default: 8) Default: 8
`--max_length`	filtering: maximum binding site length (default: 75) Default: 75
`--elongation`	elongation: span for up- and downstream elongation of binding sites (default: 20) Default: 20
`--genome_genes`	negative set generation: gene boundaries
`--skip_check, -s`
	skip check for installed prerequisites Default: False