preprocess_dataset

Pipeline for the preparation of a CLIP-Seq dataset in BED format. The pipeline consists of the following steps: 1 - Filter BED file 2 - Elongate BED file for later structure prediction 3 - Fetch genomic sequences for elongated BED file 4 - Produce FASTA file with genomic sequences in viewpoint format 5 - Secondary structure prediction with RNAshapes 6 - Secondary structure prediction with RNAstructures

DEPENDENCIES: This script requires bedtools (shuffle, slop, getfasta), RNAshapes, and RNAstructures.

A working directory and a dataset name (e.g., the protein name) have to be given. The output files can be found in: - <workingdir>/fasta/<dataset_name>/positive.fasta - genomic sequences in viewpoint format - <workingdir>/shapes/<dataset_name>/positive.txt - secondary structures of genomic sequence (predicted by RNAshapes) - <workingdir>/structures/<dataset_name>/positive.txt - secondary structures of genomic sequence (predicted by RNAstructures)

For classification, a negative set of binding sites with shuffled coordinates can be generated with the –generate_negative option. For this option, gene boundaries are required and need to be given as –genome_genes. They can be downloaded e.g. from the UCSC table browser (http://genome.ucsc.edu/cgi-bin/hgTables). Choose the most recent GENCODE track (currently GENCODE Gene V24lift37->Basic (for hg19) and All GENCODE V24->Basic (for hg38)) and ‘BED’ as output format.

usage: preprocess_dataset [-h] [--disable_filtering] [--disable_RNAshapes]
                          [--disable_RNAstructure] [--generate_negative]
                          [--min_score MIN_SCORE] [--min_length MIN_LENGTH]
                          [--max_length MAX_LENGTH] [--elongation ELONGATION]
                          [--genome_genes GENOME_GENES] [--skip_check]
                          working_dir dataset_name input genome genome_sizes

Positional Arguments

working_dir working/output directory
dataset_name dataset name
input input file in .bed format
genome reference genome in FASTA format
genome_sizes chromosome sizes of reference genome (e.g. from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes)

Named Arguments

--disable_filtering, -f
 

skip the filtering step

Default: False

--disable_RNAshapes
 

skip secondary structure prediction with RNAshapes

Default: False

--disable_RNAstructure
 

skip secondary structure prediction with RNAstructures

Default: False

--generate_negative, -n
 

generate a negative set for classification

Default: False

--min_score

filtering: minimum score for binding site (default: 0.0)

Default: 0.0

--min_length

filtering: minimum binding site length (default: 8)

Default: 8

--max_length

filtering: maximum binding site length (default: 75)

Default: 75

--elongation

elongation: span for up- and downstream elongation of binding sites (default: 20)

Default: 20

--genome_genes negative set generation: gene boundaries
--skip_check, -s
 

skip check for installed prerequisites

Default: False