preprocess_dataset¶
Pipeline for the preparation of a CLIP-Seq dataset in BED format. The pipeline consists of the following steps: 1 - Filter BED file 2 - Elongate BED file for later structure prediction 3 - Fetch genomic sequences for elongated BED file 4 - Produce FASTA file with genomic sequences in viewpoint format 5 - Secondary structure prediction with RNAshapes 6 - Secondary structure prediction with RNAstructures
DEPENDENCIES: This script requires bedtools (shuffle, slop, getfasta), RNAshapes, and RNAstructures.
A working directory and a dataset name (e.g., the protein name) have to be given. The output files can be found in: - <workingdir>/fasta/<dataset_name>/positive.fasta - genomic sequences in viewpoint format - <workingdir>/shapes/<dataset_name>/positive.txt - secondary structures of genomic sequence (predicted by RNAshapes) - <workingdir>/structures/<dataset_name>/positive.txt - secondary structures of genomic sequence (predicted by RNAstructures)
For classification, a negative set of binding sites with shuffled coordinates can be generated with the –generate_negative option. For this option, gene boundaries are required and need to be given as –genome_genes. They can be downloaded e.g. from the UCSC table browser (http://genome.ucsc.edu/cgi-bin/hgTables). Choose the most recent GENCODE track (currently GENCODE Gene V24lift37->Basic (for hg19) and All GENCODE V24->Basic (for hg38)) and ‘BED’ as output format.
usage: preprocess_dataset [-h] [--disable_filtering] [--disable_RNAshapes]
[--disable_RNAstructure] [--generate_negative]
[--min_score MIN_SCORE] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--elongation ELONGATION]
[--genome_genes GENOME_GENES] [--skip_check]
working_dir dataset_name input genome genome_sizes
Positional Arguments¶
working_dir | working/output directory |
dataset_name | dataset name |
input | input file in .bed format |
genome | reference genome in FASTA format |
genome_sizes | chromosome sizes of reference genome (e.g. from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes) |
Named Arguments¶
--disable_filtering, -f | |
skip the filtering step Default: False | |
--disable_RNAshapes | |
skip secondary structure prediction with RNAshapes Default: False | |
--disable_RNAstructure | |
skip secondary structure prediction with RNAstructures Default: False | |
--generate_negative, -n | |
generate a negative set for classification Default: False | |
--min_score | filtering: minimum score for binding site (default: 0.0) Default: 0.0 |
--min_length | filtering: minimum binding site length (default: 8) Default: 8 |
--max_length | filtering: maximum binding site length (default: 75) Default: 75 |
--elongation | elongation: span for up- and downstream elongation of binding sites (default: 20) Default: 20 |
--genome_genes | negative set generation: gene boundaries |
--skip_check, -s | |
skip check for installed prerequisites Default: False |