Tutorial ======== This tutorial will make you familiar with ssHMM and explain its usage. 0. Installation ------------------------------------- To begin this tutorial, first install ssHMM as described in the :ref:`installation` section. To make sure that the installation was successfull, check whether you can execute the ssHMM scripts: .. code-block:: bash preprocess_dataset -h train_seqstructhmm -h batch_seqstructhmm -h If all goes well, you should see the help messages of the ssHMM scripts. 1. Download a CLIP-Seq dataset ------------------------------ Now we need a CLIP-Seq dataset to work with. On Github we provide a repository of 25 CLIP-Seq and 24 synthetic datasets (https://github.molgen.mpg.de/heller/ssHMM_data). For this tutorial, we use the PUM2 CLIP-Seq dataset and download the pre-processed files for sequence and structure: .. code-block:: bash cd /home/myuser mkdir clipseq cd clipseq wget https://github.molgen.mpg.de/raw/heller/ssHMM_data/master/clip-seq/fasta/PUM2/positive.fasta wget https://github.molgen.mpg.de/raw/heller/ssHMM_data/master/clip-seq/shapes/PUM2/positive.txt 2. Start Docker image (if installed with Docker) ------------------------------------------------ If ssHMM is installed as a Docker image, you first have to start the image: .. code-block:: bash docker run -t -i -v /home/myuser/:/home/myuser/ hellerd/sshmm This boots the ssHMM image and opens a command line to control the running container. The ``-v`` option makes the home directory (containing the ``clipseq`` directory) available from within the container. Continue with the tutorial by running all commands in the container. 3. Inspect the dataset ---------------------- Let's have a look at the two files we downloaded: .. code-block:: bash head /home/myuser/clipseq/positive.fasta .. code-block:: bash >chr6:89794035-89794147(+) aaaaaattacatacaaacagCTTGTATTATATTTTATATTTTGTAAATACTGTATACCATGTATTATGTGTATATTGTTCATACTTGAGAGGtatattatagttttgttatg >chr10:102767488-102767578(+) cacccaggtttatggcctcgTTTTCACTTGTATATTTTTCACACTGTAAATTTCTTGTACAAACCCAAAGaaaaaattaaaaaaaatttt >chr2:99234790-99234904(+) taactgtgtcaacagtattgTGAAGTGATCATTTCTTGTAAAACTTGTAAATAAACTATCATCTTTGTAGATATCTTAAAGGTGTAAAGTTTGCaaatttgaagaaatatatat >chr12:49521563-49521638(-) gtgatcatgtcttttccatgTGTACCTGTAATATTTTTCCATCATATCTCAAAGTaaagtcattaacatcaaaag The FASTA file contains the nucleotide sequence of the PUM2 binding sites as determined by a CLIP-Seq experiment. Every two lines of the FASTA file hold one binding site. The first line (beginning with ``>``) specifies the genomic location of the site while the second line contains the genomic sequence. .. code-block:: bash head /home/myuser/clipseq/positive.txt .. code-block:: bash >chr6:89794035-89794147(+) EEEEEEEEESSSSSSSSIISSISSSSISSSSSSSSIIISSIISSSIIISSISSSSSSSSSSHHHSSSSISSSSSSISSIIISSSIISSSSSSSSSSISSSSSSSSSSSISSS 0.008824 >chr10:102767488-102767578(+) EEEEESSSSHHHHHSSSSMMMMMMMMMSSSSSSSHHHHHHHHHHHHHHHHHHHHHSSSSSSSEEEEEEEEEEEEEEEEEEEEEEEEEEEE 0.0312072 EEEEESSSSHHHHHSSSSMMMMMMMMMSSSSSSSHHHHHHHHHHHHHHHHHHHHHSSSSSSSMMMMMMMMMMSSSSSSHHHHHHSSSSSS 0.0163077 >chr2:99234790-99234904(+) EESSSSSHHHHSSSSSMMMMMMMMMMMMMMSSSSSSSIIISSSISSSSSSSSIIIIIIISSSSSSSSISSHHHHSSSSSSSSSSIIIISSSSSSSSISSSISSSSSSSEEEEEE 0.0677326 EESSSSSHHHHSSSSSMMMSSSSHHHHHSSSSSSSSSIIISSSISSSSSSSSIIIIIIISSSSSSSSISSHHHHSSSSSSSSSSIIIISSSSSSSSISSSISSSSSEEEEEEEE 0.0031042 >chr12:49521563-49521638(-) SSSSSIISSISSSSIIISSSSSHHHHHHHHHHHHHHHHHHHHSSSSSIIISSSSSSIISSSSSEEEEEEEEEEEE 0.098404 The structure file contains the predicted secondary structures of the binding sites. The prediction were performed with the ``RNAshapes`` tool. Again, the lines starting with ``>`` specify the genomic location of a binding site. The subsequent lines contain the predicted structural context of each nucleotide in the FASTA file. Note that these structure sequences have the same length as the nucleotide sequences from the FASTA file we have looked at before. .. _train: 4. Training ssHMM on the dataset -------------------------------- Now we can train ssHMM on the CLIP-Seq dataset we downloaded: .. code-block:: bash cd /home/myuser mkdir results train_seqstructhmm clipseq/positive.fasta clipseq/positive.txt -o results/ This creates a new directory ``results`` and starts the training of ssHMM using the ``train_seqstructhmm`` script. ``train_seqstructhmm`` has two mandatory arguments: the sequence and the structure file. We use the files that we downloaded and additionally tell ssHMM to write its output into the new ``results`` directory. For a description of all arguments of ``train_seqstructhmm`` see its :ref:`reference `. While the ``train_seqstructhmm`` script runs, it writes information to the standard output. For more information on the output, refer to the :ref:`output` section. When the script finishes, it prints messages on standard output that look similar to: .. code-block:: bash 2017-02-23 11:45:33,381 - main_logger - INFO - Terminate model after 7000 iterations. 2017-02-23 11:45:33,381 - main_logger - INFO - Completed training. Write sequence logos.. 2017-02-23 11:45:35,675 - main_logger - INFO - Completed writing sequence logos. Print model graph.. 2017-02-23 11:45:35,987 - main_logger - INFO - Printed model graph: ./job_170223_114151/final_graph.png. Write model file.. 2017-02-23 11:45:35,992 - main_logger - INFO - Wrote model file: ./job_170223_114151/final_model.xml 2017-02-23 11:45:35,992 - main_logger - INFO - Finished ssHMM successfully. The lines tell you how many iterations the training took (7000) and where you can find a graph and an XML of the trained model. 5. Inspect the trained model ---------------------------- .. hint:: If you ran ssHMM in the Docker container, it is now time to exit from the container. As the ``results`` directory is a subdirectory of ``/home/myuser``, the training results can also be found on your host machine. Exiting from the Docker container is easy: .. code-block:: bash exit We can now have a look at the model graph. See :ref:`trainingoutput` for an explanation of what the model graph shows. .. image:: images/model_graph.png Congratulations, you finished our tutorial! Check out the :ref:`reference` section for more information about the ssHMM scripts.