High-Performance Computing

High-Performance Computing (HPC) provides storage & computational power to support analysis of genomic, transcriptomic, & proteomic data.

Christened Zuputo (meaning ‘The Brain’), the WACCBIP HPC Core is a Dell EMC High Performance Computing Cluster powered by Intel processors with 6 compute nodes comprising 216 CPU cores & 12 NVIDIA Tesla P100 GPU cards. The cluster runs on Mellanox EDR Infiniband interconnect technology and is managed by the ROCKS cluster distribution software.

About Zuputo

Zuputo is a Dell EMC High Performance Computing Cluster powered by Intel processors with 6 compute nodes comprising 216 CPU cores & 12 NVIDIA Tesla P100 GPU cards. The cluster runs on Mellanox EDR Infiniband interconnect technology and is managed by the ROCKS cluster distribution software.

System Name	Zuputo
Nodes	6
Processor	2x Intel ® Xeon ® Broadwell @ 2.4GHz
CPU Cores	216
RAM	1536TB
GPUs	12 X NVIDIA TESLA P100 16GB
Interconnect	Mellanox EDR Infiniband
Storage	274 TB Lusture
Cluster Manager	ROCKS cluster distribution software
Launch Date	June 22, 2018

Applying for Access

WACCBIP strictly controls individual and organisational access to the HPC Facility. Access to the system will be allowed after either of or both underlisted types of authorisation processes have been completed:

Authorisation for access by an organisation
Authorisation for access by individuals within that organisation

For WACCBIP to grant access to the HPC facility, you are requested to kindly follow the instructions below:

Read the WACCBIP HPC Terms and Conditions of Use
Download and fill all details on the form.
Once completed, sign and email the application form to hpc@ug.edu.gh

Once received, your application will be reviewed, and a decision will be made to either grant access or not as appropriate. You will be notified of the decision.

We look forward to working with you.

Terms & Conditions of Use

The use of High-Performance Computing (HPC) resources is subject to the IT Policies of University of Ghana Computing Systems (UGCS). In addition, the following policies shall also govern the use of the HPC facilities to avoid misuse of computing resources. Applicants are required to agree to follow these guidelines before they are given a user account that will grant them access to the system. Legal or administrative action(s) shall be taken against any user who violates the rules.

HPC Account Policy

The HPC infrastructure is meant to be accessible only to registered users who have agreed to the stipulated terms and conditions.
All users shall be responsible for protecting their accounts from unauthorized access.
Users are not allowed to share their credentials with others under any condition, including via .rhost
Users are prohibited from using the credentials of others to access the HPC system.
The University of Ghana reserves the right to
Create an HPC account
Deactivate/Close or Suspend an HPC account if it is determined to have used HPC resources for purposes other than those allowed. .
Modify an account
University of Ghana Reserves the right to accept or reject any request for an HPC account.

HPC Resource Usage Policy

HPC resources must be used for academic and research purposes only.
The HPC system should not be used for activities that lead to personal or private (commercial) benefit.
Use of the HPC system for activities that violate the laws of Ghana is prohibited.
Resources must not be used to support illegal or malicious activities including but not limited to: A) Production of biological, nuclear or chemical weapons or any activity capable of producing any other types of weapons.B) Acts of terrorism.
Users are only permitted to access the login node. This should be done via a secure communication channel such as Secure Shell(ssh). Access is prohibited to other nodes such as the compute nodes, master nodes, storage nodes and management nodes.
All jobs and processing shall occur in the login node. The login node shall be used for interactive work. Heavy computational work should be submitted to the compute nodes via a Job Scheduler(OpenPBS). Users shall not under any condition access the compute nodes directly.
The use of the HPC resources to access, store, manipulate unauthorized data or programs remotely or on the HPC system is prohibited.
Users shall not under any circumstance access, store, upload, download, transmit or create sexually explicit or pornographic material on/from/to the HPC system.
This account policy is subject to renewal and amendment.

Maintenance Policy

UGCS shall inform users of HPC system maintenance via email messages. However, UGCS reserves the right to shut down the HPC system without prior notice in case of maintenance or emergency.

Security Policy

Users are obliged to report any computer security flaws or weaknesses, incidents of misuse or violation of the HPC user account policies to the System Administrator.
Users shall not under any circumstance perform any activity that will lead to circumvention of security or administrative measures and policies.
Users shall not purposely make copies of system or configuration files (such as password file, MOM files, ) for any unauthorized activity.
Download or Execution of programs which reveal security flaws of the HPC system is prohibited.

Monitoring and Privacy

University of Ghana reserves the right to perform the following activities without a user’s prior notice, knowledge or consent:i) Monitor all accounts/activities of users/resources/data//files of the HPC system and networks at any time ii) Retain copies of data/messages/other files stored/accessed/transferred/modified in or transferred from the HPC system iii) Share all information gathered with Government or Law enforcement agencies
Users will have no explicit or implicit privacy while using their user account.

Data Usage Policy

Modification of data on the HPC system without authorization is prohibited.
Users shall not upload or store data which violates any copyright agreements/laws.
University of Ghana reserves the right to remove or stop the transfer of any data at any time after a user account has expired, been deleted or no longer has a research association with the University.
Users are responsible for keeping their data intact.

Data Retention Policy

UG reserves the right to make copies of any data for backup purposes/retain such data indefinitely/stop backup of data/delete data.

User Data / Account

HPC user accounts will be deactivated if they are inactive (no logins) for 6 months.Files associated with the deactivated account (home/username) will be retained for 3 months after deactivation. During the 3 months, users can request for a reactivation of their account.
Files in temp_dirare deleted after 28 days (from date they were last accessed)
User accounts that are associated with projects will be deactivated once the project expires. However, the data will be retained for a maximum period of 12 months after which UG reserves the right to reclaim storage space by permanently deleting the data.
Users have the option to arrange for a permanent data storage by contacting the System Administrator.

User Abuse Policy

An abusive user is defined by UGCS as a user who makes it difficult for others to use the system by:A) Circumventing controls like job scheduling policy B) Making shared files accessible to others C) Manipulating the system to their benefit.

If a user is found to be abusing the system the following sanctions shall be applied:

First Incident

A notification to desist will be sent to the user by e-mail, phone or text message.
A user’s job may be terminated, if necessary, to return the system to normal.
A user’s ability to submit/run jobs or store data will be suspended.
A user’s ability to submit/run jobs or store data will be returned, once the user stops the act(s).

Second incident

If a user should repeat the abusive act(s), his superior/manager/supervisor shall be informed of the act.
All other measures in first incident will be applied.

Third incident

User’s account shall be locked for 3 – 12 months, after which UGCS shall decide whether to enable the account or disable the account permanently.

Acknowledgement
Any work that involved the use of Zuputo should be referenced as

The authors wish to acknowledge the University of Ghana for providing the computing resource for this work.

Additional Information

We would also like to be notified about any articles, conference papers reports , posters, scientific presentations in which a part of the work was done using the Zuputo cluster. Please send as en email via hpc@ug.edu.gh. The notification should include the following details

Conference name
Title of talk/poster/presentation/book/article
Author(s)
Journal, volume,
Pages or editors
Address
Publisher
Year
DOI

Running a job on the HPC Cluster using PBS

Submitting interactive jobs

An interactive PBS job can be requested for any activity which needs an environment the same as for a batch job. After a successful request, you will get a prompt, and access to all the resources assigned to the job. This is very handy when testing or debugging . For example running MPI jobs that across multiple nodes.

There are two ways you can submit an interactive job to PBS: By including all of the options on the command line, or by listing the options in a PBS script and submitting the script with the command-line option specifying an interactive job.

Submitting an interactive job from the command line

To submit from the command line, you need to specify all of the PBS options that would normally be specified by directives in a PBS script(Exclude #PBS ). Use the qsub command with -I argument

$ qsub -I

For interactive jobs you can still additional options such as the resource options for number of nodes, processors, memory, and walltime, and the -V option to insure that all the nodes get the correct environment. The -I flag signals that the job should run as an interactive job. (Note: in the example that follows, the character indicates that the following line is a continuation of the one on which it appears.)

$ qsub -I -V -q workq -l nodes=2:ppn=2 -l mem=1GB -l walltime=2:00:00

The above example requests an interactive job in the workq queue with the following resources: two nodes with two processors, each processor with 1 GB of memory for two hours. The prompt will change to something that says the job is waiting to start, followed by a prompt on the first of the assigned nodes. You can always omit the -q option because the system automatically allocates the job to a queue

qsub: waiting for job 12345678.master1.local to start
qsub job 12345678.master1.local ready

To add GPUs, add the ngpus argument 
$ qsub -I -V -q flux -l nodes=2:ppn=2 -l mem=1GB -l walltime=4:00:00 -l ngpus=2

If at some point before the interactive job has started you decide you do not want to use it, the command Ctrl-C will cancel it

When you have completed the work for which you requested the interactive job, you can just logout of the compute node, either with exit or with logout, and you will return to the login node prompt.

[user@login ~]$ exit
[user@login ~]$ logout

Submitting an interactive job using a file

To recreate the same interactive job as above, you could create a file, say job.pbs, with the following lines in it

#!/bin/bash
#PBS -l nodes=2:ppn=2
#PBS -l mem=1GB
#PBS -l walltime=2:00:00
#PBS -l ngpus=2

then submit the job using

$ qsub -I job.pbs

Running Batch Jobs

To run a batch job on the HPC cluster, you will need to set up a job submission script. This is a Portable Batch System (PBS) file. This PBS is a simple text file that defines the commands and cluster resources used for the job. A PBS file can be easily created or edited with a UNIX editor, such as emacs,pico or vi, pico.Should in case you write your script in windows environment and you upload to the server. Format the file with the dos2unix utility.

For example, if you newly created file is job.pbs , then you format with the command

dos2unix job.pbs

Submitting a Job

In order to use the HPC compute nodes, you must first log in to the server, and submit a PBS job. Do not run jobs on the login node. The qsub command is used to submit a job to the PBS queue, delete an already submitted job and to request additional resources. The qstat command is used to check on the status of a job already in the PBS queue. To simplify submitting a job, you can create a PBS script and use the qsub and qstat commands to interact with the PBS queue.

Creating a PBS Script

To set the parameters for your job, you can create a control file that contains the commands to be executed. Typically, this is in the form of a PBS script. This script is then submitted to PBS using the qsub command.

Here is a sample PBS file, named myjobs.pbs, followed by an explanation of each line of the file.

#!/bin/bash
#PBS -l nodes=1:ppn=2
#PBS -l mem=1GB
#PBS -l ngpus=2
#PBS -l walltime=00:00:45
cd /home/waccbip/test/
echo 'hello world'>myfile.txt

The first line in the file identifies which shell will be used for the job. In this example, bash is used but csh or other valid shells would also work.
The second line specifies the number of nodes and processors desired for this job. In this example, one node with two processors is being requested.
The third line specifies the RAM the use r requires for the job. In this example a 1GB RAM has been requested
The fourth line states how many GPUs is needed . In this example 2GPUs are requested.
The fifth line in the PBS file states how much wall-clock time is being requested. The format is hh:mm:ss. In this example 45 seconds of wall time have been requested.
The sixth line tells the HPC cluster to access the directory where the data is located for this job. In this example, the cluster is instructed to change the directory to the /home/waccbip/test/ directory.
The fifth line saves the text ‘hello world’ to a text file called myfile.txt

Submitting the job using the qsub command

To submit your job without requesting additional resources, issue the command

$qsub myjob.pbs

Requesting for additional wall time

If you need additional or less wall time for job execution for a submitted job, use the qsub command to make the request

The example script above, requested for 45 seconds.If you realize that you need more time say,3 minutes then you can make the changes using this command:

$qsub -l walltime=0:03:00 myjob.pbs

3 minutes will be requested from PBS. Note: If a job does not finish within the specified time, it will be terminated.

Requesting Nodes and Processors

Sometimes you may wish to alter the number of nodes and processors requested for a job. This can also be done by using the qsub command. In the example script, we have requested one node with two processors, or one dual-processor node.

If you realize later that you need three HPC nodes for your job but you are going to use only one of the dual-processors on each node, then use the following command:

$qsub -l walltime=0:03:00,nodes=3 myjob.pbs

If you want to use both processors on each HPC node, you should use the following command:

$qsub -l walltime=0:05:00,nodes=3:ppn=2 myjob.pbs

Requesting GPU nodes

You can submit jobs directly to the graphical processing unit (GPU) cluster by specifying the number of GPUs your job requires in your PBS request.

For example, if you need four nodes with 1 cores per node and two GPUs per node, then your command should look like this:

$qsub -l walltime=0:03:00,nodes=4:ppn=16 -l ngpus=2

Module

Several software packages are installed on the clusters. These softwares can be accessed and used using module files. Modules allow you to specify which versions of the packages you want.

To list all available module files use the command below

module avail

You can also list versions available for a particular software or all module files whose name contains a specified string For example, to find all all versions available for python, use:

module avail python/

To use a module :

module load modulename

To load Python version 3.5.5 use this command:

module load python/3.5.5
(replace modulename with the name of your module of interest)

You can also load multiple loads with a singe command . eg.

module load samtools bwa python

To unload the module execute the command:

module unload modulename

Example, to unload python version 3.5.5

module unload python/3.5.5

If you want to unload all softwares in your environment execute the command:

module purge

List all loaded module files by executing the command:

module list

If you don’t find your required software in the list of modules, you can use anaconda to install the package or send a request to hpc@ug.edu.gh

Getting more information

If you want to get a brief description of a module use the following commands:

module whatis modulename

module help modulename

Module collections

Sometimes you might want to run two or more sets of modules separately due conflicts. You might also have a long list of modules to load every time you log on to the cluster.

In such situations module collections can be used. Module collections allow you to create saved environments that remember a set of modules to load together.

First, load all your modules and then create the saved environment using the command:

module save

The set of modules will be saved as your default set. If you want to save a set as a non-default, assign the environment a name:

module save environment-name

To load all the modules in the set(your default), use the command:

module restore

To load a non-default list, just add the environment name

module restore environment-name

Get a list of your saved environments using:

module savelist

Installing Packages using Anaconda

System wide software packages are available via the environmental modules. However, you might not get all the software you need. To install additional software packages use the conda environment.

Conda is a package manager that makes software installation easy. It can be used to install standalone softwares such as bwa, gromacs, R. It is also possible to install packages which are dependent on other softwares. For example numpy and tensorflow which work with python can be installed using conda( but of course python the dependency python must first be installed)

In conda, software packages are installed via channels which are software repositories .An example of such a channel is Bioconda which contains over 3000 bioinformatics software. The list of software is available can be found here

1. Load anaconda

module load anaconda/3

2. Setting up channels

Conda comes with a default channel. But this might not contain the softwares you need. Other popular channels are bioconda and conda-forge. Add channels using the command:

conda config --add channels channel-name

Lets add the channels bioconda and conda-forge

conda config --add channels bioconda
conda config --add channels conda-forge

To view channels, use the command:

conda config --get channels

3. Install packages

Once your channels have been enabled, you can install any package available on the channels into the current conda environment.

conda install samtools

If you experience any issue(usually permission issues), then you need to activate an environment before installing the software.

source activate environment
conda install samtools

You can also create a new environment and install the software package.

conda create -n alignment muscle

To install a package from a specific channel use:

conda install -c channel-name package-name

Example is the installation of gromacs from bioconda channel

conda install -c bioconda gromacs

List all install packages using the command

conda list

To list all environments use:

conda env list

Resources

Visit the following sites to read more

https://conda.io/docs/user-guide/tasks/manage-pkgs.html

https://bioconda.github.io/

https://conda.io/docs/

https://anaconda.org/conda-forge

https://bioconda.github.io/recipes.html#recipes

https://conda.io/docs/user-guide/tasks/manage-channels.html

Transfering files

Running Python on the HPC Cluster

Bacterial genome annotation with prokka

Introduction

Genome annotation is done on a draft genome assembly. The main aim of genome annotation is to add biological information to the newly assembled genome which is “blank” so far as information is concerned.

For a bacterial genome we will be interested in the following information (which is non-exhaustive at this point):

– protein coding genes i.e promoter regions, robosome binding sites,coding sequence ,etc

-non coding genes (such as tRNA,rRNA, etc)

How do we perform Annotation?

There are a number of tools and pipelines that have been developed specifically for this purpose (including RAST and Prokka). In this tutorial we look at the use of prokka for annotation. Now lets dig in

Prepare your job submission script

#!/bin/bash

#PBS –l nodes=1:ppn=10

#PBS –l mem=32GB

Load prokka into your environment

module load bioinfo/prokka/1.12

Download fasta file

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/013/925/GCA_000013925.2_ASM1392v2/GCA_000013925.2_ASM1392v2_genomic.fna.gz

Unzip it

gunzip GCA_000013925.2_ASM1392v2_genomic.fna.gz

Annotate with prokka

prokka --cpus 10 --kingdom Bacteria --prefix Agy99 --genus Mycobacterium --locustag GCA_000013925 GCA_000013925.2_ASM1392v2_genomic.fna

There should be a folder called Agy99 created. cd to it

cd Agy99

In the folder 12 files will be created.

Agy99.err, Agy99.faa, Agy99.ffn, Agy99.fna, Agy99.fsa, Agy99.gbk, Agy99.gff, Agy99.log, Agy99.sqn, Agy99.tbl, Agy99.tsv, Agy99.txt

Now we can search and count some genes

Count the number of annotations created

grep ‘>’ Agy99.ffn | wc –l

Count the number of tRNA

grep ‘tRNA’ Agy99.ffn | wc –l

Count the number of hypothetical proteins

grep 'hypothetical protein’ Agy99.faa | wc –l

Congratulations. You have successfully performed a genome annotation. CHEERS!!!!!!!!!!

Installation

There is a system wide installation which can be loaded via the module file

User specific installation must be done via anaconda package

conda install –c conda-forge –c bioconda –c defaults prokka

(Note: This must be done after activating your conda environment)

Gene Expression Analysis of RNA-seq data using HISAT2, StringTie and Ballgown

RNA-SEQ analysis enables us to identify genes or transcripts that are differentially expressed across samples. It also makes it possible to identify novel transcripts and multiple isoforms for gene. This tutorial provides a step by step instructions on how to perform such analysis.

Step 1: prepare your analysis environment .

Load anaconda package

module load anaconda/3

Add channels

conda config --add channels conda-forge

conda config --add channels defaults

conda config --add channels r

conda config --add channels bioconda

create a new environment

conda create -n rna-seq r-essentials r-base python=3.7

Activate the environment

source activate rna-seq

Install the packages

conda install -c bioconda hisat2

conda install -c bioconda samtools

conda install -c bioconda stringtie

conda install -c bioconda gffcompare

conda install -c bioconda bioconductor-biocinstaller

conda install -c conda-forge r-Kindly issue the other gert

conda install -c conda-forge r-devtools / conda install -c r r-devtools

conda install -c r r-dplyr

Open R terminal

Install R packages

devtools::install_github("alyssafrazee/RSkittleBrewer")

install.packages(‘BiocManager’)

BiocManager::install(‘ballgown’)

BiocManager::install('genefilter')

BiocManager::install(‘cowplot’)

Step 2: Data acquisition

Download the data available from here. (data size is about 2GB)

wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Unpack the data (Data will be stored in chrX_data)

tar xvfz chrX_data.tar.gz

Data will be extracted to a folder called chrX_data which will have the structure samples/indexes/genome/genes . You can check this, using the command:

tree chrX_data

Data description

A a paired-end end sequence of the human chromosome X located in chrX_data/samples/*.fastq.gz. This will be our raw reads.
A reference genome (chromosome X) located in chrX_data/genome.chrX.fa.
A description of the data is stored a file geuvadis_phenodata.csv. We will refer to this is the phenotypic data. This will be used in the differential analysis step and is usually created by yourself using your samples.
Reference Genome annotations located chrX_data/genes/chrX.gtf
Index files. We will however generate our own index files.

Step 3: Start Analysis

Map reads to reference genome. This is done to assign reads to specific regions in the genome. Prior to mapping we will have to first index the reference genome

Create a folder index

mkdir index

Extract exon and splice-site information from the genome annotation

extract_splice_sites.py chrX_data/genes/chrX.gtf > index/chrX.ss

extract_exons.py  chrX_data/genes/chrX.gtf > index/chrX.exon

Confirm that these files have been created

head -n 5 index/chrX.ss

chrX 276393 281481 +

chrX 281683 284166 +

chrX 284313 288732 +

chrX 288868 290647 +

chrX 290775 291498 +

head -n 5 index/chrX.exon

chrX 276323 276393 +

chrX 281393 281683 +

chrX 284166 284313 +

chrX 288732 288868 +

chrX 290647 290775 +

Build index using hisat2. Syntax for building index using hisat2

histat2-build -p <number_of_threads> --ss <split_sites> --exon <exon> <reference.fa> <index_basename>

Note: –ss and –exon flags can be omitted if gene annotations is unavailable.

Now its time to build the index

hisat2-build -p 10 --ss index/chrX.ss --exon index/chrX.exon chrX_data/genome/chrX.fa  index/chrX_index

Confirm that the index files have been created

ls -1 index

chrX.1.ht2

chrX.2.ht2

chrX.3.ht2

chrX.4.ht2

chrX.5.ht2

chrX.6.ht2

chrX.7.ht2

chrX.8.ht2

chrX.exon

chrX.ss

Before we do the mapping let's find out how many samples we are dealing with.

ls chrX_data/samples/*chrX_1*.gz |wc -l

Syntax for hisat2 mapping

hisat2 -p <number_of_threads> --dta -x <index_basename> -1 <Read1> -2 <Read2> -S <outputname_for_generated_samfile>

We create a folder where all the generated samfiles will be stored

mkdir map

Now let's do the mapping for all the samples

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S map/ERR188044_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S map/ERR188104_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S map/ERR188234_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S map/ERR188245_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S map/ERR188257_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S map/ERR188273_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S map/ERR188337_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S map/ERR188383_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S map/ERR188401_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S map/ERR188428_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S map/ERR188454_chrX.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S map/ERR204916_chrX.sam

Convert the SAM files to sorted bam files using samtools. This is done because bam files are usually compressed and more efficient to work with than SAM files

samtools sort -@ 8 -o map/ERR188044_chrX.bam map/ERR188044_chrX.sam

samtools sort -@ 8 -o map/ERR188104_chrX.bam map/ERR188104_chrX.sam

samtools sort -@ 8 -o map/ERR188234_chrX.bam map/ERR188234_chrX.sam

samtools sort -@ 8 -o map/ERR188245_chrX.bam map/ERR188245_chrX.sam

samtools sort -@ 8 -o map/ERR188257_chrX.bam map/ERR188257_chrX.sam

samtools sort -@ 8 -o map/ERR188273_chrX.bam map/ERR188273_chrX.sam

samtools sort -@ 8 -o map/ERR188337_chrX.bam map/ERR188337_chrX.sam

samtools sort -@ 8 -o map/ERR188383_chrX.bam map/ERR188383_chrX.sam

samtools sort -@ 8 -o map/ERR188401_chrX.bam map/ERR188401_chrX.sam

samtools sort -@ 8 -o map/ERR188428_chrX.bam map/ERR188428_chrX.sam

samtools sort -@ 8 -o map/ERR188454_chrX.bam map/ERR188454_chrX.sam

samtools sort -@ 8 -o map/ERR204916_chrX.bam map/ERR204916_chrX.sam

Let's delete the SAM files to clear up storage space.

rm map/*.sam

Assemble transcripts using stringtie

# create assembly per sample using 8 threads

stringtie map/ERR188044_chrX.bam -l ERR188044 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188044_chrX.gtf

stringtie map/ERR188104_chrX.bam -l ERR188104 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188104_chrX.gtf

stringtie map/ERR188234_chrX.bam -l ERR188234 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188234_chrX.gtf

stringtie map/ERR188245_chrX.bam -l ERR188245 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188245_chrX.gtf

stringtie map/ERR188257_chrX.bam -l ERR188257 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188257_chrX.gtf

stringtie map/ERR188273_chrX.bam -l ERR188273 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188273_chrX.gtf

stringtie map/ERR188337_chrX.bam -l ERR188337 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188337_chrX.gtf

stringtie map/ERR188383_chrX.bam -l ERR188383 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188383_chrX.gtf

stringtie map/ERR188401_chrX.bam -l ERR188401 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188401_chrX.gtf

stringtie map/ERR188428_chrX.bam -l ERR188428 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188428_chrX.gtf

stringtie map/ERR188454_chrX.bam -l ERR188454 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188454_chrX.gtf

stringtie map/ERR204916_chrX.bam -l ERR204916 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR204916_chrX.gtf

Merge transcripts from all samples

We first put the names of the assembled transcript files into a single file

ls assembly/*.gtf > mergelist.txt

cat mergelist.txt

assembly/ERR188044_chrX.gtf

assembly/ERR188104_chrX.gtf

assembly/ERR188234_chrX.gtf

assembly/ERR188245_chrX.gtf

assembly/ERR188257_chrX.gtf

assembly/ERR188273_chrX.gtf

assembly/ERR188337_chrX.gtf

assembly/ERR188383_chrX.gtf

assembly/ERR188401_chrX.gtf

assembly/ERR188428_chrX.gtf

assembly/ERR188454_chrX.gtf

assembly/ERR204916_chrX.gtf

Syntax for merging

stringtie -p <number_of_threads> -G <genome_annotation> -o <outputfile> <textile_containing_list_of_transcripts>

We can now merge with stringtie

stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf mergelist.txt

Let's confirm the operation was successful

cat stringtie_merged.gtf | head -n 5

# StringTie version 2.1.4

chrX StringTie transcript 323525 324046 1000 + . gene_id “MSTRG.1”; transcript_id “MSTRG.1.1”;

chrX StringTie exon 323525 323661 1000 + . gene_id “MSTRG.1”; transcript_id “MSTRG.1.1”; exon_number “1”;

chrX StringTie exon 323905 324046 1000 + . gene_id “MSTRG.1”; transcript_id “MSTRG.1.1”; exon_number “2”;

We can find how many transcripts were identified.

cat stringtie_merged.gtf |cut -d$'\t' -f 3|grep 'transcript'|wc -l

4602

We can also inspect how transcripts identified by stringtie compare with what is in the reference annotation using gffcompare tool

gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf

cat merged.stats

# gffcompare v0.11.5 | Command line was:

#gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf

#= Summary for dataset: stringtie_merged.gtf

# Query mRNAs : 4602 in 1410 loci (4225 multi-exon transcripts)

# (664 multi-transcript loci, ~3.3 transcripts per locus)

# Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)

# Super-loci w/ reference transcripts: 1065

#—————– | Sensitivity | Precision |

Base level: 100.0 | 76.9 |

Exon level: 98.2 | 73.9 |

Intron level: 100.0 | 78.1 |

Intron chain level: 100.0 | 43.9 |

Transcript level: 99.2 | 45.3 |

Locus level: 98.9 | 74.8 |

Matching intron chains: 1856

Matching transcripts: 2085

Matching loci: 1074

Missed exons: 0/8804 ( 0.0%)

Novel exons: 1862/12363 ( 15.1%)

Missed introns: 0/7946 ( 0.0%)

Novel introns: 780/10175 ( 7.7%)

Missed loci: 0/1086 ( 0.0%)

Novel loci: 345/1410 ( 24.5%)\

Total union super-loci across all input datasets: 1410

4602 out of 4602 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

The high sensitivity of the transcript level indicates that large percentage of the stringtie transcripts match that of the reference annotation. The lower precision value indicates that many of the transcripts identified by stringtie are not in the reference annotation.This indicates the transcripts are either novel or false positives

The novel exons, introns, and loci indicate how many of the sites were not found in the reference annotation.

Estimate transcript abundance

mkdir abundance

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188044/ERR188044_chrX.gtf map/ERR188044_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188104/ERR188104_chrX.gtf map/ERR188104_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188234/ERR188234_chrX.gtf map/ERR188234_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188245/ERR188245_chrX.gtf map/ERR188245_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188257/ERR188257_chrX.gtf map/ERR188257_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188273/ERR188273_chrX.gtf map/ERR188273_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188337/ERR188337_chrX.gtf map/ERR188337_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188383/ERR188383_chrX.gtf map/ERR188383_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188401/ERR188401_chrX.gtf map/ERR188401_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188428/ERR188428_chrX.gtf map/ERR188428_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR188454/ERR188454_chrX.gtf map/ERR188454_chrX.bam

stringtie -e -B -p 8 -G stringtie_merged.gtf -o abundance/ERR204916/ERR204916_chrX.gtf map/ERR204916_chrX.bam

Differential expression

To run the differential expression analysis we use some R packages. First we load our R and call the relevant packages.

library(devtools)

library(ballgown)

library(genefilter)

library(dplyr)

library(RSkittleBrewer)

We load the phenotype data

pheno_data <- read.csv(“chrX_data/geuvadis_phenodata.csv”)

#Let's show information for first 6 samples

head(pheno_data)

ids sex population

1 ERR188044 male YRI

2 ERR188104 male YRI

3 ERR188234 female YRI

4 ERR188245 female GBR

5 ERR188257 male GBR

6 ERR188273 female YRI

Load the expression data (i.e. transcript abundance estimates) using the ballgown library. We specify three parameters:dataDir for specify where the estimates are stored, samplePattern to indicate a name pattern that is present in the sample names and pData which specifies the phenotypic information (we loaded it as pheno_data).

bg_chrX <- ballgown(dataDir = 'abundance', samplePattern = 'ERR', pData=pheno_data)

We can check some information about the load abundances

class(bg_chrX)

[1] “ballgown”

attr(,”package”)

[1] “ballgown”

bg_chrX

ballgown instance with 3491 transcripts and 12 samples

We can display expression levels for genes, transcripts,exons and introns using the functions gexpr,texpr, eexpr(), and iexpr() respectively

To get the gene expression levels

head(gexpr(bg_chrX), 3)

FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245

MSTRG.1 9.153040 0 7.683855 0

MSTRG.10 8.467854 0 0.000000 0

MSTRG.100 0.000000 0 0.000000 0

FPKM.ERR188257 FPKM.ERR188273 FPKM.ERR188337 FPKM.ERR188383

MSTRG.1 0 0 2.417433 0

MSTRG.10 0 0 0.000000 0

MSTRG.100 0 0 0.000000 0

To get transcript expression levels

head(texpr(bg_chrX),3)

FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257

1 8.467854 0 0.00000 0.00000 0.00000

2 0.000000 0 13.99012 14.15741 27.59774

3 0.000000 0 0.00000 0.00000 53.61156

FPKM.ERR188273 FPKM.ERR188337 FPKM.ERR188383 FPKM.ERR188401 FPKM.ERR188428

1 0.00000 0.00000 0.00000 4.871071 0.00

2 0.00000 0.00000 0.00000 21.825052 0.00

3 18.42071 83.10745 27.19952 0.000000 21.08

FPKM.ERR188454 FPKM.ERR204916

1 0.00000 0.00000

2 0.00000 5.52904

3 74.33337 41.66899

We now filter out to remove low abundance genes/transcripts. This is done to remove some genes that have few counts. Filtering out improves the statistical power of differential expression analysis. We use variance filter to remove transcripts with low variance( 1 or less)

bg_chrX_filt=subset(bg_chrX,"rowVars(texpr(bg_chrX)) >1",genomesubset=TRUE)

# 1,264 transcripts were filtered out

Let's check how many transcripts were filtered out. We will check how many trancripts are in the unfiltered and filtered expression data

bg_chrX_filt

ballgown instance with 3250 transcripts and 12 samples

bg_chrX

ballgown instance with 4602 transcripts and 12 sample

If you do the maths , we get 1352 transcript filtered out

Finally, we perform differential expression analysis to reveal transcripts that show significant difference across groups. This is done using the stattest function. The parameters we use are, feature, covariate, adjustvars, getFC and meas. Our feature of interest is transcript we are interested in its differential expressed in males and females so our covariate becomes the sex group. Transcripts might also be differentially expressed among differential populations (confounders) but since this is not our group of interest we have to correct for that using the adjustvars parameter. We also want to get the confounder-adjusted fold change values so we set getFC to TRUE. Finally ballgown gives the expression values in different measures but we will use FPKM as the meas. Testing for differential expression at exons and introns can also be done by changing the feature parameter.

Let's perform the differential analysis on transcripts. The results will be in an data frame

Let's check the first 6 transcripts and their values

head(results_transcripts)

feature id fc pval qval

1 transcript 1 1.7614945 0.2498894 0.8639175

2 transcript 2 0.8250292 0.8471261 0.9783267

3 transcript 3 0.6453535 0.7493203 0.9554001

4 transcript 4 2.7530478 0.1076059 0.8356592

5 transcript 5 1.4828120 0.5873142 0.8934238

6 transcript 6 0.4398648 0.2000345 0.8639175

To check how many of the transcripts have qvalue <0.05 we use the command

table(results_transcripts$qval < 0.05)

FALSE TRUE

3237 13

Let's perform the differentially analysis on genes

results_genes <- stattest(bg_chrX_filt,feature="gene",covariate="sex",adjustvars = c("population"),getFC=TRUE, meas="FPKM")

Let's check the first 6 genes

head(results_genes)

feature id fc pval qval

1 gene MSTRG.1 0.8590737 0.8277795 0.9766457

2 gene MSTRG.10 1.8171603 0.2566553 0.7974964

3 gene MSTRG.100 1.4437594 0.4360308 0.8692654

4 gene MSTRG.1000 1.4264802 0.1599910 0.7207850

5 gene MSTRG.1001 1.1644814 0.4806989 0.8692654

6 gene MSTRG.1002 0.9779751 0.9519060 0.9867803

Find out how many of the genes have a qvalue < 0.05

table(results_genes$qval<0.05)

FALSE TRUE

889 9

The results_transcipts data does not have gene ID so we add that information to it.

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

Let's confirm that the IDs have been added

head(results_transcripts)

geneNames geneIDs feature id fc pval qval

1 . MSTRG.10 transcript 1 1.7614945 0.2498894 0.8639175

2 PLCXD1 MSTRG.11 transcript 2 0.8250292 0.8471261 0.9783267

3 . MSTRG.11 transcript 3 0.6453535 0.7493203 0.9554001

4 . MSTRG.11 transcript 4 2.7530478 0.1076059 0.8356592

5 . MSTRG.12 transcript 5 1.4828120 0.5873142 0.8934238

6 . MSTRG.12 transcript 6 0.4398648 0.2000345 0.8639175

Let's arrange the results from the smallest P value to the largest

results_transcripts = arrange(results_transcripts,pval)

head(results_transcripts)

results_genes = arrange(results_genes,pval)

head(results_genes)

We can now save the results to a file (csv recommended)
write.csv(results_transcripts, “fc_transcript_results.csv”, row.names=FALSE)

write.csv(results_genes, “fc_gene_results.csv”, row.names=FALSE)

Let's subset transcripts that are detected as differentially expressed at qval <0.05

subset_transcripts <- subset(results_transcripts,results_transcripts$qval<0.05)

subset_genes <- subset(results_genes,results_genes$qval<0.05)

Save the subsets to a file

write.csv(subset_transcripts, “fc_transcript_subset.csv”, row.names=FALSE)

write.csv(subset_genes, “fc_gene_subset.csv”, row.names=FALSE)

Let's save the dge results to a file

write.csv(dge_results,foldchanges.csv,row.names=FALSE)

Now let's confirm that the files have been saved successfully.

#Save to file
write.csv(subset_transcripts, “fc_transcript_subset.csv”, row.names=FALSE)
write.csv(subset_genes, “fc_gene_subset.csv”, row.names=FALSE)

Bacterial Genomics Tutorial

Thousands of bacterial genome sequences are publicly available. In order to understand the genomic characteristics of these organisms and identify which genes are distinct and common among genomes of interest, it is important to perform a comparative analysis. However, the analysis is often tedious and challenging. This tutorial seeks to address this issue by providing a step-by-step tutorial on how to perform comparative analysis of bacterial genomes on the Zuputo cluster.

The first step is to prepare the analysis environment. We do this by loading Anaconda .

Module load anaconda/3

Configure software channels

conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels daler
conda config --add channels defaults

Download the Analysis pipeline

git clone https://github.com/vappiah/bacterial-genomics-tutorial.git

Change directory to the downloaded folder

cd bacterial-genomics

Create conda environment. Packages are listed in the environment.yaml file

conda env create -f environment.yaml

Download the polishing tool pilon

mkdir apps

wget https://github.com/broadinstitute/pilon/releases/download/v1.23/pilon-1.23.jar -O apps/pilon.jar

Activate environment

source activate bacterial-genomics-tutorial

Give execution rights for all scripts

chmod +x *.{py,sh,pl}

TIME FOR ANALYSIS

Step 1: Download data

./download_data.sh

Step2: Perform de novo assembly using spades

./assemble.sh

Step3: Polish the draft assembly using pilon. This is meant to improve the draft assembly. The scaffolds will be used. You can also modify the script to use the contigs and compare the result

./polish.sh

Step4: Perform QC for both raw assembly and polished assembly

./qc_assembly.sh

QUAST report showing the quality of draft assembly

Step 5: Reorder contigs against a reference genome using ragtag to generate a draft genome

./reorder_contigs.sh

Step 6: Perform a multi locus sequence typing using MLST software

./mlst.sh

Step 7: Check for antimicrobial resistance genes using abricate

./amr.sh

Step 8: Annotate the draft genome using prokka

./annotate.sh

Get some statistics on the annotation. Features such as genes, CDS will be counted and displayed. This python script requires you to specify the folder where annotations were saved . i.e. P7741

python get_annotat_stats.py P7741

Step 9: Generate dendogram using dREP

./dendogram.sh

Step 10: Perform Pangenome Analysis using Roary. Input files are gff (version 3) format. It is recommended to use prokka generated gff. So we generate the gffs for the files in the genome folder by re-annotating with prokka. We use the get_genome_gffs script

./get_genome_gffs.sh

Then perform pangenome analysis

./get_pangenome.sh

#get gene summary for three of the organism. the default is P7741 Agy99 and SGL03. Feel free to change it. A venn diagram will be generated (gene_summary.jpg)

python get_gene_summary.py P7741 Agy99 and SGL03 pangenome/gene_presence_absence.csv

Step 11: Visualize your genome by generating ring structures using BRIG. This can be done using the video tutorial here

You would want to combine the analysis results into a zip file for download and view locally. After which you can use scp to download to your machine.

zip -r results QC_ASSEMBLY P7741* mlst.csv amr.summary.tab dendogram pangenome *.jpg results

#Now that you have been able to perform a bacterial comparative genome analysis. It's time to apply your skills on real world data.
#Good luck and see you next time

Citation
You just have to declare a statement.
Analysis was done using scripts written by Vincent Appiah, University of Ghana

Gene Expression Analysis

Gene Expression Analysis Pipeline

Gene expression analysis enables us to identify genes or transcripts that are have significant changes across the conditions. It can often be a daunting task setting up a computational environment to perform this task. This tutorial attempts to reduce this burden by providing a number of already made scripts to perform a gene expression analysis. This document is based on a tutorial by Dave Tang. In this tutorial we make things a bit easier by automating a number of tasks and provide materials which can be used on other datasets by entering a few information. A general workflow of gene expression analysis is shown in the figure below. However, a detailed description of the stages are not provided because that is beyond the scope of this document.

How does the pipeline work?

The entire process begins with creating of config file template, entering of parameters in the config file and executing the pipeline.

To begin the analysis, you first need to load the pipeline via the environmental module.

module load bioinfo/pipelines/dge_1

Then create the config file template with the command below

run_dge create-config

The next stage is to enter your parameters in the config file. Read more in the Parameters section

Afer parameters have been set, you proceed with the execution of the pipeline using the command below

run_dge config.config

Parameters

Parameters are set in the config file. What we did earlier was to prepare a template to in which we will enter the parameters. 12 parameters will have be set (One of them has a default value “None”). What you have to do is to enter the right information at the right side of the equal sign.

Lets take a look at the parameters.

outputdir : this is the directory where you want your results to be saved

ref_gtf : this is your gtf file name. GTF is the annotated reference genome and it is needed for creating the splice sites. You have to specify the fullpath and the file as well. Eg. If your gtf files has the name “mygtf.gtf” and it is located in the directory “/home/user/myfiles”m then the ref_gtf parameter will be “/home/user/myfiles/mygtf.gtf” . In the config file it will be

ref_gtf= “/home/user/myfiles/mygtf.gtf

ref_fasta: This is the reference file in the fasta format. Its parameter will be set in the same way as

files: directory which contains all your sample files.

seqtype: RNAseq sequencing experiment can either be paired ( both forward and reverse strands are sequenced) or single( only forward strand is sequenced). for paired the parameter should be ‘paired’ or ‘p’ . Single will be ‘single’ or ‘s’

threads: this depends on your computer architecture. On our hpc we recommend a maximum thread of 36

phenodata: this file contains the experiment design. It includes the columns for filenames , conditions, confounders. A sample can be found here

covariate= this is the column of interest (it should be included in the phenodata file)

confounders=factors that could affect the results ( for example in measuring the mortality of a disease, one has to take into account age as it can also lead to the death of an individual. Age is therefore a confounding factor). confounders must also be present in the phenodata file.

split_point= this is used to extract names for identification. Usually file names will be used in the experiment design found in the phenodata file. However, file names will have extensions (eg .fastq, .gz) which are absent in the phenodata. Therefore split point indicates the range of position in filenames that should be used. Consider the filename ERR215671_r1.fastq in the experimental design we have ERR215671. in the filename the name we will interested in is ERR215671 which is from position 1 to 9. Therefore our split_point value will be 9.

ext= this is used to select sample files. If your filename is ERR2157671.fastq, then the extension is fastq. The condition for the extension is that it must be the last group of characters in the filename after the dot(.)

mode= used to perform a stagewise analysis. In this pipeline we classify the entire activity into 3 stages: indexing, mapping, differential expression. However the modes are four I.e. i,m,d and a

i – indexing

m – mapping

d – differential expression

a – all ( i.e. i,m, and d)

Demonstration

Step 1: Load the pipeline

module load bioinfo/pipelines/dge_1

Step 2: Create blank config file

run_dge create-config

Step 3:

Enter your parameters in the config file

Step 4: Run analysis

run_dge “config file”

If config file is with the name config.config and in the directory /home/user/files

then the command becomes

run_dge /home/user/files/config.config

Step 5: Check your output directory and find the file dge_results.csv

The file can be downloaded for downstream analysis ( more on this later)

Demonstration with sample data

There is a command to download a sample data for analysis. The command will download files to a directory with the name data which will be found in your current working directory. In addition, it also prepares the config file for you. Use the command below

module load bioinfo/pipelines/dge_1

run_dge download-sample-data

You can then execute the analysis pipeline.

run_dge ./data/config.config

Results will be saved as dge_results.csv in data directory

What tools are used in this pipeline?

HISAT2 v2.1 – indexing and mapping

samtools v1.9 – for sorting indexed files

stringtie v2.0.6– Transcriptome assembly and quantification

gffcompare – compare the StringTie transcripts to known transcripts using

Ballgown (R package) – Differential expression

What are the limitations of this pipeline?

Pipeline was designed to work on only linux environment. The main reason is that most of the tools only work on unix platforms ( samtools, hisat2 and stringtie )

This pipeline was tested with a two–condition experiment and therefore we assume that is what it can do at the moment. In future we will include a feature to handle experiments with three or more conditions.

This pipeline was tested with a single timepoint dataset. In the future we will include a future to handle a time course analysis

Pangenome Analysis

Synopsis: Pangenome is the entire set of genes present in all strains within a clade. In this tutorial clade refers to the all genomes on which this analysis is being performed on). The pangenome of a species comprise of the core genome containing genes present in all strains within the clade, the accessory / dispensable genes and strain-specific genes.

In this tutorial we will perform a pangenome analysis on raw bacterial genome specifically Mycobacterium ulcerans P7741 strain together with 4 other trains with known sequence data. This tutorial should be done in a linux environment.

PREPARE SOFTWARE TOOLS

We will use a number of tools to perform the analysis. We will use the conda environment to download and install the tools. Procedure for installing tools is documented here

Create and activate your conda environment. We are using the environment bioinformatics

source activate bioinformatics

Its now time to dig in

STEP 1: DOWNLOAD THE DATA

Our dis hosted on the SRA database. Since we already know the download link we use wget to get it.

wget  https://sra-pub-src-1.s3.amazonaws.com/ERR3336325/all_minion.fastq.1

Rename the sample to ERR3336325

mv all_minion.fastq.1 ERR3336325.untrimmed.fastq

STEP 1: TRIM ADAPTERS

We will first trim adapters for the oxford nanopore data using porechop tool (Install with anaconda)

We will use 18 threads

threads=18

porechop -i ERR3336325.untrimmed.fastq -o ERR3336325.fastq --format fastq -t $threads

STEP 2: DE NOVO ASSEMBLY

We then perform a de novo assembly on our sample ERR3336325. The sample we are using was sequenced using an oxford nanopore MinION device so we will use the tool called flye to do that.

Activate your conda environment and install flye (view the procedure here)

source activate bioinformatics

Before we start the de novo assembly lets create our output directory

flye

mkdir output

Now execute flye the 10 threads

flye --nano-raw ERR3336325.fastq  -o  output  -g 5.6m -t 10 -i 2

Once the de novo assembly is done the contigs will be located in output as assembly.fasta

Lets check.

ls output

STEP 3: ANNOTATION

After de novo assembly we will annotate the contig using the tool prokka

prokka --cpus 10 output/assembly.fasta  --prefix ERR3336325

When this is completed, you will have folder called ERR3336325 created. In this folder you will have 12 files created :

ERR3336325.err, ERR3336325.faa, ERR3336325.ffn, ERR3336325.fna, ERR3336325.fsa, ERR3336325.gbk, ERR3336325.gff, ERR3336325.log, ERR3336325.sqn, ERR3336325.tbl, ERR3336325.tsv, ERR3336325.txt

Check with the ls command

ls ERR3336325

The files include sequence data for nucleotide (.fna) and amino acids(.faa) as well as annotated files (.gbk and .gff)

We can look up the results to get some interesting information such as number of genes identified as well as counts of proteins, RNAs,etc

Lets get how many annotations was done in the genome

grep '>' ERR3336325/ERR3336325.ffn|wc -l 

7002

Lets get the number of proteincoding genes

grep '>' ERR3336325/ERR3336325.faa |wc -l 
6929

Lets get the number of hypothetical proteins the genome

grep 'hypothetical protein' ERR3336325/ERR3336325.faa |wc -l 
4497

Find the number of operons

grep -i 'operon' ERR3336325/ERR3336325.ffn |wc -l
0

STEP 4: PANGENOME ANALYSIS

Pangenome analysis will be done using the tool roary

This analysis will make use of gff files.

At the moment we only have gff file for our sample ERR3336325. This is located in the prokka results directory as ERR3336325.gff.

Lets make directory to keep the gffs that will be used for the analysis.

mkdir gffs

Copy the ERR3336325 sample

cp ERR3336325/ERR3336325.gff  gffs 

ls gffs

We would like to perform this analysis together with other strains of Mycobacterium ulcerans (i.e. Agy99, S4018, CSURQ0185 and Harvey

Note that roary uses gff files generated by prokka for the pangenome analysis. So we have to prepare gff files for the 4 other strains we will be using. This will involve first downloading the fasta files and preparing the gff files using prokka

Lets download the fasta files for the 4 strains   

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/013/925/GCA_000013925.2_ASM1392v2/GCA_000013925.2_ASM1392v2_genomic.fna.gz 

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/870/585/GCA_001870585.1_ASM187058v1/GCA_001870585.1_ASM187058v1_genomic.fna.gz 

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/524/035/GCA_000524035.1_ASM52403v1/GCA_000524035.1_ASM52403v1_genomic.fna.gz 

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/506/705/GCA_902506705.1_Q0185/GCA_902506705.1_Q0185_genomic.fna.gz 

wget ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/026/445/GCA_000026445.2_ASM2644v2/GCA_000026445.2_ASM2644v2_genomic.fna.gz

Lets uncompress the files

gunzip GCA_000013925.2_ASM1392v2_genomic.fna.gz 

gunzip GCA_001870585.1_ASM187058v1_genomic.fna.gz 

gunzip GCA_000524035.1_ASM52403v1_genomic.fna.gz
 
gunzip GCA_902506705.1_Q0185_genomic.fna.gz 

gunzip GCA_000026445.2_ASM2644v2_genomic.fna.gz

Lets prepare the gff files using prokka

prokka --cpus 10 --kingdom Bacteria --outdir Agy99 --genus Mycobacterium GCA_000013925.2_ASM1392v2_genomic.fna 

prokka --cpus 10 --kingdom Bacteria --outdir S4018 --genus Mycobacterium  GCA_001870585.1_ASM187058v1_genomic.fna 

prokka --cpus 10 --kingdom Bacteria --outdir Harvey --genus Mycobacterium GCA_000524035.1_ASM52403v1_genomic.fna 

prokka --cpus 10 --kingdom Bacteria --outdir CSURQ0185 --genus Mycobacterium GCA_902506705.1_Q0185_genomic.fna 

prokka --cpus 10 --kingdom Bacteria --outdir Liflandii --genus Mycobacterium  GCA_000026445.2_ASM2644v2_genomic.fna

The gff files will be located in the individual output directory

Verify with ls command.

ls Agy99 

ls S4018 

ls Harvey 

ls CSURQ0185 

ls Liflandii

Now, lets copy those gff files to our gffs directory

cp Agy99/PROKKA*.gff gffs/Agy99.gff    

cp S4018/PROKKA*.gff gffs/S4018.gff 

cp Harvey/PROKKA*.gff gffs/Harvey.gff 

cp CSURQ0185/PROKKA*.gff gffs/CSURQ0185.gff 

cp Liflandii/PROKKA*.gff gffs/Liflandii.gff

Lets verify that all files are present in the gffs directory

ls gffs

Its time for the main course meal. I assume that you already have roary installed and setup

We generate a pangenome with a core gene alignment using 10 threads

roary -f roaryresult -p 10 -e -n -v --mafft  gffs/*.gff

Visualizing the results

This involves two stages

– generating trees ( in newick format) using FastTree

– plot the trees using python script

 FastTree -nt -gtr roaryresult/core_gene_alignment.aln>roaryresult/mytree.newick

./roary_plots.py --labels roaryresult/mytree.newick roaryresult/gene_presence_absence.csv 

mv *.png roaryresult

Once this is complete three png files will be created.

ls roaryresult

View those plots to understand your data. Other useful outputs will be located in the roaryresult directory.

Variant Calling-Exome-Single-Sample

Synopsis: This document demonstrates each step of the GATK pipeline that can be used to call out variants in whole exome data. It follows the GATK best practices . We assume that you are familiar with the linux environment

We first load our tools

module load bioinfo/samtools/1.9 

module load  bioinfo/gatk/4.1 

module load bioinfo/bwa/0.7 

module load java/8 

module load R/3.4.3 

module load bioinfo/tabix/1.10.2 

module load bioinfo/bgzip/1.10.2 

picard=/opt/apps/bioinfo/picard/2.23.1/picard.jar  Let's download the reference and annotation files. We will use hg19bundle from broadinstitute

mkdir  hg19bundle

wget --directory-prefix=hg19bundle ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_omni*wget --directory-prefix=hg19bundle ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/hapmap_3.3*wget --directory-prefix=hg19bundle ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G*wget --directory-prefix=hg19bundle ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence*wget --directory-prefix=hg19bundle ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138* 

wget --directory-prefix=hg19bundle ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz 

wget --directory-prefix=hg19bundle ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19*

We Process the downloaded files.

unzip the reference genome

gunzip hg19bundle/ucsc.hg19.*.gz

Unzip the annotation files( After several experimentations this was the best approach I could come up with)

gunzip hg19bundle/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz 

gunzip  hg19bundle/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz  

gunzip hg19bundle/dbsnp_138.hg19.vcf.idx.gz 

gunzip hg19bundle/dbsnp_138.hg19.vcf.gz 

gunzip hg19bundle/hapmap_3.3.hg19.sites.vcf.idx.gz 

gunzip hg19bundle/hapmap_3.3.hg19.sites.vcf.gz 

gunzip hg19bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz 

gunzip hg19bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz 

gunzip hg19bundle 1000G_omni2.5.hg19.sites.vcf.idx.gz 

gunzip hg19bundle/1000G_omni2.5.hg19.sites.vcf.gz

Let's set the reference and annotation file names

ref=hg19bundle/ucsc.hg19.fasta

GCONF=hg19bundle/1000G_phase1.snps.high_confidence.hg19.sites.vcf 

DBSNP=hg19bundle/dbsnp_138.hg19.vcf 

HAPMAP=hg19bundle/hapmap_3.3.hg19.sites.vcf 

MILLS=hg19bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf 

OMNI=hg19bundle/1000G_omni2.5.hg19.sites.vcf

We will use 18 threads for this task

threads=18

Let us download the exome raw sequence reads. The data is taken from the 1000 genomes project repository

mkdir reads

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA18486/sequence_read/SRR100027_2.filt.fastq.gz

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA18486/sequence_read/SRR100027_1.filt.fastq.gz

Let's rename the raw reads (We move to the reads directory and rename )

mv SRR100027_1.filt.fastq.gz reads/read1.fastq.gz 

mv SRR100027_2.filt.fastq.gz reads/read2.fastq.gz

We also download the capture file which contains the regions of the genome that were sequenced

wget --directory-prefix=hg19bundle ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/20130108.exome.targets.bed

Set the location

targetregion=$resourcedir/20130108.exome.targets.bed

Let's set location for java temp directory

javatmp=tmp

mkdir $javatmp

Set path to the data input and output files

read1=reads/read1.fastq.gz 

read2=reads/read2.fastq.gz

outdir=outdir 
mkdir $outdir

Now that we have all the tools and resources setup. Let's start the variant calling.

Step 1: We begin our pipeline with indexing the reference genome

bwa index $ref

Step 2:

samtools faidx $ref

Note: GATK provides index files and so if you have them, you can skip step 1 and 2

For this tutorial you can skip step2

Step 3: Align the fastq files to the reference genome. We use hg19 as teh reference genome

Aligning also requires that we add readgroups . So let's define our readgroups

PLATFORM="illumina" 

sampleid=S01

(If you are running several samples , then you should generate unique ids for each sample)

samplename=NGR

bwa mem -t $threads -R "@RG\\tID:${sampleid}\\tPL:$PLATFORM\\tPU:None\\tLB:None\\tSM:${samplename}" $ref $read1 $read2 > $outdir/reads.sam

We convert the samfile to bamfile

samtools view -h $outdir/reads.sam -O BAM -o $outdir/reads.bam

Sort the bamfile

samtools sort -O BAM --reference $ref -@ $threads -o $outdir/reads.sorted.bam $outdir/reads.bam

Index the bamfile

samtools index $outdir/reads.sorted.bam

Step 4. Variant Calling with GATK.

#We first mark duplicates. This step is computationally expensive so we try to speed it using gatk-spark #module

java -jar $picard MarkDuplicates \ 

        I=$outdir/reads.sorted.bam \ 

        O=$outdir/mark_dup.bam \ 

        M=$outdir/dum_metrics.txt

 gatk BaseRecalibrator -I $outdir/mark_dup.bam -R $ref --known-sites "$GCONF" --known-sites "$DBSNP" --known-sites "$MILLS" -O $outdir/reads.recal.table 

 gatk ApplyBQSR -R $ref -I $outdir/mark_dup.bam -O $outdir/reads.recal.bam -bqsr $outdir/reads.recal.table

#variant calling

gatk  --java-options "-Xmx4g" HaplotypeCaller -R $ref -I $outdir/reads.recal.bam -L $targetregion --dbsnp  $DBSNP --native-pair-hmm-threads $threads --lenient true -O $outdir/reads.g.vcf

Publications