RNA-Seq Analysis Workflow with HISAT2 Aligner

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Workflow for RNAseq, using hisat2 aligner

get the hisat index for human

wget https://cloud.biohpc.swmed.edu/index.php/s/grch38/download
mv download grch39.tar.gz
tar -xvzf grch39.tar.gz

link the location in the Snakefile

GENOME="/cluster/home/michalo/project_michalo/hisat/grch38/genome"

get the GTF

wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
gunzip Homo_sapiens.GRCh38.99.gtf.gz

link the GTF in Snakefile

GTF="/cluster/home/michalo/project_michalo/hg38/Homo_sapiens.GRCh38.99.gtf"

Software required:

If you want to use it locally, the software from the workflow: trimmomatic, hisat, subread, samtools need to be installed locally and made runnable from command line

Adapting

The paths to genome, GTF and adapters need to be set in the python constants in the Snakefile If needed, also paths to the software commands and trimmomatic jar. Recommended is to have them in the executable or java paths, eg with setting the environment value.

Running

Create a run directory, where you place: Snakefile, adapters.fa and fastq.gz files in "data" subdirectory. Do the updates to the Snakefile as above: location of genome index and GTF annotation, then:

dry run

snakemake -np

normal run

snakemake -p

run on the cluster

Make the snakemake available in the cluster environment, eg

module load gcc/8.2.0 python/3.10.4

LSF

snakemake -p -j 999 --cluster-config cluster.json --cluster "bsub -W {cluster.time} -n {cluster.n}"

SLURM

# change times in cluster.json to HH:MM:SS
snakemake -p -j 999 --cluster-config cluster.json --cluster "sbatch --time {cluster.time} -n {cluster.n}"
snakemake -p -j 999 --cluster-config cluster.json --cluster "sbatch --time {cluster.time} -n 1 --cpus-per-task={cluster.n}"
snakemake -p -j 999 --cluster-config cluster.json --cluster "sbatch --time {cluster.time} -n 1 --cpus-per-task={cluster.n} --mem-per-cpu={cluster.mem}"

SLURM with containers

Running the workflow with the containers from Galaxy software stack requires passing the external folders as singularity parameters to the snakemake. The containers will be loaded into .snakemake folder.

 snakemake -p -j 999 --use-singularity --cluster-config cluster.json \
 --cluster "sbatch --time {cluster.time} -n 1 --cpus-per-task={cluster.n}" \
 --singularity-args "--bind /cluster/scratch/michalo/Anthony_RNA/:/mnt2 --bind /cluster/home/michalo/project_michalo/hisat/grch38/:/genomes --bind /cluster/home/michalo/project_michalo/hg38/:/annots"

Code Snippets

run:
    shell(
    'module load gdc \n'+
    'module load java \n'+
    'module load trimmomatic \n'+
    'echo {input} \n'+
    'trimmomatic SE -phred33 {input} {output} ILLUMINACLIP:'+TRIMFILE+':2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36'
    )

SnakeMake Trimmomatic From line 53 of master/Snakefile

run:
    shell(
    'module load gcc/4.8.2 gdc python/2.7.11 hisat2/2.1.0 \n'+
    'echo {input} \n'+
    'hisat2  -q -p '+CORES+'-x '+GENOME+' -U {input} -S mapped_reads/{wildcards.sample}.sam \n')

SnakeMake HISAT2 From line 68 of master/Snakefile

run:
    shell(
    'module load samtools \n'+
    'samtools view -@ '+CORES+' -bS {input} > {output} ')

SnakeMake SAMtools From line 80 of master/Snakefile

shell:
    "module load samtools \n"
    "samtools sort -@ 24 -T sorted_reads/{wildcards.sample} "
    "-O bam {input} > {output}"

SnakeMake SAMtools From line 90 of master/Snakefile

shell:
    "module load samtools \n"
    "samtools index {input}"

SnakeMake SAMtools From line 100 of master/Snakefile

shell:
    "touch secondary_analysis/final_marker_bai.txt"

SnakeMake From line 109 of master/Snakefile

shell:
    "stringtie --rf -o {output} -p 24 {input}"

SnakeMake StringTie From line 117 of master/Snakefile

shell:
    "touch secondary_analysis/final_marker_string.txt"

SnakeMake From line 125 of master/Snakefile

shell:
    "module load legacy gcc/4.8.2 python/2.7.6 samtools/1.1 boost/1.55.0 eigen/3.2.1 cufflinks/2.2.1 \n"

SnakeMake From line 134 of master/Snakefile

shell:
    "touch secondary_analysis/final_marker_cuff.txt"

SnakeMake From line 145 of master/Snakefile

shell:
    "module load subread \n"
    "featureCounts -M -f --fraction -s 2 -T 24 -t gene -g gene_id -a "+GTF+" -o {output} {input}"

SnakeMake FeatureCounts From line 154 of master/Snakefile

run:
    import pandas
    import glob

    filez = glob.glob('secondary_analysis/*.cnt')
    t1 = pandas.read_table(filez[1], header=1)
    tout = t1.iloc[:,0]
    for f in filez:
       t1=pandas.read_table(f, header=1)
       tout=pandas.concat([tout, t1.iloc[:,6]], axis=1)
       print(f)

    tout.to_csv('secondary_analysis/counts.csv')