snakemake pipeline from transcripts to annotated protein sequences

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

snakemake pipeline from transcripts to annotated protein sequences

About

This is a snakemake wrapper of a number of well-known, standardised tools to annotate a dataset of transcripts based on sequence-homology and protein domain methods.

What it does

Briefly, this workflow will extract the longest coding sequences per transcript (longest isoform per gene) from a genome annotation. It will translate these to proteins, and proceed with a multi-evidence functional annotation (EggNOG mapper, interproscan, blast best reciprocal hits, and Orthofinder). If needed, the pipeline will download and prepare the necessary databases for these steps.

workflow graphical DAG

Eventually the pipeline will aim at generating two final output files; one with the functional annotation of all the predicted protein-coding genes, and one with the functional annotation of a subset of the genes with high-confidence of being a transcription factor.

The user will be able to proceed with downstream analyses of their own and to use any of the intermediate or secondarily derived outputs of this workflow.

Input

The necessar input to run this workflow is:

A .fasta file of a genome of interest,
A .gtf annotation of the genes of the genome of interest,
A config.yaml file with all the necessary parameters, specificially:
- A name of the organism (this will eventually be used to tag all the files in the run),
- The path to the genome of interest,
- The path to the gtf of interest,
- Parameters for the different tools (otherwise the workflow will use the ones specified by default)

The user can use the pre-included config.yaml file as a template for their own run.

Alternatively, the user can provide a set of transcripts in a transcripts.fna file at the root directory of the workflow or a set of protein sequences in the path transdecoder/predicted.pep . Snakemake will recognise these and start the workflow from these files.

How to run

NOTE: At the moment the workflow uses checkpoints (empty files whose names are recognised by snakemake) as a means to connect rules. If one whishes to restart the workflow or resume from a specific step, please consider removing the checkpoints/ directory or any specific checkpoint of interest.

First of all make sure that you have snakemake and conda accessible from your running environment. You can always create a conda environment with snakemake in it (this is how I do it myself).

# create venv
mamba create -n snakemake_venv -c bioconda snakemake
# activate venv
conda activate snakemake_venv

After cloning this repository, prepare the config file and the genome and annotation files.

Then, move to the root directory of the workflow and run snakemake like this

# move to the directory of the workflow
cd gene_annot/
# run snakemake
snakemake --use-conda

Optionally you can add options like --cores NUMCORES to use a specific amount of cores, and --rerun-incomplete to retry and re-run the steps that might have gone wrong in a previous attempt. Please refer to the snakemake documentation for more information.

Code Snippets

shell:
	'''
	blastp -query transdecoder/transdecoder_dir/longest_orfs.pep \
	-db {params.db_dir}/uniprot_sprot.fasta \
	-max_target_seqs 1 \
	-outfmt 6 \
	-evalue 1e-5 \
	-num_threads 10 > {output}
	'''

SnakeMake From line 9 of rules/blastp.smk

shell:
	'''
	download_eggnog_data.py -y -H -f  -d {params.database} 2> {log} && \
	touch {output}
	'''

SnakeMake From line 7 of rules/downloaddatabases.smk

shell:
	'''
	wget {params.hmmerURL} -P {params.db_dir} && \
	gunzip -f {params.db_dir}/Pfam-A.hmm.gz && \
	hmmpress {params.db_dir}/Pfam-A.hmm && \
	touch {output}
	'''

SnakeMake From line 19 of rules/downloaddatabases.smk

shell:
	'''
	wget {params.swissprotURL} -P {params.db_dir} && \
	gunzip -f {params.db_dir}/uniprot_sprot.fasta.gz && \
	makeblastdb -dbtype prot -in {params.db_dir}/uniprot_sprot.fasta && \
	touch {output}
	'''

SnakeMake From line 33 of rules/downloaddatabases.smk

shell:
	'''
	wget {params.animaltfdbURL} && touch {output}
	'''

SnakeMake From line 45 of rules/downloaddatabases.smk

	shell:
		'''
		emapper.py \
		--cpu 12 \
        -i {input.pep} \
		--itype proteins \
        {params.eggnog} \
        --output ./predicted.pep && \
		touch {output}
		'''

SnakeMake eggNOG-mapper v2 From line 11 of rules/emapper.smk

shell:
	'''
	samtools faidx {input.genome} \
	&& touch {output} 
	'''

SnakeMake SAMtools From line 9 of rules/genomeindex.smk

shell:
	'''
	gffread -w {output.transcripts} -g {input.genome} {input.gtf} && \
	touch {output.done}
	'''

SnakeMake gffread From line 11 of rules/gffread.smk

shell:
	'''
	hmmscan --cpu 12  \
	--domtblout {output} \
	{params.db_dir}/Pfam-A.hmm \
	transdecoder/transdecoder_dir/longest_orfs.pep
	'''

SnakeMake hmmscan (EBI) From line 9 of rules/hmmer.smk

shell:
	'''
	mkdir -p {params.outdir} && \
	{params.ipr_path}/interproscan.sh -i {input.predictedpep} \
	-f tsv \
	-d {params.outdir} \
	--cpu 12 \
	-goterms \
	--appl Pfam, PANTHER, SUPERFAMILY \
	2> {log} && \
	touch {output}
	'''

SnakeMake InterProScan (EBI) From line 11 of rules/interproscan.smk

shell:
	'''
	python3 workflow/scripts/parse_iprfile.py {params.outdir}/predicted.pep.tsv {params.pfamTFDB} {output}
	'''

SnakeMake From line 30 of rules/interproscan.smk

shell:
	'''
	python3 workflow/scripts/parse_iprfile.py {params.outdir}/predicted.pep.tsv {params.sfamTFDB} {output}
	'''

SnakeMake From line 41 of rules/interproscan.smk

shell:
	'''
	agat_sp_keep_longest_isoform.pl --gff {input} -o {output} \
	2> {log}
	'''

SnakeMake From line 6 of rules/longestisoforms.smk

shell:
	'''
	TransDecoder.LongOrfs -t {input.transcripts} \
	--output_dir ./transdecoder/transdecoder_dir/ \
	&& touch {output}
	'''

SnakeMake TransDecoder From line 7 of rules/longorfs.smk

shell:
	'''
	mkdir -p orthofinder/ && \
	ln -s {params.proteomes_directory}/* orthofinder/ && \
	touch {output}
	'''

SnakeMake From line 7 of rules/orthofinder.smk

shell:
	'''
	ln -s ../{input.predictedpep} {output}
	'''

SnakeMake From line 20 of rules/orthofinder.smk

shell:
	'''
	orthofinder {params.userparams} \
	-f {params.proteomes_directory} && \
	touch {output}
	'''

SnakeMake OrthoFinder From line 35 of rules/orthofinder.smk

shell:
	'''
	agat_convert_sp_gxf2gxf.pl --gff {input.gtf} -o {output.gff3} \
	2> {log} &&1>2
	'''

SnakeMake From line 8 of rules/parsegtf.smk

shell:
	'''
	makeblastdb -dbtype prot -in {input.predictedpep} && \
	touch {output}
	'''

SnakeMake From line 7 of rules/recipbesthits.smk

shell:
	'''
	blastp -db {params.db_dir}/uniprot_sprot.fasta -query {input.predictedpep} \
	{params.rbh1} | \
	tee rbh_1_2_orig.tsv | \
	cut -f1,2 > rbh_1_2.tsv && \
	blastp -db {input.predictedpep} -query {params.db_dir}/uniprot_sprot.fasta \
	{params.rbh2} | \
	tee rbh_2_1_orig.tsv | \
	cut -f1,2 > rbh_2_1.tsv && \
	grep -f rbh_2_1.tsv rbh_1_2.tsv > {output}
	'''

SnakeMake From line 24 of rules/recipbesthits.smk

	shell:
		'''
		TransDecoder.Predict -t {input.transcripts} \
        --retain_pfam_hits {input.pfam} \
        --retain_blastp_hits {input.blastp} \
        --single_best_only \
		--output_dir {params.outdir} && \
		mv transcripts.fna.transdecoder* {params.outdir} && \
		awk {params.awkexpr1:q} {params.outdir}/transcripts.fna.transdecoder.pep > {output.transdecoderTSV} && \
		awk {params.awkexpr2:q} {params.outdir}/transcripts.fna.transdecoder.pep | awk {params.awkexpr3:q} > {output.predictedpep} && \
		touch {output.transdecoderDone}
		'''

SnakeMake From line 19 of rules/transdecoderEvidences.smk

import argparse

# Parse the command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument('input_file', help='path to the interproscan results tsv')
parser.add_argument('db_file', help='path to the database')
parser.add_argument('output_file', help='path to the output file')
args = parser.parse_args()

# Open the input file for reading
with open(args.input_file, "r") as f:
    # Read in the lines of the file as a list of strings
    lines = f.readlines()

# Open the db file for reading
with open(args.db_file, "r") as f:
    # Read in the lines of the file as a list of strings and strip any whitespace
    values = [line.strip() for line in f.readlines()]

# Filter the lines of the input file based on whether any of their columns match any of the values in the db file
filtered_lines = []
for line in lines:
    columns = line.strip().split('\t')
    for column in columns:
        if column in values:
            filtered_lines.append(line)
            break  # no need to check remaining columns once a match is found

# Write the filtered lines to the output file
with open(args.output_file, "w") as f:
    f.writelines(filtered_lines)