snakemake pipeline from transcripts to annotated protein sequences

public public 1yr ago 0 bookmarks

snakemake pipeline from transcripts to annotated protein sequences

About

This is a snakemake wrapper of a number of well-known, standardised tools to annotate a dataset of transcripts based on sequence-homology and protein domain methods.

What it does

Briefly, this workflow will extract the longest coding sequences per transcript (longest isoform per gene) from a genome annotation. It will translate these to proteins, and proceed with a multi-evidence functional annotation (EggNOG mapper, interproscan, blast best reciprocal hits, and Orthofinder). If needed, the pipeline will download and prepare the necessary databases for these steps.

workflow graphical DAG

Eventually the pipeline will aim at generating two final output files; one with the functional annotation of all the predicted protein-coding genes, and one with the functional annotation of a subset of the genes with high-confidence of being a transcription factor.

The user will be able to proceed with downstream analyses of their own and to use any of the intermediate or secondarily derived outputs of this workflow.

Input

The necessar input to run this workflow is:

  • A .fasta file of a genome of interest,

  • A .gtf annotation of the genes of the genome of interest,

  • A config.yaml file with all the necessary parameters, specificially:

    • A name of the organism (this will eventually be used to tag all the files in the run),

    • The path to the genome of interest,

    • The path to the gtf of interest,

    • Parameters for the different tools (otherwise the workflow will use the ones specified by default)

The user can use the pre-included config.yaml file as a template for their own run.

Alternatively, the user can provide a set of transcripts in a transcripts.fna file at the root directory of the workflow or a set of protein sequences in the path transdecoder/predicted.pep . Snakemake will recognise these and start the workflow from these files.

How to run

NOTE: At the moment the workflow uses checkpoints (empty files whose names are recognised by snakemake) as a means to connect rules. If one whishes to restart the workflow or resume from a specific step, please consider removing the checkpoints/ directory or any specific checkpoint of interest.

First of all make sure that you have snakemake and conda accessible from your running environment. You can always create a conda environment with snakemake in it (this is how I do it myself).

# create venv
mamba create -n snakemake_venv -c bioconda snakemake
# activate venv
conda activate snakemake_venv

After cloning this repository, prepare the config file and the genome and annotation files.

Then, move to the root directory of the workflow and run snakemake like this

# move to the directory of the workflow
cd gene_annot/
# run snakemake
snakemake --use-conda

Optionally you can add options like --cores NUMCORES to use a specific amount of cores, and --rerun-incomplete to retry and re-run the steps that might have gone wrong in a previous attempt. Please refer to the snakemake documentation for more information.

Code Snippets

 9
10
11
12
13
14
15
16
17
shell:
	'''
	blastp -query transdecoder/transdecoder_dir/longest_orfs.pep \
	-db {params.db_dir}/uniprot_sprot.fasta \
	-max_target_seqs 1 \
	-outfmt 6 \
	-evalue 1e-5 \
	-num_threads 10 > {output}
	'''
 7
 8
 9
10
11
shell:
	'''
	download_eggnog_data.py -y -H -f  -d {params.database} 2> {log} && \
	touch {output}
	'''
19
20
21
22
23
24
25
shell:
	'''
	wget {params.hmmerURL} -P {params.db_dir} && \
	gunzip -f {params.db_dir}/Pfam-A.hmm.gz && \
	hmmpress {params.db_dir}/Pfam-A.hmm && \
	touch {output}
	'''
33
34
35
36
37
38
39
shell:
	'''
	wget {params.swissprotURL} -P {params.db_dir} && \
	gunzip -f {params.db_dir}/uniprot_sprot.fasta.gz && \
	makeblastdb -dbtype prot -in {params.db_dir}/uniprot_sprot.fasta && \
	touch {output}
	'''
45
46
47
48
shell:
	'''
	wget {params.animaltfdbURL} && touch {output}
	'''
11
12
13
14
15
16
17
18
19
20
	shell:
		'''
		emapper.py \
		--cpu 12 \
        -i {input.pep} \
		--itype proteins \
        {params.eggnog} \
        --output ./predicted.pep && \
		touch {output}
		'''
 9
10
11
12
13
shell:
	'''
	samtools faidx {input.genome} \
	&& touch {output} 
	'''
11
12
13
14
15
shell:
	'''
	gffread -w {output.transcripts} -g {input.genome} {input.gtf} && \
	touch {output.done}
	'''
 9
10
11
12
13
14
15
shell:
	'''
	hmmscan --cpu 12  \
	--domtblout {output} \
	{params.db_dir}/Pfam-A.hmm \
	transdecoder/transdecoder_dir/longest_orfs.pep
	'''
11
12
13
14
15
16
17
18
19
20
21
22
shell:
	'''
	mkdir -p {params.outdir} && \
	{params.ipr_path}/interproscan.sh -i {input.predictedpep} \
	-f tsv \
	-d {params.outdir} \
	--cpu 12 \
	-goterms \
	--appl Pfam, PANTHER, SUPERFAMILY \
	2> {log} && \
	touch {output}
	'''
30
31
32
33
shell:
	'''
	python3 workflow/scripts/parse_iprfile.py {params.outdir}/predicted.pep.tsv {params.pfamTFDB} {output}
	'''
41
42
43
44
shell:
	'''
	python3 workflow/scripts/parse_iprfile.py {params.outdir}/predicted.pep.tsv {params.sfamTFDB} {output}
	'''
 6
 7
 8
 9
10
shell:
	'''
	agat_sp_keep_longest_isoform.pl --gff {input} -o {output} \
	2> {log}
	'''
 7
 8
 9
10
11
12
shell:
	'''
	TransDecoder.LongOrfs -t {input.transcripts} \
	--output_dir ./transdecoder/transdecoder_dir/ \
	&& touch {output}
	'''
 7
 8
 9
10
11
12
shell:
	'''
	mkdir -p orthofinder/ && \
	ln -s {params.proteomes_directory}/* orthofinder/ && \
	touch {output}
	'''
20
21
22
23
shell:
	'''
	ln -s ../{input.predictedpep} {output}
	'''
35
36
37
38
39
40
shell:
	'''
	orthofinder {params.userparams} \
	-f {params.proteomes_directory} && \
	touch {output}
	'''
 8
 9
10
11
12
shell:
	'''
	agat_convert_sp_gxf2gxf.pl --gff {input.gtf} -o {output.gff3} \
	2> {log} &&1>2
	'''
 7
 8
 9
10
11
shell:
	'''
	makeblastdb -dbtype prot -in {input.predictedpep} && \
	touch {output}
	'''
24
25
26
27
28
29
30
31
32
33
34
35
shell:
	'''
	blastp -db {params.db_dir}/uniprot_sprot.fasta -query {input.predictedpep} \
	{params.rbh1} | \
	tee rbh_1_2_orig.tsv | \
	cut -f1,2 > rbh_1_2.tsv && \
	blastp -db {input.predictedpep} -query {params.db_dir}/uniprot_sprot.fasta \
	{params.rbh2} | \
	tee rbh_2_1_orig.tsv | \
	cut -f1,2 > rbh_2_1.tsv && \
	grep -f rbh_2_1.tsv rbh_1_2.tsv > {output}
	'''
19
20
21
22
23
24
25
26
27
28
29
30
	shell:
		'''
		TransDecoder.Predict -t {input.transcripts} \
        --retain_pfam_hits {input.pfam} \
        --retain_blastp_hits {input.blastp} \
        --single_best_only \
		--output_dir {params.outdir} && \
		mv transcripts.fna.transdecoder* {params.outdir} && \
		awk {params.awkexpr1:q} {params.outdir}/transcripts.fna.transdecoder.pep > {output.transdecoderTSV} && \
		awk {params.awkexpr2:q} {params.outdir}/transcripts.fna.transdecoder.pep | awk {params.awkexpr3:q} > {output.predictedpep} && \
		touch {output.transdecoderDone}
		'''
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import argparse

# Parse the command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument('input_file', help='path to the interproscan results tsv')
parser.add_argument('db_file', help='path to the database')
parser.add_argument('output_file', help='path to the output file')
args = parser.parse_args()

# Open the input file for reading
with open(args.input_file, "r") as f:
    # Read in the lines of the file as a list of strings
    lines = f.readlines()

# Open the db file for reading
with open(args.db_file, "r") as f:
    # Read in the lines of the file as a list of strings and strip any whitespace
    values = [line.strip() for line in f.readlines()]

# Filter the lines of the input file based on whether any of their columns match any of the values in the db file
filtered_lines = []
for line in lines:
    columns = line.strip().split('\t')
    for column in columns:
        if column in values:
            filtered_lines.append(line)
            break  # no need to check remaining columns once a match is found

# Write the filtered lines to the output file
with open(args.output_file, "w") as f:
    f.writelines(filtered_lines)
ShowHide 15 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/apposada/gene_annot
Name: gene_annot
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...