Pipeline to extract protein domain clusters from sequence data

public public 1yr ago 0 bookmarks

Introduction

This pipeline is designed to automate the prediction of protein domains given only the protein's sequence, based on the method described in (Granata et al. , 2017). The pipeline is written in Snakemake framework (Koster and Rahmann, 2012) to allow easy execution both in compute clusters and in local machines and to ensure reproducibility of the generated data. We also strive to make installation as convenient as possible, by having most of our dependencies available on conda.

The pipeline consists of three main processes:

  • Generation of multiple sequence alignment using an iterative protein sequence search tool called hhblits

  • Prediction of direct coupling between every pair of protein residues using gplmDCA (Feinauer et al. , 2014; & Ekeberg et al. , 2013)

  • Generation of protein domains clusters using spectral clustering (Ponzoni, 2015)

This pipeline can be run end-to-end — i.e. from FASTA sequence input toa table of every residue position and their corresponding domain membership — or can be started in from intermediate files.

This repository also include RMarkdown files we use for our downstream analyses which aims to inspect the possible relationship between the number and identity of ExAC mutations (Lek et al. , 2016) and pathogenic variants.

Dependencies

Python, Conda, MATLAB

Setup

Before using the pipeline, you will need to install snakemake on top of the dependencies listed above

conda install -c bioconda -c conda-forge snakemake

Also, hhblits requires a database to be downloaded for which they recommend UniClust30 . Download this database into the db/ directory.

Running the pipeline on local machine

On the local machine you could simply run the pipeline by firstly populating the fasta_list.dat file with linebreak-delimieted name of your fasta files (without the .fa suffix). You would also need to modify the parameters in params.json .

Once those are done, you can run

snakemake -k --use-conda

Running the pipeline on a compute cluster

For parallelise implementation on a compute cluster, you would need to create a configuration file called cluster.json , for example

{ "__default__": { "partition": "general", "mem": "1G", "time": "00:30:00" }
}

then, simply run

snakemake -k -j 100 --use-conda --cluster-config cluster.json --cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {threads}"

where -j 100 means that you only allow a maxiumum of 100 jobs to be run simultaneously on the cluster.

Results

The result along with all intermediate and log files can be found in the results/ directory, where the final result (every residue's domain membership in the protein) is found in results/clustering .

Analysis

Data

Files used to preprocess the clinvar, ExAC and PanCanAtlas variations can be found in downstream_analysis/clinvar_data , downstream_analysis/exac_data and downstream_analysis/pancanatlas_data . The Rmd files in each folder explains what the other files are used for. The link to these datasets can be found here:

  • Clinvar: ftp://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2019/clinvar_20190715_papu.vcf.gz

  • ExAC: ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz

  • PanCanAtlas: http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc

Preliminary Analysis

Analysis of the first 6,924 proteins can be found in downstream_analysis/Analysis/Analysis_exac.Rmd , which also explains what the other files in that directory is used for.

Bibliography

Feinauer, C., Skwark, M. J., Pagnani, A., & Aurell, E. (2014). Improving contact prediction along three dimensions. PLoS computational biology, 10(10), e1003847.

Granata, D., Ponzoni, L., Micheletti, C., & Carnevale, V. (2017). Patterns of coevolving amino acids unveil structural and dynamical domains. Proceedings of the National Academy of Sciences, 114(50), E10612-E10621.

Köster, J., & Rahmann, S. (2012). Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520-2522.

Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T., ... & Tukiainen, T. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285.

Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to
infer Potts models. Physical Review E, 87(1), 012707.

Ponzoni, L., Polles, G., Carnevale, V., & Micheletti, C. (2015). SPECTRUS: A dimensionality reduction approach for identifying dynamical
domains in protein complexes from limited structural datasets. Structure, 23(8), 1516-1525.

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
library(dplyr)
library(ggplot2)

main <- function(){
  # Accepting Command line Argument
  args <- commandArgs(trailingOnly = TRUE)
  dir.name <- args[1]
  output.clustering <- args[2]
  output.quality <- args[3]

  # Plotting clustering of a protein sequence
  ## read every cluster labels in a directory 
  name.array <- list.files(dir.name, pattern = 'final_labels_kmed-*')
  df <- data.frame()
  for (i in 1:length(name.array)){
    df.single <- read.csv(paste(dir.name, name.array[i], sep = "/"), header = FALSE, sep = " ")
    df.single$V3 <- gsub(".dat", "", gsub("final_labels_kmed-", "", name.array[i])) %>% as.numeric()
    df.single$V2 <- df.single$V2 %>% as.factor
    colnames(df.single) <- c("residue_position", "membership", "q")
    df = rbind(df, df.single) 
  }
  df = df[order(df$q,df$residue_position),]
  df = df[df$q <= 10,] #only plot the top 10 

  ## define coloring scheme for 10 clusterings 
  #### TODO: make this variable dynamic to the number of clusters detected #### 
  colors = c("#4e79a7", "#59a14f", "#9c755f", "#f28e2b", "#edc948", "#e15759", "#b07aa1",
             "#76b6b2", "#ff9da7", 	"#000000", "#bab0ac")

  ## use scatter plot with square dots
  df1 = df; df1$q <- df$q + .05
  df2 = df; df2$q <- df$q + .1
  df3 = df; df3$q <- df$q + .15
  df4 = df; df4$q <- df$q + .2

  df.combined = rbind(df, df1, df2, df3, df4)

  colnames(df.combined) <- c("residue_position", "membership", "q")


  options(bitmapType='cairo') 
  png(output.clustering)
  print(ggplot(df.combined, aes(x=residue_position, y=q, color=membership)) + 
    geom_point(size=1.3, shape=15) + 
    scale_color_manual(values=colors) + labs(x="residue position", y="number of clusters"))
  dev.off()


  # Plot quality score against number of clusters
  system(sprintf("sed -i 's/^[ \t]*//' %s/quality_score.dat", dir.name))
  df.qualscore <- read.csv(paste(dir.name, "quality_score.dat", sep = "/"), header = FALSE, strip.white=TRUE, sep = " ")
  colnames(df.qualscore) <- c("q", "median_quality_score", "mean_quality_score", "prefactor1", "prefactor2")
  df.qualscore = df.qualscore[df.qualscore$q <= 10,]

  png(output.quality)
  print(ggplot(df.qualscore, aes(x=q, y=median_quality_score)) + geom_line() + geom_point() + 
    labs(x="number of clusters", y="quality score"))
  dev.off()
}

main()
34
35
shell: 
	"hhblits -cpu {params.cpu_per_tasks} -n {params.n_of_iterations} -e {params.e_value} -i {input.fa} -oa3m {output} -d {params.database} 1> {log.summary} 2> {log.err}"
42
43
shell: 
	"cat {input.msa} | ./filter_by_gap.sh > {output}"
50
51
52
shell:
	"cat {input} | ./reformat_a3m_fa.py |"
	"sed 's/-/./g' > {output}"
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
	shell:
		"""
		if [ $(cat {input} | grep -c "^>") -lt 100 ]; then 
			echo "Stopping job due to not enough sequence in the MSA (< 100) "
			sample=$(echo {input} | sed 's/.*\///' | sed 's/.filtered.reformatted.a3m//'); echo $sample
			sed -i '/$sample/d' sample_list.dat && echo "$sample FAILED: TOO FEW MSA SEQUENCES" >> fasta_done.dat
			exit 1 
		else cp {input} {output}
		fi
		"""
"""
rule conserve_scoring:
	input:
		"results/msa/{sample}.filtered.count.a3m"
	output:
		"results/covscore/{sample}.covscore"
	shell: 
	... 
"""
86
87
88
89
90
91
92
93
shell: 
	"""
	cat {input.fa} | sed -n '1,2p' >> results/msa_stats/{sample}.msa.stats
	echo 'Protein Sequence Length:' >> results/msa_stats/{sample}.msa.stats 
	cat {input.fa} | sed -n '2p' | wc -c >> results/msa_stats/{sample}.msa.stats
	echo 'MSA count' >> results/msa_stats/{sample}.msa.stats
	wc -l {input.count} >> results/msa_stats/{sample}.msa.stats
	"""
109
110
111
112
113
114
shell: 
	"""
	wait=$((1 + RANDOM % 10)).$((1 + RANDOM % 6))
	sleep $wait # randomly wait to avoid job clashes 
	matlab -nodisplay -nosplash -r \"gplmDCA_asymmetric('{input.msa}', '{output}', {params.lambda_h}, {params.lambda_J}, {params.lambda_chi}, {params.reweighting_threshold}, {params.nr_of_cores}, {params.M})\" 2> {log}
	"""
SnakeMake From line 109 of master/Snakefile
123
124
125
126
shell: 
	"""
	./cluster_GN.sh {input} {output} > {log}	
	"""
SnakeMake From line 123 of master/Snakefile
137
138
139
140
141
142
143
144
145
146
shell:
	"""
	sample=$(echo {input.dca} | sed 's/.*\///' | sed 's/.gplmDCA//')
	./cluster_spectrus.sh {input.dca} $sample {output.clust} results/clustering_stats/ > {output.stats} &&
	sed -i '/$sample/d' sample_list.dat && echo "$sample OK" >> fasta_done.dat
	rm -rf results/clustering_stats/results_$sample
	mkdir -p {output.toplot}
	mv results_"$sample".temp/* {output.toplot}
	rmdir results_"$sample".temp/
	""" 
SnakeMake From line 137 of master/Snakefile
156
157
158
159
shell: """
	mkdir -p results/plots;
	Rscript plot_graphics.R {input.SC} {output.clust} {output.qual}
	"""
SnakeMake From line 156 of master/Snakefile
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/kristoforusbryant/domain-clustering-pipeline
Name: domain-clustering-pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...