Pipeline to extract protein domain clusters from sequence data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
This pipeline is designed to automate the prediction of protein domains given only the protein's sequence, based on the method described in (Granata et al. , 2017). The pipeline is written in Snakemake framework (Koster and Rahmann, 2012) to allow easy execution both in compute clusters and in local machines and to ensure reproducibility of the generated data. We also strive to make installation as convenient as possible, by having most of our dependencies available on conda.
The pipeline consists of three main processes:
-
Generation of multiple sequence alignment using an iterative protein sequence search tool called hhblits
-
Prediction of direct coupling between every pair of protein residues using gplmDCA (Feinauer et al. , 2014; & Ekeberg et al. , 2013)
-
Generation of protein domains clusters using spectral clustering (Ponzoni, 2015)
This pipeline can be run end-to-end — i.e. from FASTA sequence input toa table of every residue position and their corresponding domain membership — or can be started in from intermediate files.
This repository also include RMarkdown files we use for our downstream analyses which aims to inspect the possible relationship between the number and identity of ExAC mutations (Lek et al. , 2016) and pathogenic variants.
Dependencies
Python, Conda, MATLAB
Setup
Before using the pipeline, you will need to install snakemake on top of the dependencies listed above
conda install -c bioconda -c conda-forge snakemake
Also, hhblits requires a database to be downloaded for which they recommend
UniClust30
. Download this database into the
db/
directory.
Running the pipeline on local machine
On the local machine you could simply run the pipeline by firstly populating the
fasta_list.dat
file with linebreak-delimieted name of your fasta files (without the .fa suffix). You would also need to modify the parameters in
params.json
.
Once those are done, you can run
snakemake -k --use-conda
Running the pipeline on a compute cluster
For parallelise implementation on a compute cluster, you would need to create a configuration file called
cluster.json
, for example
{ "__default__": { "partition": "general", "mem": "1G", "time": "00:30:00" }
}
then, simply run
snakemake -k -j 100 --use-conda --cluster-config cluster.json --cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {threads}"
where
-j 100
means that you only allow a maxiumum of 100 jobs to be run simultaneously on the cluster.
Results
The result along with all intermediate and log files can be found in the
results/
directory, where the final result (every residue's domain membership in the protein) is found in
results/clustering
.
Analysis
Data
Files used to preprocess the clinvar, ExAC and PanCanAtlas variations can be found in
downstream_analysis/clinvar_data
,
downstream_analysis/exac_data
and
downstream_analysis/pancanatlas_data
. The Rmd files in each folder explains what the other files are used for. The link to these datasets can be found here:
-
Clinvar: ftp://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2019/clinvar_20190715_papu.vcf.gz
-
ExAC: ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/ExAC.r0.3.1.sites.vep.vcf.gz
-
PanCanAtlas: http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc
Preliminary Analysis
Analysis of the first 6,924 proteins can be found in
downstream_analysis/Analysis/Analysis_exac.Rmd
, which also explains what the other files in that directory is used for.
Bibliography
Feinauer, C., Skwark, M. J., Pagnani, A., & Aurell, E. (2014). Improving contact prediction along three dimensions. PLoS computational biology, 10(10), e1003847.
Granata, D., Ponzoni, L., Micheletti, C., & Carnevale, V. (2017). Patterns of coevolving amino acids unveil structural and dynamical domains. Proceedings of the National Academy of Sciences, 114(50), E10612-E10621.
Köster, J., & Rahmann, S. (2012). Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520-2522.
Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T., ... & Tukiainen, T. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285.
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to
infer Potts models. Physical Review E, 87(1), 012707.
Ponzoni, L., Polles, G., Carnevale, V., & Micheletti, C. (2015). SPECTRUS: A dimensionality reduction approach for identifying dynamical
domains in protein complexes from limited structural datasets. Structure, 23(8), 1516-1525.
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | library(dplyr) library(ggplot2) main <- function(){ # Accepting Command line Argument args <- commandArgs(trailingOnly = TRUE) dir.name <- args[1] output.clustering <- args[2] output.quality <- args[3] # Plotting clustering of a protein sequence ## read every cluster labels in a directory name.array <- list.files(dir.name, pattern = 'final_labels_kmed-*') df <- data.frame() for (i in 1:length(name.array)){ df.single <- read.csv(paste(dir.name, name.array[i], sep = "/"), header = FALSE, sep = " ") df.single$V3 <- gsub(".dat", "", gsub("final_labels_kmed-", "", name.array[i])) %>% as.numeric() df.single$V2 <- df.single$V2 %>% as.factor colnames(df.single) <- c("residue_position", "membership", "q") df = rbind(df, df.single) } df = df[order(df$q,df$residue_position),] df = df[df$q <= 10,] #only plot the top 10 ## define coloring scheme for 10 clusterings #### TODO: make this variable dynamic to the number of clusters detected #### colors = c("#4e79a7", "#59a14f", "#9c755f", "#f28e2b", "#edc948", "#e15759", "#b07aa1", "#76b6b2", "#ff9da7", "#000000", "#bab0ac") ## use scatter plot with square dots df1 = df; df1$q <- df$q + .05 df2 = df; df2$q <- df$q + .1 df3 = df; df3$q <- df$q + .15 df4 = df; df4$q <- df$q + .2 df.combined = rbind(df, df1, df2, df3, df4) colnames(df.combined) <- c("residue_position", "membership", "q") options(bitmapType='cairo') png(output.clustering) print(ggplot(df.combined, aes(x=residue_position, y=q, color=membership)) + geom_point(size=1.3, shape=15) + scale_color_manual(values=colors) + labs(x="residue position", y="number of clusters")) dev.off() # Plot quality score against number of clusters system(sprintf("sed -i 's/^[ \t]*//' %s/quality_score.dat", dir.name)) df.qualscore <- read.csv(paste(dir.name, "quality_score.dat", sep = "/"), header = FALSE, strip.white=TRUE, sep = " ") colnames(df.qualscore) <- c("q", "median_quality_score", "mean_quality_score", "prefactor1", "prefactor2") df.qualscore = df.qualscore[df.qualscore$q <= 10,] png(output.quality) print(ggplot(df.qualscore, aes(x=q, y=median_quality_score)) + geom_line() + geom_point() + labs(x="number of clusters", y="quality score")) dev.off() } main() |
34 35 | shell: "hhblits -cpu {params.cpu_per_tasks} -n {params.n_of_iterations} -e {params.e_value} -i {input.fa} -oa3m {output} -d {params.database} 1> {log.summary} 2> {log.err}" |
42 43 | shell: "cat {input.msa} | ./filter_by_gap.sh > {output}" |
50 51 52 | shell: "cat {input} | ./reformat_a3m_fa.py |" "sed 's/-/./g' > {output}" |
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | shell: """ if [ $(cat {input} | grep -c "^>") -lt 100 ]; then echo "Stopping job due to not enough sequence in the MSA (< 100) " sample=$(echo {input} | sed 's/.*\///' | sed 's/.filtered.reformatted.a3m//'); echo $sample sed -i '/$sample/d' sample_list.dat && echo "$sample FAILED: TOO FEW MSA SEQUENCES" >> fasta_done.dat exit 1 else cp {input} {output} fi """ """ rule conserve_scoring: input: "results/msa/{sample}.filtered.count.a3m" output: "results/covscore/{sample}.covscore" shell: ... """ |
86 87 88 89 90 91 92 93 | shell: """ cat {input.fa} | sed -n '1,2p' >> results/msa_stats/{sample}.msa.stats echo 'Protein Sequence Length:' >> results/msa_stats/{sample}.msa.stats cat {input.fa} | sed -n '2p' | wc -c >> results/msa_stats/{sample}.msa.stats echo 'MSA count' >> results/msa_stats/{sample}.msa.stats wc -l {input.count} >> results/msa_stats/{sample}.msa.stats """ |
109 110 111 112 113 114 | shell: """ wait=$((1 + RANDOM % 10)).$((1 + RANDOM % 6)) sleep $wait # randomly wait to avoid job clashes matlab -nodisplay -nosplash -r \"gplmDCA_asymmetric('{input.msa}', '{output}', {params.lambda_h}, {params.lambda_J}, {params.lambda_chi}, {params.reweighting_threshold}, {params.nr_of_cores}, {params.M})\" 2> {log} """ |
123 124 125 126 | shell: """ ./cluster_GN.sh {input} {output} > {log} """ |
137 138 139 140 141 142 143 144 145 146 | shell: """ sample=$(echo {input.dca} | sed 's/.*\///' | sed 's/.gplmDCA//') ./cluster_spectrus.sh {input.dca} $sample {output.clust} results/clustering_stats/ > {output.stats} && sed -i '/$sample/d' sample_list.dat && echo "$sample OK" >> fasta_done.dat rm -rf results/clustering_stats/results_$sample mkdir -p {output.toplot} mv results_"$sample".temp/* {output.toplot} rmdir results_"$sample".temp/ """ |
156 157 158 159 | shell: """ mkdir -p results/plots; Rscript plot_graphics.R {input.SC} {output.clust} {output.qual} """ |
Support
- Future updates
Related Workflows





