CADD scripts release for offline scoring. For more information about CADD, please visit our website
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Combined Annotation Dependent Depletion (CADD)
CADD is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome (currently supported builds: GRCh37/hg19 and GRCh38/hg38).
Details about CADD, including features in the latest version, the different genome builds and how we envision the use case of CADD are described in our latest manuscript:
Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M.The original manuscript describing the method and its features was published by Nature Genetics in 2014:
CADD: predicting the deleteriousness of variants throughout the human genome.
Nucleic Acids Res. 2018 Oct 29. doi: 10.1093/nar/gky1016 .
PubMed PMID: 30371827 .
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J.
A general framework for estimating the relative pathogenicity of human genetic variants.
Nat Genet. 2014 Feb 2. doi: 10.1038/ng.2892 .
PubMed PMID: 24487276 .
We provide pre-computed CADD-based scores (C-scores) for all 8.6 billion possible single nucleotide variants (SNVs) of the reference genome, as well as all SNV and insertions/deletions variants (InDels) from population-wide whole genome variant releases and enable scoring of short InDels on our website.
Please check our website for updates and further information
Offline Installation
This section describes how users can setup CADD version 1.6 on their own system. Please note that this requires between 100 GB - 1 TB of disc space and at least 12 GB of RAM.
Prerequisite
- conda
# can be installed like this
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh -p $HOME/miniconda2 -b
export PATH=$HOME/miniconda2/bin:$PATH
- snakemake (installed via conda)
conda install -c conda-forge -c bioconda snakemake mamba
*Note2: If you are using an existing conda installation, please make sure it is a version >=4.4.0 . Make also sure to use snakemake >= 4.0 as some command line parameters are not available in earlier versions. *
Note3: We are also installing mamba here. In principle it should also work without mamba, in that case add
--conda-frontend conda
to line 216 in install.sh
Setup
- load/move the zipped CADD archive to its destination folder and unzip.
unzip CADD.zip
-
from here, you can either install everything seperately or run the script
install.sh
to install using a brief installation dialog (see below).
Install script
This is the easier way of installing CADD, just run:
./install.sh
You first state which parts you want to install (the environments as well as at least one genome build including annotation tracks are neccessary for a quick start) and the script should manage loading and unpacking the neccessary files.
Manual installation
Running CADD depends on four big building blocks (plus the repository containing this README which we assume you already downloaded):
-
snakemake
-
dependencies
-
genome annotations
-
prescored variants
Installing dependencies
As of this version, dependencies have to be installed via conda and snakemake. This is because we are using two different enviroments for python2 and python3.
snakemake test/input.tsv.gz --use-conda --conda-create-envs-only --conda-prefix envs \
--configfile config/config_GRCh38_v1.6.yml --snakefile Snakefile
Please note that we installing both conda environments in the CADD subdirectory
envs
via
--conda-prefix envs
. If you do not want this behavior (we do this in order to not install the environments in all active directories you run CADD from), adjust or remove this parameter.
Installing annotations
Both version of CADD (for the different genome builds) rely on a big number of genomic annotations. Depending on which genome build you require you can get them from our website (be careful where you put them as these are really big files and have identical filenames) via:
# for GRCh37 / hg19
wget -c https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/annotationsGRCh37_v1.6.tar.gz
# for GRCh38 / hg38
wget -c https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/annotationsGRCh38_v1.6.tar.gz
As those files are about 100 and 200 GB in size, downloads can take long (depending on your internet connection). We recommend to setup the process in the background and using a tool (like
wget -c
mentioned above) that allows you to continue an interrupted download.
To make sure you downloaded the files correctly, we recommend downloading md5 hash files from our website (e.g.
wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/annotationsGRCh38_v1.6.tar.gz.md5
) and checking for completeness (via
md5sum -c annotationsGRCh38_v1.6.tar.gz.md5
).
The annotation files are finally put in the folder
data/annotations
and unpacked:
cd data/annotations
tar -zxvf annotationsGRCh37_v1.6.tar.gz
mv GRCh37 GRCh37_v1.4
tar -zxvf annotationsGRCh38_v1.6.tar.gz
cd $OLDPWD
Installing prescored files
At this point you are ready to go, but if you want a faster version of CADD, you can download the prescored files from our website (see section Downloads for a list of available files). Please note that these files can be very big. The files are (together with their respective tabix indices) put in the folders
no_anno
or
incl_anno
depending on the file under
data/prescored/${GENOME_BUILD}_${VERSION}/
and will be automatically detected by the
CADD.sh
script.
Running CADD
You run CADD via the script
CADD.sh
which technically only requieres an either vcf or vcf.gz input file as last argument. You can further specify the genome build via
-g
, CADD version via
-v
(deprecated, the new version of the scripts only support v1.6), request a fully annotated output (
-a
flag) and specify a seperate output file via
-o
(else inputfile name
.tsv.gz
is used). I.e:
./CADD.sh test/input.vcf
./CADD.sh -a -g GRCh37 -o output_inclAnno_GRCh37.tsv.gz test/input.vcf
You can test whether your CADD is set up properly by comparing to the example files in the
test
directory.
Update
Version 1.6 includes some changes in comparison to v1.5. Next to the obvious switch of the pipeline into a Snakemake workflow which became necessary due to the ongoin issues with
conda activate
, the new models for v1.6 are extended by more specialized annotations for splicing variants, as well as a few minor changes in some other annotations (most prominent: fixed gerp for GRCh38) and changes in consequence categories which make this scripts incompatible with CADD v1.4 and v1.5. If you are still using those version, please use
version 1.5 of this repository
.
## Copyright
Copyright (c) University of Washington, Hudson-Alpha Institute for
Biotechnology and Berlin Institute of Health 2013-2020. All rights reserved.
Permission is hereby granted, to all non-commercial users and licensees of CADD
(Combined Annotation Dependent Framework, licensed by the University of
Washington) to obtain copies of this software and associated documentation
files (the "Software"), to use the Software without restriction, including
rights to use, copy, modify, merge, and distribute copies of the Software. The
above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Code Snippets
7 8 9 10 | shell: ''' zcat {input} > {output} ''' |
16 17 18 19 20 21 22 | shell: ''' cat {input} \ | python $CADD/src/scripts/VCF2vepVCF.py \ | sort -k1,1 -k2,2n -k4,4 -k5,5 \ | uniq > {output} ''' |
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | shell: ''' # Prescoring echo '## Prescored variant file' > {output.prescored}; if [ -d $CADD/{config[PrescoredFolder]} ] then for PRESCORED in $(ls $CADD/{config[PrescoredFolder]}/*.tsv.gz) do cat {input} \ | python $CADD/src/scripts/extract_scored.py --header \ -p $PRESCORED --found_out={output.prescored}.tmp \ > {input}.tmp; cat {output.prescored}.tmp >> {output.prescored} mv {input}.tmp {input}; done; rm {output.prescored}.tmp fi mv {input} {output.novel} ''' |
54 55 56 57 58 59 60 61 62 63 64 65 | shell: ''' cat {input} \ | vep --quiet --cache --offline --dir $CADD/{config[VEPpath]} \ --buffer 1000 --no_stats --species homo_sapiens \ --db_version={config[EnsemblDB]} --assembly {config[GenomeBuild]} \ --format vcf --regulatory --sift b --polyphen b --per_gene --ccds --domains \ --numbers --canonical --total_length --vcf --force_overwrite --output_file STDOUT \ | python $CADD/src/scripts/annotateVEPvcf.py \ -c $CADD/{config[ReferenceConfig]} \ | gzip -c > {output} ''' |
71 72 73 74 75 76 | shell: ''' zcat {input} \ | python $CADD/src/scripts/trackTransformation.py -b \ -c $CADD/{config[ImputeConfig]} -o {output} --noheader; ''' |
84 85 86 87 88 89 90 91 92 93 94 95 96 97 | shell: ''' python $CADD/src/scripts/predictSKmodel.py \ -i {input.impute} -m $CADD/{config[Model]} -a {input.anno} \ | python $CADD/src/scripts/max_line_hierarchy.py --all \ | python $CADD/src/scripts/appendPHREDscore.py \ -t $CADD/{config[ConversionTable]} > {output}; if [ "{config[Annotation]}" = 'False' ] then cat {output} | cut -f {config[Columns]} | uniq > {output}.tmp mv {output}.tmp {output} fi ''' |
105 106 107 108 109 110 111 112 113 114 | shell: ''' ( echo "{config[Header]}"; head -n 1 {input.novel}; cat {input.pre} {input.novel} \ | grep -v "^#" \ | sort -k1,1 -k2,2n -k3,3 -k4,4 || true; ) | bgzip -c > {output}; ''' |
Support
- Future updates