CADD-SV – a framework to score the effect of structural variants

public public 1yr ago Version: v1.1 0 bookmarks

CADD-SV

CADD-SV – a framework to score the effect of structural variants

Here, we describe CADD-SV, a method to retrieve a wide set of annotations in the range and vicinity of a SV. Our tool computes summary statistics and uses a trained machine learning model to differentiate deleterious from neutral variants. In training, we use human and chimpanzee derived alleles as proxy-neutral and contrast them with matched simulated variants as proxy-pathogenic. This approach has proven powerful in the interpretation of SNVs (CADD, https://cadd.gs.washington.edu). We show that CADD-SV scores correlate with known pathogenic variants in individual genomes and allelic diversity.

Pre-requirements

Conda

The pipeline depends on Snakemake , a workflow management system that wraps up all scripts and runs them highly automated, in various environments (workstations, clusters, grid, or cloud). Further, we use Conda as software/dependency management tool. Conda can install snakemake and all neccessary software with its dependencies automatically. Conda installation guidelines can be found here:

https://conda.io/projects/conda/en/latest/user-guide/install/index.html

Snakemake

After installing Conda, you install Snakemake using Conda and the environment.yaml provided in this repository. For this purpose, please clone or download and uncompress the repository first. Then change into the root folder of the local repository.

git clone https://github.com/kircherlab/CADD-SV
cd CADD-SV

We will now initiate the Conda environment, which we will need for getting the Snakemake workflow invoked. Using this environment ( run.caddsv ) snakemake will be installed

conda env create -n run.caddsv --file environment.yaml

The second conda environment ( envs/SV.yml ), containing all packages and tools to run CADD-SV, will be installed automatically during the first run. This can take some time.

Annotations

CADD-SV depends on various annotations to provide the model with its necessary input features. CADD-SV automatically retrieves and transforms these annotations (see Snakefile) and combines them in bed-format at /desired-sv-set/matrix.bed

Annotations can be downloaded and expanded individually. However, to run CADD-SV with the pre-trained model and to minimize runtime and memory failures use the annotation sets as stored at https://kircherlab.bihealth.org/download/CADD-SV/

wget https://kircherlab.bihealth.org/download/CADD-SV/v1.0/dependencies.tar.gz
tar -xf dependencies.tar.gz

Config

Almost ready to go. After you prepared the files above, you may need to adjust the name of your dataset in the config.yml .

List of required input files

  • Models and scripts as cloned from this GIT repository

  • Annotations in the annotations/ folder

  • CADD-SV scores SV in a coordinate sorted BED format on the GRCh38 genome build. The type of SV needs to be included for each variant in the 4th column. We recommend to split files containing more than 10,000 SVs into smaller files. An example input file can be found in input/ . The file needs to have the suffix id_ . If you plan to process variants from another genome build or SVs in VCF format, see below.

Running the pipeline

Ready to go! If you run the pipeline on a cluster see the cluster.json for an estimate of minimum resource requirements for the individual jobs. Note that this depends on your dataset size so you may have to adjust this.

To start the pipeline:

conda activate run.caddsv
# dry run to see if everything works
snakemake --use-conda --configfile config.yml -j 4 -n
# run the pipeline
snakemake --use-conda --configfile config.yml -j 4

Output files

The pipeline outputs your SV set containing all annotations in BED format in a folder named output containing the CADD-SV and two raw scores in rows 6-8. Further information about individual annotations are kept in a subfolder named after your input dataset.

Further Information

Annotations

CADD-SV integrates different annotations, here some links to its annotation sources. A complete list can be found as Suppl. Table 1 of the manuscript/pre-print.

Integrated Scores

CADD (https://krishna.gs.washington.edu/download/CADD/bigWig/)
LINSIGHT (http://compgen.cshl.edu/LINSIGHT/LINSIGHT.bw)

Species conservation and constraint metrics

PhastCons (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/)
Syntenic regions (http://webclu.bio.wzw.tum.de/cgi-bin/syntenymapper/get-species-list.py)
GERP score (http://mendel.stanford.edu/SidowLab/downloads/gerp/)

Population and disease constraint metrics

pLI score (ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/manuscript_data/forweb_cleaned_exac_r03_march16_z_data_pLI.txt.gz)
Conserved coding regions (CCR) (https://www.nature.com/articles/s41588-018-0294-6?WT.feed_name=subjects_population-genetics)
DDD Happloinsufficiency (https://decipher.sanger.ac.uk/files/downloads/HI_Predictions_Version3.bed.gz)

Epigenetic and regulatory activity

Encode Features such as Histon Modifications and DNase and RNase-seq (https://www.encodeproject.org/help/batch-download/)
GC content (http://hgdownload.cse.ucsc.edu/gbdb/hg38/bbi/gc5BaseBw/gc5Base.bw)
ChromHMM states of ENCODE cell lines (http://compbio.mit.edu/ChromHMM/)

3D genome organization

CTCF (http://genome.cshlp.org/content/suppl/2012/08/28/22.9.1680.DC1/Table_S2_Location_of_ChIP-seq_binding_positions_in_19_cell_lines.txt)
Encode-HiC data (https://www.encodeproject.org/search/?type=Experiment&assay_term_name=HiC&replicates.library.biosample.donor.organism.scientific_name=Homo%20sapiens&status=released)
Enhancer-promoter-links from FOCS (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5930446/)
Frequently Interacting Regulatory Elements (FIREs) (https://www.sciencedirect.com/science/article/pii/S2211124716314814)
Directionality index for HiC data from various datasets (https://www.genomegitar.org/processed-data.html)
DeepC saliencies score (http://userweb.molbiol.ox.ac.uk/public/rschwess/container_for_ucsc/data/deepC/saliency_scores/saliencies_merged_gm12878_5kb.bw)

Gene and element annotations

Ensembl-gff3 genebuild 96 (ftp://ftp.ensembl.org/pub/release-96/gff3/homo_sapiens/Homo_sapiens.GRCh38.96.chr.gff3.gz)
Fantom5 enhancers (https://zenodo.org/record/556775#.Xkz3G0oo-70)

Converting VCF and other genome builds

If you want to score SVs in a VCF format or your SVs are not in GRCh38 genomebuild coordinates: We provide an environment to handle this. It uses the SURVIVOR tools (https://github.com/fritzsedlazeck/SURVIVOR).

conda env create -n prepBED --file envs/prepBED.yml

To convert your VCF into BED format run:

conda activate prepBED
SURVIVOR vcftobed input.vcf 0 -1 output.bed
cut -f1,2,6,11 output.bed > beds/set_id.bed

To lift hg19 coordinates to GRCh38 apply the following steps:

conda activate prepBED
liftOver beds/setname_hg19_id.bed dependencies/hg19ToHg38.over.chain.gz beds/setname_id.bed beds/setname_unlifted.bed

Code Snippets

69
70
71
72
shell:
    """
    cut -f1,2,3 {input} > {output}
    """
82
83
84
85
shell:
    """
    sed 's/^chr\\|%$//g' {input} > {output}
    """
 97
 98
 99
100
101
102
shell:
    """
    bedtools merge -i {input.nochr} > {output.nochr}
    bedtools merge -i {input.wchr} > {output.wchr}

    """
114
115
116
117
shell:
    """
    bedtools coverage -b {input.anno} -a {input.bed} > {output}
    """
128
129
130
131
shell:
    """
    bedtools coverage -b {input.anno} -a {input.bed} > {output}
    """
143
144
145
146
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output.t1}
    """
159
160
161
162
shell:
    """
    tabix {input.anno} -R {input.merg} | awk '{{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \\"\\";"; }}' | gtf2bed - | cut -f1,2,3,8,10  > {output}
    """
174
175
176
177
shell:
    """
    paste <( bedtools coverage -b <(grep "exon" {input.anno}) -a {input.bed} | cut -f 1,2,3,5 ) <(bedtools coverage -b <(grep "transcript" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "gene" {input.anno} ) -a {input.bed} | cut -f 5) <( bedtools coverage -b <( grep "start_codon" {input.anno} ) -a {input.bed} | cut -f 5 ) <( bedtools coverage -b <(grep "stop_codon" {input.anno}) -a {input.bed} | cut -f 5) <(bedtools coverage -b <(grep "three_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "five_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "CDS" {input.anno}) -a {input.bed} | cut -f 5 ) > {output}
    """
191
192
193
shell:
    """
    grep "exon" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} |cut -f 1,2,3,9 |paste - <(grep "gene" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9)|paste - <(grep "start_codon" {input.anno} |bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9 ) >> {output}   """
205
206
207
208
shell:
    """
    bedtools map -b {input.anno} -a {input.bed} -c 4 -o distinct > {output}
    """
220
221
222
223
shell:
    """
    Rscript --vanilla scripts/PLIextract.R {input.gn} {input.pli} {output}
    """
SnakeMake From line 220 of master/Snakefile
235
236
237
238
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy8.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,4,5,5,6,6,7,7,8,8 -o max,sum,max,sum,max,sum,max,sum,max,sum; done < {input.bed}) > {output}
    """
249
250
251
252
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output}
    """
SnakeMake From line 249 of master/Snakefile
267
268
269
270
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy5_nochr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o max; done < {input.bed}) > {output}
    """
281
282
283
284
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output}
    """
SnakeMake From line 281 of master/Snakefile
296
297
298
299
300
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk '{{ total+=$4 }}END{{ printf("%f",total) }}'); done < {input.bed}) > {output}

    """
SnakeMake From line 296 of master/Snakefile
315
316
317
318
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.ep} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
330
331
332
333
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4,4 -o max,min > {output}
    """
341
342
shell:
    "paste {input} > {output}"
SnakeMake From line 341 of master/Snakefile
354
355
356
357
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
369
370
371
372
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
381
382
shell:
    "paste {input} > {output}"
SnakeMake From line 381 of master/Snakefile
394
395
396
397
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
408
409
410
411
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4 -o max > {output}
    """
423
424
425
426
427
shell:
    """
    bedtools map -a {input.bed} \
    -b {input.anno} -c 4,4 -o max,min > {output}
    """
435
436
shell:
    "paste {input} > {output}"
SnakeMake From line 435 of master/Snakefile
445
446
447
448
shell:
    """
    cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print m}}' > {output.mingg}
    cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}}' > {output.maxgg}"""
SnakeMake From line 445 of master/Snakefile
459
460
461
462
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4 -o mean > {output}
    """
473
474
475
476
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4_nochr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output}
    """
489
490
shell:
    "tabix {input.anno} -R {input.merg} | cat annotations/dummy5_nochr.bed - | bedtools map -a {input.bed} -b - -c 4,4 -o max,sum > {output}"
498
499
shell:
    "paste {input} > {output}"
SnakeMake From line 498 of master/Snakefile
511
512
513
514
shell:
    """
    (while read -r line; do bedtools map -b <(cat annotations/dummy_chrhmm.bed; tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 -o max; done < {input.bed}) > {output}
    """
528
529
530
531
shell:
    """
    bedtools coverage -a {input.bed} -b {input.anno} -counts  > {output}
    """
542
543
544
545
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 5 -o max > {output}
    """
556
557
558
559
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk 'BEGIN{{ maxVal="." }}{{ if ((maxVal == ".") || ($4 > maxVal)) {{ maxVal=$4 }} }}END{{ print maxVal }}'); done < {input.bed}) > {output}
    """
SnakeMake From line 556 of master/Snakefile
595
596
597
598
shell:
    """
    paste <(cut -f1-11 {input.cadd}) <(cut -f4 {input.cadd2}) <(cut -f4 {input.ccr}) <(cut -f4-28 {input.chromHMM}) <(cut -f4,5,7 {input.ctcf}) <(cut -f1 {input.di_min}) <(cut -f1 {input.di_max}) <(cut -f4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65 {input.encode}) <(cut -f4-7 {input.ep}) <(cut -f4,5,9,10,14,15,19,20,24,25 {input.fire}) <(cut -f4 {input.gc}) <(cut -f4-11 {input.gm}) <(cut -f4 {input.gerp}) <(cut -f4 {input.gerp2}) <(cut -f4,5,6,7,11,12,13,14,18,19,20,21,25,26,27,28 {input.hic}) <(cut -f4-7 {input.hesc}) <(cut -f4-7 {input.microsyn}) <(cut -f4 {input.mpc}) <(cut -f4 {input.pli}) <(cut -f4,5,6 {input.g_dist}) <(cut -f4 {input.remapTF}) <(cut -f4 {input.f5}) <(cut -f4 {input.hi}) <(cut -f4 {input.deepc}) <(cut -f4,5,7 {input.ultrac}) <(cut -f4 {input.linsight})| cat {input.header} - > {output}
    """
SnakeMake From line 595 of master/Snakefile
616
617
618
619
620
621
shell:
    """
    bedtools flank -i {input.bed} -g {input.genome} -l 100 -r 0 > {output.wchr}
    bedtools flank -i {input.bed} -g {input.genome} -l 100 -r 0 | sed 's/^chr\|%$//g' > {output.nchr}

    """
633
634
635
636
637
638
shell:
    """
    bedtools merge -i {input.nchr} > {output.nchr}
    bedtools merge -i {input.wchr} > {output.wchr}

    """
650
651
652
653
shell:
    """
    bedtools coverage -b {input.anno} -a {input.bed} > {output}
    """
664
665
666
667
shell:
    """
    bedtools coverage -b {input.anno} -a {input.bed} > {output}
    """
679
680
681
682
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output}
    """
695
696
697
698
shell:
    """
    tabix {input.anno} -R {input.merg} | awk '{{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \\"\\";"; }}' | gtf2bed - | cut -f1,2,3,8,10  > {output}
    """
710
711
712
713
shell:
    """
    paste <( bedtools coverage -b <(grep "exon" {input.anno}) -a {input.bed} | cut -f 1,2,3,5 ) <(bedtools coverage -b <(grep "transcript" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "gene" {input.anno} ) -a {input.bed} | cut -f 5) <( bedtools coverage -b <( grep "start_codon" {input.anno} ) -a {input.bed} | cut -f 5 ) <( bedtools coverage -b <(grep "stop_codon" {input.anno}) -a {input.bed} | cut -f 5) <(bedtools coverage -b <(grep "three_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "five_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "CDS" {input.anno}) -a {input.bed} | cut -f 5 ) > {output}
    """
727
728
729
shell:
    """
    grep "exon" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} |cut -f 1,2,3,9 |paste - <(grep "gene" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9)|paste - <(grep "start_codon" {input.anno} |bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9 ) >> {output}   """
741
742
743
744
shell:
    """
    bedtools map -b {input.anno} -a {input.bed} -c 4 -o distinct > {output}
    """
756
757
758
759
shell:
    """
    Rscript --vanilla scripts/PLIextract.R {input.gn} {input.pli} {output}
    """
SnakeMake From line 756 of master/Snakefile
771
772
773
774
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy8.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,4,5,5,6,6,7,7,8,8 -o max,sum,max,sum,max,sum,max,sum,max,sum; done < {input.bed}) > {output}
    """
785
786
787
788
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output}
    """
SnakeMake From line 785 of master/Snakefile
804
805
806
807
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy5_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o max; done < {input.bed}) > {output}
    """
818
819
820
821
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output}
    """
SnakeMake From line 818 of master/Snakefile
833
834
835
836
837
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk '{{ total+=$4 }}END{{ printf("%f",total) }}'); done < {input.bed}) > {output}        

    """
SnakeMake From line 833 of master/Snakefile
852
853
854
855
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.ep} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
867
868
869
870
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4,4 -o max,min > {output}
    """
878
879
shell:
    "paste {input} > {output}"
SnakeMake From line 878 of master/Snakefile
891
892
893
894
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
906
907
908
909
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
920
921
shell:
    "paste {input} > {output}"
SnakeMake From line 920 of master/Snakefile
933
934
935
936
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
947
948
949
950
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4 -o max > {output}
    """
962
963
964
965
966
shell:
    """
    bedtools map -a {input.bed} \
    -b {input.anno} -c 4,4 -o max,min > {output}
    """
974
975
shell:
    "paste {input} > {output}"
SnakeMake From line 974 of master/Snakefile
984
985
986
987
988
shell:
    """
    cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print m}}' > {output.mingg}
    cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}}' > {output.maxgg}
    """
SnakeMake From line 984 of master/Snakefile
 999
1000
1001
1002
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4 -o mean > {output}
    """
1013
1014
1015
1016
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output}
    """
1029
1030
shell:
    "tabix {input.anno} -R {input.merg} | cat annotations/dummy5_nchr.bed - | bedtools map -a {input.bed} -b - -c 4,4 -o max,sum > {output}"
1038
1039
shell:
    "paste {input} > {output}"
SnakeMake From line 1038 of master/Snakefile
1051
1052
1053
1054
shell:
    """
    (while read -r line; do bedtools map -b <(cat annotations/dummy_chrhmm.bed; tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 -o max; done < {input.bed}) > {output}
    """
1068
1069
1070
1071
shell:
    """
    bedtools coverage -a {input.bed} -b {input.anno} -counts  > {output}
    """
1082
1083
1084
1085
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 5 -o max > {output}
    """
1096
1097
1098
1099
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk 'BEGIN{{ maxVal="." }}{{ if ((maxVal == ".") || ($4 > maxVal)) {{ maxVal=$4 }} }}END{{ print maxVal }}'); done < {input.bed}) > {output}
    """
SnakeMake From line 1096 of master/Snakefile
1135
1136
1137
1138
shell:
    """
    paste <(cut -f1-11 {input.cadd}) <(cut -f4 {input.cadd2}) <(cut -f4 {input.ccr}) <(cut -f4-28 {input.chromHMM}) <(cut -f4,5,7 {input.ctcf}) <(cut -f1 {input.di_min}) <(cut -f1 {input.di_max}) <(cut -f4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65 {input.encode}) <(cut -f4-7 {input.ep}) <(cut -f4,5,9,10,14,15,19,20,24,25 {input.fire}) <(cut -f4 {input.gc}) <(cut -f4-11 {input.gm}) <(cut -f4 {input.gerp}) <(cut -f4 {input.gerp2}) <(cut -f4,5,6,7,11,12,13,14,18,19,20,21,25,26,27,28 {input.hic}) <(cut -f4-7 {input.hesc}) <(cut -f4-7 {input.microsyn}) <(cut -f4 {input.mpc}) <(cut -f4 {input.pli}) <(cut -f4,5,6 {input.g_dist}) <(cut -f4 {input.remapTF}) <(cut -f4 {input.f5}) <(cut -f4 {input.hi}) <(cut -f4 {input.deepc}) <(cut -f4,5,7 {input.ultrac}) <(cut -f4 {input.linsight})| cat {input.header} - > {output}
    """
SnakeMake From line 1135 of master/Snakefile
1156
1157
1158
1159
1160
1161
shell:
    """
    bedtools flank -i {input.bed} -g {input.genome} -l 0 -r 100 | bedtools sort > {output.wchr}
    bedtools flank -i {input.bed} -g {input.genome} -l 0 -r 100 | bedtools sort | sed 's/^chr\|%$//g' > {output.nchr}

    """
1173
1174
1175
1176
1177
1178
shell:
    """
    bedtools merge -i {input.nchr} > {output.nchr}
    bedtools merge -i {input.wchr} > {output.wchr}

    """
1190
1191
1192
1193
shell:
    """
    bedtools coverage -b {input.anno} -a {input.bed} > {output}
    """
1204
1205
1206
1207
shell:
    """
    bedtools coverage -b {input.anno} -a {input.bed} > {output}
    """
1219
1220
1221
1222
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output}
    """
1235
1236
1237
1238
shell:
    """
    tabix {input.anno} -R {input.merg} | awk '{{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \\"\\";"; }}' | gtf2bed - | cut -f1,2,3,8,10  > {output}
    """
1250
1251
1252
1253
shell:
    """
    paste <( bedtools coverage -b <(grep "exon" {input.anno}) -a {input.bed} | cut -f 1,2,3,5 ) <(bedtools coverage -b <(grep "transcript" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "gene" {input.anno} ) -a {input.bed} | cut -f 5) <( bedtools coverage -b <( grep "start_codon" {input.anno} ) -a {input.bed} | cut -f 5 ) <( bedtools coverage -b <(grep "stop_codon" {input.anno}) -a {input.bed} | cut -f 5) <(bedtools coverage -b <(grep "three_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "five_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "CDS" {input.anno}) -a {input.bed} | cut -f 5 ) > {output}
    """
1267
1268
1269
shell:
    """
    grep "exon" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} |cut -f 1,2,3,9 |paste - <(grep "gene" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9)|paste - <(grep "start_codon" {input.anno} |bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9 ) >> {output}   """
1281
1282
1283
1284
shell:
    """
    bedtools map -b {input.anno} -a {input.bed} -c 4 -o distinct > {output}
    """
1296
1297
1298
1299
shell:
    """
    Rscript --vanilla scripts/PLIextract.R {input.gn} {input.pli} {output}
    """
SnakeMake From line 1296 of master/Snakefile
1311
1312
1313
1314
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy8.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,4,5,5,6,6,7,7,8,8 -o max,sum,max,sum,max,sum,max,sum,max,sum; done < {input.bed}) > {output}
    """
1325
1326
1327
1328
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output}
    """
SnakeMake From line 1325 of master/Snakefile
1344
1345
1346
1347
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy5_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o max; done < {input.bed}) > {output}
    """
1358
1359
1360
1361
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output}
    """
SnakeMake From line 1358 of master/Snakefile
1373
1374
1375
1376
1377
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk '{{ total+=$4 }}END{{ printf("%f",total) }}'); done < {input.bed}) > {output}

    """
SnakeMake From line 1373 of master/Snakefile
1392
1393
1394
1395
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.ep} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
1407
1408
1409
1410
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4,4 -o max,min > {output}
    """
1418
1419
shell:
    "paste {input} > {output}"
SnakeMake From line 1418 of master/Snakefile
1431
1432
1433
1434
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
1446
1447
1448
1449
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
1462
1463
shell:
    "paste {input} > {output}"
SnakeMake From line 1462 of master/Snakefile
1475
1476
1477
1478
shell:
    """
    bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1}
    """
1489
1490
1491
1492
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4 -o max > {output}
    """
1504
1505
1506
1507
1508
shell:
    """
    bedtools map -a {input.bed} \
    -b {input.anno} -c 4,4 -o max,min > {output}
    """
1516
1517
shell:
    "paste {input} > {output}"
SnakeMake From line 1516 of master/Snakefile
1526
1527
1528
1529
shell:
    """
    cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print m}}' > {output.mingg}
    cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}}' > {output.maxgg}"""
SnakeMake From line 1526 of master/Snakefile
1540
1541
1542
1543
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 4 -o mean > {output}
    """
1554
1555
1556
1557
shell:
    """
    (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output}
    """
1570
1571
shell:
    "tabix {input.anno} -R {input.merg} | cat annotations/dummy5_nchr.bed - | bedtools map -a {input.bed} -b - -c 4,4 -o max,sum > {output}"
1579
1580
shell:
    "paste {input} > {output}"
SnakeMake From line 1579 of master/Snakefile
1592
1593
1594
1595
shell:
    """
    (while read -r line; do bedtools map -b <(cat annotations/dummy_chrhmm.bed; tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 -o max; done < {input.bed}) > {output}
    """
1609
1610
1611
1612
shell:
    """
    bedtools coverage -a {input.bed} -b {input.anno} -counts  > {output}
    """
1623
1624
1625
1626
shell:
    """
    bedtools map -a {input.bed} -b {input.anno} -c 5 -o max > {output}
    """
1637
1638
1639
1640
shell:
    """
    (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk 'BEGIN{{ maxVal="." }}{{ if ((maxVal == ".") || ($4 > maxVal)) {{ maxVal=$4 }} }}END{{ print maxVal }}'); done < {input.bed}) > {output}
    """
SnakeMake From line 1637 of master/Snakefile
1676
1677
1678
1679
shell:
    """
    paste <(cut -f1-11 {input.cadd}) <(cut -f4 {input.cadd2}) <(cut -f4 {input.ccr}) <(cut -f4-28 {input.chromHMM}) <(cut -f4,5,7 {input.ctcf}) <(cut -f1 {input.di_min}) <(cut -f1 {input.di_max}) <(cut -f4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65 {input.encode}) <(cut -f4-7 {input.ep}) <(cut -f4,5,9,10,14,15,19,20,24,25 {input.fire}) <(cut -f4 {input.gc}) <(cut -f4-11 {input.gm}) <(cut -f4 {input.gerp}) <(cut -f4 {input.gerp2}) <(cut -f4,5,6,7,11,12,13,14,18,19,20,21,25,26,27,28 {input.hic}) <(cut -f4-7 {input.hesc}) <(cut -f4-7 {input.microsyn}) <(cut -f4 {input.mpc}) <(cut -f4 {input.pli}) <(cut -f4,5,6 {input.g_dist}) <(cut -f4 {input.remapTF}) <(cut -f4 {input.f5}) <(cut -f4 {input.hi}) <(cut -f4 {input.deepc}) <(cut -f4,5,7 {input.ultrac}) <(cut -f4 {input.linsight})| cat {input.header} - > {output}
    """
SnakeMake From line 1676 of master/Snakefile
1699
1700
1701
1702
shell:
    """
    Rscript --vanilla scripts/scoring.R {params.name} {input.span} {input.flank_up} {input.flank_down} {output}
    """
SnakeMake From line 1699 of master/Snakefile
1714
1715
1716
1717
shell:
    """
    bedtools sort -i {input.score} | cat {input.header} - > {output}
    """
ShowHide 106 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://cadd-sv.bihealth.org
Name: cadd-sv
Version: v1.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...