CADD-SV – a framework to score the effect of structural variants
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
CADD-SV
CADD-SV – a framework to score the effect of structural variants
Here, we describe CADD-SV, a method to retrieve a wide set of annotations in the range and vicinity of a SV. Our tool computes summary statistics and uses a trained machine learning model to differentiate deleterious from neutral variants. In training, we use human and chimpanzee derived alleles as proxy-neutral and contrast them with matched simulated variants as proxy-pathogenic. This approach has proven powerful in the interpretation of SNVs (CADD, https://cadd.gs.washington.edu). We show that CADD-SV scores correlate with known pathogenic variants in individual genomes and allelic diversity.
Pre-requirements
Conda
The pipeline depends on Snakemake , a workflow management system that wraps up all scripts and runs them highly automated, in various environments (workstations, clusters, grid, or cloud). Further, we use Conda as software/dependency management tool. Conda can install snakemake and all neccessary software with its dependencies automatically. Conda installation guidelines can be found here:
https://conda.io/projects/conda/en/latest/user-guide/install/index.html
Snakemake
After installing Conda, you install Snakemake using Conda and the
environment.yaml
provided in this repository. For this purpose, please clone or download and uncompress the repository first. Then change into the root folder of the local repository.
git clone https://github.com/kircherlab/CADD-SV
cd CADD-SV
We will now initiate the Conda environment, which we will need for getting the Snakemake workflow invoked. Using this environment (
run.caddsv
) snakemake will be installed
conda env create -n run.caddsv --file environment.yaml
The second conda environment (
envs/SV.yml
), containing all packages and tools to run CADD-SV, will be installed automatically during the first run. This can take some time.
Annotations
CADD-SV depends on various annotations to provide the model with its necessary input features. CADD-SV automatically retrieves and transforms these annotations (see Snakefile) and combines them in bed-format at
/desired-sv-set/matrix.bed
Annotations can be downloaded and expanded individually. However, to run CADD-SV with the pre-trained model and to minimize runtime and memory failures use the annotation sets as stored at https://kircherlab.bihealth.org/download/CADD-SV/
wget https://kircherlab.bihealth.org/download/CADD-SV/v1.0/dependencies.tar.gz
tar -xf dependencies.tar.gz
Config
Almost ready to go. After you prepared the files above, you may need to adjust the name of your dataset in the
config.yml
.
List of required input files
-
Models and scripts as cloned from this GIT repository
-
Annotations in the
annotations/
folder -
CADD-SV scores SV in a coordinate sorted BED format on the GRCh38 genome build. The type of SV needs to be included for each variant in the 4th column. We recommend to split files containing more than 10,000 SVs into smaller files. An example input file can be found in
input/
. The file needs to have the suffixid_
. If you plan to process variants from another genome build or SVs in VCF format, see below.
Running the pipeline
Ready to go! If you run the pipeline on a cluster see the
cluster.json
for an estimate of minimum resource requirements for the individual jobs. Note that this depends on your dataset size so you may have to adjust this.
To start the pipeline:
conda activate run.caddsv
# dry run to see if everything works
snakemake --use-conda --configfile config.yml -j 4 -n
# run the pipeline
snakemake --use-conda --configfile config.yml -j 4
Output files
The pipeline outputs your SV set containing all annotations in BED format in a folder named
output
containing the CADD-SV and two raw scores in rows 6-8.
Further information about individual annotations are kept in a subfolder named after your input dataset.
Further Information
Annotations
CADD-SV integrates different annotations, here some links to its annotation sources. A complete list can be found as Suppl. Table 1 of the manuscript/pre-print.
Integrated Scores
CADD (https://krishna.gs.washington.edu/download/CADD/bigWig/)
LINSIGHT (http://compgen.cshl.edu/LINSIGHT/LINSIGHT.bw)
Species conservation and constraint metrics
PhastCons (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/)
Syntenic regions (http://webclu.bio.wzw.tum.de/cgi-bin/syntenymapper/get-species-list.py)
GERP score (http://mendel.stanford.edu/SidowLab/downloads/gerp/)
Population and disease constraint metrics
pLI score (ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/manuscript_data/forweb_cleaned_exac_r03_march16_z_data_pLI.txt.gz)
Conserved coding regions (CCR) (https://www.nature.com/articles/s41588-018-0294-6?WT.feed_name=subjects_population-genetics)
DDD Happloinsufficiency (https://decipher.sanger.ac.uk/files/downloads/HI_Predictions_Version3.bed.gz)
Epigenetic and regulatory activity
Encode Features such as Histon Modifications and DNase and RNase-seq (https://www.encodeproject.org/help/batch-download/)
GC content (http://hgdownload.cse.ucsc.edu/gbdb/hg38/bbi/gc5BaseBw/gc5Base.bw)
ChromHMM states of ENCODE cell lines (http://compbio.mit.edu/ChromHMM/)
3D genome organization
CTCF (http://genome.cshlp.org/content/suppl/2012/08/28/22.9.1680.DC1/Table_S2_Location_of_ChIP-seq_binding_positions_in_19_cell_lines.txt)
Encode-HiC data (https://www.encodeproject.org/search/?type=Experiment&assay_term_name=HiC&replicates.library.biosample.donor.organism.scientific_name=Homo%20sapiens&status=released)
Enhancer-promoter-links from FOCS (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5930446/)
Frequently Interacting Regulatory Elements (FIREs) (https://www.sciencedirect.com/science/article/pii/S2211124716314814)
Directionality index for HiC data from various datasets (https://www.genomegitar.org/processed-data.html)
DeepC saliencies score (http://userweb.molbiol.ox.ac.uk/public/rschwess/container_for_ucsc/data/deepC/saliency_scores/saliencies_merged_gm12878_5kb.bw)
Gene and element annotations
Ensembl-gff3 genebuild 96 (ftp://ftp.ensembl.org/pub/release-96/gff3/homo_sapiens/Homo_sapiens.GRCh38.96.chr.gff3.gz)
Fantom5 enhancers (https://zenodo.org/record/556775#.Xkz3G0oo-70)
Converting VCF and other genome builds
If you want to score SVs in a VCF format or your SVs are not in GRCh38 genomebuild coordinates: We provide an environment to handle this. It uses the SURVIVOR tools (https://github.com/fritzsedlazeck/SURVIVOR).
conda env create -n prepBED --file envs/prepBED.yml
To convert your VCF into BED format run:
conda activate prepBED
SURVIVOR vcftobed input.vcf 0 -1 output.bed
cut -f1,2,6,11 output.bed > beds/set_id.bed
To lift hg19 coordinates to GRCh38 apply the following steps:
conda activate prepBED
liftOver beds/setname_hg19_id.bed dependencies/hg19ToHg38.over.chain.gz beds/setname_id.bed beds/setname_unlifted.bed
Code Snippets
69 70 71 72 | shell: """ cut -f1,2,3 {input} > {output} """ |
82 83 84 85 | shell: """ sed 's/^chr\\|%$//g' {input} > {output} """ |
97 98 99 100 101 102 | shell: """ bedtools merge -i {input.nochr} > {output.nochr} bedtools merge -i {input.wchr} > {output.wchr} """ |
114 115 116 117 | shell: """ bedtools coverage -b {input.anno} -a {input.bed} > {output} """ |
128 129 130 131 | shell: """ bedtools coverage -b {input.anno} -a {input.bed} > {output} """ |
143 144 145 146 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output.t1} """ |
159 160 161 162 | shell: """ tabix {input.anno} -R {input.merg} | awk '{{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \\"\\";"; }}' | gtf2bed - | cut -f1,2,3,8,10 > {output} """ |
174 175 176 177 | shell: """ paste <( bedtools coverage -b <(grep "exon" {input.anno}) -a {input.bed} | cut -f 1,2,3,5 ) <(bedtools coverage -b <(grep "transcript" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "gene" {input.anno} ) -a {input.bed} | cut -f 5) <( bedtools coverage -b <( grep "start_codon" {input.anno} ) -a {input.bed} | cut -f 5 ) <( bedtools coverage -b <(grep "stop_codon" {input.anno}) -a {input.bed} | cut -f 5) <(bedtools coverage -b <(grep "three_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "five_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "CDS" {input.anno}) -a {input.bed} | cut -f 5 ) > {output} """ |
191 192 193 | shell: """ grep "exon" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} |cut -f 1,2,3,9 |paste - <(grep "gene" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9)|paste - <(grep "start_codon" {input.anno} |bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9 ) >> {output} """ |
205 206 207 208 | shell: """ bedtools map -b {input.anno} -a {input.bed} -c 4 -o distinct > {output} """ |
220 221 222 223 | shell: """ Rscript --vanilla scripts/PLIextract.R {input.gn} {input.pli} {output} """ |
235 236 237 238 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy8.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,4,5,5,6,6,7,7,8,8 -o max,sum,max,sum,max,sum,max,sum,max,sum; done < {input.bed}) > {output} """ |
249 250 251 252 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output} """ |
267 268 269 270 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy5_nochr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o max; done < {input.bed}) > {output} """ |
281 282 283 284 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output} """ |
296 297 298 299 300 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk '{{ total+=$4 }}END{{ printf("%f",total) }}'); done < {input.bed}) > {output} """ |
315 316 317 318 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.ep} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
330 331 332 333 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4,4 -o max,min > {output} """ |
341 342 | shell: "paste {input} > {output}" |
354 355 356 357 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
369 370 371 372 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
381 382 | shell: "paste {input} > {output}" |
394 395 396 397 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
408 409 410 411 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4 -o max > {output} """ |
423 424 425 426 427 | shell: """ bedtools map -a {input.bed} \ -b {input.anno} -c 4,4 -o max,min > {output} """ |
435 436 | shell: "paste {input} > {output}" |
445 446 447 448 | shell: """ cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print m}}' > {output.mingg} cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}}' > {output.maxgg}""" |
459 460 461 462 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4 -o mean > {output} """ |
473 474 475 476 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4_nochr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output} """ |
489 490 | shell: "tabix {input.anno} -R {input.merg} | cat annotations/dummy5_nochr.bed - | bedtools map -a {input.bed} -b - -c 4,4 -o max,sum > {output}" |
498 499 | shell: "paste {input} > {output}" |
511 512 513 514 | shell: """ (while read -r line; do bedtools map -b <(cat annotations/dummy_chrhmm.bed; tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 -o max; done < {input.bed}) > {output} """ |
528 529 530 531 | shell: """ bedtools coverage -a {input.bed} -b {input.anno} -counts > {output} """ |
542 543 544 545 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 5 -o max > {output} """ |
556 557 558 559 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk 'BEGIN{{ maxVal="." }}{{ if ((maxVal == ".") || ($4 > maxVal)) {{ maxVal=$4 }} }}END{{ print maxVal }}'); done < {input.bed}) > {output} """ |
595 596 597 598 | shell: """ paste <(cut -f1-11 {input.cadd}) <(cut -f4 {input.cadd2}) <(cut -f4 {input.ccr}) <(cut -f4-28 {input.chromHMM}) <(cut -f4,5,7 {input.ctcf}) <(cut -f1 {input.di_min}) <(cut -f1 {input.di_max}) <(cut -f4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65 {input.encode}) <(cut -f4-7 {input.ep}) <(cut -f4,5,9,10,14,15,19,20,24,25 {input.fire}) <(cut -f4 {input.gc}) <(cut -f4-11 {input.gm}) <(cut -f4 {input.gerp}) <(cut -f4 {input.gerp2}) <(cut -f4,5,6,7,11,12,13,14,18,19,20,21,25,26,27,28 {input.hic}) <(cut -f4-7 {input.hesc}) <(cut -f4-7 {input.microsyn}) <(cut -f4 {input.mpc}) <(cut -f4 {input.pli}) <(cut -f4,5,6 {input.g_dist}) <(cut -f4 {input.remapTF}) <(cut -f4 {input.f5}) <(cut -f4 {input.hi}) <(cut -f4 {input.deepc}) <(cut -f4,5,7 {input.ultrac}) <(cut -f4 {input.linsight})| cat {input.header} - > {output} """ |
616 617 618 619 620 621 | shell: """ bedtools flank -i {input.bed} -g {input.genome} -l 100 -r 0 > {output.wchr} bedtools flank -i {input.bed} -g {input.genome} -l 100 -r 0 | sed 's/^chr\|%$//g' > {output.nchr} """ |
633 634 635 636 637 638 | shell: """ bedtools merge -i {input.nchr} > {output.nchr} bedtools merge -i {input.wchr} > {output.wchr} """ |
650 651 652 653 | shell: """ bedtools coverage -b {input.anno} -a {input.bed} > {output} """ |
664 665 666 667 | shell: """ bedtools coverage -b {input.anno} -a {input.bed} > {output} """ |
679 680 681 682 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output} """ |
695 696 697 698 | shell: """ tabix {input.anno} -R {input.merg} | awk '{{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \\"\\";"; }}' | gtf2bed - | cut -f1,2,3,8,10 > {output} """ |
710 711 712 713 | shell: """ paste <( bedtools coverage -b <(grep "exon" {input.anno}) -a {input.bed} | cut -f 1,2,3,5 ) <(bedtools coverage -b <(grep "transcript" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "gene" {input.anno} ) -a {input.bed} | cut -f 5) <( bedtools coverage -b <( grep "start_codon" {input.anno} ) -a {input.bed} | cut -f 5 ) <( bedtools coverage -b <(grep "stop_codon" {input.anno}) -a {input.bed} | cut -f 5) <(bedtools coverage -b <(grep "three_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "five_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "CDS" {input.anno}) -a {input.bed} | cut -f 5 ) > {output} """ |
727 728 729 | shell: """ grep "exon" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} |cut -f 1,2,3,9 |paste - <(grep "gene" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9)|paste - <(grep "start_codon" {input.anno} |bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9 ) >> {output} """ |
741 742 743 744 | shell: """ bedtools map -b {input.anno} -a {input.bed} -c 4 -o distinct > {output} """ |
756 757 758 759 | shell: """ Rscript --vanilla scripts/PLIextract.R {input.gn} {input.pli} {output} """ |
771 772 773 774 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy8.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,4,5,5,6,6,7,7,8,8 -o max,sum,max,sum,max,sum,max,sum,max,sum; done < {input.bed}) > {output} """ |
785 786 787 788 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output} """ |
804 805 806 807 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy5_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o max; done < {input.bed}) > {output} """ |
818 819 820 821 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output} """ |
833 834 835 836 837 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk '{{ total+=$4 }}END{{ printf("%f",total) }}'); done < {input.bed}) > {output} """ |
852 853 854 855 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.ep} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
867 868 869 870 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4,4 -o max,min > {output} """ |
878 879 | shell: "paste {input} > {output}" |
891 892 893 894 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
906 907 908 909 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
920 921 | shell: "paste {input} > {output}" |
933 934 935 936 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
947 948 949 950 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4 -o max > {output} """ |
962 963 964 965 966 | shell: """ bedtools map -a {input.bed} \ -b {input.anno} -c 4,4 -o max,min > {output} """ |
974 975 | shell: "paste {input} > {output}" |
984 985 986 987 988 | shell: """ cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print m}}' > {output.mingg} cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}}' > {output.maxgg} """ |
999 1000 1001 1002 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4 -o mean > {output} """ |
1013 1014 1015 1016 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output} """ |
1029 1030 | shell: "tabix {input.anno} -R {input.merg} | cat annotations/dummy5_nchr.bed - | bedtools map -a {input.bed} -b - -c 4,4 -o max,sum > {output}" |
1038 1039 | shell: "paste {input} > {output}" |
1051 1052 1053 1054 | shell: """ (while read -r line; do bedtools map -b <(cat annotations/dummy_chrhmm.bed; tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 -o max; done < {input.bed}) > {output} """ |
1068 1069 1070 1071 | shell: """ bedtools coverage -a {input.bed} -b {input.anno} -counts > {output} """ |
1082 1083 1084 1085 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 5 -o max > {output} """ |
1096 1097 1098 1099 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk 'BEGIN{{ maxVal="." }}{{ if ((maxVal == ".") || ($4 > maxVal)) {{ maxVal=$4 }} }}END{{ print maxVal }}'); done < {input.bed}) > {output} """ |
1135 1136 1137 1138 | shell: """ paste <(cut -f1-11 {input.cadd}) <(cut -f4 {input.cadd2}) <(cut -f4 {input.ccr}) <(cut -f4-28 {input.chromHMM}) <(cut -f4,5,7 {input.ctcf}) <(cut -f1 {input.di_min}) <(cut -f1 {input.di_max}) <(cut -f4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65 {input.encode}) <(cut -f4-7 {input.ep}) <(cut -f4,5,9,10,14,15,19,20,24,25 {input.fire}) <(cut -f4 {input.gc}) <(cut -f4-11 {input.gm}) <(cut -f4 {input.gerp}) <(cut -f4 {input.gerp2}) <(cut -f4,5,6,7,11,12,13,14,18,19,20,21,25,26,27,28 {input.hic}) <(cut -f4-7 {input.hesc}) <(cut -f4-7 {input.microsyn}) <(cut -f4 {input.mpc}) <(cut -f4 {input.pli}) <(cut -f4,5,6 {input.g_dist}) <(cut -f4 {input.remapTF}) <(cut -f4 {input.f5}) <(cut -f4 {input.hi}) <(cut -f4 {input.deepc}) <(cut -f4,5,7 {input.ultrac}) <(cut -f4 {input.linsight})| cat {input.header} - > {output} """ |
1156 1157 1158 1159 1160 1161 | shell: """ bedtools flank -i {input.bed} -g {input.genome} -l 0 -r 100 | bedtools sort > {output.wchr} bedtools flank -i {input.bed} -g {input.genome} -l 0 -r 100 | bedtools sort | sed 's/^chr\|%$//g' > {output.nchr} """ |
1173 1174 1175 1176 1177 1178 | shell: """ bedtools merge -i {input.nchr} > {output.nchr} bedtools merge -i {input.wchr} > {output.wchr} """ |
1190 1191 1192 1193 | shell: """ bedtools coverage -b {input.anno} -a {input.bed} > {output} """ |
1204 1205 1206 1207 | shell: """ bedtools coverage -b {input.anno} -a {input.bed} > {output} """ |
1219 1220 1221 1222 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output} """ |
1235 1236 1237 1238 | shell: """ tabix {input.anno} -R {input.merg} | awk '{{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \\"\\";"; }}' | gtf2bed - | cut -f1,2,3,8,10 > {output} """ |
1250 1251 1252 1253 | shell: """ paste <( bedtools coverage -b <(grep "exon" {input.anno}) -a {input.bed} | cut -f 1,2,3,5 ) <(bedtools coverage -b <(grep "transcript" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "gene" {input.anno} ) -a {input.bed} | cut -f 5) <( bedtools coverage -b <( grep "start_codon" {input.anno} ) -a {input.bed} | cut -f 5 ) <( bedtools coverage -b <(grep "stop_codon" {input.anno}) -a {input.bed} | cut -f 5) <(bedtools coverage -b <(grep "three_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "five_prime_utr" {input.anno}) -a {input.bed} | cut -f 5 ) <(bedtools coverage -b <(grep "CDS" {input.anno}) -a {input.bed} | cut -f 5 ) > {output} """ |
1267 1268 1269 | shell: """ grep "exon" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} |cut -f 1,2,3,9 |paste - <(grep "gene" {input.anno} | bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9)|paste - <(grep "start_codon" {input.anno} |bedtools closest -d -t first -b stdin -a {input.bed} | cut -f 9 ) >> {output} """ |
1281 1282 1283 1284 | shell: """ bedtools map -b {input.anno} -a {input.bed} -c 4 -o distinct > {output} """ |
1296 1297 1298 1299 | shell: """ Rscript --vanilla scripts/PLIextract.R {input.gn} {input.pli} {output} """ |
1311 1312 1313 1314 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy8.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,4,5,5,6,6,7,7,8,8 -o max,sum,max,sum,max,sum,max,sum,max,sum; done < {input.bed}) > {output} """ |
1325 1326 1327 1328 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output} """ |
1344 1345 1346 1347 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}')| cat annotations/dummy5_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o max; done < {input.bed}) > {output} """ |
1358 1359 1360 1361 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | wc -l); done < {input.bed}) > {output} """ |
1373 1374 1375 1376 1377 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk '{{ total+=$4 }}END{{ printf("%f",total) }}'); done < {input.bed}) > {output} """ |
1392 1393 1394 1395 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.ep} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
1407 1408 1409 1410 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4,4 -o max,min > {output} """ |
1418 1419 | shell: "paste {input} > {output}" |
1431 1432 1433 1434 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
1446 1447 1448 1449 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
1462 1463 | shell: "paste {input} > {output}" |
1475 1476 1477 1478 | shell: """ bedtools closest -d -t first -a {input.bed} -b {input.hic} | Rscript --vanilla scripts/annotateHIC.R stdin {output.o1} """ |
1489 1490 1491 1492 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4 -o max > {output} """ |
1504 1505 1506 1507 1508 | shell: """ bedtools map -a {input.bed} \ -b {input.anno} -c 4,4 -o max,min > {output} """ |
1516 1517 | shell: "paste {input} > {output}" |
1526 1527 1528 1529 | shell: """ cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print m}}' > {output.mingg} cut -f 4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65,69,70,74,75,79,80,84,85,89,90 {input} | awk '{{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}}' > {output.maxgg}""" |
1540 1541 1542 1543 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 4 -o mean > {output} """ |
1554 1555 1556 1557 | shell: """ (while read -r line; do bedtools map -b <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') | cat annotations/dummy4_nchr.bed - ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4 -o mean; done < {input.bed}) > {output} """ |
1570 1571 | shell: "tabix {input.anno} -R {input.merg} | cat annotations/dummy5_nchr.bed - | bedtools map -a {input.bed} -b - -c 4,4 -o max,sum > {output}" |
1579 1580 | shell: "paste {input} > {output}" |
1592 1593 1594 1595 | shell: """ (while read -r line; do bedtools map -b <(cat annotations/dummy_chrhmm.bed; tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3+1}}') ) -a <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 -o max; done < {input.bed}) > {output} """ |
1609 1610 1611 1612 | shell: """ bedtools coverage -a {input.bed} -b {input.anno} -counts > {output} """ |
1623 1624 1625 1626 | shell: """ bedtools map -a {input.bed} -b {input.anno} -c 5 -o max > {output} """ |
1637 1638 1639 1640 | shell: """ (while read -r line; do paste <(echo $line | awk 'BEGIN{{ OFS="\t" }}{{ print $1,$2,$3}}') <(tabix {input.anno} $(echo $line | awk '{{ print $1":"$2"-"$3}}') | awk 'BEGIN{{ maxVal="." }}{{ if ((maxVal == ".") || ($4 > maxVal)) {{ maxVal=$4 }} }}END{{ print maxVal }}'); done < {input.bed}) > {output} """ |
1676 1677 1678 1679 | shell: """ paste <(cut -f1-11 {input.cadd}) <(cut -f4 {input.cadd2}) <(cut -f4 {input.ccr}) <(cut -f4-28 {input.chromHMM}) <(cut -f4,5,7 {input.ctcf}) <(cut -f1 {input.di_min}) <(cut -f1 {input.di_max}) <(cut -f4,5,9,10,14,15,19,20,24,25,29,30,34,35,39,40,44,45,49,50,54,55,59,60,64,65 {input.encode}) <(cut -f4-7 {input.ep}) <(cut -f4,5,9,10,14,15,19,20,24,25 {input.fire}) <(cut -f4 {input.gc}) <(cut -f4-11 {input.gm}) <(cut -f4 {input.gerp}) <(cut -f4 {input.gerp2}) <(cut -f4,5,6,7,11,12,13,14,18,19,20,21,25,26,27,28 {input.hic}) <(cut -f4-7 {input.hesc}) <(cut -f4-7 {input.microsyn}) <(cut -f4 {input.mpc}) <(cut -f4 {input.pli}) <(cut -f4,5,6 {input.g_dist}) <(cut -f4 {input.remapTF}) <(cut -f4 {input.f5}) <(cut -f4 {input.hi}) <(cut -f4 {input.deepc}) <(cut -f4,5,7 {input.ultrac}) <(cut -f4 {input.linsight})| cat {input.header} - > {output} """ |
1699 1700 1701 1702 | shell: """ Rscript --vanilla scripts/scoring.R {params.name} {input.span} {input.flank_up} {input.flank_down} {output} """ |
1714 1715 1716 1717 | shell: """ bedtools sort -i {input.score} | cat {input.header} - > {output} """ |
Support
- Future updates
Related Workflows





