A repo for the Twist HiFi capture/snakemake workflow

public public 1yr ago 0 bookmarks

About

This is a snakemake workflow for analyzing targeted HiFi sequence datasets. The workflow was designed for Twist gene panels sequenced on PacBio Sequel IIe or Revio systems. Learn more in this App Note . Please note that this workflow is still in development. There may be updates to the input / output files and workflow steps which may affect the behavior of the program.

Getting started

Dependencies

This workflow uses conda to install dependencies and set up the compute environment. The workflow is tested with conda v22+ on a SLURM cluster and we recommend 16 cores, 4 GB RAM per core, on a single machine (64GB total memory). Otherwise, use at your own risk.

Installation

# create the base conda environment
$ conda create \
 --channel conda-forge \
 --channel bioconda \
 --prefix ./conda_env \
 python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow

Quick start

We recommend using this GRCh38 reference . Refer to the FAQ for more details on input files and formats.

# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference directory
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Run full workflow starting with demultiplexing for batch <batch_name>
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics

Understanding output

BAM and VCF files

Haplotagged aligned BAM files and VCF files are available for each sample in the whatshap subdirectory for each sample. These files were produced using Deep Variant and Whatshap. See FAQ for a detailed directory structure.

If pbsv is used in the pipeline, a VCF for structral variants can be found in the pbsv subdirectory for each sample.

Demultiplexed output

The demux directory contains unmapped BAM files for each sample. These files are named with the barcode ID. The directory also contains all of the output for lima , the demultiplexing software for PacBio.

Target enrichment stats

See the stats directory for many useful tables and summary figures. The file hs_metrics_consolidated_quickview.tsv is a good place to start to understand run performance. Definitions of each statistics can be found here .

read_categories.png gives a graphical summary of read lengths and the percentage of reads that are demultiplexed, PCR duplicates, unmapped, on and off target.

FAQ

1) What is the format for the biosample_csv file?

This file provides a mapping of barcodes to samples. It is used for sample demultiplexing, to trim adapter sequences, and to rename files with sample ID. The file is comma-delimited where the first column is a pair of barcode IDs (forward and reverse, must match the headers in the barcodes fasta) and the second column in a sample ID (must be unique). The csv must contain a header row, but the header names can be arbitrary.

Here is an example:

Barcode,Biosample
UDI_1_A01_f--UDI_1_A01_r,HG002
UDI_2_B01_f--UDI_2_B01_r,HG003
UDI_3_C01_f--UDI_3_C01_r,HG004
UDI_4_D01_f--UDI_4_D01_r,HG005

2) What is the file format for HiFi reads?

Please use hifi_reads.bam, the native format for PacBio HiFi data, which is an unmapped BAM file ("ubam"). Learn more here .

3) Can I change the barcode file?

By default the pipeline uses the Tru-Seq barcodes from Twist. You can change you barcode file by providing a different fasta file in workflow/barcodes . For example, the barcode file "Sequel_96_barcodes_v1.IDT_pad.fasta" was used for this dataset and in this preprint . Adjust the path to the barcode file in workflow/config.yaml and make sure you update your biosample_csv file.

4) What is the format of the target.bed file?

This is a standard BED file that specifies the coordinates of the regions of interest. It is used by HS Metrics to compute target coverage, on-target mapping rate, and other statistics.

Here is an example targets.bed :

chr1	155234451	155244627	GBA
chr5	70049638	70078522	SMN2
chr5	70925030	70953942	SMN1
chr22	42126498	42130810	CYP2D6

5) What is a probes.bed file? What if I don't have one?

The probes.bed file specifies the coordinate for the probes use to prepare the target capture library. The file usually contains hundreds or even thousands of probes depending on the size of the panel. You may not have access to the probe design due to it being proprietary. In this case, you can use the target.bed file in place of the probes.bed file (you will use target.bed twice when you call the workflow command).

Here is an example of probes.bed :

chr1	48037726	48037846
chr1	48038919	48039039
chr1	48040111	48040231
chr1	48041516	48041636

6) What software is used in the pipeline?

python v3.9+
conda v22+
lima v2.5+
pbmarkdups v1+
pbmm2 v1.7+
deep variant v1.4+
whatshap v1.1+
glnexus v1.4.1+
pbsv v2.8+
htslib v1.13+
hsmetrics v1.3.1
pharmCAT
pangu v0.2.1

7) What are the steps in the workflow?

  1. demultiplex HiFi reads with lima

  2. Mark PCR duplicate HiFi reads by sample with pbmarkdups

  3. align HiFi reads to reference with pbmm2

  4. call small variants with DeepVariant v1.4.0

  5. phase small variants with WhatsHap

  6. haplotag aligned BAMs with WhatsHap

  7. call SV with pbsv

  8. jointly call all variants (excl pbsv) with glnexus

  9. [optionally] call PGx star (*) alleles with PharmCAT and pangu

  10. [optionally] annotate output gVCF with dbsnp or other database containing variant IDs

  11. Run some QC reports including hsMetrics

8) What is the full directory structure?

.
├── cluster_logs # slurm stderr/stdout logs
├── reference
│ ├── reference.chr_lengths.txt # cut -f1,2 reference.fasta.fai > reference.chr_lengths.txt
│ ├── reference.fasta
│ └── reference.fasta.fai
├── batches
│ └── <batch_name> # batch_id
│ ├── benchmarks/ # cpu time per task
│ ├── demux/ # demultiplexed hifi reads
│ ├── glnexus/ # intermediate cohort vcf files
│ ├── logs/ # per-rule stdout/stderr logs
│ ├── picard/ # interval lists for hsmetrics
│ ├── stats/ # batch-wide collated reports, including HS Metrics summary
│ ├── whatshap_cohort/ # joint-called, phased SNV
│ ├── merged_gvcf/ # [optional] annotated gvcf with all batch samples included
│ ├── <sample_id 1>/ # per-sample results, one for each sample
│ : ...
│ └── <sample_id n>/ # per-sample results, one for each sample
│ ├── coverage/ # read coverage by target beds
│ ├── deepvariant/ # intermediate DV vcf, incl g.vcf per sample
│ ├── hs_metrics/ # picard hsmetrics for this sample
│ ├── markdup/ # unaligned reads with PCR dups marked
│ ├── pangu/ # [optional] HiFi CYP2D6 star calling results
│ ├── pbsv/ # structural variant calls
│ ├── pharmcat/ # [optional] pharmcat results
│ ├── read_metrics/ # per-read information
│ └── whatshap/ # phased small variants; merged haplotagged alignments
│
└── workflow # clone of this repo

Detailed run guidance

# create the base conda environment
$ conda create \
 --channel conda-forge \
 --channel bioconda \
 --prefix ./conda_env \
 python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow
# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference 
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Annotation [optional]
# drop your annotation file and index into annotation
# and adjust the [annotate][variants] path in workflow/config.yaml
# ensure that [annotate][gVCF] is set to True in workflow/config.yaml
# PharmCAT [optional]
# Set [pharmcat][run_analysis] to True in workflow/config.yaml
# run the full workflow including demux/markdup/mapping from a single HiFi movie for batch <batch_name>
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# run just variant calling and phasing for a set of bams following demux/markdup/mapping on SL
# <hifi_reads> can be a directory of bams or a textfile with one bam path per line (fofn)
$ sbatch workflow/run_snakemake_SLmapped.sh <batch_name> <hifi_reads> <target_bed> [<probe_bed>]

Code Snippets

16
17
18
19
20
21
22
23
24
shell:
    '''
    (bcftools convert --gvcf2vcf \
     -f {input.reference} \
     -R {params.variants} \
     -Oz \
     -o {output} \
     {input.gvcf}) > {log} 2>&1
    '''
41
42
43
44
45
46
47
48
shell:
    '''
    (bcftools annotate -c ID \
                       -R {params.region} \
                       -a {params.variants} \
                       -Oz -o {output} \
                       {input.gvcf}) > {log} 2>&1
    '''
65
66
67
68
69
70
71
shell:
    '''
    (bcftools merge \
              -R {params.region} \
              -Oz -o {output} \
              {input.vcf} {params.variants}) > {log} 2>&1
    '''
 95
 96
 97
 98
 99
100
101
shell:
    '''
    (bcftools merge \
              -R {params.region} \
              -Oz -o {output} \
              {input.gvcfs}) > {log} 2>&1
    '''
17
18
shell:
    "tabix {params} {input} > {log} 2>&1"
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
shell:
    """
    (/opt/deepvariant/bin/make_examples \
        --add_hp_channel \
        --alt_aligned_pileup=diff_channels \
        --min_mapping_quality=1 \
        --parse_sam_aux_fields \
        --partition_size=25000 \
        --max_reads_per_partition=600 \
        --phase_reads \
        --pileup_image_width {params.pileup_image_width} \
        --norealign_reads \
        --sort_by_haplotypes \
        --track_ref_reads \
        --vsc_min_fraction_indels {params.vsc_min_fraction_indels} \
        --mode calling \
        --ref {input.reference} \
        --reads {input.bam} \
        --examples {params.examples} \
        --gvcf {params.gvcf} \
        --task {params.shard}) > {log} 2>&1
    """
73
74
75
76
77
78
79
80
shell:
    """
    (echo "CUDA_VISIBLE_DEVICES=" $CUDA_VISIBLE_DEVICES; \
     /opt/deepvariant/bin/call_variants \
        --outfile {output} \
        --examples {params.examples} \
        --checkpoint {params.model}) > {log} 2>&1
    """
108
109
110
111
112
113
114
115
116
shell:
    """
    (/opt/deepvariant/bin/postprocess_variants \
        --ref {input.reference} \
        --infile {input.tfrecord} \
        --outfile {output.vcf} \
        --nonvariant_site_tfrecord_path {params.nonvariant_tfrecs} \
        --gvcf_outfile {output.gvcf}) > {log} 2>&1
    """
134
135
shell:
    "(bcftools stats --threads 3 {params} {input} > {output}) > {log} 2>&1"
24
25
26
27
28
29
30
31
32
33
34
shell:
    '''
    (mkdir -p {output.odir}
    lima {params.filters} \
         --split-named \
         --dump-removed \
         -j {threads} \
         --log-level {params.loglevel} \
         --biosample-csv {input.biosamples} \
         {input.reads} {input.barcodes} {params.odir}/demultiplex.bam) > {log} 2>&1
    '''        
43
44
45
46
47
48
run:
    missing = sample2barcode.keys() - params.samples
    if len( missing ):
        with open( f"batches/{batch}/demux_FAIL.txt", 'w' ) as ofile:
            for sample in sorted( missing ):
                ofile.write( f'{sample2barcode[sample]},{sample}\n' ) 
26
27
28
29
30
shell:
    '''
    (bedtools merge -i <(bedtools slop -b 10000 -i {input.targets} -g {input.chr_len} | sort -k1,1 -k2,2n) \
                    -d {params.distance} -c 4 -o first > {output}) 2> {log}
    '''
54
55
56
57
58
59
60
61
62
shell:
    """
    (rm -rf {output.scratch_dir} && \
    glnexus_cli --threads {threads} \
                --dir {output.scratch_dir} \
                --config DeepVariant_unfiltered \
                --bed {input.bed} \
                {input.gvcf} > {output.bcf}) 2> {log}
    """
79
80
81
82
shell: 
    '''
    (bcftools view {params} {input} -o {output}) > {log} 2>&1
    '''
104
105
106
107
shell: 
    '''
    (tabix {params.extra} {input.vcf} {params.region} > {params.vcf} && bgzip {params.vcf}) > {log} 2>&1
    '''
133
134
135
136
137
138
139
shell:
    """
    (whatshap phase {params.extra} \
        --output {output} \
        --reference {input.reference} \
        {input.vcf} {input.phaseinput}) > {log} 2>&1
    """
168
169
170
171
shell: 
    '''
    (bcftools concat {params} -o {output} {input.calls}) > {log} 2>&1
    '''
191
192
193
194
195
196
197
198
199
shell:
    """
    (whatshap stats \
        --gtf {output.gtf} \
        --tsv {output.tsv} \
        --block-list {output.blocklist} \
        --chr-lengths {input.chr_lengths} \
        {input.vcf}) > {log} 2>&1
    """
13
14
wrapper:
    "v1.10.0/bio/picard/createsequencedictionary"
29
30
wrapper:
    "v1.10.0/bio/picard/bedtointervallist"
54
55
wrapper:
    "v1.3.1/bio/picard/collecthsmetrics"
66
67
68
69
70
71
72
shell:
    '''
    awk 'BEGIN     {{ FS=OFS="\\t" }} \
         /METRICS/ {{getline; print;
                     getline; split(FILENAME,s,"/"); $67=s[3]; print}}' {input} \
    | awk 'NR==1 || !/^BAIT_SET/' > {output}
    '''
 95
 96
 97
 98
 99
100
101
102
103
104
105
run:
    import csv
    from operator import itemgetter
    quickgetter = itemgetter(*quickviewColumns)
    with open( input[0], newline='' ) as hsmetrics,\
         open( output[0], 'w', newline='' ) as quickview:
        reader = csv.DictReader( hsmetrics, dialect='unix', delimiter='\t')
        writer = csv.DictWriter( quickview, quickviewColumns, dialect='unix', delimiter='\t', quoting=0 )
        writer.writeheader()
        for row in reader:
            writer.writerow( dict( zip( quickviewColumns, quickgetter(row) ) ) )
21
22
shell:
    "(pangu -m capture -p {params.prefix} {input.bam}) > {log} 2>&1"
34
35
36
37
shell:
    '''
    awk 'BEGIN {{OFS="\t"}} !($2 ~ /\//) {{$2=$2"/[]"}} 1' {input} > {output}
    '''
23
24
25
26
27
28
29
30
31
32
33
shell:
    """
    (pbmm2 align --num-threads {threads} \
        --preset {params.preset} \
        --sample {params.sample} \
        --log-level {params.loglevel} \
        {params.extra} \
        {input.reference} \
        {input.query} \
        {output.bam}) > {log} 2>&1
    """
20
21
22
23
24
25
26
shell:
    """
    (pbsv discover {params.extra} \
        --log-level {params.loglevel} \
        --tandem-repeats {input.tr_bed} \
        {input.bam} {output}) > {log} 2>&1
    """
SnakeMake From line 20 of rules/pbsv.smk
47
48
49
50
51
52
53
shell:
    """
    (pbsv call {params.extra} \
        --log-level {params.loglevel} \
        --num-threads {threads} \
        {input.reference} {input.svsigs} {output}) > {log} 2>&1
    """
SnakeMake From line 47 of rules/pbsv.smk
18
19
20
21
22
23
24
25
26
27
shell:
    '''
    (/pharmcat/pharmcat_vcf_preprocessor.py \
        --missing-to-ref \
        -vcf {input.vcf} \
        -refFna {input.reference} \
        -refVcf {params.regions} \
        -bf {params.basefile} \
        -o {params.odir} ) > {log} 2>&1
    '''
45
46
47
48
49
50
51
52
53
54
55
56
shell:
    '''
    (bedtools coverage \
              -sorted \
              -g {input.genome} \
              -f 1 \
              -header \
              -mean \
              -a {input.vcf} \
              -b {input.bam} \
     | ( sed  -u '/^#CHROM/q' ; awk '$11 >= {params.mincov}' | cut -f1-10 ) > {output}) > {log} 2>&1
    '''
73
74
75
76
77
78
79
80
shell:
    '''
    (/pharmcat/pharmcat \
        -vcf {input.vcf} \
        -reporterJson \
        -po {input.cyp2d6} \
        -o {params.odir}) > {log} 2>&1
    '''
20
21
22
23
24
25
26
27
shell:
    '''
    (pbmarkdup --log-level {params.loglevel} \
               -j {threads} \
               {params.options} \
               {input} {output.markdup}
    samtools index {output.markdup}) > {log} 2>&1
    '''
36
37
38
39
shell:
    '''
    pbindex {input}
    '''
23
24
25
26
shell:
    '''
    (bedtools intersect -u -a {input.ensmbl} -b {input.targets} > {output}) > {log} 2>&1
    '''
35
36
37
38
shell:
    '''
    cp {input} {output}
    '''
55
56
57
58
59
60
61
62
63
64
65
66
shell:
    '''
    samtools view -hbF {params.filter} {input.bam} > {output.filtered}
    #by element count of reads
    bedtools intersect -a {input.bed} -b {output.filtered} -c > {output.by_elem}
    #by read count of elements
    bedtools intersect -a {output.filtered} -b {input.bed} -bed -c \
    | awk -v elem={params.elem} \
          'BEGIN {{ OFS="," ; print "readname,chr,start,stop,"elem }} \
           {{ print $4,$1,$2,$3,$NF }}' \
    | ( sed -u 1q; sort ) > {output.by_read} 
    '''
85
86
87
88
89
90
91
92
93
94
shell:
    '''
    bedtools intersect -a <(samtools view -hbF {params.filter} {input.bam}) -b {input.bed} > {output.ontarget}
    bedtools genomecov -bga -ibam {output.ontarget} > {output.bg}
    bedtools intersect -a {output.bg} -b {input.bed} -wb \
    | awk -v s={params.sample} \
          'BEGIN {{ OFS = ","; print "sample,chr,start,stop,coverage,target" }} \
          {{ print s,$1,$2,$3,$4,$8 }}' \
    > {output.cov} 
    '''
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
shell:
    '''
    awk -F, -v s={params.sample} -v low={params.lowcov} \
        'BEGIN {{ OFS=","; print "sample,target,totalBp,lowcovBp,fracLow" }} 
         NR>1 {{ 
                span = $4 - $3; 
                split( $6, t, "_" ); target = t[1];
                total += span; 
                subset[target] += span; 
                if( $5 <= low ) {{ 
                                    lowcov += span; 
                                    lowsubset[target] += span; 
                                  }} 
               }} 
         END {{ 
                for ( tg in subset ) {{ print s, tg, subset[tg], lowsubset[tg], lowsubset[tg]/subset[tg]; }} 
                print s,"all",total,lowcov,lowcov/total; 
             }}' {input} > {output}
    '''
SnakeMake From line 106 of rules/qc_cov.smk
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
shell:
    '''
    tail -qn1 {input} \
    | awk -F, \
        'BEGIN {{ OFS=","; print "sample,target,totalBp,lowcovBp,fracLow" }} \
        {{ \
            total += $3; \
            lowcov += $4; \
            print; \
        }} \
        END {{ \
                print "Total","allTargets",total,lowcov,lowcov/total;
            }}' \
    > {output}
    '''
SnakeMake From line 135 of rules/qc_cov.smk
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
shell:
    '''
    samtools view -F {params.filter} {input.bam} \
    | awk 'BEGIN {{ OFS="," ; print "readname,length,qual,ismapped,duplicates,softclip" }} \
           {{ if( match( $0, /ds:i:[0-9]/ ) ) {{ split( substr( $0, RSTART, RLENGTH ),d,":"); dup=d[3] }} \
              else                            {{ dup=0 }} \
           }} ; \
           {{ soft = match( $6, /[0-9]+S/ ) ? \
                     substr( $6, RSTART, RLENGTH-1 ) : \
                     0 \
           }} ; \
           {{ mapped = !and($2,4) }} ; \
           {{ match( $0, /rq:f:[0-9.]+/ ); \
              split( substr( $0, RSTART, RLENGTH ),q,":") ; \
              print $1, length($10), q[3], mapped, dup, soft \
           }}' \
    | ( sed -u 1q; sort) > {output.stats}
    '''
193
194
195
196
197
198
199
shell:
    ''' 
    bedtools intersect -a <(samtools view -hbF {params.filter} {input.bam}) -b {input.bed} -bed -loj \
    | awk 'BEGIN {{ print "readname,target" }} \
           {{ print $4","$NF }}' \
    | ( sed -u 1q; sort -t, -u -k1,1 ) > {output.target}
    ''' 
212
213
214
215
216
217
shell:
    '''
    samtools view -f 1024 {input} \
    | awk -v s={params.sample} '{{print s","length($10)",Duplicates"}}' \
    > {output}
    '''
226
227
228
229
230
shell:
    ''' 
    awk ' BEGIN {{ print "sample,length,category" }}
          {{ print }}' {input} > {output}
    '''
SnakeMake From line 226 of rules/qc_cov.smk
243
244
script:
    'scripts/merge_read_metrics.py'
SnakeMake From line 243 of rules/qc_cov.smk
253
254
255
256
shell:
    '''
    awk 'NR==1 || FNR>1' {input} > {output}
    '''
SnakeMake From line 253 of rules/qc_cov.smk
265
266
267
268
shell:
    '''
    awk 'NR==1 || FNR>1' {input} > {output}
    '''
SnakeMake From line 265 of rules/qc_cov.smk
279
280
281
282
283
284
285
286
287
shell:
    '''
    awk -v min={params.low} '$NF <= min {{ print $1, $2, $3, $4 ~ /^[0-9]+$/ ? "" : $4 }}' {input} \
    | sort -k1,1 -k2,2n \
    | uniq -c \
    | awk 'BEGIN {{ OFS=","; print "chr,start,stop,target,samplesDropped" }} \
                 {{ print $2,$3,$4,$5,$1 }}' \
    > {output}
    '''
SnakeMake From line 279 of rules/qc_cov.smk
297
298
299
300
301
302
303
shell:
    '''
    bedtools nuc -fi {input.ref} -bed {input.bed} \
    | awk 'BEGIN {{ OFS=","; print "chr,start,stop,target,frac_gc" }} \
           NR>1  {{ print $1,$2,$3,$4,$(NF-7) }}' \
    > {output}
    '''
327
328
script:
    'scripts/plot_read_data.py'
SnakeMake From line 327 of rules/qc_cov.smk
339
340
script:
    'scripts/plot_coverage.py'
SnakeMake From line 339 of rules/qc_cov.smk
351
352
script:
    'scripts/plot_multi_coverage.py'
SnakeMake From line 351 of rules/qc_cov.smk
375
376
script:
    'scripts/plot_multi_reads.py'
SnakeMake From line 375 of rules/qc_cov.smk
13
14
script:
    'scripts/plot_read_cats.py'
1
2
3
4
5
6
7
8
import pandas as pd

res = pd.concat( [ pd.read_csv( csv, index_col='readname' )
                   for csv in snakemake.input ],
                  axis=1 )\
        .reset_index()
res.insert( 0, 'sample', snakemake.params.sample )
res.to_csv( snakemake.output[0], index=False )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DPI=200

coverage = pd.read_csv( snakemake.input[0] )

order = sorted(coverage.target.unique())
maxCols = int( len(order) ** 0.5 ) + 1

plt.figure(figsize=(40,40))

g = sns.FacetGrid(data=coverage,sharex=False,
                  col='target',col_wrap=maxCols,col_order=order,
                  hue='target')

g.map(plt.plot, 'start','coverage')

g.set_xlabels('chr start pos')
g.set_ylabels('Coverage')

g.savefig(f'{snakemake.params.odir}/coverage_by_target.png', dpi=DPI)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

DPI=400

data = pd.read_csv( snakemake.input[0] )
#replace "." with "off-target"
ot = 'off-target'
data.target = data.target.str.replace('.',ot,regex=False)

####
#multi-sample coverage per target
####

order = sorted(data.target.unique())
ncols = ceil(len(order)**.5)

g = sns.FacetGrid(data=data,sharex=False,
                  col='target',col_wrap=ncols,col_order=order,
                  hue='sample')

g.map(plt.plot, 'start','coverage')\
 .set_xlabels('chr start pos')\
 .set_ylabels('Coverage')\
 .add_legend()\
 .savefig(f'{snakemake.params.odir}/multi_coverage_by_target.png',dpi=DPI)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DPI=400

readCsv       = snakemake.input.csv
targetBed     = snakemake.input.bed
targetBuffer  = int(snakemake.params.buffer)
targetPerPlot = int(snakemake.params.targetsPerPanel)
outDir        = snakemake.params.odir

# function for partitioning target sets into readable subplots
def partition(lst, size):
    for i in range(0, len(lst), size):
        yield lst[i:i+size]

targets = pd.read_csv(targetBed,
                      sep='\t',
                      names=['chr','start','stop','target'])\
            .set_index('target')
#Set length of target region
targets['tlength'] = targets.eval('stop - start + 2 * @targetBuffer')

data = pd.read_csv(readCsv)
#replace "." with "off-target"
ot = 'off-target'
data.target = data.target.str.replace('.',ot,regex=False)

####
#mean base coverage
####
print('writing mean base cov')
pdata = data.query(' target != "off-target" ')\
            .groupby( ['target','sample'] )\
            .length.sum().reset_index()

pdata['meanBaseCoverage'] = pdata.length / pdata.target.map(targets.tlength)

# assign groups to partition targets
order = sorted(pdata.target.unique())
grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp }
pdata[ 'plotGroup' ] = pdata.target.map(grps)

g = sns.catplot(data=pdata,
                x='target',sharex=False,
                y='meanBaseCoverage',
                col='plotGroup', col_wrap=2,
                kind='box',aspect=2)
g.set_xticklabels(rotation=45);
# remove plotgroup labels
for ax in g.axes.flatten():
    ax.set_title('')
plt.tight_layout()
g.savefig(f'{outDir}/mean_base_coverage.png',dpi=DPI)
plt.clf()

g = sns.catplot(data=pdata,
                x='target', sharex=False,
                y='meanBaseCoverage',
                col='plotGroup', col_wrap=2,
                hue='sample',
                kind='strip',aspect=2 )
                #facet_kws=dict(legend_out=True))
g.set_xticklabels(rotation=45);
for ax in g.axes.flatten():
    ax.set_title('')
sns.move_legend(g, "upper left", bbox_to_anchor=(1, 0.75), frameon=False)
plt.tight_layout()
g.savefig(f'{outDir}/mean_base_coverage_by_sample.png',dpi=DPI)
plt.clf()


####
#dedup rate by target
####
print('writing dedup_rate')
def dedup_rate(d):
    dups = d.duplicates.sum()
    return  dups / ( dups + len(d) ) 


pdata = data.groupby(['target','sample'])\
            .apply(dedup_rate)\
            .rename('Duplication Rate')\
            .reset_index()
order = sorted(pdata.target.unique())
grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp }
pdata[ 'plotGroup' ] = pdata.target.map(grps)

g = sns.catplot(data=pdata,
                x='target', sharex=False,
                col='plotGroup',col_wrap=2,
                y='Duplication Rate',
                kind='box',aspect=2)
g.set_xticklabels(rotation=45);
for ax in g.axes.flatten():
    ax.set_title('')
plt.tight_layout()
g.savefig(f'{outDir}/dedup_rate_by_target.png',dpi=DPI)
plt.clf()

####
#readlength by target
####
#TODO better represent dups by replicating length dup times
print('writing dedup_length')
pdata = data

order = sorted(pdata.target.unique())
grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp }
pdata[ 'plotGroup' ] = pdata.target.map(grps)
g = sns.catplot( data=pdata,
                 x='target', sharex=False,
                 col='plotGroup',col_wrap=2,
                 y='length',
                 kind='violin',aspect=2)
g.set_xticklabels(rotation=45);
for ax in g.axes.flatten():
    ax.set_title('')
plt.tight_layout()
g.savefig(f'{outDir}/dedup_length_by_target.png',dpi=DPI)
plt.clf()

####
#readlength by target
####
print('writing readlength by target')
pdata = data
wrap  = int( len(order) ** 0.5 + 1 )         
g = sns.FacetGrid(data=pdata,
                  hue='sample',
                  col='target',col_wrap=wrap,
                  col_order=order,sharey=False)
order = sorted(pdata.target.unique())

g.map(sns.kdeplot,'length')
g.set_xlabels('Read Length')
g.set_ylabels('HiFi Reads')
g.add_legend()
g.savefig(f'{outDir}/readlength_hist_by_target.png',dpi=DPI)
plt.clf()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

allDemuxReads = pd.read_csv( snakemake.input.readCsv )
limaReport    = pd.read_csv( snakemake.input.lima, sep='\t', usecols=['ReadLengths','PassedFilters'] )
dups          = pd.read_csv( snakemake.input.dups, usecols=['sample','length','category'] )
outdir        = snakemake.params.odir

#load data
onTarget  = allDemuxReads.query('target != "."')\
                         .assign(category='OnTarget Unique')[['sample','category','length']]

offTarget = allDemuxReads.query('target == "." and ismapped==1')\
                         .assign(category='OffTarget Unique')[['sample','category','length']]

unmapped  = allDemuxReads.query('ismapped==0')\
                         .assign(category='Unmapped Unique')[['sample','category','length']]

noDemux   = limaReport.query('PassedFilters==0')\
                      .assign(category='Failed Demux')\
                      .drop('PassedFilters',axis=1).rename(columns={'ReadLengths':'length'})

#merge
allData = pd.concat([onTarget,offTarget,unmapped,dups,noDemux],ignore_index=True)

#plot
hue_order=['OnTarget Unique','OffTarget Unique','Unmapped Unique','Duplicates','Failed Demux']


f, (ax0, ax1) = plt.subplots(1, 2, 
                             figsize=(10,8),
                             gridspec_kw={'width_ratios': [1, 5]})

#histogram
sns.histplot(data=allData,
             x='length',
             hue_order=hue_order,
             hue='category',
             multiple='stack',
             binrange=(0,15000),
             ax=ax1)

#side plot
data = allData.groupby('category').size().reindex(hue_order[::-1]).rename('reads').reset_index()
data['yieldFrac'] = data.reads.transform(lambda x:x/x.sum()).fillna(0)

bottom=0
colors = sns.color_palette()[:5][::-1]

for i,row in data.iterrows():
    ax0.bar(1,row.yieldFrac,bottom=bottom,color=colors[i],alpha=0.75)
    ax0.annotate(f'{row.yieldFrac:.2f}',(1,bottom + .5*row.yieldFrac),ha='center',va='center')
    bottom += row.yieldFrac

ax0.set_xticks([])
ax0.set_ylabel('Total Fraction')

plt.tight_layout()

f.savefig(f'{outdir}/read_categories.png',dpi=400)

# write readlength legnth stats table
allData.fillna('noSampleData')\
       .groupby(['category','sample'])\
       .length.describe().dropna().astype(int)\
       .to_csv(f'{outdir}/read_length_by_sample.csv')
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DPI=200

readCsv      = snakemake.input.csv
targetBed    = snakemake.input.bed
targetBuffer = int(snakemake.params.buffer)
outDir       = snakemake.params.odir


targets = pd.read_csv(targetBed,
                      sep='\t',
                      names=['chr','start','stop','target'])\
            .set_index('target')
#Set length of target region
targets['tlength'] = targets.eval('stop - start + 2 * @targetBuffer')

data = pd.read_csv(readCsv)
#replace "." with "off-target"
ot = 'off-target'
data.target = data.target.str.replace('.',ot,regex=False)


####
#mean base coverage
####

pdata = data.query(' target != @ot ')\
            .groupby( 'target' )\
            .length.sum().reset_index()

try:
    pdata['meanBaseCoverage'] = pdata.length / pdata.target.map(targets.tlength)
except pd.errors.InvalidIndexError as e:
    print(f'Error: Is your target name list unique?\n\n{e}')
    sys.exit()

order = sorted(pdata.target)
g = sns.catplot(data=pdata,
                x='target',order=order,
                y='meanBaseCoverage',
                kind='bar',aspect=2)
plt.xticks(rotation=45)
g.savefig(f'{outDir}/mean_base_coverage_by_target.png',dpi=DPI)
plt.clf()

###
#sample readlength
###
pdata = data.query( ' target != @ot ')

g = sns.displot(pdata.length,kind='hist',kde=True)
g.savefig(f'{outDir}/readlength_hist.png',dpi=DPI)
plt.clf()

####
#readlength by target
###

pdata = data

order = sorted(pdata.target.unique())
g = sns.FacetGrid(data=pdata,
                  xlim=(0,10000),
                  col='target',col_wrap=7,
                  col_order=order,
                  hue='sample',
                  sharey=False)

g.map(sns.kdeplot,'length')

g.set_xlabels('Read Length')
g.set_ylabels('HiFi Reads')
g.add_legend()
g.savefig(f'{outDir}/readlength_hist_by_target.png',dpi=DPI)
plt.clf()

####
#readlength by target
####
pdata = data

order = sorted(pdata.target.unique())
g = sns.catplot( data=pdata,
                 x='target',
                 y='length',
                 order=order,
                 kind='violin',
                 aspect=2)
plt.xticks(rotation=45);
g.savefig(f'{outDir}/readlength_violins_by_target.png',dpi=DPI)
plt.clf()

####
#dedup count by target
####
pdata = data

order = sorted(pdata.target.unique())

g = sns.catplot( data=pdata,
                 x='target',
                 y='duplicates',
                 order=order,
                 kind='box',aspect=2)
plt.xticks(rotation=45);
g.savefig(f'{outDir}/dedup_count_by_target.png',dpi=DPI)
plt.clf()


####
#dedup rate by target
####
def dedup_rate(d):
    dups = d.duplicates.sum()
    return  dups / ( dups + len(d) ) 


pdata = data.groupby('target')\
            .apply(dedup_rate)\
            .rename('Duplication Rate')\
            .reset_index()

order = sorted(pdata.target)
g = sns.catplot(data=pdata,
                x='target',order=order,
                y='Duplication Rate',
                kind='bar',aspect=2)
plt.xticks(rotation=45)
g.savefig(f'{outDir}/dedup_rate_by_target.png',dpi=DPI)
plt.clf()

####
#rq by target
###
pdata = data

order = sorted(pdata.target.unique())

g = sns.catplot( data=pdata,
                 x='target',
                 y='qual',
                 order=order,
                 kind='box',aspect=2)
plt.xticks(rotation=45)
g.savefig(f'{outDir}/readqual_by_target.png',dpi=DPI)
plt.clf()

###
#off target length
###
pdata = data.query( ' target == @ot ')

g = sns.displot(pdata.length,kind='hist',kde=True)
g.savefig(f'{outDir}/off-target_length_hist.png',dpi=DPI)
20
21
22
23
24
25
26
27
shell:
    """
    (whatshap phase {params.extra} \
        --output {output} \
        --reference {input.reference} \
        {input.vcf} \
        {input.phaseinput}) > {log} 2>&1
    """
47
48
49
50
51
52
53
54
55
shell:
    """
    (whatshap stats \
        --gtf {output.gtf} \
        --tsv {output.tsv} \
        --block-list {output.blocklist} \
        --chr-lengths {input.chr_lengths} \
        {input.vcf}) > {log} 2>&1
    """
77
78
79
80
81
82
83
shell:
    """
    (whatshap haplotag {params} \
        --output {output} \
        --reference {input.reference} \
        {input.vcf} {input.bam}) > {log} 2>&1
    """
100
101
shell:
    "(samtools index -@ 3 {input}) > {log} 2>&1"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Fabian Kilpert"
__copyright__ = "Copyright 2020, Fabian Kilpert"
__email__ = "[email protected]"
__license__ = "MIT"

import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell()

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard BedToIntervalList"
        " {java_opts} {extra}"
        " --INPUT {snakemake.input.bed}"
        " --SEQUENCE_DICTIONARY {snakemake.input.dict}"
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output}"
        " {log}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts


extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard CreateSequenceDictionary"
        " {java_opts} {extra}"
        " --REFERENCE {snakemake.input[0]}"
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output[0]}"
        " {log}"
    )
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard CollectHsMetrics"
        " {java_opts} {extra}"
        " --INPUT {snakemake.input.bam}"
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output[0]}"
        " --REFERENCE_SEQUENCE {snakemake.input.reference}"
        " --BAIT_INTERVALS {snakemake.input.bait_intervals}"
        " --TARGET_INTERVALS {snakemake.input.target_intervals}"
        " {log}"
    )
ShowHide 58 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/PacificBiosciences/HiFiTargetEnrichment
Name: hifitargetenrichment
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...