A repo for the Twist HiFi capture/snakemake workflow

public 1yr ago 0 bookmarks

View Workflow

About

This is a snakemake workflow for analyzing targeted HiFi sequence datasets. The workflow was designed for Twist gene panels sequenced on PacBio Sequel IIe or Revio systems. Learn more in this App Note . Please note that this workflow is still in development. There may be updates to the input / output files and workflow steps which may affect the behavior of the program.

Getting started

Dependencies

This workflow uses conda to install dependencies and set up the compute environment. The workflow is tested with conda v22+ on a SLURM cluster and we recommend 16 cores, 4 GB RAM per core, on a single machine (64GB total memory). Otherwise, use at your own risk.

Installation

# create the base conda environment
$ conda create \
 --channel conda-forge \
 --channel bioconda \
 --prefix ./conda_env \
 python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow

Quick start

We recommend using this GRCh38 reference . Refer to the FAQ for more details on input files and formats.

# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference directory
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Run full workflow starting with demultiplexing for batch <batch_name>
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics

Understanding output

BAM and VCF files

Haplotagged aligned BAM files and VCF files are available for each sample in the whatshap subdirectory for each sample. These files were produced using Deep Variant and Whatshap. See FAQ for a detailed directory structure.

If pbsv is used in the pipeline, a VCF for structral variants can be found in the pbsv subdirectory for each sample.

Demultiplexed output

The demux directory contains unmapped BAM files for each sample. These files are named with the barcode ID. The directory also contains all of the output for lima , the demultiplexing software for PacBio.

Target enrichment stats

See the stats directory for many useful tables and summary figures. The file hs_metrics_consolidated_quickview.tsv is a good place to start to understand run performance. Definitions of each statistics can be found here .

read_categories.png gives a graphical summary of read lengths and the percentage of reads that are demultiplexed, PCR duplicates, unmapped, on and off target.

FAQ

1) What is the format for the biosample_csv file?

This file provides a mapping of barcodes to samples. It is used for sample demultiplexing, to trim adapter sequences, and to rename files with sample ID. The file is comma-delimited where the first column is a pair of barcode IDs (forward and reverse, must match the headers in the barcodes fasta) and the second column in a sample ID (must be unique). The csv must contain a header row, but the header names can be arbitrary.

Here is an example:

Barcode,Biosample
UDI_1_A01_f--UDI_1_A01_r,HG002
UDI_2_B01_f--UDI_2_B01_r,HG003
UDI_3_C01_f--UDI_3_C01_r,HG004
UDI_4_D01_f--UDI_4_D01_r,HG005

2) What is the file format for HiFi reads?

Please use hifi_reads.bam, the native format for PacBio HiFi data, which is an unmapped BAM file ("ubam"). Learn more here .

3) Can I change the barcode file?

By default the pipeline uses the Tru-Seq barcodes from Twist. You can change you barcode file by providing a different fasta file in workflow/barcodes . For example, the barcode file "Sequel_96_barcodes_v1.IDT_pad.fasta" was used for this dataset and in this preprint . Adjust the path to the barcode file in workflow/config.yaml and make sure you update your biosample_csv file.

4) What is the format of the `target.bed` file?

This is a standard BED file that specifies the coordinates of the regions of interest. It is used by HS Metrics to compute target coverage, on-target mapping rate, and other statistics.

Here is an example targets.bed :

chr1	155234451	155244627	GBA
chr5	70049638	70078522	SMN2
chr5	70925030	70953942	SMN1
chr22	42126498	42130810	CYP2D6

5) What is a `probes.bed` file? What if I don't have one?

The probes.bed file specifies the coordinate for the probes use to prepare the target capture library. The file usually contains hundreds or even thousands of probes depending on the size of the panel. You may not have access to the probe design due to it being proprietary. In this case, you can use the target.bed file in place of the probes.bed file (you will use target.bed twice when you call the workflow command).

Here is an example of probes.bed :

chr1	48037726	48037846
chr1	48038919	48039039
chr1	48040111	48040231
chr1	48041516	48041636

6) What software is used in the pipeline?

python v3.9+
conda v22+
lima v2.5+
pbmarkdups v1+
pbmm2 v1.7+
deep variant v1.4+
whatshap v1.1+
glnexus v1.4.1+
pbsv v2.8+
htslib v1.13+
hsmetrics v1.3.1
pharmCAT
pangu v0.2.1

7) What are the steps in the workflow?

demultiplex HiFi reads with lima
Mark PCR duplicate HiFi reads by sample with pbmarkdups
align HiFi reads to reference with pbmm2
call small variants with DeepVariant v1.4.0
phase small variants with WhatsHap
haplotag aligned BAMs with WhatsHap
call SV with pbsv
jointly call all variants (excl pbsv) with glnexus
[optionally] call PGx star (*) alleles with PharmCAT and pangu
[optionally] annotate output gVCF with dbsnp or other database containing variant IDs
Run some QC reports including hsMetrics

8) What is the full directory structure?

.
├── cluster_logs # slurm stderr/stdout logs
├── reference
│ ├── reference.chr_lengths.txt # cut -f1,2 reference.fasta.fai > reference.chr_lengths.txt
│ ├── reference.fasta
│ └── reference.fasta.fai
├── batches
│ └── <batch_name> # batch_id
│ ├── benchmarks/ # cpu time per task
│ ├── demux/ # demultiplexed hifi reads
│ ├── glnexus/ # intermediate cohort vcf files
│ ├── logs/ # per-rule stdout/stderr logs
│ ├── picard/ # interval lists for hsmetrics
│ ├── stats/ # batch-wide collated reports, including HS Metrics summary
│ ├── whatshap_cohort/ # joint-called, phased SNV
│ ├── merged_gvcf/ # [optional] annotated gvcf with all batch samples included
│ ├── <sample_id 1>/ # per-sample results, one for each sample
│ : ...
│ └── <sample_id n>/ # per-sample results, one for each sample
│ ├── coverage/ # read coverage by target beds
│ ├── deepvariant/ # intermediate DV vcf, incl g.vcf per sample
│ ├── hs_metrics/ # picard hsmetrics for this sample
│ ├── markdup/ # unaligned reads with PCR dups marked
│ ├── pangu/ # [optional] HiFi CYP2D6 star calling results
│ ├── pbsv/ # structural variant calls
│ ├── pharmcat/ # [optional] pharmcat results
│ ├── read_metrics/ # per-read information
│ └── whatshap/ # phased small variants; merged haplotagged alignments
│
└── workflow # clone of this repo

Detailed run guidance

# create the base conda environment
$ conda create \
 --channel conda-forge \
 --channel bioconda \
 --prefix ./conda_env \
 python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow
# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference 
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Annotation [optional]
# drop your annotation file and index into annotation
# and adjust the [annotate][variants] path in workflow/config.yaml
# ensure that [annotate][gVCF] is set to True in workflow/config.yaml
# PharmCAT [optional]
# Set [pharmcat][run_analysis] to True in workflow/config.yaml
# run the full workflow including demux/markdup/mapping from a single HiFi movie for batch <batch_name>
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# run just variant calling and phasing for a set of bams following demux/markdup/mapping on SL
# <hifi_reads> can be a directory of bams or a textfile with one bam path per line (fofn)
$ sbatch workflow/run_snakemake_SLmapped.sh <batch_name> <hifi_reads> <target_bed> [<probe_bed>]

Code Snippets

shell:
    '''
    (bcftools convert --gvcf2vcf \
     -f {input.reference} \
     -R {params.variants} \
     -Oz \
     -o {output} \
     {input.gvcf}) > {log} 2>&1
    '''

SnakeMake From line 16 of rules/annotate.smk

shell:
    '''
    (bcftools annotate -c ID \
                       -R {params.region} \
                       -a {params.variants} \
                       -Oz -o {output} \
                       {input.gvcf}) > {log} 2>&1
    '''

SnakeMake From line 41 of rules/annotate.smk

shell:
    '''
    (bcftools merge \
              -R {params.region} \
              -Oz -o {output} \
              {input.vcf} {params.variants}) > {log} 2>&1
    '''

SnakeMake From line 65 of rules/annotate.smk

shell:
    '''
    (bcftools merge \
              -R {params.region} \
              -Oz -o {output} \
              {input.gvcfs}) > {log} 2>&1
    '''

SnakeMake From line 95 of rules/annotate.smk

shell:
    "tabix {params} {input} > {log} 2>&1"

SnakeMake tabix From line 17 of rules/common.smk

shell:
    """
    (/opt/deepvariant/bin/make_examples \
        --add_hp_channel \
        --alt_aligned_pileup=diff_channels \
        --min_mapping_quality=1 \
        --parse_sam_aux_fields \
        --partition_size=25000 \
        --max_reads_per_partition=600 \
        --phase_reads \
        --pileup_image_width {params.pileup_image_width} \
        --norealign_reads \
        --sort_by_haplotypes \
        --track_ref_reads \
        --vsc_min_fraction_indels {params.vsc_min_fraction_indels} \
        --mode calling \
        --ref {input.reference} \
        --reads {input.bam} \
        --examples {params.examples} \
        --gvcf {params.gvcf} \
        --task {params.shard}) > {log} 2>&1
    """

SnakeMake From line 28 of rules/deepvariant.smk

shell:
    """
    (echo "CUDA_VISIBLE_DEVICES=" $CUDA_VISIBLE_DEVICES; \
     /opt/deepvariant/bin/call_variants \
        --outfile {output} \
        --examples {params.examples} \
        --checkpoint {params.model}) > {log} 2>&1
    """

SnakeMake From line 73 of rules/deepvariant.smk

shell:
    """
    (/opt/deepvariant/bin/postprocess_variants \
        --ref {input.reference} \
        --infile {input.tfrecord} \
        --outfile {output.vcf} \
        --nonvariant_site_tfrecord_path {params.nonvariant_tfrecs} \
        --gvcf_outfile {output.gvcf}) > {log} 2>&1
    """

SnakeMake From line 108 of rules/deepvariant.smk

shell:
    "(bcftools stats --threads 3 {params} {input} > {output}) > {log} 2>&1"

SnakeMake From line 134 of rules/deepvariant.smk

shell:
    '''
    (mkdir -p {output.odir}
    lima {params.filters} \
         --split-named \
         --dump-removed \
         -j {threads} \
         --log-level {params.loglevel} \
         --biosample-csv {input.biosamples} \
         {input.reads} {input.barcodes} {params.odir}/demultiplex.bam) > {log} 2>&1
    '''        

SnakeMake MLM From line 24 of rules/demux.smk

run:
    missing = sample2barcode.keys() - params.samples
    if len( missing ):
        with open( f"batches/{batch}/demux_FAIL.txt", 'w' ) as ofile:
            for sample in sorted( missing ):
                ofile.write( f'{sample2barcode[sample]},{sample}\n' ) 

SnakeMake From line 43 of rules/demux.smk

shell:
    '''
    (bedtools merge -i <(bedtools slop -b 10000 -i {input.targets} -g {input.chr_len} | sort -k1,1 -k2,2n) \
                    -d {params.distance} -c 4 -o first > {output}) 2> {log}
    '''

SnakeMake From line 26 of rules/glnexus.smk

shell:
    """
    (rm -rf {output.scratch_dir} && \
    glnexus_cli --threads {threads} \
                --dir {output.scratch_dir} \
                --config DeepVariant_unfiltered \
                --bed {input.bed} \
                {input.gvcf} > {output.bcf}) 2> {log}
    """

SnakeMake GLnexus From line 54 of rules/glnexus.smk

shell: 
    '''
    (bcftools view {params} {input} -o {output}) > {log} 2>&1
    '''

SnakeMake From line 79 of rules/glnexus.smk

shell: 
    '''
    (tabix {params.extra} {input.vcf} {params.region} > {params.vcf} && bgzip {params.vcf}) > {log} 2>&1
    '''

SnakeMake From line 104 of rules/glnexus.smk

shell:
    """
    (whatshap phase {params.extra} \
        --output {output} \
        --reference {input.reference} \
        {input.vcf} {input.phaseinput}) > {log} 2>&1
    """

SnakeMake From line 133 of rules/glnexus.smk

shell: 
    '''
    (bcftools concat {params} -o {output} {input.calls}) > {log} 2>&1
    '''

SnakeMake From line 168 of rules/glnexus.smk

shell:
    """
    (whatshap stats \
        --gtf {output.gtf} \
        --tsv {output.tsv} \
        --block-list {output.blocklist} \
        --chr-lengths {input.chr_lengths} \
        {input.vcf}) > {log} 2>&1
    """

SnakeMake From line 191 of rules/glnexus.smk

wrapper:
    "v1.10.0/bio/picard/createsequencedictionary"

SnakeMake From line 13 of rules/hsmetrics.smk

wrapper:
    "v1.10.0/bio/picard/bedtointervallist"

SnakeMake From line 29 of rules/hsmetrics.smk

wrapper:
    "v1.3.1/bio/picard/collecthsmetrics"

SnakeMake From line 54 of rules/hsmetrics.smk

shell:
    '''
    awk 'BEGIN     {{ FS=OFS="\\t" }} \
         /METRICS/ {{getline; print;
                     getline; split(FILENAME,s,"/"); $67=s[3]; print}}' {input} \
    | awk 'NR==1 || !/^BAIT_SET/' > {output}
    '''

SnakeMake From line 66 of rules/hsmetrics.smk

run:
    import csv
    from operator import itemgetter
    quickgetter = itemgetter(*quickviewColumns)
    with open( input[0], newline='' ) as hsmetrics,\
         open( output[0], 'w', newline='' ) as quickview:
        reader = csv.DictReader( hsmetrics, dialect='unix', delimiter='\t')
        writer = csv.DictWriter( quickview, quickviewColumns, dialect='unix', delimiter='\t', quoting=0 )
        writer.writeheader()
        for row in reader:
            writer.writerow( dict( zip( quickviewColumns, quickgetter(row) ) ) )

SnakeMake From line 95 of rules/hsmetrics.smk

shell:
    "(pangu -m capture -p {params.prefix} {input.bam}) > {log} 2>&1"

SnakeMake From line 21 of rules/pangu_cyp2d6.smk

shell:
    '''
    awk 'BEGIN {{OFS="\t"}} !($2 ~ /\//) {{$2=$2"/[]"}} 1' {input} > {output}
    '''

SnakeMake From line 34 of rules/pangu_cyp2d6.smk

shell:
    """
    (pbmm2 align --num-threads {threads} \
        --preset {params.preset} \
        --sample {params.sample} \
        --log-level {params.loglevel} \
        {params.extra} \
        {input.reference} \
        {input.query} \
        {output.bam}) > {log} 2>&1
    """

SnakeMake From line 23 of rules/pbmm2.smk

shell:
    """
    (pbsv discover {params.extra} \
        --log-level {params.loglevel} \
        --tandem-repeats {input.tr_bed} \
        {input.bam} {output}) > {log} 2>&1
    """

SnakeMake From line 20 of rules/pbsv.smk

shell:
    """
    (pbsv call {params.extra} \
        --log-level {params.loglevel} \
        --num-threads {threads} \
        {input.reference} {input.svsigs} {output}) > {log} 2>&1
    """

SnakeMake From line 47 of rules/pbsv.smk

shell:
    '''
    (/pharmcat/pharmcat_vcf_preprocessor.py \
        --missing-to-ref \
        -vcf {input.vcf} \
        -refFna {input.reference} \
        -refVcf {params.regions} \
        -bf {params.basefile} \
        -o {params.odir} ) > {log} 2>&1
    '''

SnakeMake From line 18 of rules/pharmcat.smk

shell:
    '''
    (bedtools coverage \
              -sorted \
              -g {input.genome} \
              -f 1 \
              -header \
              -mean \
              -a {input.vcf} \
              -b {input.bam} \
     | ( sed  -u '/^#CHROM/q' ; awk '$11 >= {params.mincov}' | cut -f1-10 ) > {output}) > {log} 2>&1
    '''

SnakeMake From line 45 of rules/pharmcat.smk

shell:
    '''
    (/pharmcat/pharmcat \
        -vcf {input.vcf} \
        -reporterJson \
        -po {input.cyp2d6} \
        -o {params.odir}) > {log} 2>&1
    '''

SnakeMake From line 73 of rules/pharmcat.smk

shell:
    '''
    (pbmarkdup --log-level {params.loglevel} \
               -j {threads} \
               {params.options} \
               {input} {output.markdup}
    samtools index {output.markdup}) > {log} 2>&1
    '''

SnakeMake SAMtools From line 20 of rules/preprocess.smk

shell:
    '''
    pbindex {input}
    '''

SnakeMake From line 36 of rules/preprocess.smk

shell:
    '''
    (bedtools intersect -u -a {input.ensmbl} -b {input.targets} > {output}) > {log} 2>&1
    '''

SnakeMake From line 23 of rules/qc_cov.smk

shell:
    '''
    cp {input} {output}
    '''

SnakeMake From line 35 of rules/qc_cov.smk

shell:
    '''
    samtools view -hbF {params.filter} {input.bam} > {output.filtered}
    #by element count of reads
    bedtools intersect -a {input.bed} -b {output.filtered} -c > {output.by_elem}
    #by read count of elements
    bedtools intersect -a {output.filtered} -b {input.bed} -bed -c \
    | awk -v elem={params.elem} \
          'BEGIN {{ OFS="," ; print "readname,chr,start,stop,"elem }} \
           {{ print $4,$1,$2,$3,$NF }}' \
    | ( sed -u 1q; sort ) > {output.by_read} 
    '''

SnakeMake SAMtools BEDTools From line 55 of rules/qc_cov.smk

shell:
    '''
    bedtools intersect -a <(samtools view -hbF {params.filter} {input.bam}) -b {input.bed} > {output.ontarget}
    bedtools genomecov -bga -ibam {output.ontarget} > {output.bg}
    bedtools intersect -a {output.bg} -b {input.bed} -wb \
    | awk -v s={params.sample} \
          'BEGIN {{ OFS = ","; print "sample,chr,start,stop,coverage,target" }} \
          {{ print s,$1,$2,$3,$4,$8 }}' \
    > {output.cov} 
    '''

SnakeMake BEDTools From line 85 of rules/qc_cov.smk

shell:
    '''
    awk -F, -v s={params.sample} -v low={params.lowcov} \
        'BEGIN {{ OFS=","; print "sample,target,totalBp,lowcovBp,fracLow" }} 
         NR>1 {{ 
                span = $4 - $3; 
                split( $6, t, "_" ); target = t[1];
                total += span; 
                subset[target] += span; 
                if( $5 <= low ) {{ 
                                    lowcov += span; 
                                    lowsubset[target] += span; 
                                  }} 
               }} 
         END {{ 
                for ( tg in subset ) {{ print s, tg, subset[tg], lowsubset[tg], lowsubset[tg]/subset[tg]; }} 
                print s,"all",total,lowcov,lowcov/total; 
             }}' {input} > {output}
    '''

SnakeMake From line 106 of rules/qc_cov.smk

shell:
    '''
    tail -qn1 {input} \
    | awk -F, \
        'BEGIN {{ OFS=","; print "sample,target,totalBp,lowcovBp,fracLow" }} \
        {{ \
            total += $3; \
            lowcov += $4; \
            print; \
        }} \
        END {{ \
                print "Total","allTargets",total,lowcov,lowcov/total;
            }}' \
    > {output}
    '''

SnakeMake From line 135 of rules/qc_cov.smk

shell:
    '''
    samtools view -F {params.filter} {input.bam} \
    | awk 'BEGIN {{ OFS="," ; print "readname,length,qual,ismapped,duplicates,softclip" }} \
           {{ if( match( $0, /ds:i:[0-9]/ ) ) {{ split( substr( $0, RSTART, RLENGTH ),d,":"); dup=d[3] }} \
              else                            {{ dup=0 }} \
           }} ; \
           {{ soft = match( $6, /[0-9]+S/ ) ? \
                     substr( $6, RSTART, RLENGTH-1 ) : \
                     0 \
           }} ; \
           {{ mapped = !and($2,4) }} ; \
           {{ match( $0, /rq:f:[0-9.]+/ ); \
              split( substr( $0, RSTART, RLENGTH ),q,":") ; \
              print $1, length($10), q[3], mapped, dup, soft \
           }}' \
    | ( sed -u 1q; sort) > {output.stats}
    '''

SnakeMake SAMtools From line 162 of rules/qc_cov.smk

shell:
    ''' 
    bedtools intersect -a <(samtools view -hbF {params.filter} {input.bam}) -b {input.bed} -bed -loj \
    | awk 'BEGIN {{ print "readname,target" }} \
           {{ print $4","$NF }}' \
    | ( sed -u 1q; sort -t, -u -k1,1 ) > {output.target}
    ''' 

SnakeMake BEDTools From line 193 of rules/qc_cov.smk

shell:
    '''
    samtools view -f 1024 {input} \
    | awk -v s={params.sample} '{{print s","length($10)",Duplicates"}}' \
    > {output}
    '''

SnakeMake SAMtools From line 212 of rules/qc_cov.smk

shell:
    ''' 
    awk ' BEGIN {{ print "sample,length,category" }}
          {{ print }}' {input} > {output}
    '''

SnakeMake From line 226 of rules/qc_cov.smk

script:
    'scripts/merge_read_metrics.py'

SnakeMake From line 243 of rules/qc_cov.smk

shell:
    '''
    awk 'NR==1 || FNR>1' {input} > {output}
    '''

SnakeMake From line 253 of rules/qc_cov.smk

shell:
    '''
    awk 'NR==1 || FNR>1' {input} > {output}
    '''

SnakeMake From line 265 of rules/qc_cov.smk

shell:
    '''
    awk -v min={params.low} '$NF <= min {{ print $1, $2, $3, $4 ~ /^[0-9]+$/ ? "" : $4 }}' {input} \
    | sort -k1,1 -k2,2n \
    | uniq -c \
    | awk 'BEGIN {{ OFS=","; print "chr,start,stop,target,samplesDropped" }} \
                 {{ print $2,$3,$4,$5,$1 }}' \
    > {output}
    '''

SnakeMake From line 279 of rules/qc_cov.smk

shell:
    '''
    bedtools nuc -fi {input.ref} -bed {input.bed} \
    | awk 'BEGIN {{ OFS=","; print "chr,start,stop,target,frac_gc" }} \
           NR>1  {{ print $1,$2,$3,$4,$(NF-7) }}' \
    > {output}
    '''

SnakeMake BEDTools From line 297 of rules/qc_cov.smk

script:
    'scripts/plot_read_data.py'

SnakeMake From line 327 of rules/qc_cov.smk

script:
    'scripts/plot_coverage.py'

SnakeMake From line 339 of rules/qc_cov.smk

script:
    'scripts/plot_multi_coverage.py'

SnakeMake From line 351 of rules/qc_cov.smk

script:
    'scripts/plot_multi_reads.py'

SnakeMake From line 375 of rules/qc_cov.smk

script:
    'scripts/plot_read_cats.py'

SnakeMake From line 13 of rules/qc_ext.smk

import pandas as pd

res = pd.concat( [ pd.read_csv( csv, index_col='readname' )
                   for csv in snakemake.input ],
                  axis=1 )\
        .reset_index()
res.insert( 0, 'sample', snakemake.params.sample )
res.to_csv( snakemake.output[0], index=False )

Python Pandas From line 1 of scripts/merge_read_metrics.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DPI=200

coverage = pd.read_csv( snakemake.input[0] )

order = sorted(coverage.target.unique())
maxCols = int( len(order) ** 0.5 ) + 1

plt.figure(figsize=(40,40))

g = sns.FacetGrid(data=coverage,sharex=False,
                  col='target',col_wrap=maxCols,col_order=order,
                  hue='target')

g.map(plt.plot, 'start','coverage')

g.set_xlabels('chr start pos')
g.set_ylabels('Coverage')

g.savefig(f'{snakemake.params.odir}/coverage_by_target.png', dpi=DPI)

Python Pandas numpy matplotlib seaborn From line 1 of scripts/plot_coverage.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

DPI=400

data = pd.read_csv( snakemake.input[0] )
#replace "." with "off-target"
ot = 'off-target'
data.target = data.target.str.replace('.',ot,regex=False)

####
#multi-sample coverage per target
####

order = sorted(data.target.unique())
ncols = ceil(len(order)**.5)

g = sns.FacetGrid(data=data,sharex=False,
                  col='target',col_wrap=ncols,col_order=order,
                  hue='sample')

g.map(plt.plot, 'start','coverage')\
 .set_xlabels('chr start pos')\
 .set_ylabels('Coverage')\
 .add_legend()\
 .savefig(f'{snakemake.params.odir}/multi_coverage_by_target.png',dpi=DPI)

Python Pandas numpy matplotlib seaborn From line 1 of scripts/plot_multi_coverage.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DPI=400

readCsv       = snakemake.input.csv
targetBed     = snakemake.input.bed
targetBuffer  = int(snakemake.params.buffer)
targetPerPlot = int(snakemake.params.targetsPerPanel)
outDir        = snakemake.params.odir

# function for partitioning target sets into readable subplots
def partition(lst, size):
    for i in range(0, len(lst), size):
        yield lst[i:i+size]

targets = pd.read_csv(targetBed,
                      sep='\t',
                      names=['chr','start','stop','target'])\
            .set_index('target')
#Set length of target region
targets['tlength'] = targets.eval('stop - start + 2 * @targetBuffer')

data = pd.read_csv(readCsv)
#replace "." with "off-target"
ot = 'off-target'
data.target = data.target.str.replace('.',ot,regex=False)

####
#mean base coverage
####
print('writing mean base cov')
pdata = data.query(' target != "off-target" ')\
            .groupby( ['target','sample'] )\
            .length.sum().reset_index()

pdata['meanBaseCoverage'] = pdata.length / pdata.target.map(targets.tlength)

# assign groups to partition targets
order = sorted(pdata.target.unique())
grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp }
pdata[ 'plotGroup' ] = pdata.target.map(grps)

g = sns.catplot(data=pdata,
                x='target',sharex=False,
                y='meanBaseCoverage',
                col='plotGroup', col_wrap=2,
                kind='box',aspect=2)
g.set_xticklabels(rotation=45);
# remove plotgroup labels
for ax in g.axes.flatten():
    ax.set_title('')
plt.tight_layout()
g.savefig(f'{outDir}/mean_base_coverage.png',dpi=DPI)
plt.clf()

g = sns.catplot(data=pdata,
                x='target', sharex=False,
                y='meanBaseCoverage',
                col='plotGroup', col_wrap=2,
                hue='sample',
                kind='strip',aspect=2 )
                #facet_kws=dict(legend_out=True))
g.set_xticklabels(rotation=45);
for ax in g.axes.flatten():
    ax.set_title('')
sns.move_legend(g, "upper left", bbox_to_anchor=(1, 0.75), frameon=False)
plt.tight_layout()
g.savefig(f'{outDir}/mean_base_coverage_by_sample.png',dpi=DPI)
plt.clf()


####
#dedup rate by target
####
print('writing dedup_rate')
def dedup_rate(d):
    dups = d.duplicates.sum()
    return  dups / ( dups + len(d) ) 


pdata = data.groupby(['target','sample'])\
            .apply(dedup_rate)\
            .rename('Duplication Rate')\
            .reset_index()
order = sorted(pdata.target.unique())
grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp }
pdata[ 'plotGroup' ] = pdata.target.map(grps)

g = sns.catplot(data=pdata,
                x='target', sharex=False,
                col='plotGroup',col_wrap=2,
                y='Duplication Rate',
                kind='box',aspect=2)
g.set_xticklabels(rotation=45);
for ax in g.axes.flatten():
    ax.set_title('')
plt.tight_layout()
g.savefig(f'{outDir}/dedup_rate_by_target.png',dpi=DPI)
plt.clf()

####
#readlength by target
####
#TODO better represent dups by replicating length dup times
print('writing dedup_length')
pdata = data

order = sorted(pdata.target.unique())
grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp }
pdata[ 'plotGroup' ] = pdata.target.map(grps)
g = sns.catplot( data=pdata,
                 x='target', sharex=False,
                 col='plotGroup',col_wrap=2,
                 y='length',
                 kind='violin',aspect=2)
g.set_xticklabels(rotation=45);
for ax in g.axes.flatten():
    ax.set_title('')
plt.tight_layout()
g.savefig(f'{outDir}/dedup_length_by_target.png',dpi=DPI)
plt.clf()

####
#readlength by target
####
print('writing readlength by target')
pdata = data
wrap  = int( len(order) ** 0.5 + 1 )         
g = sns.FacetGrid(data=pdata,
                  hue='sample',
                  col='target',col_wrap=wrap,
                  col_order=order,sharey=False)
order = sorted(pdata.target.unique())

g.map(sns.kdeplot,'length')
g.set_xlabels('Read Length')
g.set_ylabels('HiFi Reads')
g.add_legend()
g.savefig(f'{outDir}/readlength_hist_by_target.png',dpi=DPI)
plt.clf()

Python Pandas numpy matplotlib seaborn From line 1 of scripts/plot_multi_reads.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

allDemuxReads = pd.read_csv( snakemake.input.readCsv )
limaReport    = pd.read_csv( snakemake.input.lima, sep='\t', usecols=['ReadLengths','PassedFilters'] )
dups          = pd.read_csv( snakemake.input.dups, usecols=['sample','length','category'] )
outdir        = snakemake.params.odir

#load data
onTarget  = allDemuxReads.query('target != "."')\
                         .assign(category='OnTarget Unique')[['sample','category','length']]

offTarget = allDemuxReads.query('target == "." and ismapped==1')\
                         .assign(category='OffTarget Unique')[['sample','category','length']]

unmapped  = allDemuxReads.query('ismapped==0')\
                         .assign(category='Unmapped Unique')[['sample','category','length']]

noDemux   = limaReport.query('PassedFilters==0')\
                      .assign(category='Failed Demux')\
                      .drop('PassedFilters',axis=1).rename(columns={'ReadLengths':'length'})

#merge
allData = pd.concat([onTarget,offTarget,unmapped,dups,noDemux],ignore_index=True)

#plot
hue_order=['OnTarget Unique','OffTarget Unique','Unmapped Unique','Duplicates','Failed Demux']


f, (ax0, ax1) = plt.subplots(1, 2, 
                             figsize=(10,8),
                             gridspec_kw={'width_ratios': [1, 5]})

#histogram
sns.histplot(data=allData,
             x='length',
             hue_order=hue_order,
             hue='category',
             multiple='stack',
             binrange=(0,15000),
             ax=ax1)

#side plot
data = allData.groupby('category').size().reindex(hue_order[::-1]).rename('reads').reset_index()
data['yieldFrac'] = data.reads.transform(lambda x:x/x.sum()).fillna(0)

bottom=0
colors = sns.color_palette()[:5][::-1]

for i,row in data.iterrows():
    ax0.bar(1,row.yieldFrac,bottom=bottom,color=colors[i],alpha=0.75)
    ax0.annotate(f'{row.yieldFrac:.2f}',(1,bottom + .5*row.yieldFrac),ha='center',va='center')
    bottom += row.yieldFrac

ax0.set_xticks([])
ax0.set_ylabel('Total Fraction')

plt.tight_layout()

f.savefig(f'{outdir}/read_categories.png',dpi=400)

# write readlength legnth stats table
allData.fillna('noSampleData')\
       .groupby(['category','sample'])\
       .length.describe().dropna().astype(int)\
       .to_csv(f'{outdir}/read_length_by_sample.csv')

Python Pandas matplotlib seaborn From line 1 of scripts/plot_read_cats.py

import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

DPI=200

readCsv      = snakemake.input.csv
targetBed    = snakemake.input.bed
targetBuffer = int(snakemake.params.buffer)
outDir       = snakemake.params.odir


targets = pd.read_csv(targetBed,
                      sep='\t',
                      names=['chr','start','stop','target'])\
            .set_index('target')
#Set length of target region
targets['tlength'] = targets.eval('stop - start + 2 * @targetBuffer')

data = pd.read_csv(readCsv)
#replace "." with "off-target"
ot = 'off-target'
data.target = data.target.str.replace('.',ot,regex=False)


####
#mean base coverage
####

pdata = data.query(' target != @ot ')\
            .groupby( 'target' )\
            .length.sum().reset_index()

try:
    pdata['meanBaseCoverage'] = pdata.length / pdata.target.map(targets.tlength)
except pd.errors.InvalidIndexError as e:
    print(f'Error: Is your target name list unique?\n\n{e}')
    sys.exit()

order = sorted(pdata.target)
g = sns.catplot(data=pdata,
                x='target',order=order,
                y='meanBaseCoverage',
                kind='bar',aspect=2)
plt.xticks(rotation=45)
g.savefig(f'{outDir}/mean_base_coverage_by_target.png',dpi=DPI)
plt.clf()

###
#sample readlength
###
pdata = data.query( ' target != @ot ')

g = sns.displot(pdata.length,kind='hist',kde=True)
g.savefig(f'{outDir}/readlength_hist.png',dpi=DPI)
plt.clf()

####
#readlength by target
###

pdata = data

order = sorted(pdata.target.unique())
g = sns.FacetGrid(data=pdata,
                  xlim=(0,10000),
                  col='target',col_wrap=7,
                  col_order=order,
                  hue='sample',
                  sharey=False)

g.map(sns.kdeplot,'length')

g.set_xlabels('Read Length')
g.set_ylabels('HiFi Reads')
g.add_legend()
g.savefig(f'{outDir}/readlength_hist_by_target.png',dpi=DPI)
plt.clf()

####
#readlength by target
####
pdata = data

order = sorted(pdata.target.unique())
g = sns.catplot( data=pdata,
                 x='target',
                 y='length',
                 order=order,
                 kind='violin',
                 aspect=2)
plt.xticks(rotation=45);
g.savefig(f'{outDir}/readlength_violins_by_target.png',dpi=DPI)
plt.clf()

####
#dedup count by target
####
pdata = data

order = sorted(pdata.target.unique())

g = sns.catplot( data=pdata,
                 x='target',
                 y='duplicates',
                 order=order,
                 kind='box',aspect=2)
plt.xticks(rotation=45);
g.savefig(f'{outDir}/dedup_count_by_target.png',dpi=DPI)
plt.clf()


####
#dedup rate by target
####
def dedup_rate(d):
    dups = d.duplicates.sum()
    return  dups / ( dups + len(d) ) 


pdata = data.groupby('target')\
            .apply(dedup_rate)\
            .rename('Duplication Rate')\
            .reset_index()

order = sorted(pdata.target)
g = sns.catplot(data=pdata,
                x='target',order=order,
                y='Duplication Rate',
                kind='bar',aspect=2)
plt.xticks(rotation=45)
g.savefig(f'{outDir}/dedup_rate_by_target.png',dpi=DPI)
plt.clf()

####
#rq by target
###
pdata = data

order = sorted(pdata.target.unique())

g = sns.catplot( data=pdata,
                 x='target',
                 y='qual',
                 order=order,
                 kind='box',aspect=2)
plt.xticks(rotation=45)
g.savefig(f'{outDir}/readqual_by_target.png',dpi=DPI)
plt.clf()

###
#off target length
###
pdata = data.query( ' target == @ot ')

g = sns.displot(pdata.length,kind='hist',kde=True)
g.savefig(f'{outDir}/off-target_length_hist.png',dpi=DPI)

Python Pandas numpy matplotlib seaborn From line 1 of scripts/plot_read_data.py

shell:
    """
    (whatshap phase {params.extra} \
        --output {output} \
        --reference {input.reference} \
        {input.vcf} \
        {input.phaseinput}) > {log} 2>&1
    """

SnakeMake From line 20 of rules/whatshap.smk

shell:
    """
    (whatshap stats \
        --gtf {output.gtf} \
        --tsv {output.tsv} \
        --block-list {output.blocklist} \
        --chr-lengths {input.chr_lengths} \
        {input.vcf}) > {log} 2>&1
    """

SnakeMake From line 47 of rules/whatshap.smk

shell:
    """
    (whatshap haplotag {params} \
        --output {output} \
        --reference {input.reference} \
        {input.vcf} {input.bam}) > {log} 2>&1
    """

SnakeMake From line 77 of rules/whatshap.smk

shell:
    "(samtools index -@ 3 {input}) > {log} 2>&1"

SnakeMake From line 100 of rules/whatshap.smk

__author__ = "Fabian Kilpert"
__copyright__ = "Copyright 2020, Fabian Kilpert"
__email__ = "[email protected]"
__license__ = "MIT"

import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell()

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard BedToIntervalList"
        " {java_opts} {extra}"
        " --INPUT {snakemake.input.bed}"
        " --SEQUENCE_DICTIONARY {snakemake.input.dict}"
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output}"
        " {log}"
    )

Python Snakemake Picard snakemake-wrapper-utils From line 1 of bedtointervallist/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts


extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard CreateSequenceDictionary"
        " {java_opts} {extra}"
        " --REFERENCE {snakemake.input[0]}"
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output[0]}"
        " {log}"
    )

Python Snakemake Picard snakemake-wrapper-utils From line 1 of createsequencedictionary/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard CollectHsMetrics"
        " {java_opts} {extra}"
        " --INPUT {snakemake.input.bam}"
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output[0]}"
        " --REFERENCE_SEQUENCE {snakemake.input.reference}"
        " --BAIT_INTERVALS {snakemake.input.bait_intervals}"
        " --TARGET_INTERVALS {snakemake.input.target_intervals}"
        " {log}"
    )