WES HLA Typing based on multiple alternative tools

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Workflow Diagram

Scope of this workflow

This workflow enables the concurrent analysis of WES or WGS data using publicly available software to derive HLA haplotypes from this type of data.

Currently available software tools

xHLA

Xie, C., Yeo, Z. X., Wong, M., Piper, J., Long, T., Kirkness, E. F., ... & Brady, C. (2017). Fast and accurate HLA typing from short-read next-generation sequence data with xHLA. Proceedings of the National Academy of Sciences, 114(30), 8059-8064.

The workflow implements read mapping the reads against hg38 without alt contigs using bwa mem as instructed by the authors. The mapped reads are then sorted and index using samtools.

The workflow utilizes the [Docker Image][2] provided by the authors to perform the actual HLA typing.
HLA-VBSeq

Nariai, N., Kojima, K., Saito, S., Mimori, T., Sato, Y., Kawai, Y., ... & Nagasaki, M. (2015, December). HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data. In BMC genomics (Vol. 16, No. S2, p. S7). BioMed Central.

Wang, Y. Y., Mimori, T., Khor, S. S., Gervais, O., Kawai, Y., Hitomi, Y., ... & Nagasaki, M. (2019). HLA-VBSeq v2: improved HLA calling accuracy with full-length Japanese class-I panel. Human Genome Variation, 6(1), 1-5.

The workflow implements read mapping the reads against hg19 without alt contigs. The authors instructions merely state to "map against hg19" without any further specifics, but mapping against hg19 with alt contigs yielded very poor typing results with missing HLA class I genes, thus the workflow uses hg19 without alt contigs.

HLA-VBSeq released two reference database versions:
- v1 database based on IMGT/HLA database, Release 3.15.0
- v2 database based on IMGT/HLA database Release 3.31.0 and Japanese HLA reference dataset
OptiType

Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M., & Kohlbacher, O. (2014). OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30(23), 3310-3316.

The workflow invokes the [OptiType snakemake wrapper][1] without prior filtering of reads.
HLA-LA

Dilthey, A. T., Mentzer, A. J., Carapito, R., Cutland, C., Cereb, N., Madhi, S. A., ... & Phillippy, A. M. (2019). HLA*LA - HLA typing from linearly projected graph alignments. Bioinformatics, 35(21), 4394-4396.

The workflow uses reads mapped against the human genome (hg38) without alt contigs as input for HLA-LA. A corresponding reference txt file for HLA-LA is part of this workflow repository. The preprocessed graph directory PRG_MHC_GRCh38_withIMGT can be either placed manually in typing/hla_la/hla_la.graphs/ or it will be downloaded and preprocessed automatically.

The workflow uses the [HLA-LA bioconda package][3] for graph preprocessing and HLA typing.
arcasHLA

Orenbuch, R., Filip, I., Comito, D., Shaman, J., Pe’er, I., & Rabadan, R. (2020). arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics, 36(1), 33-40.

The workflow maps RNAseq reads against the human genome (hg38) without alt contigs using the STAR aligner with default paramters. It then invokes the 'extract' and 'genotype' subtools provided by arcasHLA.

Usage

Install snakemake

conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake

Clone the MultiHLA repository

git clone https://github.com/lkuchenb/MultiHLA.git hla_typing
cd hla_typing

Put the input files in place
MultiHLA comes with a predefined folder structure:
- dataset/
  
  A dataset is defined as a set of samples. Place a TSV file here for every dataset with the following three named columns:
```
SampleName FileNameR1 FileNameR2
Donor1 SEQ_D1_DAT_01_S53_L001_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L001_R2_001.fastq.gz
Donor1 SEQ_D1_DAT_01_S53_L002_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L002_R2_001.fastq.gz
Donor2 SEQ_D2_DAT_01_S54_L001_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L001_R2_001.fastq.gz
Donor2 SEQ_D2_DAT_01_S54_L002_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L002_R2_001.fastq.gz
Donor3 SEQ_D3_DAT_01_S55_L001_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L001_R2_001.fastq.gz
Donor3 SEQ_D3_DAT_01_S55_L002_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L002_R2_001.fastq.gz
```
  FASTQ files have to come in gziped pairs and be named {prefix}_R[12]{suffix}.fastq.gz . A sample can be covered by an arbitrary number of FASTQ pairs (at least one).
- fastq/
  
  Place the FASTQ files as listed in your dataset sheet here.
- ref/
  
  Place or link the required human genome references here as described for each supported method, otherwise they will be automatically downloaded.
- trim/
  
  This is an output folder. It will be filled with adapter trimmed versions of the provided FASTQ files.
- typing/{method}/
  
  This is an output folder. It will be filled with subfolders for each method.
- workflow/
  
  This folder contains the workflow code.
Run the workflow

Invoke snakemake using snakemake --use-conda --use-singularity . This enables snakemake to automatically install dependencies into conda environments that are created on the fly and also enables the container based jobs to run. To process all samples of a dataset, for example the dataset dataset_1 described in datasets/dataset_1.tsv use
```
snakemake --use-conda --use-singularity typing/dataset_1.all.multihla
```
Memory and run time requirements for each job are noted in their resources ( mem_mb and time ).

Code Snippets

shell:
    """
    base="$PWD"
    log="$base"/{log:q}

    # DOWNLOAD AN UNPACK ARCASHLA
    mkdir -p {output.outdir:q} &>"$log"
    cd {output.outdir:q} &>>"$log"
    tar --strip-components=1 -xf "$base"/{input:q} &>>"$log"

    # OBTAIN REFERENCE
    ./arcasHLA reference --version {wildcards.refversion:q} &>>"$log"
    """

SnakeMake From line 37 of rules/arcasHLA.smk

shell:
    """
    rm -rf {output.temp} &>>{log.cli:q}
    mkdir -p {output.temp:q} &>{log.cli:q}
    {input.arcas_bin:q} extract -t {threads} --paired -o {output.temp:q} --log {log.main:q} {input.bam:q} &>>{log.cli:q}
    mv {output.temp:q}/*1.fq.gz {output.r1:q} &>>{log.cli:q}
    mv {output.temp:q}/*2.fq.gz {output.r2:q} &>>{log.cli:q}
    """

SnakeMake From line 70 of rules/arcasHLA.smk

shell:
    """
    rm -vrf {output.temp} &>>{log.cli:q}
    mkdir -vp {output.temp:q} &>{log.cli:q}
    {input.arcas_bin:q} genotype -t {threads} -o {output.temp:q} --log {log.main:q} {input.r1:q} {input.r2:q} &>>{log.cli:q}
    mv -v {output.temp:q}/*genes.json {output.genes:q} &>>{log.cli}
    mv -v {output.temp:q}/*genotype.json {output.genotype:q} &>>{log.cli}
    mv -v {output.temp:q}/*alignment.p {output.align:q} &>>{log.cli}
    """

SnakeMake From line 99 of rules/arcasHLA.smk

script:
    '../scripts/compare.py'

SnakeMake From line 33 of rules/compare.smk

shell:
    """
    logpath="$PWD/{log}"
    origpath="$PWD"
    date >"$logpath"

    # CHECK DOWNLOADED FILE
    md5sum < {input:q} 2>>"$logpath" | fgrep -q {params.checksum} || echo 'Checksum failure for PRG_MHC_GRCh38_withIMGT.tar.gz' | tee -a "$logpath" >&2

    # UNPACK DOWNLOADED FILE
    cd 'typing/hla_la/hla_la.graphs' &>>"$logpath"
    tar xf "$origpath"/{input:q} &>>"$logpath"

    # PREPARE THE INFERENCE GRAPH
    "$CONDA_PREFIX"/opt/hla-la/bin/HLA-LA --action prepareGraph --PRG_graph_dir "$PWD"/PRG_MHC_GRCh38_withIMGT &>>"$logpath"
    """

SnakeMake From line 41 of rules/hla_la.smk

shell:
    """
    LOGPATH="$PWD/{log}"
    date > "$LOGPATH"
    env >> "$LOGPATH"
    # Check if a proper conda environment is available
    if [ -z "${{CONDA_PREFIX+_}}" ] || [ ! -d "$CONDA_PREFIX"/opt/hla-la ]; then
        echo "--use-conda is required to run the HLA-LA rule." | tee -a "$LOGPATH" >&2
        exit 1
    fi

    # Create and prepare the temp directory
    rm -rf {params.tempdir:q}
    mkdir -p {params.tempdir:q}/src {params.tempdir:q}/bin >>"$LOGPATH"

    # PREPARE DIRECTORY SRC
    cd {params.tempdir:q}/src
    ln -s "$CONDA_PREFIX"/opt/hla-la/src/* . &>>"$LOGPATH"
    mv HLA-LA.pl HLA-LA.pl~
    cp "$CONDA_PREFIX"/opt/hla-la/src/HLA-LA.pl . &>>"$LOGPATH"
    cd ..

    # PREPARE DIRECTORY BIN
    ln -s "$CONDA_PREFIX"/opt/hla-la/bin/HLA-LA bin/ &>>"$LOGPATH"

    # PREPARE DIRECTORY GRAPHS
    ln -s ../../../{input.graphs:q} graphs &>>"$LOGPATH"

    # RUN MAIN PROGRAM
    ./src/HLA-LA.pl --sampleID sample --BAM ../../../{input.bam:q} --workingDir "$PWD" --graph PRG_MHC_GRCh38_withIMGT &>>"$LOGPATH"

    # MOVE RESULTS INTO PLACE
    cd ../../../
    rm -f {output.outdir:q} &>>"$LOGPATH"
    mv {params.tempdir:q}/sample {output.outdir:q} &>>"$LOGPATH"
    ln -s ../../{output.outdir:q}/hla/R1_bestguess_G.txt {output.main:q} &>>"$LOGPATH"

    # REMOVE TEMPORARY DIRECTORY
    rm -rf {params.tempdir:q} &>>"$LOGPATH"
    """

SnakeMake hla-la From line 76 of rules/hla_la.smk

shell:
    """
    # Extract a list of read name that were aligned to HLA loci (HLA-A, B, C, DM, DO, DP, DQ, DR, E, F, G, H, J, K, L, P, V, MIC, and TAP)
    samtools view -@ {threads} {input.bam:q} chr6:29907037-29915661 chr6:31319649-31326989 chr6:31234526-31241863 \
        chr6:32914391-32922899 chr6:32900406-32910847 chr6:32969960-32979389 chr6:32778540-32786825 \
        chr6:33030346-33050555 chr6:33041703-33059473 chr6:32603183-32613429 chr6:32707163-32716664 \
        chr6:32625241-32636466 chr6:32721875-32733330 chr6:32405619-32414826 chr6:32544547-32559613 \
        chr6:32518778-32554154 chr6:32483154-32559613 chr6:30455183-30463982 chr6:29689117-29699106 \
        chr6:29792756-29800899 chr6:29793613-29978954 chr6:29855105-29979733 chr6:29892236-29899009 \
        chr6:30225339-30236728 chr6:31369356-31385092 chr6:31460658-31480901 chr6:29766192-29772202 \
        chr6:32810986-32823755 chr6:32779544-32808599 chr6:29756731-29767588 \
        2> {log:q} | awk '{{print $1}}' | LC_ALL=C sort | uniq > {output.locus_ids:q}

    echo "'samtools view' finished extracting chromosome 6 reads." >>{log:q}

    mkdir -p {output.idx_dir:q}

    java -jar -Xmx32g -Xms32g '{params[hla_vbseq_bam_name_index_jar]}' index {input.bam:q} --indexFile {output.idx_dir:q}/index &>>{log:q}
    echo "bamNameIndex.jar [1/2] finished." >>{log:q}

    java -jar '{params[hla_vbseq_bam_name_index_jar]}' search {input.bam:q} --indexFile {output.idx_dir:q}/index --name {output.locus_ids:q} --output {output.partial_sam:q} &>>{log:q}
    echo "bamNameIndex.jar [2/2] finished." >>{log:q}

    picard SamToFastq I={output.partial_sam:q} F={output.partial_fq_1:q} F2={output.partial_fq_2:q} &>>{log:q}
    echo "picard SamToFastq [1/2] finished." >>{log:q}

    samtools view -@ {threads} -bh -f 12 {input.bam:q} >{output.bam_unmapped:q} 2>>{log:q}
    echo "'samtools view' finished extracting unmapped reads." >>{log:q}

    picard SamToFastq I={output.bam_unmapped:q} F={output.unmapped_fq_1:q} F2={output.unmapped_fq_2:q} &>>{log:q}
    echo "picard SamToFastq [2/2] finished." >>{log:q}

    cat {output.partial_fq_1:q} {output.unmapped_fq_1:q} >{output.cat_fq_1} 2>>{log:q}
    cat {output.partial_fq_2:q} {output.unmapped_fq_2:q} >{output.cat_fq_2} 2>>{log:q}
    echo "Generating input FASTQ files finished." >>{log:q}

    bwa index '{params[hla_vbseq_ref_fa]}' &>>{log:q}
    bwa mem -t {threads} -P -L 10000 -a '{params[hla_vbseq_ref_fa]}' {output.cat_fq_1} {output.cat_fq_2} >{output.sam_final:q} 2>>{log:q}
    echo "Mapping against HLA reference finished." >>{log:q}

    # Run VBSeq
    java -jar '{params[hla_vbseq_main_jar]}'  '{params[hla_vbseq_ref_fa]}' {output.sam_final:q} {output.result_txt:q} --alpha_zero 0.01 --is_paired &>>{log:q}
    echo "VBSeq finished." >>{log:q}

    # Call alleles
    python '{params[hla_vbseq_call_hla_digits]}' -v {output.result_txt:q} -a '{params[hla_vbseq_ref_txt]}'  -r 140 -d 4 --ispaired >{output.result_tsv:q} 2>>{log:q}
    echo "Calling HLA alleles finished." >>{log:q}

    echo "" >>{log:q}
    echo "Rule 'hla_vbseq_typing' finished." >>{log:q}
    """

SnakeMake SAMtools BWA Picard From line 74 of rules/hla_vbseq.smk

wrapper:
    '0.65.0/bio/bwa-mem2/index'

SnakeMake From line 43 of map/bwa.smk

wrapper:
    "0.65.0/bio/samtools/index"

SnakeMake From line 63 of map/bwa.smk

wrapper:
    "0.65.0/bio/bwa-mem2/mem"

SnakeMake From line 88 of map/bwa.smk

shell:
    """
    touch {output:q}
    """

SnakeMake From line 105 of map/bwa.smk

wrapper:
    "0.67.0/bio/star/index"

SnakeMake From line 48 of map/star.smk

wrapper:
    "0.67.0/bio/star/align"

SnakeMake From line 67 of map/star.smk

run:
    # Parse JSON output of xHLA
    import json
    with open(input[0], 'r') as infile:
        alleles = json.load(infile)['hla']['alleles']

    # Write multihla output
    import collections
    with open(output[0], 'w') as outfile:
        print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file=outfile)
        data = collections.defaultdict(lambda : [])
        for allele in alleles:
            gene, allele = allele.split('*')
            data[gene].append(allele)
        for gene, alleles in data.items():
            if len(alleles) == 2:
                print(f'{wildcards.dataset}\t{wildcards.sample}\txHLA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile)
            elif len(alleles) == 1:
                print(f'{wildcards.dataset}\t{wildcards.sample}\txHLA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile)
            else:
                raise RuntimeError(f'xHLA: More than two alleles reported for gene {gene}.')

SnakeMake JSON From line 86 of rules/multihla.smk

script:
    '../scripts/concat_tables.py'

SnakeMake From line 121 of rules/multihla.smk

run:
    import csv
    from collections import defaultdict
    ddict = defaultdict(lambda : set())
    with open(input[0], 'r') as infile:
        reader = csv.DictReader(infile, delimiter = '\t')
        for rec in reader:
            # extract first two fields and drop gene name
            allele = ':'.join(rec['Allele'].split(':')[:2]).split('*')[1]
            # record allele
            ddict[rec['Locus']].add(allele)
    with open(output[0], 'w') as outfile:
        print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file = outfile)
        for gene, alleles in ddict.items():
            alleles = list(alleles)
            if len(alleles)==2:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-LA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile)
            elif len(alleles)==1:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-LA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile)
            else:
                raise RuntimeError('hla_la_conversion expects either one or two alleles to be reported per gene')

SnakeMake From line 139 of rules/multihla.smk

script:
    '../scripts/concat_tables.py'

SnakeMake From line 174 of rules/multihla.smk

run:
    from csv import DictReader
    with open(input[0], 'r') as infile:
        reader = DictReader(infile, delimiter = '\t')
        with open(output[0], 'w') as outfile:
            for record in reader:
                gene = record["Gene"]
                allele_1 = record["Allele1"].split('*')[1]
                if record["Allele2"]:
                    allele_2 = record["Allele2"].split('*')[1]
                    print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-VBSeq\t{params.opts}\t{gene}\t{allele_1}\t{allele_2}', file = outfile)
                else:
                    print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-VBSeq\t{params.opts}\t{gene}\t{allele_1}\t{allele_1}', file = outfile)

SnakeMake From line 192 of rules/multihla.smk

script:
    '../scripts/concat_tables.py'

SnakeMake From line 220 of rules/multihla.smk

run:
    import csv
    from collections import defaultdict
    ddict = defaultdict(lambda : set())
    with open(input[0], 'r') as infile:
        reader = csv.DictReader(infile, delimiter = '\t')
        for rec in reader:
            if ddict:
                raise RuntimeError('optitype_conversion requires the result file to contain exactly one record')
            ddict['A'] = { rec['A1'].split('*')[1], rec['A2'].split('*')[1] }
            ddict['B'] = { rec['B1'].split('*')[1], rec['B2'].split('*')[1] }
            ddict['C'] = { rec['C1'].split('*')[1], rec['C2'].split('*')[1] }
    with open(output[0], 'w') as outfile:
        print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file = outfile)
        for gene, alleles in ddict.items():
            alleles = list(alleles)
            if len(alleles) == 2:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tOptiType\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile)
            elif len(alleles) == 1:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tOptiType\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile)
            else:
                raise RuntimeError('optitype_conversion expects either one or two alleles to be reported per gene')

SnakeMake From line 238 of rules/multihla.smk

script:
    '../scripts/concat_tables.py'

SnakeMake From line 274 of rules/multihla.smk

script:
    '../scripts/concat_tables.py'

SnakeMake From line 293 of rules/multihla.smk

wrapper:
    "0.66.0/bio/optitype"

SnakeMake From line 38 of rules/optitype.smk

shell:
    """
    # Run trimgalore
    trim_galore --output_dir trim --fastqc -j {threads} --paired {input[0]:q} {input[1]:q} &>{log:q}

    # trimgalore compresses output only if input was compressed
    [ "{params.input_ext}" = ".gz" ] || gzip trim/*.fq &>>{log:q}

    # Move output into place
    mv -f {params.r1_tmp:q} {output.r1:q}
    mv -f {params.r2_tmp:q} {output.r2:q}
    mv -f {params.r1fastqc_tmp:q} {output.r1fastqc:q}
    mv -f {params.r2fastqc_tmp:q} {output.r2fastqc:q}
    """

SnakeMake Trim_Galore From line 64 of rules/trim.smk

script:
    '../scripts/hg_filter_noalt.py'

SnakeMake From line 34 of rules/ucsc_hg_download.smk

shell:
    """
    mkdir -p {output.bindir:q}
    mkdir -p {output.tdir:q}

    # This is a hack to circumvent that xHLA does not allow for any control
    # over the number of threads it consumes - it will always consume
    # everything the machine reports. Two components seem to use
    # multithreading, the invocation of diamond and the lpsolver in R. Here
    # we fix the diamond issue by integrating a wrapper script called
    # 'diamond' in the PATH that sets the max threads paramter of diamond
    # accordingly. Currently there is no fix for the R code, which makes
    # this rule vulnerable to cluster engines that enforce the SMP limit.

    echo '#!/bin/bash' > {output.bindir:q}/diamond
    echo 'shift' >> {output.bindir:q}/diamond
    echo '/usr/bin/diamond blastx -p {threads} "$@"' >> {output.bindir:q}/diamond
    chmod +x {output.bindir:q}/diamond
    export PATH="{output.bindir}:$PATH"

    export LC_ALL=C
    ecode=0
    if python /opt/bin/run.py --sample_id sample.$HOSTNAME.$$ --input_bam_path {input.bam:q} --output_path {output.tdir:q} &>{log:q}
    then
        mv -v {output.tdir:q}/report-sample.$HOSTNAME.$$-hla.json {output.json:q} &>>{log:q}
    else
        ecode=1
    fi
    rm -rfv hla-sample.$HOSTNAME.$$ &>>{log:q}
    exit $ecode
    """

SnakeMake Diamond oschmod From line 42 of rules/xhla.smk

import pandas as pd

# Read the ground truth data
gtruth = pd.read_csv(snakemake.input.gtruth, '\t')
gtruth['Dataset'] = snakemake.wildcards.dataset
gtruth['Method'] = 'ground_truth'
gtruth['Options'] = ''

# Read the typing results
typing = pd.read_csv(snakemake.input.typing, '\t')

# Concatenate and pivot the data
compare = pd.concat([gtruth, typing])

# Create result column with set of called alleles per gene and drop original, individual columns
compare['Result'] = compare.apply(axis='columns', func=lambda x : {x.Allele1, x.Allele2})
compare.drop(columns=['Allele1','Allele2'], inplace = True)

# Pivot the table to have one column per method, drop constant column index level
compare = compare.pivot_table(index=['Dataset','Sample','Gene'], columns=['Method','Options'], aggfunc=lambda x : set.union(*x))
compare = compare.droplevel(0, axis='columns')

# Make the ground truth the first column
compare = compare[ [c for c in compare.columns if c[0]=='ground_truth'] + [c for c in compare.columns if c[0]!='ground_truth'] ]

def annotate(row, method):
    if pd.isna(row[method]) or len(row[method])==0:
        return 'no result'
    gt = row[('ground_truth', '')]
    if row[method] == gt:
        return 'correct'
    if pd.isna(gt) or len(gt)==0:
        return f'no ground truth: {row[method]}'
    if row[method] & gt: # Intersection
        return f'partial match: {row[method]}'
    else:
        return f'mismatch: {row[method]}'

# Create copy with annotated version of results
annotated = compare.copy()
for col in [ c for c in annotated.columns if c[0] != 'ground_truth' ]:
    annotated[col] = annotated.apply(lambda x : annotate(x, col), axis = 'columns')

# Write results
compare.to_csv(snakemake.output.raw, '\t', index = True)
annotated.to_csv(snakemake.output.annotated, '\t', index = True)

Python Pandas From line 23 of scripts/compare.py

header = False
with open(snakemake.output[0], 'w') as outfile:
    for inpath in snakemake.input:
        with open(inpath, 'r') as infile:
            if header:
                next(infile)
            else:
                header = True
            for row in infile:
                print(row, file = outfile, end = '')

Python From line 24 of scripts/concat_tables.py

from Bio import SeqIO
import gzip

whitelist = [ 'chr' + e for e in [ str(num) for num in range(1,23) ] + ['X','Y'] ]

with open(snakemake.output[0], 'w') as outfile:
    with gzip.open(snakemake.input[0], 'rt') as infile:
        generator = ( record for record in SeqIO.parse(infile, "fasta") if record.id in whitelist )
        SeqIO.write(generator, outfile, 'fasta')

Python Biopython From line 23 of scripts/hg_filter_noalt.py

__author__ = "Christopher Schröder, Patrik Smeds"
__copyright__ = "Copyright 2020, Christopher Schröder, Patrik Smeds"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided.")
elif len(snakemake.input) > 1:
    raise ValueError("Please provide exactly one reference genome as input.")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

shell("bwa-mem2 index" " {prefix}" " {snakemake.input[0]}" " {log}")

Python Snakemake Bwa-mem2 From line 1 of index/wrapper.py

__author__ = "Christopher Schröder, Johannes Köster, Julian de Ruiter"
__copyright__ = (
    "Copyright 2020, Christopher Schröder, Johannes Köster and Julian de Ruiter"
)
__email__ = "[email protected] [email protected], [email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = (
        "picard SortSam {sort_extra} INPUT=/dev/stdin"
        " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
    )

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa-mem2 mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}"
)

Python Snakemake SAMtools Picard From line 1 of mem/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
outdir = os.path.dirname(snakemake.output[0])

# get sequencing type
seq_type = snakemake.params.get("sequencing_type", "dna")
seq_type = "--{}".format(seq_type)

# check if non-default config.ini is used
config = snakemake.params.get("config", "")
if any(config):
    config = "--config {}".format(config)

shell(
    "(OptiTypePipeline.py"
    " --input {snakemake.input.reads}"
    " --outdir {outdir}"
    " --prefix {snakemake.wildcards.sample}"
    " {seq_type}"
    " {config}"
    " {extra})"
    " {log}"
)

Python Snakemake From line 1 of optitype/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fq1]
    if isinstance(snakemake.input.fq1, str)
    else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
    fq2 = (
        [snakemake.input.fq2]
        if isinstance(snakemake.input.fq2, str)
        else snakemake.input.fq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

outprefix = os.path.dirname(snakemake.output[0]) + "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}"
)

Python Snakemake STAR From line 1 of align/wrapper.py

__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake.utils import makedirs

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")

gtf = snakemake.input.get("gtf")
if gtf is not None:
    gtf = "--sjdbGTFfile " + gtf
    sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
    gtf = sjdb_overhang = ""

makedirs(snakemake.output)

shell(
    "STAR "  # Tool
    "--runMode genomeGenerate "  # Indexation mode
    "{extra} "  # Optional parameters
    "--runThreadN {snakemake.threads} "  # Number of threads
    "--genomeDir {snakemake.output} "  # Path to output
    "--genomeFastaFiles {snakemake.input.fasta} "  # Path to fasta files
    "{sjdb_overhang} "  # Read-len - 1
    "{gtf} "  # Highly recommended GTF
    "{log}"  # Logging
)