WES HLA Typing based on multiple alternative tools

public public 1yr ago 0 bookmarks

Workflow Diagram

Scope of this workflow

This workflow enables the concurrent analysis of WES or WGS data using publicly available software to derive HLA haplotypes from this type of data.

Currently available software tools

  • xHLA

    Xie, C., Yeo, Z. X., Wong, M., Piper, J., Long, T., Kirkness, E. F., ... & Brady, C. (2017). Fast and accurate HLA typing from short-read next-generation sequence data with xHLA. Proceedings of the National Academy of Sciences, 114(30), 8059-8064.

    The workflow implements read mapping the reads against hg38 without alt contigs using bwa mem as instructed by the authors. The mapped reads are then sorted and index using samtools.

    The workflow utilizes the [Docker Image][2] provided by the authors to perform the actual HLA typing.

  • HLA-VBSeq

    Nariai, N., Kojima, K., Saito, S., Mimori, T., Sato, Y., Kawai, Y., ... & Nagasaki, M. (2015, December). HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data. In BMC genomics (Vol. 16, No. S2, p. S7). BioMed Central.

    Wang, Y. Y., Mimori, T., Khor, S. S., Gervais, O., Kawai, Y., Hitomi, Y., ... & Nagasaki, M. (2019). HLA-VBSeq v2: improved HLA calling accuracy with full-length Japanese class-I panel. Human Genome Variation, 6(1), 1-5.

    The workflow implements read mapping the reads against hg19 without alt contigs. The authors instructions merely state to "map against hg19" without any further specifics, but mapping against hg19 with alt contigs yielded very poor typing results with missing HLA class I genes, thus the workflow uses hg19 without alt contigs.

    HLA-VBSeq released two reference database versions:

    • v1 database based on IMGT/HLA database, Release 3.15.0

    • v2 database based on IMGT/HLA database Release 3.31.0 and Japanese HLA reference dataset

  • OptiType

    Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M., & Kohlbacher, O. (2014). OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30(23), 3310-3316.

    The workflow invokes the [OptiType snakemake wrapper][1] without prior filtering of reads.

  • HLA-LA

    Dilthey, A. T., Mentzer, A. J., Carapito, R., Cutland, C., Cereb, N., Madhi, S. A., ... & Phillippy, A. M. (2019). HLA*LA - HLA typing from linearly projected graph alignments. Bioinformatics, 35(21), 4394-4396.

    The workflow uses reads mapped against the human genome (hg38) without alt contigs as input for HLA-LA. A corresponding reference txt file for HLA-LA is part of this workflow repository. The preprocessed graph directory PRG_MHC_GRCh38_withIMGT can be either placed manually in typing/hla_la/hla_la.graphs/ or it will be downloaded and preprocessed automatically.

    The workflow uses the [HLA-LA bioconda package][3] for graph preprocessing and HLA typing.

  • arcasHLA

    Orenbuch, R., Filip, I., Comito, D., Shaman, J., Pe’er, I., & Rabadan, R. (2020). arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics, 36(1), 33-40.

    The workflow maps RNAseq reads against the human genome (hg38) without alt contigs using the STAR aligner with default paramters. It then invokes the 'extract' and 'genotype' subtools provided by arcasHLA.

Usage

  1. Install snakemake

    conda install -c conda-forge mamba
    mamba create -c conda-forge -c bioconda -n snakemake snakemake
    conda activate snakemake
    
  2. Clone the MultiHLA repository

    git clone https://github.com/lkuchenb/MultiHLA.git hla_typing
    cd hla_typing
    
  3. Put the input files in place
    MultiHLA comes with a predefined folder structure:

    • dataset/

      A dataset is defined as a set of samples. Place a TSV file here for every dataset with the following three named columns:

      SampleName FileNameR1 FileNameR2
      Donor1 SEQ_D1_DAT_01_S53_L001_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L001_R2_001.fastq.gz
      Donor1 SEQ_D1_DAT_01_S53_L002_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L002_R2_001.fastq.gz
      Donor2 SEQ_D2_DAT_01_S54_L001_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L001_R2_001.fastq.gz
      Donor2 SEQ_D2_DAT_01_S54_L002_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L002_R2_001.fastq.gz
      Donor3 SEQ_D3_DAT_01_S55_L001_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L001_R2_001.fastq.gz
      Donor3 SEQ_D3_DAT_01_S55_L002_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L002_R2_001.fastq.gz
      

      FASTQ files have to come in gziped pairs and be named {prefix}_R[12]{suffix}.fastq.gz . A sample can be covered by an arbitrary number of FASTQ pairs (at least one).

    • fastq/

      Place the FASTQ files as listed in your dataset sheet here.

    • ref/

      Place or link the required human genome references here as described for each supported method, otherwise they will be automatically downloaded.

    • trim/

      This is an output folder. It will be filled with adapter trimmed versions of the provided FASTQ files.

    • typing/{method}/

      This is an output folder. It will be filled with subfolders for each method.

    • workflow/

      This folder contains the workflow code.

  4. Run the workflow

    Invoke snakemake using snakemake --use-conda --use-singularity . This enables snakemake to automatically install dependencies into conda environments that are created on the fly and also enables the container based jobs to run. To process all samples of a dataset, for example the dataset dataset_1 described in datasets/dataset_1.tsv use

    snakemake --use-conda --use-singularity typing/dataset_1.all.multihla
    

    Memory and run time requirements for each job are noted in their resources ( mem_mb and time ).

Code Snippets

37
38
39
40
41
42
43
44
45
46
47
48
49
shell:
    """
    base="$PWD"
    log="$base"/{log:q}

    # DOWNLOAD AN UNPACK ARCASHLA
    mkdir -p {output.outdir:q} &>"$log"
    cd {output.outdir:q} &>>"$log"
    tar --strip-components=1 -xf "$base"/{input:q} &>>"$log"

    # OBTAIN REFERENCE
    ./arcasHLA reference --version {wildcards.refversion:q} &>>"$log"
    """
70
71
72
73
74
75
76
77
shell:
    """
    rm -rf {output.temp} &>>{log.cli:q}
    mkdir -p {output.temp:q} &>{log.cli:q}
    {input.arcas_bin:q} extract -t {threads} --paired -o {output.temp:q} --log {log.main:q} {input.bam:q} &>>{log.cli:q}
    mv {output.temp:q}/*1.fq.gz {output.r1:q} &>>{log.cli:q}
    mv {output.temp:q}/*2.fq.gz {output.r2:q} &>>{log.cli:q}
    """
 99
100
101
102
103
104
105
106
107
shell:
    """
    rm -vrf {output.temp} &>>{log.cli:q}
    mkdir -vp {output.temp:q} &>{log.cli:q}
    {input.arcas_bin:q} genotype -t {threads} -o {output.temp:q} --log {log.main:q} {input.r1:q} {input.r2:q} &>>{log.cli:q}
    mv -v {output.temp:q}/*genes.json {output.genes:q} &>>{log.cli}
    mv -v {output.temp:q}/*genotype.json {output.genotype:q} &>>{log.cli}
    mv -v {output.temp:q}/*alignment.p {output.align:q} &>>{log.cli}
    """
33
34
script:
    '../scripts/compare.py'
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
shell:
    """
    logpath="$PWD/{log}"
    origpath="$PWD"
    date >"$logpath"

    # CHECK DOWNLOADED FILE
    md5sum < {input:q} 2>>"$logpath" | fgrep -q {params.checksum} || echo 'Checksum failure for PRG_MHC_GRCh38_withIMGT.tar.gz' | tee -a "$logpath" >&2

    # UNPACK DOWNLOADED FILE
    cd 'typing/hla_la/hla_la.graphs' &>>"$logpath"
    tar xf "$origpath"/{input:q} &>>"$logpath"

    # PREPARE THE INFERENCE GRAPH
    "$CONDA_PREFIX"/opt/hla-la/bin/HLA-LA --action prepareGraph --PRG_graph_dir "$PWD"/PRG_MHC_GRCh38_withIMGT &>>"$logpath"
    """
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
shell:
    """
    LOGPATH="$PWD/{log}"
    date > "$LOGPATH"
    env >> "$LOGPATH"
    # Check if a proper conda environment is available
    if [ -z "${{CONDA_PREFIX+_}}" ] || [ ! -d "$CONDA_PREFIX"/opt/hla-la ]; then
        echo "--use-conda is required to run the HLA-LA rule." | tee -a "$LOGPATH" >&2
        exit 1
    fi

    # Create and prepare the temp directory
    rm -rf {params.tempdir:q}
    mkdir -p {params.tempdir:q}/src {params.tempdir:q}/bin >>"$LOGPATH"

    # PREPARE DIRECTORY SRC
    cd {params.tempdir:q}/src
    ln -s "$CONDA_PREFIX"/opt/hla-la/src/* . &>>"$LOGPATH"
    mv HLA-LA.pl HLA-LA.pl~
    cp "$CONDA_PREFIX"/opt/hla-la/src/HLA-LA.pl . &>>"$LOGPATH"
    cd ..

    # PREPARE DIRECTORY BIN
    ln -s "$CONDA_PREFIX"/opt/hla-la/bin/HLA-LA bin/ &>>"$LOGPATH"

    # PREPARE DIRECTORY GRAPHS
    ln -s ../../../{input.graphs:q} graphs &>>"$LOGPATH"

    # RUN MAIN PROGRAM
    ./src/HLA-LA.pl --sampleID sample --BAM ../../../{input.bam:q} --workingDir "$PWD" --graph PRG_MHC_GRCh38_withIMGT &>>"$LOGPATH"

    # MOVE RESULTS INTO PLACE
    cd ../../../
    rm -f {output.outdir:q} &>>"$LOGPATH"
    mv {params.tempdir:q}/sample {output.outdir:q} &>>"$LOGPATH"
    ln -s ../../{output.outdir:q}/hla/R1_bestguess_G.txt {output.main:q} &>>"$LOGPATH"

    # REMOVE TEMPORARY DIRECTORY
    rm -rf {params.tempdir:q} &>>"$LOGPATH"
    """
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
shell:
    """
    # Extract a list of read name that were aligned to HLA loci (HLA-A, B, C, DM, DO, DP, DQ, DR, E, F, G, H, J, K, L, P, V, MIC, and TAP)
    samtools view -@ {threads} {input.bam:q} chr6:29907037-29915661 chr6:31319649-31326989 chr6:31234526-31241863 \
        chr6:32914391-32922899 chr6:32900406-32910847 chr6:32969960-32979389 chr6:32778540-32786825 \
        chr6:33030346-33050555 chr6:33041703-33059473 chr6:32603183-32613429 chr6:32707163-32716664 \
        chr6:32625241-32636466 chr6:32721875-32733330 chr6:32405619-32414826 chr6:32544547-32559613 \
        chr6:32518778-32554154 chr6:32483154-32559613 chr6:30455183-30463982 chr6:29689117-29699106 \
        chr6:29792756-29800899 chr6:29793613-29978954 chr6:29855105-29979733 chr6:29892236-29899009 \
        chr6:30225339-30236728 chr6:31369356-31385092 chr6:31460658-31480901 chr6:29766192-29772202 \
        chr6:32810986-32823755 chr6:32779544-32808599 chr6:29756731-29767588 \
        2> {log:q} | awk '{{print $1}}' | LC_ALL=C sort | uniq > {output.locus_ids:q}

    echo "'samtools view' finished extracting chromosome 6 reads." >>{log:q}

    mkdir -p {output.idx_dir:q}

    java -jar -Xmx32g -Xms32g '{params[hla_vbseq_bam_name_index_jar]}' index {input.bam:q} --indexFile {output.idx_dir:q}/index &>>{log:q}
    echo "bamNameIndex.jar [1/2] finished." >>{log:q}

    java -jar '{params[hla_vbseq_bam_name_index_jar]}' search {input.bam:q} --indexFile {output.idx_dir:q}/index --name {output.locus_ids:q} --output {output.partial_sam:q} &>>{log:q}
    echo "bamNameIndex.jar [2/2] finished." >>{log:q}

    picard SamToFastq I={output.partial_sam:q} F={output.partial_fq_1:q} F2={output.partial_fq_2:q} &>>{log:q}
    echo "picard SamToFastq [1/2] finished." >>{log:q}

    samtools view -@ {threads} -bh -f 12 {input.bam:q} >{output.bam_unmapped:q} 2>>{log:q}
    echo "'samtools view' finished extracting unmapped reads." >>{log:q}

    picard SamToFastq I={output.bam_unmapped:q} F={output.unmapped_fq_1:q} F2={output.unmapped_fq_2:q} &>>{log:q}
    echo "picard SamToFastq [2/2] finished." >>{log:q}

    cat {output.partial_fq_1:q} {output.unmapped_fq_1:q} >{output.cat_fq_1} 2>>{log:q}
    cat {output.partial_fq_2:q} {output.unmapped_fq_2:q} >{output.cat_fq_2} 2>>{log:q}
    echo "Generating input FASTQ files finished." >>{log:q}

    bwa index '{params[hla_vbseq_ref_fa]}' &>>{log:q}
    bwa mem -t {threads} -P -L 10000 -a '{params[hla_vbseq_ref_fa]}' {output.cat_fq_1} {output.cat_fq_2} >{output.sam_final:q} 2>>{log:q}
    echo "Mapping against HLA reference finished." >>{log:q}

    # Run VBSeq
    java -jar '{params[hla_vbseq_main_jar]}'  '{params[hla_vbseq_ref_fa]}' {output.sam_final:q} {output.result_txt:q} --alpha_zero 0.01 --is_paired &>>{log:q}
    echo "VBSeq finished." >>{log:q}

    # Call alleles
    python '{params[hla_vbseq_call_hla_digits]}' -v {output.result_txt:q} -a '{params[hla_vbseq_ref_txt]}'  -r 140 -d 4 --ispaired >{output.result_tsv:q} 2>>{log:q}
    echo "Calling HLA alleles finished." >>{log:q}

    echo "" >>{log:q}
    echo "Rule 'hla_vbseq_typing' finished." >>{log:q}
    """
43
44
wrapper:
    '0.65.0/bio/bwa-mem2/index'
SnakeMake From line 43 of map/bwa.smk
63
64
wrapper:
    "0.65.0/bio/samtools/index"
SnakeMake From line 63 of map/bwa.smk
88
89
wrapper:
    "0.65.0/bio/bwa-mem2/mem"
SnakeMake From line 88 of map/bwa.smk
105
106
107
108
shell:
    """
    touch {output:q}
    """
SnakeMake From line 105 of map/bwa.smk
48
49
wrapper:
    "0.67.0/bio/star/index"
SnakeMake From line 48 of map/star.smk
67
68
wrapper:
    "0.67.0/bio/star/align"
SnakeMake From line 67 of map/star.smk
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
run:
    # Parse JSON output of xHLA
    import json
    with open(input[0], 'r') as infile:
        alleles = json.load(infile)['hla']['alleles']

    # Write multihla output
    import collections
    with open(output[0], 'w') as outfile:
        print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file=outfile)
        data = collections.defaultdict(lambda : [])
        for allele in alleles:
            gene, allele = allele.split('*')
            data[gene].append(allele)
        for gene, alleles in data.items():
            if len(alleles) == 2:
                print(f'{wildcards.dataset}\t{wildcards.sample}\txHLA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile)
            elif len(alleles) == 1:
                print(f'{wildcards.dataset}\t{wildcards.sample}\txHLA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile)
            else:
                raise RuntimeError(f'xHLA: More than two alleles reported for gene {gene}.')
121
122
script:
    '../scripts/concat_tables.py'
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
run:
    import csv
    from collections import defaultdict
    ddict = defaultdict(lambda : set())
    with open(input[0], 'r') as infile:
        reader = csv.DictReader(infile, delimiter = '\t')
        for rec in reader:
            # extract first two fields and drop gene name
            allele = ':'.join(rec['Allele'].split(':')[:2]).split('*')[1]
            # record allele
            ddict[rec['Locus']].add(allele)
    with open(output[0], 'w') as outfile:
        print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file = outfile)
        for gene, alleles in ddict.items():
            alleles = list(alleles)
            if len(alleles)==2:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-LA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile)
            elif len(alleles)==1:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-LA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile)
            else:
                raise RuntimeError('hla_la_conversion expects either one or two alleles to be reported per gene')
174
175
script:
    '../scripts/concat_tables.py'
192
193
194
195
196
197
198
199
200
201
202
203
204
run:
    from csv import DictReader
    with open(input[0], 'r') as infile:
        reader = DictReader(infile, delimiter = '\t')
        with open(output[0], 'w') as outfile:
            for record in reader:
                gene = record["Gene"]
                allele_1 = record["Allele1"].split('*')[1]
                if record["Allele2"]:
                    allele_2 = record["Allele2"].split('*')[1]
                    print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-VBSeq\t{params.opts}\t{gene}\t{allele_1}\t{allele_2}', file = outfile)
                else:
                    print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-VBSeq\t{params.opts}\t{gene}\t{allele_1}\t{allele_1}', file = outfile)
220
221
script:
    '../scripts/concat_tables.py'
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
run:
    import csv
    from collections import defaultdict
    ddict = defaultdict(lambda : set())
    with open(input[0], 'r') as infile:
        reader = csv.DictReader(infile, delimiter = '\t')
        for rec in reader:
            if ddict:
                raise RuntimeError('optitype_conversion requires the result file to contain exactly one record')
            ddict['A'] = { rec['A1'].split('*')[1], rec['A2'].split('*')[1] }
            ddict['B'] = { rec['B1'].split('*')[1], rec['B2'].split('*')[1] }
            ddict['C'] = { rec['C1'].split('*')[1], rec['C2'].split('*')[1] }
    with open(output[0], 'w') as outfile:
        print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file = outfile)
        for gene, alleles in ddict.items():
            alleles = list(alleles)
            if len(alleles) == 2:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tOptiType\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile)
            elif len(alleles) == 1:
                print(f'{wildcards.dataset}\t{wildcards.sample}\tOptiType\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile)
            else:
                raise RuntimeError('optitype_conversion expects either one or two alleles to be reported per gene')
274
275
script:
    '../scripts/concat_tables.py'
293
294
script:
    '../scripts/concat_tables.py'
38
39
wrapper:
    "0.66.0/bio/optitype"
64
65
66
67
68
69
70
71
72
73
74
75
76
77
shell:
    """
    # Run trimgalore
    trim_galore --output_dir trim --fastqc -j {threads} --paired {input[0]:q} {input[1]:q} &>{log:q}

    # trimgalore compresses output only if input was compressed
    [ "{params.input_ext}" = ".gz" ] || gzip trim/*.fq &>>{log:q}

    # Move output into place
    mv -f {params.r1_tmp:q} {output.r1:q}
    mv -f {params.r2_tmp:q} {output.r2:q}
    mv -f {params.r1fastqc_tmp:q} {output.r1fastqc:q}
    mv -f {params.r2fastqc_tmp:q} {output.r2fastqc:q}
    """
34
35
script:
    '../scripts/hg_filter_noalt.py'
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
shell:
    """
    mkdir -p {output.bindir:q}
    mkdir -p {output.tdir:q}

    # This is a hack to circumvent that xHLA does not allow for any control
    # over the number of threads it consumes - it will always consume
    # everything the machine reports. Two components seem to use
    # multithreading, the invocation of diamond and the lpsolver in R. Here
    # we fix the diamond issue by integrating a wrapper script called
    # 'diamond' in the PATH that sets the max threads paramter of diamond
    # accordingly. Currently there is no fix for the R code, which makes
    # this rule vulnerable to cluster engines that enforce the SMP limit.

    echo '#!/bin/bash' > {output.bindir:q}/diamond
    echo 'shift' >> {output.bindir:q}/diamond
    echo '/usr/bin/diamond blastx -p {threads} "$@"' >> {output.bindir:q}/diamond
    chmod +x {output.bindir:q}/diamond
    export PATH="{output.bindir}:$PATH"

    export LC_ALL=C
    ecode=0
    if python /opt/bin/run.py --sample_id sample.$HOSTNAME.$$ --input_bam_path {input.bam:q} --output_path {output.tdir:q} &>{log:q}
    then
        mv -v {output.tdir:q}/report-sample.$HOSTNAME.$$-hla.json {output.json:q} &>>{log:q}
    else
        ecode=1
    fi
    rm -rfv hla-sample.$HOSTNAME.$$ &>>{log:q}
    exit $ecode
    """
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import pandas as pd

# Read the ground truth data
gtruth = pd.read_csv(snakemake.input.gtruth, '\t')
gtruth['Dataset'] = snakemake.wildcards.dataset
gtruth['Method'] = 'ground_truth'
gtruth['Options'] = ''

# Read the typing results
typing = pd.read_csv(snakemake.input.typing, '\t')

# Concatenate and pivot the data
compare = pd.concat([gtruth, typing])

# Create result column with set of called alleles per gene and drop original, individual columns
compare['Result'] = compare.apply(axis='columns', func=lambda x : {x.Allele1, x.Allele2})
compare.drop(columns=['Allele1','Allele2'], inplace = True)

# Pivot the table to have one column per method, drop constant column index level
compare = compare.pivot_table(index=['Dataset','Sample','Gene'], columns=['Method','Options'], aggfunc=lambda x : set.union(*x))
compare = compare.droplevel(0, axis='columns')

# Make the ground truth the first column
compare = compare[ [c for c in compare.columns if c[0]=='ground_truth'] + [c for c in compare.columns if c[0]!='ground_truth'] ]

def annotate(row, method):
    if pd.isna(row[method]) or len(row[method])==0:
        return 'no result'
    gt = row[('ground_truth', '')]
    if row[method] == gt:
        return 'correct'
    if pd.isna(gt) or len(gt)==0:
        return f'no ground truth: {row[method]}'
    if row[method] & gt: # Intersection
        return f'partial match: {row[method]}'
    else:
        return f'mismatch: {row[method]}'

# Create copy with annotated version of results
annotated = compare.copy()
for col in [ c for c in annotated.columns if c[0] != 'ground_truth' ]:
    annotated[col] = annotated.apply(lambda x : annotate(x, col), axis = 'columns')

# Write results
compare.to_csv(snakemake.output.raw, '\t', index = True)
annotated.to_csv(snakemake.output.annotated, '\t', index = True)
24
25
26
27
28
29
30
31
32
33
header = False
with open(snakemake.output[0], 'w') as outfile:
    for inpath in snakemake.input:
        with open(inpath, 'r') as infile:
            if header:
                next(infile)
            else:
                header = True
            for row in infile:
                print(row, file = outfile, end = '')
23
24
25
26
27
28
29
30
31
from Bio import SeqIO
import gzip

whitelist = [ 'chr' + e for e in [ str(num) for num in range(1,23) ] + ['X','Y'] ]

with open(snakemake.output[0], 'w') as outfile:
    with gzip.open(snakemake.input[0], 'rt') as infile:
        generator = ( record for record in SeqIO.parse(infile, "fasta") if record.id in whitelist )
        SeqIO.write(generator, outfile, 'fasta')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Christopher Schröder, Patrik Smeds"
__copyright__ = "Copyright 2020, Christopher Schröder, Patrik Smeds"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided.")
elif len(snakemake.input) > 1:
    raise ValueError("Please provide exactly one reference genome as input.")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

shell("bwa-mem2 index" " {prefix}" " {snakemake.input[0]}" " {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
__author__ = "Christopher Schröder, Johannes Köster, Julian de Ruiter"
__copyright__ = (
    "Copyright 2020, Christopher Schröder, Johannes Köster and Julian de Ruiter"
)
__email__ = "[email protected] [email protected], [email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = (
        "picard SortSam {sort_extra} INPUT=/dev/stdin"
        " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
    )

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa-mem2 mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Jan Forster"
__copyright__ = "Copyright 2020, Jan Forster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
outdir = os.path.dirname(snakemake.output[0])

# get sequencing type
seq_type = snakemake.params.get("sequencing_type", "dna")
seq_type = "--{}".format(seq_type)

# check if non-default config.ini is used
config = snakemake.params.get("config", "")
if any(config):
    config = "--config {}".format(config)

shell(
    "(OptiTypePipeline.py"
    " --input {snakemake.input.reads}"
    " --outdir {outdir}"
    " --prefix {snakemake.wildcards.sample}"
    " {seq_type}"
    " {config}"
    " {extra})"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fq1]
    if isinstance(snakemake.input.fq1, str)
    else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
    fq2 = (
        [snakemake.input.fq2]
        if isinstance(snakemake.input.fq2, str)
        else snakemake.input.fq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

outprefix = os.path.dirname(snakemake.output[0]) + "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake.utils import makedirs

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")

gtf = snakemake.input.get("gtf")
if gtf is not None:
    gtf = "--sjdbGTFfile " + gtf
    sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
    gtf = sjdb_overhang = ""

makedirs(snakemake.output)

shell(
    "STAR "  # Tool
    "--runMode genomeGenerate "  # Indexation mode
    "{extra} "  # Optional parameters
    "--runThreadN {snakemake.threads} "  # Number of threads
    "--genomeDir {snakemake.output} "  # Path to output
    "--genomeFastaFiles {snakemake.input.fasta} "  # Path to fasta files
    "{sjdb_overhang} "  # Read-len - 1
    "{gtf} "  # Highly recommended GTF
    "{log}"  # Logging
)
ShowHide 26 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/lkuchenb/MultiHLA
Name: multihla
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...