Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Scope of this workflow
This workflow enables the concurrent analysis of WES or WGS data using publicly available software to derive HLA haplotypes from this type of data.
Currently available software tools
-
xHLA
Xie, C., Yeo, Z. X., Wong, M., Piper, J., Long, T., Kirkness, E. F., ... & Brady, C. (2017). Fast and accurate HLA typing from short-read next-generation sequence data with xHLA. Proceedings of the National Academy of Sciences, 114(30), 8059-8064.
The workflow implements read mapping the reads against hg38 without alt contigs using
bwa mem
as instructed by the authors. The mapped reads are then sorted and index using samtools.The workflow utilizes the [Docker Image][2] provided by the authors to perform the actual HLA typing.
-
HLA-VBSeq
Nariai, N., Kojima, K., Saito, S., Mimori, T., Sato, Y., Kawai, Y., ... & Nagasaki, M. (2015, December). HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data. In BMC genomics (Vol. 16, No. S2, p. S7). BioMed Central.
Wang, Y. Y., Mimori, T., Khor, S. S., Gervais, O., Kawai, Y., Hitomi, Y., ... & Nagasaki, M. (2019). HLA-VBSeq v2: improved HLA calling accuracy with full-length Japanese class-I panel. Human Genome Variation, 6(1), 1-5.
The workflow implements read mapping the reads against hg19 without alt contigs. The authors instructions merely state to "map against hg19" without any further specifics, but mapping against hg19 with alt contigs yielded very poor typing results with missing HLA class I genes, thus the workflow uses hg19 without alt contigs.
HLA-VBSeq released two reference database versions:
-
v1 database based on IMGT/HLA database, Release 3.15.0
-
v2 database based on IMGT/HLA database Release 3.31.0 and Japanese HLA reference dataset
-
-
OptiType
Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M., & Kohlbacher, O. (2014). OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30(23), 3310-3316.
The workflow invokes the [OptiType snakemake wrapper][1] without prior filtering of reads.
-
HLA-LA
Dilthey, A. T., Mentzer, A. J., Carapito, R., Cutland, C., Cereb, N., Madhi, S. A., ... & Phillippy, A. M. (2019). HLA*LA - HLA typing from linearly projected graph alignments. Bioinformatics, 35(21), 4394-4396.
The workflow uses reads mapped against the human genome (hg38) without alt contigs as input for HLA-LA. A corresponding reference txt file for HLA-LA is part of this workflow repository. The preprocessed graph directory
PRG_MHC_GRCh38_withIMGT
can be either placed manually intyping/hla_la/hla_la.graphs/
or it will be downloaded and preprocessed automatically.The workflow uses the [HLA-LA bioconda package][3] for graph preprocessing and HLA typing.
-
arcasHLA
Orenbuch, R., Filip, I., Comito, D., Shaman, J., Pe’er, I., & Rabadan, R. (2020). arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics, 36(1), 33-40.
The workflow maps RNAseq reads against the human genome (hg38) without alt contigs using the STAR aligner with default paramters. It then invokes the 'extract' and 'genotype' subtools provided by arcasHLA.
Usage
-
Install snakemake
conda install -c conda-forge mamba mamba create -c conda-forge -c bioconda -n snakemake snakemake conda activate snakemake
-
Clone the MultiHLA repository
git clone https://github.com/lkuchenb/MultiHLA.git hla_typing cd hla_typing
-
Put the input files in place
MultiHLA comes with a predefined folder structure:-
dataset/
A dataset is defined as a set of samples. Place a TSV file here for every dataset with the following three named columns:
SampleName FileNameR1 FileNameR2 Donor1 SEQ_D1_DAT_01_S53_L001_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L001_R2_001.fastq.gz Donor1 SEQ_D1_DAT_01_S53_L002_R1_001.fastq.gz SEQ_D1_DAT_01_S53_L002_R2_001.fastq.gz Donor2 SEQ_D2_DAT_01_S54_L001_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L001_R2_001.fastq.gz Donor2 SEQ_D2_DAT_01_S54_L002_R1_001.fastq.gz SEQ_D2_DAT_01_S54_L002_R2_001.fastq.gz Donor3 SEQ_D3_DAT_01_S55_L001_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L001_R2_001.fastq.gz Donor3 SEQ_D3_DAT_01_S55_L002_R1_001.fastq.gz SEQ_D3_DAT_01_S55_L002_R2_001.fastq.gz
FASTQ files have to come in gziped pairs and be named
{prefix}_R[12]{suffix}.fastq.gz
. A sample can be covered by an arbitrary number of FASTQ pairs (at least one). -
fastq/
Place the FASTQ files as listed in your dataset sheet here.
-
ref/
Place or link the required human genome references here as described for each supported method, otherwise they will be automatically downloaded.
-
trim/
This is an output folder. It will be filled with adapter trimmed versions of the provided FASTQ files.
-
typing/{method}/
This is an output folder. It will be filled with subfolders for each method.
-
workflow/
This folder contains the workflow code.
-
-
Run the workflow
Invoke snakemake using
snakemake --use-conda --use-singularity
. This enables snakemake to automatically install dependencies into conda environments that are created on the fly and also enables the container based jobs to run. To process all samples of a dataset, for example the datasetdataset_1
described indatasets/dataset_1.tsv
usesnakemake --use-conda --use-singularity typing/dataset_1.all.multihla
Memory and run time requirements for each job are noted in their resources (
mem_mb
andtime
).
Code Snippets
37 38 39 40 41 42 43 44 45 46 47 48 49 | shell: """ base="$PWD" log="$base"/{log:q} # DOWNLOAD AN UNPACK ARCASHLA mkdir -p {output.outdir:q} &>"$log" cd {output.outdir:q} &>>"$log" tar --strip-components=1 -xf "$base"/{input:q} &>>"$log" # OBTAIN REFERENCE ./arcasHLA reference --version {wildcards.refversion:q} &>>"$log" """ |
70 71 72 73 74 75 76 77 | shell: """ rm -rf {output.temp} &>>{log.cli:q} mkdir -p {output.temp:q} &>{log.cli:q} {input.arcas_bin:q} extract -t {threads} --paired -o {output.temp:q} --log {log.main:q} {input.bam:q} &>>{log.cli:q} mv {output.temp:q}/*1.fq.gz {output.r1:q} &>>{log.cli:q} mv {output.temp:q}/*2.fq.gz {output.r2:q} &>>{log.cli:q} """ |
99 100 101 102 103 104 105 106 107 | shell: """ rm -vrf {output.temp} &>>{log.cli:q} mkdir -vp {output.temp:q} &>{log.cli:q} {input.arcas_bin:q} genotype -t {threads} -o {output.temp:q} --log {log.main:q} {input.r1:q} {input.r2:q} &>>{log.cli:q} mv -v {output.temp:q}/*genes.json {output.genes:q} &>>{log.cli} mv -v {output.temp:q}/*genotype.json {output.genotype:q} &>>{log.cli} mv -v {output.temp:q}/*alignment.p {output.align:q} &>>{log.cli} """ |
33 34 | script: '../scripts/compare.py' |
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | shell: """ logpath="$PWD/{log}" origpath="$PWD" date >"$logpath" # CHECK DOWNLOADED FILE md5sum < {input:q} 2>>"$logpath" | fgrep -q {params.checksum} || echo 'Checksum failure for PRG_MHC_GRCh38_withIMGT.tar.gz' | tee -a "$logpath" >&2 # UNPACK DOWNLOADED FILE cd 'typing/hla_la/hla_la.graphs' &>>"$logpath" tar xf "$origpath"/{input:q} &>>"$logpath" # PREPARE THE INFERENCE GRAPH "$CONDA_PREFIX"/opt/hla-la/bin/HLA-LA --action prepareGraph --PRG_graph_dir "$PWD"/PRG_MHC_GRCh38_withIMGT &>>"$logpath" """ |
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | shell: """ LOGPATH="$PWD/{log}" date > "$LOGPATH" env >> "$LOGPATH" # Check if a proper conda environment is available if [ -z "${{CONDA_PREFIX+_}}" ] || [ ! -d "$CONDA_PREFIX"/opt/hla-la ]; then echo "--use-conda is required to run the HLA-LA rule." | tee -a "$LOGPATH" >&2 exit 1 fi # Create and prepare the temp directory rm -rf {params.tempdir:q} mkdir -p {params.tempdir:q}/src {params.tempdir:q}/bin >>"$LOGPATH" # PREPARE DIRECTORY SRC cd {params.tempdir:q}/src ln -s "$CONDA_PREFIX"/opt/hla-la/src/* . &>>"$LOGPATH" mv HLA-LA.pl HLA-LA.pl~ cp "$CONDA_PREFIX"/opt/hla-la/src/HLA-LA.pl . &>>"$LOGPATH" cd .. # PREPARE DIRECTORY BIN ln -s "$CONDA_PREFIX"/opt/hla-la/bin/HLA-LA bin/ &>>"$LOGPATH" # PREPARE DIRECTORY GRAPHS ln -s ../../../{input.graphs:q} graphs &>>"$LOGPATH" # RUN MAIN PROGRAM ./src/HLA-LA.pl --sampleID sample --BAM ../../../{input.bam:q} --workingDir "$PWD" --graph PRG_MHC_GRCh38_withIMGT &>>"$LOGPATH" # MOVE RESULTS INTO PLACE cd ../../../ rm -f {output.outdir:q} &>>"$LOGPATH" mv {params.tempdir:q}/sample {output.outdir:q} &>>"$LOGPATH" ln -s ../../{output.outdir:q}/hla/R1_bestguess_G.txt {output.main:q} &>>"$LOGPATH" # REMOVE TEMPORARY DIRECTORY rm -rf {params.tempdir:q} &>>"$LOGPATH" """ |
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | shell: """ # Extract a list of read name that were aligned to HLA loci (HLA-A, B, C, DM, DO, DP, DQ, DR, E, F, G, H, J, K, L, P, V, MIC, and TAP) samtools view -@ {threads} {input.bam:q} chr6:29907037-29915661 chr6:31319649-31326989 chr6:31234526-31241863 \ chr6:32914391-32922899 chr6:32900406-32910847 chr6:32969960-32979389 chr6:32778540-32786825 \ chr6:33030346-33050555 chr6:33041703-33059473 chr6:32603183-32613429 chr6:32707163-32716664 \ chr6:32625241-32636466 chr6:32721875-32733330 chr6:32405619-32414826 chr6:32544547-32559613 \ chr6:32518778-32554154 chr6:32483154-32559613 chr6:30455183-30463982 chr6:29689117-29699106 \ chr6:29792756-29800899 chr6:29793613-29978954 chr6:29855105-29979733 chr6:29892236-29899009 \ chr6:30225339-30236728 chr6:31369356-31385092 chr6:31460658-31480901 chr6:29766192-29772202 \ chr6:32810986-32823755 chr6:32779544-32808599 chr6:29756731-29767588 \ 2> {log:q} | awk '{{print $1}}' | LC_ALL=C sort | uniq > {output.locus_ids:q} echo "'samtools view' finished extracting chromosome 6 reads." >>{log:q} mkdir -p {output.idx_dir:q} java -jar -Xmx32g -Xms32g '{params[hla_vbseq_bam_name_index_jar]}' index {input.bam:q} --indexFile {output.idx_dir:q}/index &>>{log:q} echo "bamNameIndex.jar [1/2] finished." >>{log:q} java -jar '{params[hla_vbseq_bam_name_index_jar]}' search {input.bam:q} --indexFile {output.idx_dir:q}/index --name {output.locus_ids:q} --output {output.partial_sam:q} &>>{log:q} echo "bamNameIndex.jar [2/2] finished." >>{log:q} picard SamToFastq I={output.partial_sam:q} F={output.partial_fq_1:q} F2={output.partial_fq_2:q} &>>{log:q} echo "picard SamToFastq [1/2] finished." >>{log:q} samtools view -@ {threads} -bh -f 12 {input.bam:q} >{output.bam_unmapped:q} 2>>{log:q} echo "'samtools view' finished extracting unmapped reads." >>{log:q} picard SamToFastq I={output.bam_unmapped:q} F={output.unmapped_fq_1:q} F2={output.unmapped_fq_2:q} &>>{log:q} echo "picard SamToFastq [2/2] finished." >>{log:q} cat {output.partial_fq_1:q} {output.unmapped_fq_1:q} >{output.cat_fq_1} 2>>{log:q} cat {output.partial_fq_2:q} {output.unmapped_fq_2:q} >{output.cat_fq_2} 2>>{log:q} echo "Generating input FASTQ files finished." >>{log:q} bwa index '{params[hla_vbseq_ref_fa]}' &>>{log:q} bwa mem -t {threads} -P -L 10000 -a '{params[hla_vbseq_ref_fa]}' {output.cat_fq_1} {output.cat_fq_2} >{output.sam_final:q} 2>>{log:q} echo "Mapping against HLA reference finished." >>{log:q} # Run VBSeq java -jar '{params[hla_vbseq_main_jar]}' '{params[hla_vbseq_ref_fa]}' {output.sam_final:q} {output.result_txt:q} --alpha_zero 0.01 --is_paired &>>{log:q} echo "VBSeq finished." >>{log:q} # Call alleles python '{params[hla_vbseq_call_hla_digits]}' -v {output.result_txt:q} -a '{params[hla_vbseq_ref_txt]}' -r 140 -d 4 --ispaired >{output.result_tsv:q} 2>>{log:q} echo "Calling HLA alleles finished." >>{log:q} echo "" >>{log:q} echo "Rule 'hla_vbseq_typing' finished." >>{log:q} """ |
43 44 | wrapper: '0.65.0/bio/bwa-mem2/index' |
63 64 | wrapper: "0.65.0/bio/samtools/index" |
88 89 | wrapper: "0.65.0/bio/bwa-mem2/mem" |
105 106 107 108 | shell: """ touch {output:q} """ |
48 49 | wrapper: "0.67.0/bio/star/index" |
67 68 | wrapper: "0.67.0/bio/star/align" |
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | run: # Parse JSON output of xHLA import json with open(input[0], 'r') as infile: alleles = json.load(infile)['hla']['alleles'] # Write multihla output import collections with open(output[0], 'w') as outfile: print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file=outfile) data = collections.defaultdict(lambda : []) for allele in alleles: gene, allele = allele.split('*') data[gene].append(allele) for gene, alleles in data.items(): if len(alleles) == 2: print(f'{wildcards.dataset}\t{wildcards.sample}\txHLA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile) elif len(alleles) == 1: print(f'{wildcards.dataset}\t{wildcards.sample}\txHLA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile) else: raise RuntimeError(f'xHLA: More than two alleles reported for gene {gene}.') |
121 122 | script: '../scripts/concat_tables.py' |
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | run: import csv from collections import defaultdict ddict = defaultdict(lambda : set()) with open(input[0], 'r') as infile: reader = csv.DictReader(infile, delimiter = '\t') for rec in reader: # extract first two fields and drop gene name allele = ':'.join(rec['Allele'].split(':')[:2]).split('*')[1] # record allele ddict[rec['Locus']].add(allele) with open(output[0], 'w') as outfile: print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file = outfile) for gene, alleles in ddict.items(): alleles = list(alleles) if len(alleles)==2: print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-LA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile) elif len(alleles)==1: print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-LA\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile) else: raise RuntimeError('hla_la_conversion expects either one or two alleles to be reported per gene') |
174 175 | script: '../scripts/concat_tables.py' |
192 193 194 195 196 197 198 199 200 201 202 203 204 | run: from csv import DictReader with open(input[0], 'r') as infile: reader = DictReader(infile, delimiter = '\t') with open(output[0], 'w') as outfile: for record in reader: gene = record["Gene"] allele_1 = record["Allele1"].split('*')[1] if record["Allele2"]: allele_2 = record["Allele2"].split('*')[1] print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-VBSeq\t{params.opts}\t{gene}\t{allele_1}\t{allele_2}', file = outfile) else: print(f'{wildcards.dataset}\t{wildcards.sample}\tHLA-VBSeq\t{params.opts}\t{gene}\t{allele_1}\t{allele_1}', file = outfile) |
220 221 | script: '../scripts/concat_tables.py' |
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | run: import csv from collections import defaultdict ddict = defaultdict(lambda : set()) with open(input[0], 'r') as infile: reader = csv.DictReader(infile, delimiter = '\t') for rec in reader: if ddict: raise RuntimeError('optitype_conversion requires the result file to contain exactly one record') ddict['A'] = { rec['A1'].split('*')[1], rec['A2'].split('*')[1] } ddict['B'] = { rec['B1'].split('*')[1], rec['B2'].split('*')[1] } ddict['C'] = { rec['C1'].split('*')[1], rec['C2'].split('*')[1] } with open(output[0], 'w') as outfile: print('Dataset\tSample\tMethod\tOptions\tGene\tAllele1\tAllele2', file = outfile) for gene, alleles in ddict.items(): alleles = list(alleles) if len(alleles) == 2: print(f'{wildcards.dataset}\t{wildcards.sample}\tOptiType\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[1]}', file = outfile) elif len(alleles) == 1: print(f'{wildcards.dataset}\t{wildcards.sample}\tOptiType\t{params.opts}\t{gene}\t{alleles[0]}\t{alleles[0]}', file = outfile) else: raise RuntimeError('optitype_conversion expects either one or two alleles to be reported per gene') |
274 275 | script: '../scripts/concat_tables.py' |
293 294 | script: '../scripts/concat_tables.py' |
38 39 | wrapper: "0.66.0/bio/optitype" |
64 65 66 67 68 69 70 71 72 73 74 75 76 77 | shell: """ # Run trimgalore trim_galore --output_dir trim --fastqc -j {threads} --paired {input[0]:q} {input[1]:q} &>{log:q} # trimgalore compresses output only if input was compressed [ "{params.input_ext}" = ".gz" ] || gzip trim/*.fq &>>{log:q} # Move output into place mv -f {params.r1_tmp:q} {output.r1:q} mv -f {params.r2_tmp:q} {output.r2:q} mv -f {params.r1fastqc_tmp:q} {output.r1fastqc:q} mv -f {params.r2fastqc_tmp:q} {output.r2fastqc:q} """ |
34 35 | script: '../scripts/hg_filter_noalt.py' |
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | shell: """ mkdir -p {output.bindir:q} mkdir -p {output.tdir:q} # This is a hack to circumvent that xHLA does not allow for any control # over the number of threads it consumes - it will always consume # everything the machine reports. Two components seem to use # multithreading, the invocation of diamond and the lpsolver in R. Here # we fix the diamond issue by integrating a wrapper script called # 'diamond' in the PATH that sets the max threads paramter of diamond # accordingly. Currently there is no fix for the R code, which makes # this rule vulnerable to cluster engines that enforce the SMP limit. echo '#!/bin/bash' > {output.bindir:q}/diamond echo 'shift' >> {output.bindir:q}/diamond echo '/usr/bin/diamond blastx -p {threads} "$@"' >> {output.bindir:q}/diamond chmod +x {output.bindir:q}/diamond export PATH="{output.bindir}:$PATH" export LC_ALL=C ecode=0 if python /opt/bin/run.py --sample_id sample.$HOSTNAME.$$ --input_bam_path {input.bam:q} --output_path {output.tdir:q} &>{log:q} then mv -v {output.tdir:q}/report-sample.$HOSTNAME.$$-hla.json {output.json:q} &>>{log:q} else ecode=1 fi rm -rfv hla-sample.$HOSTNAME.$$ &>>{log:q} exit $ecode """ |
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | import pandas as pd # Read the ground truth data gtruth = pd.read_csv(snakemake.input.gtruth, '\t') gtruth['Dataset'] = snakemake.wildcards.dataset gtruth['Method'] = 'ground_truth' gtruth['Options'] = '' # Read the typing results typing = pd.read_csv(snakemake.input.typing, '\t') # Concatenate and pivot the data compare = pd.concat([gtruth, typing]) # Create result column with set of called alleles per gene and drop original, individual columns compare['Result'] = compare.apply(axis='columns', func=lambda x : {x.Allele1, x.Allele2}) compare.drop(columns=['Allele1','Allele2'], inplace = True) # Pivot the table to have one column per method, drop constant column index level compare = compare.pivot_table(index=['Dataset','Sample','Gene'], columns=['Method','Options'], aggfunc=lambda x : set.union(*x)) compare = compare.droplevel(0, axis='columns') # Make the ground truth the first column compare = compare[ [c for c in compare.columns if c[0]=='ground_truth'] + [c for c in compare.columns if c[0]!='ground_truth'] ] def annotate(row, method): if pd.isna(row[method]) or len(row[method])==0: return 'no result' gt = row[('ground_truth', '')] if row[method] == gt: return 'correct' if pd.isna(gt) or len(gt)==0: return f'no ground truth: {row[method]}' if row[method] & gt: # Intersection return f'partial match: {row[method]}' else: return f'mismatch: {row[method]}' # Create copy with annotated version of results annotated = compare.copy() for col in [ c for c in annotated.columns if c[0] != 'ground_truth' ]: annotated[col] = annotated.apply(lambda x : annotate(x, col), axis = 'columns') # Write results compare.to_csv(snakemake.output.raw, '\t', index = True) annotated.to_csv(snakemake.output.annotated, '\t', index = True) |
24 25 26 27 28 29 30 31 32 33 | header = False with open(snakemake.output[0], 'w') as outfile: for inpath in snakemake.input: with open(inpath, 'r') as infile: if header: next(infile) else: header = True for row in infile: print(row, file = outfile, end = '') |
23 24 25 26 27 28 29 30 31 | from Bio import SeqIO import gzip whitelist = [ 'chr' + e for e in [ str(num) for num in range(1,23) ] + ['X','Y'] ] with open(snakemake.output[0], 'w') as outfile: with gzip.open(snakemake.input[0], 'rt') as infile: generator = ( record for record in SeqIO.parse(infile, "fasta") if record.id in whitelist ) SeqIO.write(generator, outfile, 'fasta') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Christopher Schröder, Patrik Smeds" __copyright__ = "Copyright 2020, Christopher Schröder, Patrik Smeds" __email__ = "[email protected], [email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided.") elif len(snakemake.input) > 1: raise ValueError("Please provide exactly one reference genome as input.") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", "") if len(prefix) > 0: prefix = "-p " + prefix shell("bwa-mem2 index" " {prefix}" " {snakemake.input[0]}" " {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | __author__ = "Christopher Schröder, Johannes Köster, Julian de Ruiter" __copyright__ = ( "Copyright 2020, Christopher Schröder, Johannes Köster and Julian de Ruiter" ) __email__ = "[email protected] [email protected], [email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell # Extract arguments. extra = snakemake.params.get("extra", "") sort = snakemake.params.get("sort", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -" elif sort == "samtools": # Sort alignments using samtools sort. pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -" # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" prefix = path.splitext(snakemake.output[0])[0] sort_extra += " -T " + prefix + ".tmp" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = ( "picard SortSam {sort_extra} INPUT=/dev/stdin" " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}" ) else: raise ValueError("Unexpected value for params.sort ({})".format(sort)) shell( "(bwa-mem2 mem" " -t {snakemake.threads}" " {extra}" " {snakemake.params.index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Jan Forster" __copyright__ = "Copyright 2020, Jan Forster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) outdir = os.path.dirname(snakemake.output[0]) # get sequencing type seq_type = snakemake.params.get("sequencing_type", "dna") seq_type = "--{}".format(seq_type) # check if non-default config.ini is used config = snakemake.params.get("config", "") if any(config): config = "--config {}".format(config) shell( "(OptiTypePipeline.py" " --input {snakemake.input.reads}" " --outdir {outdir}" " --prefix {snakemake.wildcards.sample}" " {seq_type}" " {config}" " {extra})" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) fq1 = snakemake.input.get("fq1") assert fq1 is not None, "input-> fq1 is a required input parameter" fq1 = ( [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1 ) fq2 = snakemake.input.get("fq2") if fq2: fq2 = ( [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2 ) assert len(fq1) == len( fq2 ), "input-> equal number of files required for fq1 and fq2" input_str_fq1 = ",".join(fq1) input_str_fq2 = ",".join(fq2) if fq2 is not None else "" input_str = " ".join([input_str_fq1, input_str_fq2]) if fq1[0].endswith(".gz"): readcmd = "--readFilesCommand zcat" else: readcmd = "" outprefix = os.path.dirname(snakemake.output[0]) + "/" shell( "STAR " "{extra} " "--runThreadN {snakemake.threads} " "--genomeDir {snakemake.params.index} " "--readFilesIn {input_str} " "{readcmd} " "--outFileNamePrefix {outprefix} " "--outStd Log " "{log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | __author__ = "Thibault Dayris" __copyright__ = "Copyright 2019, Dayris Thibault" __email__ = "[email protected]" __license__ = "MIT" from snakemake.shell import shell from snakemake.utils import makedirs log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") sjdb_overhang = snakemake.params.get("sjdbOverhang", "100") gtf = snakemake.input.get("gtf") if gtf is not None: gtf = "--sjdbGTFfile " + gtf sjdb_overhang = "--sjdbOverhang " + sjdb_overhang else: gtf = sjdb_overhang = "" makedirs(snakemake.output) shell( "STAR " # Tool "--runMode genomeGenerate " # Indexation mode "{extra} " # Optional parameters "--runThreadN {snakemake.threads} " # Number of threads "--genomeDir {snakemake.output} " # Path to output "--genomeFastaFiles {snakemake.input.fasta} " # Path to fasta files "{sjdb_overhang} " # Read-len - 1 "{gtf} " # Highly recommended GTF "{log}" # Logging ) |
Support
- Future updates