A Snakemake workflow assembling bacterial genomes according to the standard operating procedure in the Clavel Lab
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A Snakemake workflow for assembling bacterial genomes according to the standard operating procedure in the Clavel Lab and assessing their quality in accordance to MIMAG and SeqCode criteria.
Usage
First, the different conda environments can be installed using the following command:
snakemake --use-conda --conda-create-envs-only
Then, the workflow can be used with the default configuration file and 9 threads as follows:
snakemake -c 9 --use-conda
Or using a custom one:
snakemake --configfile config/er2.yaml -c 9 --use-conda
Notes
-
The default rerun behavior of Snakemake will rerun the workflow for any possible changes in input, code, parameters, modification time or software environment. However, this implies that the workflow downstream of the plasmid extraction will be rerun everytime because of the presence of checkpoints . In a situation where the user knows that some steps do not need to be run again (e.g., plasmid extraction, contigs curation), the flag
--rerun-triggers {mtime,params,software-env,code}
(instead of the default:--rerun-triggers {mtime,params,input,software-env,code}
) can be added at the user's responsibility. -
The assembly workflow is done by default only with the paired reads, but it is now possible to use the unpaired reads (quality-filtered, adapters- and phix-removed) when needed (e.g., low quality reverse reads that bring the paired reads number too low). To enable this feature, one must change the default git branch with the following command:
git checkout with-unpaired
and run the workflow according to the Usage section.
Input/Output
The documentation provides more details on the configuration of the workflow, but in summary:
Input :
-
A tab-separated table indicating the isolate name and the path to paired-end FASTQ files
-
A configuration file
Output :
-
The genome of the isolate (with the plasmids sequences if any) in a gzipped FASTA file format in
results/genome/isolate.combined.fa.gz
-
The plasmids sequences of the isolates in a gzipped FASTA file format in
results/plasmids/isolate.plasmids.fa.gz
-
A comma-separated table (in
results
) summarising the metadata on the generated genomes with for example:-
paths to the previous files and their md5sums
-
genome quality metrics and the software used to calculate them (e.g., completion, contamination, tRNAs, 16S, 23S, 5S)
-
boolean flags indicating which genome quality tests passed or failed
-
The column of the output table are detailed in the next section with examples.
Output table metadata details
Metadata | Definition | Example |
---|---|---|
isolate
|
Identifier of the isolate provided in the sample table | CLAXXH123 |
genome_md5
|
Checksum of the FASTA file of the genome | db2cd4eff6d40c463998851f43b15670 |
assembly_qual
|
Assembly quality, either “High-quality draft” if all quality flags are valid or “Manual review:” with the quality flags that were invalid. | Manual review: are_contigs_less_100 |
genome_length
|
Size of the genome in nucleotide based on the length of all contigs in the assembly | 4839229 |
number_contig
|
Number of contigs in the assembly | 134 |
N50
|
Median length of the contigs in the assembly | 83304 |
number_contig_below_1kb
|
Number of small contigs, under 1000 nucleotides-long | 202 |
max_contig_length
|
Length of the longest contig | 291966 |
coverage
|
Ratio of the basepairs count of raw sequences over the basepairs count of the assembly | 394 |
assembly_software
|
Name of the software used for the assembly | SPAdes |
compl_score
|
Percentage of estimated marker-based completion of the assembly | 99.46 |
compl_software
|
Name of the software used for assessing the assembly completion | checkm |
contam_score
|
Percentage of estimated marker-based contamination of the assembly | 0 |
contam_software
|
Name of the software used for assessing the assembly contamination | checkm |
16S_SSU_rRNA_length
|
Semicolon list of lengths of detected 16S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 1393;1037;597;528 |
SSU_recover_software
|
Name of the software used for detecting the 16S rRNA sequences | metaxa2 |
23S_LSU_rRNA_length
|
Semicolon list of lengths of detected 23S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 1637;795;528 |
LSU_recover_software
|
Name of the software used for detecting the 23S rRNA sequences | metaxa2 |
trnas
|
Number of unique essential tRNAs detected in the assembly among the following twenty: Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr and Val. | 20 |
trna_ext_software
|
Name of the software used for detecting tRNAs sequences | tRNAscan-SE |
5S_rRNA_length
|
Semicolon list of lengths of detected 5S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 110;110;106;105 |
archive_file
|
Path to the archive of the genome file | results/genome/CLAJMH5.genome.fa.gz |
archive_file_md5
|
Checksum of the archive of the genome file | db2cd4eff6d40c463998851f43b15670 |
forward_file
|
Path to the forward/R1 raw sequence file | /DATA/hibc_fastq/CLAJMH5_S16_R1_001.fastq.gz |
forward_file_md5
|
Checksum of the archive of the forward raw sequence file | 1ff2885fe846135980b481996d8e1da6 |
reverse_file
|
Path to the reverse/R2 raw sequence file | /DATA/hibc_fastq/CLAJMH5_S16_R2_001.fastq.gz |
reverse_file_md5
|
Checksum of the archive of the reverse raw sequence file | 3acd93ba96b97aac4da7756c83f2f990 |
sequence_count
|
Number of reads in the library (sequencing depth) | 6314677 |
basepairs_count
|
Number of basepairs in nucleotides | 1907032454 |
average_length
|
The average read length estimated as basepairs_count divided by sequence_count | 151 |
sequence_count_qual
|
Number of reads in the library (sequencing depth) after quality filtering | 4837051 |
basepairs_count_qual
|
Number of basepairs in nucleotides after quality filtering | 1324608613 |
adapters_file
|
Path to the FASTA file containing the adapters sequences | resources/NexteraPE-PE.fa |
plasmid_length
|
Semicolon list of lengths of detected plasmid sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. | 7059 |
is_compl_grtr_90
|
Logical flag indicating whether the completion, in compl_score, is greater than 90% | TRUE |
is_contam_less_5
|
Logical flag indicating whether the contamination, contam_score, is greater than 5% | TRUE |
is_coverage_grtr_10
|
Logical flag indicating whether the coverage is greater than 10x | TRUE |
are_contigs_less_100
|
Logical flag indicating whether the number of contigs, in number_contig, is below 100 | FALSE |
is_N50_grtr_25kb
|
Logical flag indicating whether the N50 is longer than 25 000 nucleotides | TRUE |
is_max_contig_grtr_100kb
|
Logical flag indicating whether the longest contig, in max_contig_length, is longer than 100 000 nucleotides | TRUE |
is_trnas_grtr_18
|
Logical flag indicating whether the number of tRNAs is greater than 18 | TRUE |
is_SSU_grtr_0
|
Logical flag indicating whether any 16S/SSU rRNA sequence longer than 0 nucleotide is detected | TRUE |
is_LSU_grtr_0
|
Logical flag indicating whether any 23S/LSU rRNA sequence longer than 0 nucleotide is detected | TRUE |
is_5S_grtr_0
|
Logical flag indicating whether any 5S rRNA sequence longer than 0 nucleotide is detected | TRUE |
assembly_date
|
Date on which the assembly was generated | 2022-10-12 |
workflow_version
|
Git tag of the genome assembly workflow used. If not a round version like v5.2, then the commit hash past the round version is added | v5.2-1-g4dae043 |
Changelog
-
v7.0: Concatenate genome and plasmids sequences (if any). Add plasmids proteins annotations.
-
v6.1: Fix conda environments setup by updating snakemake wrappers.
-
v6.0: Checksums for both the genome archive and FASTA file are computed. Improved the table. Assembly with unpaired reads is possible.
-
v5.3.2: Fixes for conda environments
-
v5.3: Update bakta (fix protein predictions) and minor updates ofcheckm, metaxa2 and SPades. Better documentation of the column of the output table.
-
v5.2: Recovered the behavior of length of 0bp when LSU/SSU/plasmids are not detected.
-
v5.1: Fix issue in plasmid length summary. Remove bakta compliant mode.
-
v5.0: Better summary with lengths of (multiple) 16S, 23S, 5S and plasmids. Set the minimum contig size to 500bp. Remove MDMcleaner step. Fix incorrect N50 and contig number by QUAST. Note on tRNAs.
-
v4.2: Handle exceptions from MDMcleaner. Fix plasmid workflow to not trigger unneeded rules.
-
v4.1: Fix issue with the input function to keep the workflow flexible with the plasmid assembly
-
v4.0: Fix bugs such as issue with Recycler failing with unconnected plasmid assembly graph or missing gene names in annotations. Add assembly date and workflow version to the summary table. Better documentation regarding the rerun behavior of Snakemake
-
v3.2: Make sure to use at least the version
v1.5
of bakta -
v3.1: Reduce disk space used by flagging fast-to-generate output files as temporary (e.g., trimmed sequences, BAM file)
-
v3.0: Improved README (especially for the configuration in
config
) and fixed bugs. Failed quality criteria are correctly displayed along the aggregated table, plasmids are produced only when detected and a length of 0 bp is outputted when LSU/SSU are not detected. -
v2.0: Extended the assembly quality assessment to evaluate the compliance to the MIMAG and SeqCode criteria and output an aggregated summary table
-
v1.1: Added a quality control by CheckM
-
v1.0: Initial conversion of the Standard Operating Procedure used in the lab to a Snakemake workflow
Code Snippets
39 40 41 42 43 44 | shell: """ spades.py -t {threads} --careful \ -1 {input[0]} -2 {input[1]} \ -o results/assembly/{wildcards.isolate} &> {log} """ |
30 31 | shell: "cat {input} | gzip > {output}" |
17 18 19 20 21 22 23 24 25 | shell: """ trimmomatic PE -threads {threads} -phred33 \ {input.r1} {input.r2} \ {output.r1} {output.r1_unpaired} \ {output.r2} {output.r2_unpaired} \ ILLUMINACLIP:{params.illumina_clip}:2:30:10 \ {params.extra} &> {log} """ |
45 46 | wrapper: "v1.21.0/bio/bbtools/bbduk" |
14 15 16 17 18 19 | shell: """ plasmidspades.py -t {threads} --careful \ -1 {input[0]} -2 {input[1]} \ -o results/plasmid_reconstruction/{wildcards.isolate} &> {log} """ |
31 32 33 34 | shell: """ make_fasta_from_fastg.py -g {input} -o {output} &> {log} """ |
51 52 | wrapper: "v1.21.0/bio/bwa/index" |
77 78 | wrapper: "v1.21.0/bio/bwa/mem" |
93 94 | wrapper: "v1.21.0/bio/samtools/view" |
105 106 | wrapper: "v1.21.0/bio/samtools/sort" |
117 118 | wrapper: "v1.21.0/bio/samtools/index" |
178 179 | wrapper: "v1.21.0/bio/bbtools/bbduk" |
206 207 208 209 | shell: """ (seqkit fx2tab --name --length --threads {threads} {input} 1> {output} 2> {log} ) || (cat /dev/null > {output}) """ |
14 15 16 17 | shell: """ seqkit seq --remove-gaps --min-len {params.min_contig_length} --threads {threads} {input} 1> {output} 2> {log} """ |
28 29 30 31 32 33 | shell: """ seqkit replace --pattern 'NODE(_[0-9]*_length_[0-9]*)_cov.*' \ --replacement '{wildcards.isolate}_contig$1' --line-width 0 \ {input} > {output} """ |
42 43 44 45 46 | shell: """ grep '>' {input} | sed 's/>//' |\ awk '{{OFS="\t"; print $0,$0,"contig","linear","-"}}' > {output} """ |
55 56 57 58 59 | shell: """ grep '>' {input} | sed 's/>//' |\ awk '{{OFS="\t"; print $0,$0,"plasmid","circular","-"}}' > {output} """ |
69 70 | shell: "cat {input} > {output}" |
93 94 95 96 97 98 99 100 101 | shell: """ checkm lineage_wf {params.flag} \ --extension fa \ --tab_table --file {output} \ --threads {threads} \ results/quality_check/{wildcards.isolate}/ \ results/quality_check/{wildcards.isolate}/checkm/ &> {log} """ |
111 112 113 114 | shell: """ bakta_db download --output {output} """ |
131 132 133 134 135 136 137 138 139 | shell: """ bakta --db {params.db}/db/ \ --prefix {wildcards.isolate} \ --locus-tag {wildcards.isolate} \ --output {params.outdir} \ --replicons {input.replicon} \ --threads {threads} {input.fasta} &> {log} """ |
175 176 | script: "../scripts/check_tRNAs_5S.py" |
191 192 193 194 195 196 197 198 | shell: """ metaxa2_x -i {input} \ -o SSU-{wildcards.isolate} -f fasta \ -g ssu --mode genome \ --table F --fasta T --graphical F --summary F \ --cpu {threads} &> {log} """ |
210 211 212 213 | shell: """ seqkit replace -p 'NODE' -r '{wildcards.isolate}_SSU_16S_rRNA NODE' --line-width 0 --threads {threads} {input.seq} > {output} """ |
228 229 230 231 232 233 234 235 | shell: """ metaxa2_x -i {input} \ -o LSU-{wildcards.isolate} -f fasta \ -g lsu --mode genome \ --table F --fasta T --graphical F --summary F \ --cpu {threads} &> {log} """ |
247 248 249 250 | shell: """ seqkit replace -p 'NODE' -r '{wildcards.isolate}_LSU_23S_rRNA NODE' --line-width 0 --threads {threads} {input.seq} > {output} """ |
260 261 262 263 264 | shell: """ mv {input.SSU} {output.SSU} mv {input.LSU} {output.LSU} """ |
297 298 | script: "../scripts/check_LSU_SSU.py" |
312 313 314 315 | shell: """ seqkit stats -T --threads {threads} {input} 1> {output} 2> {log} """ |
349 350 | script: "../scripts/write_coverage_and_metrics.py" |
365 366 367 368 369 | shell: """ quast --threads {threads} --labels "{wildcards.isolate}.raw,{wildcards.isolate}.final" \ --no-icarus --min-contig 0 --output-dir {output.quast_dir} {input.raw_assembly} {input.final_assembly} > {log} """ |
379 380 | shell: "md5sum {input} 1> {output} 2> {log}" |
390 391 | shell: "md5sum {input} 1> {output} 2> {log}" |
401 402 | shell: "zcat {input}| md5sum| sed -e 's#-#{input}#' -e 's/.gz$//' 1> {output} 2> {log}" |
425 426 | script: "../scripts/write_summary_table.py" |
443 444 | script: "../scripts/aggregate_summaries.py" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import sys import pandas as pd from functools import reduce sys.stderr = open(snakemake.log[0], "w") # Read the summary csv files of each of the isolate processed and transposed the long table into a wide table loaded_df = [pd.read_csv(csv, index_col=0).T for csv in snakemake.input] # Concatenate all the genomes aggregate_df = pd.concat(loaded_df) # Make sure the isolate column does not remain in the index aggregate_df = aggregate_df.rename_axis('isolate').reset_index() # Write the aggregated table to file aggregate_df.to_csv(snakemake.output[0], index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import sys import os import re import pandas as pd sys.stderr = open(snakemake.log[0], "w") # Initialize a dictionary of subunit length subunit_length = {} # Read the first line of the report giving the length of all the SSU (16S) and the LSU (23S) FASTA sequences detected for subunit in ['SSU', 'LSU']: if os.path.getsize(snakemake.input[subunit]) > 0: # Read the report ssu_lsu = pd.read_table(snakemake.input[subunit], names = ['file', 'length']) # Extract the lengths and sort in descending order ssu_lsu_lengths = ssu_lsu.loc[:,'length'].sort_values(ascending = False).tolist() # Format the lengths as string and concatenate. No concatenation if only one element subunit_length[subunit+'_bp'] = ';'.join([ str(x) for x in ssu_lsu_lengths ]) else: # if the report is empty subunit_length[subunit+'_bp'] = "0" # Export the length dictionary as a dataframe subunit_length_df = pd.DataFrame(subunit_length, index=[snakemake.wildcards.isolate]) subunit_length_df.to_csv(snakemake.output[0]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | import sys import pandas as pd sys.stderr = open(snakemake.log[0], "w") # Read Bakta tabular annotations annotations = pd.read_table(snakemake.input[0], sep="\t", header=2) # Casting the column as string to fix a AttributeError when no gene names are found # Can only use .str accessor with string values annotations['Gene']=annotations['Gene'].astype(str) # Keep a subset with the tRNAs in the predicted gene name and replace NaN by False annotations_trnas = annotations[annotations['Gene'].str.contains('_trna', na=False)] # Dictionary counts of the number of occurrences of each provided tRNA trnas = {x: annotations_trnas['Gene'].str.count(x+'_trna').sum() for x in snakemake.params.tRNAs} # How many tRNAs were detected? how_many = sum([trnas[x]>0 for x in trnas.keys()]) # Subset the annotations with the prediceted 5S rRNA genes five_s = annotations[annotations.Gene == '5S_rrna'] # Compute the length five_s_lengths = five_s.Stop - five_s.Start # Extract the lengths and sort in descending order five_s_lengths = five_s_lengths.sort_values(ascending = False).tolist() # Format the lengths as strings and concatenate. No concatenation if only one element if five_s_lengths: five_s_lengths = ';'.join([ str(x) for x in five_s_lengths ]) else: five_s_lengths = '0' # Export the detailed counts of tRNAs as a dataframe detailled_df = pd.DataFrame(trnas, index=[snakemake.wildcards.isolate]) detailled_df.to_csv(snakemake.output.details) # Export the summary count dictionary of tRNAs and 5S as a dataframe summary_df = pd.merge( pd.DataFrame({'tRNAs': how_many}, index=[snakemake.wildcards.isolate]), pd.DataFrame({'5S_rRNA_length': five_s_lengths}, index=[snakemake.wildcards.isolate]), left_index=True, right_index=True) summary_df.to_csv(snakemake.output.summary) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import sys import pandas as pd sys.stderr = open(snakemake.log[0], "w") # Initialize a dictionary of metrics metrics = {} # Read the seqkit stats table for raw sequences R1 (and possibly R2) raw_df = pd.read_table(snakemake.input['raw'], sep="\t") # Gather metrics on raw sequences metrics['sequence_count'] = raw_df.at[0, 'num_seqs'] metrics['basepairs_count'] = raw_df.loc[:, 'sum_len'].sum() # if paired end data metrics['average_length'] = raw_df.at[0, 'avg_len'] # Read the seqkit stats table for trimmed/quality sequences R1 (and possibly R2) trimmed_df = pd.read_table(snakemake.input['trimmed'], sep="\t") # Gather metrics on trimmed/quality filtered sequences metrics['sequence_count_qual'] = trimmed_df.at[0, 'num_seqs'] metrics['basepairs_count_qual'] = trimmed_df.loc[:, 'sum_len'].sum() # if paired end data # Read the seqkit stats table for the assembly assembly_df = pd.read_table(snakemake.input['assembly'], sep="\t") # Compute the coverage as the basepairs count of raw sequence over the basepair count of the assembly metrics['coverage'] = round(metrics['basepairs_count'] / assembly_df.at[0, 'sum_len']) # Export the metrics dictionary as a dataframe metrics_df = pd.DataFrame(metrics, index=[snakemake.wildcards.isolate]) metrics_df.to_csv(snakemake.output[0]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | import sys import os import pandas as pd from functools import reduce from datetime import date import subprocess sys.stderr = open(snakemake.log[0], "w") # Read the csv files of basepairs metrics and SSU/LSU extraction dict_csv = { key: pd.read_csv(snakemake.input[key], index_col=0) for key in ['metrics', 'ssu_lsu', 'trnas_5s'] } # Read the tsv files of sample table provided in the configuration file and outputs from genome quality assessment tools dict_table = { key: pd.read_table(snakemake.input[key], sep='\t') for key in ['samples', 'checkm', 'quast'] } # Read the checksums of the raw fastq files and the genome fasta file dict_md5 = { key: pd.read_table(snakemake.input[key], sep='\s+', names=['md5', 'file']) for key in ['fastq_md5', 'archive_md5', 'genome_md5'] } # Keep the csv as a list list_csv = list(dict_csv.values()) # Append the samples table and add to the csv list of genome metrics, SSU/LSU and tRNAs dict_table['samples'].set_index('isolate', inplace=True) list_csv.append(dict_table['samples']) # Merge the table based on the index that is the isolate name merged = reduce(lambda left,right: pd.merge(left,right, left_index=True, right_index=True), list_csv) # Add quality metrics from the dictionaries containing the tables merged['compl_score'] = float(dict_table['checkm'].loc[ dict_table['checkm']['Bin Id']==snakemake.wildcards.isolate+'.genome', 'Completeness']) merged['compl_software'] = 'checkm' merged['contam_score'] = float(dict_table['checkm'].loc[ dict_table['checkm']['Bin Id']==snakemake.wildcards.isolate+'.genome', 'Contamination']) merged['contam_software'] = 'checkm' # Set the index for the QUAST data dict_table['quast'].set_index('Assembly', inplace=True) # Add assembly metrics merged['N50'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.final', 'N50'] merged['genome_length'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.final', 'Total length'] merged['number_contig'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.final', '# contigs (>= 0 bp)'] merged['number_contig_below_1kb'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.raw', '# contigs (>= 0 bp)'] - dict_table['quast'].loc[snakemake.wildcards.isolate+'.raw', '# contigs (>= 1000 bp)'] merged['max_contig_length'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.final','Largest contig'] # Add software information merged['SSU_recover_software'] = merged['LSU_recover_software'] = 'metaxa2' merged['trna_ext_software'] = 'tRNAscan-SE' merged['assembly_software'] = 'SPAdes' # Add the location and checksums of the genome merged['archive_file'] = dict_md5['archive_md5'].at[0, 'file'] merged['archive_file_md5'] = dict_md5['archive_md5'].at[0, 'md5'] merged['genome_md5'] = dict_md5['genome_md5'].at[0, 'md5'] # Add information about the raw sequences: location, checksums, metrics merged['forward_file_md5'] = dict_md5['fastq_md5'].at[0, 'md5'] merged['reverse_file_md5'] = dict_md5['fastq_md5'].at[1, 'md5'] # Rename columns merged.rename(columns={'SSU_bp':'16S_SSU_rRNA_length', 'LSU_bp':'23S_LSU_rRNA_length', 'tRNAs':'trnas'}, inplace=True) # Compute criteria of the MIMAG and SeqCode hq_criteria = { 'is_compl_grtr_90': merged.at[snakemake.wildcards.isolate, 'compl_score'] > 90, 'is_contam_less_5': merged.at[snakemake.wildcards.isolate, 'contam_score'] < 5, 'is_coverage_grtr_10': merged.at[snakemake.wildcards.isolate, 'coverage'] > 10, 'are_contigs_less_100': merged.at[snakemake.wildcards.isolate, 'number_contig'] < 100, 'is_N50_grtr_25kb': merged.at[snakemake.wildcards.isolate, 'N50'] > 25000, 'is_max_contig_grtr_100kb': merged.at[snakemake.wildcards.isolate, 'max_contig_length'] > 100000, 'is_trnas_grtr_18': merged.at[snakemake.wildcards.isolate, 'trnas'] > 18, 'is_SSU_grtr_0': merged.at[snakemake.wildcards.isolate, '16S_SSU_rRNA_length'] != "0", 'is_LSU_grtr_0': merged.at[snakemake.wildcards.isolate, '23S_LSU_rRNA_length'] != "0", 'is_5S_grtr_0': merged.at[snakemake.wildcards.isolate, '5S_rRNA_length'] != "0", } # Assess which criteria of the MIMAG and SeqCode did not pass to_check = [x for x in hq_criteria.keys() if not hq_criteria[x]] criteria_false = ', '.join(to_check) # Indicate the assembly quality if compliance to the MIMAG and SeqCode criteria merged['assembly_qual'] = 'High-quality draft' if all(hq_criteria.values()) else 'Manual review: ' + criteria_false # Add the flags of the criteria to the main table merged = pd.merge(merged, pd.DataFrame(hq_criteria, index = [snakemake.wildcards.isolate]), left_index=True, right_index=True) # Add the plasmids lengths if a non-empty file was provided if os.path.getsize(snakemake.input['plasmids']) > 0: # Read the report plasmids = pd.read_table(snakemake.input['plasmids'], names = ['file', 'length']) # Extract the lengths and sort in descending order plasmids = plasmids.loc[:,'length'].sort_values(ascending = False).tolist() # Format the lengths as string and concatenate. No concatenation if only one element merged['plasmid_length'] = ';'.join([ str(x) for x in plasmids ]) else: # if the report is empty merged['plasmid_length'] = '0' # Add the adapters file used merged['adapters_file'] = snakemake.params.adapters # Reorder columns genome_csv = merged.reindex(columns=['genome_md5', 'assembly_qual','genome_length', 'number_contig', 'N50', 'number_contig_below_1kb', 'max_contig_length', 'coverage', 'assembly_software', 'compl_score', 'compl_software', 'contam_score', 'contam_software', '16S_SSU_rRNA_length', 'SSU_recover_software', '23S_LSU_rRNA_length', 'LSU_recover_software', 'trnas', 'trna_ext_software', '5S_rRNA_length', 'archive_file', 'archive_file_md5', 'forward_file', 'forward_file_md5', 'reverse_file', 'reverse_file_md5', 'sequence_count', 'basepairs_count', 'average_length', 'sequence_count_qual', 'basepairs_count_qual', 'adapters_file', 'plasmid_length', 'is_compl_grtr_90', 'is_contam_less_5', 'is_coverage_grtr_10', 'are_contigs_less_100', 'is_N50_grtr_25kb','is_max_contig_grtr_100kb', 'is_trnas_grtr_18', 'is_SSU_grtr_0', 'is_LSU_grtr_0', 'is_5S_grtr_0']) # Add the date when the assembly was done genome_csv['assembly_date'] = date.today() # Get the version of the workflow (and the last commit if no tags are present) and add to the table try: get_tag = subprocess.check_output("git describe --always", shell=True, text=True) genome_csv['workflow_version'] = get_tag.strip() except subprocess.CalledProcessError: genome_csv['workflow_version'] = "NA" print("Either git is not installed or the workflow is not run in the git repository") # Convert the index to a column 'isolate' for clarity # src: https://stackoverflow.com/a/63926255 genome_csv = genome_csv.rename_axis('isolate').reset_index() # Write the table to file genome_csv.T.to_csv(snakemake.output[0], header=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | __author__ = "Filipe G. Vieira" __copyright__ = "Copyright 2021, Filipe G. Vieira" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts java_opts = get_java_opts(snakemake) extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) n = len(snakemake.input.sample) assert ( n == 1 or n == 2 ), "input->sample must have 1 (single-end) or 2 (paired-end) elements." if n == 1: reads = "in={}".format(snakemake.input.sample) trimmed = "out={}".format(snakemake.output.trimmed) else: reads = "in={} in2={}".format(*snakemake.input.sample) trimmed = "out={} out2={}".format(*snakemake.output.trimmed) singleton = snakemake.output.get("singleton", "") if singleton: singleton = f"outs={singleton}" discarded = snakemake.output.get("discarded", "") if discarded: discarded = f"outm={discarded}" stats = snakemake.output.get("stats", "") if stats: stats = f"stats={stats}" shell( "bbduk.sh {java_opts} t={snakemake.threads} " "{reads} " "{adapters} " "{extra} " "{trimmed} {singleton} {discarded} " "{stats} " "{log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Patrik Smeds" __copyright__ = "Copyright 2016, Patrik Smeds" __email__ = "patrik.smeds@gmail.com" __license__ = "MIT" from os.path import splitext from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) # Check inputs/arguments. if len(snakemake.input) == 0: raise ValueError("A reference genome has to be provided!") elif len(snakemake.input) > 1: raise ValueError("Only one reference genome can be inputed!") # Prefix that should be used for the database prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0]) if len(prefix) > 0: prefix = "-p " + prefix # Contrunction algorithm that will be used to build the database, default is bwtsw construction_algorithm = snakemake.params.get("algorithm", "") if len(construction_algorithm) != 0: construction_algorithm = "-a " + construction_algorithm shell( "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | __author__ = "Johannes Köster, Julian de Ruiter" __copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter" __email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com" __license__ = "MIT" import tempfile from os import path from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts from snakemake_wrapper_utils.samtools import get_samtools_opts # Extract arguments. extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) sort = snakemake.params.get("sorting", "none") sort_order = snakemake.params.get("sort_order", "coordinate") sort_extra = snakemake.params.get("sort_extra", "") samtools_opts = get_samtools_opts(snakemake) java_opts = get_java_opts(snakemake) index = snakemake.input.idx if isinstance(index, str): index = path.splitext(snakemake.input.idx)[0] else: index = path.splitext(snakemake.input.idx[0])[0] # Check inputs/arguments. if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in { 1, 2, }: raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements") if sort_order not in {"coordinate", "queryname"}: raise ValueError("Unexpected value for sort_order ({})".format(sort_order)) # Determine which pipe command to use for converting to bam or sorting. if sort == "none": # Simply convert to bam using samtools view. pipe_cmd = "samtools view {samtools_opts}" elif sort == "samtools": # Add name flag if needed. if sort_order == "queryname": sort_extra += " -n" # Sort alignments using samtools sort. pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}" elif sort == "picard": # Sort alignments using picard SortSam. pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}" else: raise ValueError(f"Unexpected value for params.sort ({sort})") with tempfile.TemporaryDirectory() as tmpdir: shell( "(bwa mem" " -t {snakemake.threads}" " {extra}" " {index}" " {snakemake.input.reads}" " | " + pipe_cmd + ") {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Samtools takes additional threads through its option -@ # One thread for samtools merge # Other threads are *additional* threads passed to the '-@' argument threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1) shell( "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" import tempfile from pathlib import Path from snakemake.shell import shell from snakemake_wrapper_utils.samtools import get_samtools_opts samtools_opts = get_samtools_opts(snakemake) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) with tempfile.TemporaryDirectory() as tmpdir: tmp_prefix = Path(tmpdir) / "samtools_fastq.sort_" shell( "samtools sort {samtools_opts} {extra} -T {tmp_prefix} {snakemake.input[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.samtools import get_samtools_opts samtools_opts = get_samtools_opts(snakemake) extra = snakemake.params.get("extra", "") region = snakemake.params.get("region", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True) shell("samtools view {samtools_opts} {extra} {snakemake.input[0]} {region} {log}") |
Support
- Future updates
Related Workflows





