A Snakemake workflow assembling bacterial genomes according to the standard operating procedure in the Clavel Lab

public public 1yr ago Version: v7.0 0 bookmarks

A Snakemake workflow for assembling bacterial genomes according to the standard operating procedure in the Clavel Lab and assessing their quality in accordance to MIMAG and SeqCode criteria.

Usage

First, the different conda environments can be installed using the following command:

snakemake --use-conda --conda-create-envs-only

Then, the workflow can be used with the default configuration file and 9 threads as follows:

snakemake -c 9 --use-conda

Or using a custom one:

snakemake --configfile config/er2.yaml -c 9 --use-conda

Notes

  1. The default rerun behavior of Snakemake will rerun the workflow for any possible changes in input, code, parameters, modification time or software environment. However, this implies that the workflow downstream of the plasmid extraction will be rerun everytime because of the presence of checkpoints . In a situation where the user knows that some steps do not need to be run again (e.g., plasmid extraction, contigs curation), the flag --rerun-triggers {mtime,params,software-env,code} (instead of the default: --rerun-triggers {mtime,params,input,software-env,code} ) can be added at the user's responsibility.

  2. The assembly workflow is done by default only with the paired reads, but it is now possible to use the unpaired reads (quality-filtered, adapters- and phix-removed) when needed (e.g., low quality reverse reads that bring the paired reads number too low). To enable this feature, one must change the default git branch with the following command: git checkout with-unpaired and run the workflow according to the Usage section.

Input/Output

The documentation provides more details on the configuration of the workflow, but in summary:

Input :

  • A tab-separated table indicating the isolate name and the path to paired-end FASTQ files

  • A configuration file

Output :

  • The genome of the isolate (with the plasmids sequences if any) in a gzipped FASTA file format in results/genome/isolate.combined.fa.gz

  • The plasmids sequences of the isolates in a gzipped FASTA file format in results/plasmids/isolate.plasmids.fa.gz

  • A comma-separated table (in results ) summarising the metadata on the generated genomes with for example:

    • paths to the previous files and their md5sums

    • genome quality metrics and the software used to calculate them (e.g., completion, contamination, tRNAs, 16S, 23S, 5S)

    • boolean flags indicating which genome quality tests passed or failed

The column of the output table are detailed in the next section with examples.

Output table metadata details

Metadata Definition Example
isolate Identifier of the isolate provided in the sample table CLAXXH123
genome_md5 Checksum of the FASTA file of the genome db2cd4eff6d40c463998851f43b15670
assembly_qual Assembly quality, either “High-quality draft” if all quality flags are valid or “Manual review:” with the quality flags that were invalid. Manual review: are_contigs_less_100
genome_length Size of the genome in nucleotide based on the length of all contigs in the assembly 4839229
number_contig Number of contigs in the assembly 134
N50 Median length of the contigs in the assembly 83304
number_contig_below_1kb Number of small contigs, under 1000 nucleotides-long 202
max_contig_length Length of the longest contig 291966
coverage Ratio of the basepairs count of raw sequences over the basepairs count of the assembly 394
assembly_software Name of the software used for the assembly SPAdes
compl_score Percentage of estimated marker-based completion of the assembly 99.46
compl_software Name of the software used for assessing the assembly completion checkm
contam_score Percentage of estimated marker-based contamination of the assembly 0
contam_software Name of the software used for assessing the assembly contamination checkm
16S_SSU_rRNA_length Semicolon list of lengths of detected 16S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. 1393;1037;597;528
SSU_recover_software Name of the software used for detecting the 16S rRNA sequences metaxa2
23S_LSU_rRNA_length Semicolon list of lengths of detected 23S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. 1637;795;528
LSU_recover_software Name of the software used for detecting the 23S rRNA sequences metaxa2
trnas Number of unique essential tRNAs detected in the assembly among the following twenty: Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr and Val. 20
trna_ext_software Name of the software used for detecting tRNAs sequences tRNAscan-SE
5S_rRNA_length Semicolon list of lengths of detected 5S rRNA sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. 110;110;106;105
archive_file Path to the archive of the genome file results/genome/CLAJMH5.genome.fa.gz
archive_file_md5 Checksum of the archive of the genome file db2cd4eff6d40c463998851f43b15670
forward_file Path to the forward/R1 raw sequence file /DATA/hibc_fastq/CLAJMH5_S16_R1_001.fastq.gz
forward_file_md5 Checksum of the archive of the forward raw sequence file 1ff2885fe846135980b481996d8e1da6
reverse_file Path to the reverse/R2 raw sequence file /DATA/hibc_fastq/CLAJMH5_S16_R2_001.fastq.gz
reverse_file_md5 Checksum of the archive of the reverse raw sequence file 3acd93ba96b97aac4da7756c83f2f990
sequence_count Number of reads in the library (sequencing depth) 6314677
basepairs_count Number of basepairs in nucleotides 1907032454
average_length The average read length estimated as basepairs_count divided by sequence_count 151
sequence_count_qual Number of reads in the library (sequencing depth) after quality filtering 4837051
basepairs_count_qual Number of basepairs in nucleotides after quality filtering 1324608613
adapters_file Path to the FASTA file containing the adapters sequences resources/NexteraPE-PE.fa
plasmid_length Semicolon list of lengths of detected plasmid sequences. Lengths are sorted decreasing or a value of 0 when no sequences are detected. 7059
is_compl_grtr_90 Logical flag indicating whether the completion, in compl_score, is greater than 90% TRUE
is_contam_less_5 Logical flag indicating whether the contamination, contam_score, is greater than 5% TRUE
is_coverage_grtr_10 Logical flag indicating whether the coverage is greater than 10x TRUE
are_contigs_less_100 Logical flag indicating whether the number of contigs, in number_contig, is below 100 FALSE
is_N50_grtr_25kb Logical flag indicating whether the N50 is longer than 25 000 nucleotides TRUE
is_max_contig_grtr_100kb Logical flag indicating whether the longest contig, in max_contig_length, is longer than 100 000 nucleotides TRUE
is_trnas_grtr_18 Logical flag indicating whether the number of tRNAs is greater than 18 TRUE
is_SSU_grtr_0 Logical flag indicating whether any 16S/SSU rRNA sequence longer than 0 nucleotide is detected TRUE
is_LSU_grtr_0 Logical flag indicating whether any 23S/LSU rRNA sequence longer than 0 nucleotide is detected TRUE
is_5S_grtr_0 Logical flag indicating whether any 5S rRNA sequence longer than 0 nucleotide is detected TRUE
assembly_date Date on which the assembly was generated 2022-10-12
workflow_version Git tag of the genome assembly workflow used. If not a round version like v5.2, then the commit hash past the round version is added v5.2-1-g4dae043

Changelog

  • v7.0: Concatenate genome and plasmids sequences (if any). Add plasmids proteins annotations.

  • v6.1: Fix conda environments setup by updating snakemake wrappers.

  • v6.0: Checksums for both the genome archive and FASTA file are computed. Improved the table. Assembly with unpaired reads is possible.

  • v5.3.2: Fixes for conda environments

  • v5.3: Update bakta (fix protein predictions) and minor updates ofcheckm, metaxa2 and SPades. Better documentation of the column of the output table.

  • v5.2: Recovered the behavior of length of 0bp when LSU/SSU/plasmids are not detected.

  • v5.1: Fix issue in plasmid length summary. Remove bakta compliant mode.

  • v5.0: Better summary with lengths of (multiple) 16S, 23S, 5S and plasmids. Set the minimum contig size to 500bp. Remove MDMcleaner step. Fix incorrect N50 and contig number by QUAST. Note on tRNAs.

  • v4.2: Handle exceptions from MDMcleaner. Fix plasmid workflow to not trigger unneeded rules.

  • v4.1: Fix issue with the input function to keep the workflow flexible with the plasmid assembly

  • v4.0: Fix bugs such as issue with Recycler failing with unconnected plasmid assembly graph or missing gene names in annotations. Add assembly date and workflow version to the summary table. Better documentation regarding the rerun behavior of Snakemake

  • v3.2: Make sure to use at least the version v1.5 of bakta

  • v3.1: Reduce disk space used by flagging fast-to-generate output files as temporary (e.g., trimmed sequences, BAM file)

  • v3.0: Improved README (especially for the configuration in config ) and fixed bugs. Failed quality criteria are correctly displayed along the aggregated table, plasmids are produced only when detected and a length of 0 bp is outputted when LSU/SSU are not detected.

  • v2.0: Extended the assembly quality assessment to evaluate the compliance to the MIMAG and SeqCode criteria and output an aggregated summary table

  • v1.1: Added a quality control by CheckM

  • v1.0: Initial conversion of the Standard Operating Procedure used in the lab to a Snakemake workflow

Code Snippets

39
40
41
42
43
44
shell:
    """
    spades.py -t {threads} --careful \
        -1 {input[0]} -2 {input[1]} \
        -o results/assembly/{wildcards.isolate} &> {log}
    """
30
31
shell:
    "cat {input} | gzip > {output}"
17
18
19
20
21
22
23
24
25
shell:
    """
    trimmomatic PE -threads {threads} -phred33 \
    {input.r1} {input.r2} \
    {output.r1} {output.r1_unpaired} \
    {output.r2} {output.r2_unpaired} \
    ILLUMINACLIP:{params.illumina_clip}:2:30:10 \
    {params.extra} &> {log}
    """
45
46
wrapper:
    "v1.21.0/bio/bbtools/bbduk"
14
15
16
17
18
19
shell:
    """
    plasmidspades.py -t {threads} --careful \
        -1 {input[0]} -2 {input[1]} \
        -o results/plasmid_reconstruction/{wildcards.isolate} &> {log}
    """
31
32
33
34
shell:
    """
    make_fasta_from_fastg.py -g {input} -o {output} &> {log}
    """
51
52
wrapper:
    "v1.21.0/bio/bwa/index"
77
78
wrapper:
    "v1.21.0/bio/bwa/mem"
93
94
wrapper:
    "v1.21.0/bio/samtools/view"
105
106
wrapper:
    "v1.21.0/bio/samtools/sort"
117
118
wrapper:
    "v1.21.0/bio/samtools/index"
178
179
wrapper:
    "v1.21.0/bio/bbtools/bbduk"
206
207
208
209
shell:
    """
    (seqkit fx2tab --name --length --threads {threads} {input} 1> {output} 2> {log} ) || (cat /dev/null > {output})
    """
14
15
16
17
shell:
    """
    seqkit seq --remove-gaps --min-len {params.min_contig_length} --threads {threads} {input} 1> {output} 2> {log}
    """
28
29
30
31
32
33
shell:
    """
    seqkit replace --pattern 'NODE(_[0-9]*_length_[0-9]*)_cov.*' \
    --replacement '{wildcards.isolate}_contig$1' --line-width 0 \
    {input} > {output}
    """
42
43
44
45
46
shell:
    """
    grep '>' {input} | sed 's/>//' |\
            awk '{{OFS="\t"; print $0,$0,"contig","linear","-"}}' > {output}
    """
55
56
57
58
59
shell:
    """
    grep '>' {input} | sed 's/>//' |\
            awk '{{OFS="\t"; print $0,$0,"plasmid","circular","-"}}' > {output}
    """
69
70
shell:
    "cat {input} > {output}"
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
    """
    checkm lineage_wf {params.flag} \
    --extension fa \
    --tab_table --file {output} \
    --threads {threads} \
    results/quality_check/{wildcards.isolate}/ \
    results/quality_check/{wildcards.isolate}/checkm/ &> {log}
    """
111
112
113
114
shell:
    """
    bakta_db download --output {output}
    """
131
132
133
134
135
136
137
138
139
shell:
    """
    bakta --db {params.db}/db/ \
    --prefix {wildcards.isolate} \
    --locus-tag {wildcards.isolate} \
    --output {params.outdir} \
    --replicons {input.replicon} \
    --threads {threads} {input.fasta} &> {log}
    """
175
176
script:
    "../scripts/check_tRNAs_5S.py"
191
192
193
194
195
196
197
198
shell:
    """
    metaxa2_x -i {input} \
        -o SSU-{wildcards.isolate} -f fasta \
        -g ssu --mode genome \
        --table F --fasta T --graphical F --summary F \
        --cpu {threads} &> {log}
    """
210
211
212
213
shell:
    """
    seqkit replace -p 'NODE' -r '{wildcards.isolate}_SSU_16S_rRNA NODE' --line-width 0 --threads {threads} {input.seq} > {output}
    """
228
229
230
231
232
233
234
235
shell:
    """
    metaxa2_x -i {input} \
        -o LSU-{wildcards.isolate} -f fasta \
        -g lsu --mode genome \
        --table F --fasta T --graphical F --summary F \
        --cpu {threads} &> {log}
    """
247
248
249
250
shell:
    """
    seqkit replace -p 'NODE' -r '{wildcards.isolate}_LSU_23S_rRNA NODE' --line-width 0 --threads {threads} {input.seq} > {output}
    """
260
261
262
263
264
shell:
    """
    mv {input.SSU} {output.SSU}
    mv {input.LSU} {output.LSU}
    """
297
298
script:
    "../scripts/check_LSU_SSU.py"
312
313
314
315
shell:
    """
    seqkit stats -T --threads {threads} {input} 1> {output} 2> {log}
    """
349
350
script:
    "../scripts/write_coverage_and_metrics.py"
365
366
367
368
369
shell:
    """
    quast --threads {threads} --labels "{wildcards.isolate}.raw,{wildcards.isolate}.final" \
    --no-icarus --min-contig 0 --output-dir {output.quast_dir} {input.raw_assembly} {input.final_assembly} > {log}
    """
379
380
shell:
    "md5sum {input} 1> {output} 2> {log}"
390
391
shell:
    "md5sum {input} 1> {output} 2> {log}"
401
402
shell:
    "zcat {input}| md5sum| sed -e 's#-#{input}#' -e 's/.gz$//' 1> {output} 2> {log}"
425
426
script:
    "../scripts/write_summary_table.py"
443
444
script:
    "../scripts/aggregate_summaries.py"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import sys
import pandas as pd
from functools import reduce

sys.stderr = open(snakemake.log[0], "w")

# Read the summary csv files of each of the isolate processed and transposed the long table into a wide table
loaded_df = [pd.read_csv(csv, index_col=0).T for csv in snakemake.input]

# Concatenate all the genomes
aggregate_df = pd.concat(loaded_df)

# Make sure the isolate column does not remain in the index
aggregate_df = aggregate_df.rename_axis('isolate').reset_index()

# Write the aggregated table to file
aggregate_df.to_csv(snakemake.output[0], index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import sys
import os
import re
import pandas as pd

sys.stderr = open(snakemake.log[0], "w")

# Initialize a dictionary of subunit length
subunit_length = {}

# Read the first line of the report giving the length of all the SSU (16S) and the LSU (23S) FASTA sequences detected
for subunit in ['SSU', 'LSU']:
    if os.path.getsize(snakemake.input[subunit]) > 0:
        # Read the report
        ssu_lsu = pd.read_table(snakemake.input[subunit], names = ['file', 'length'])
        # Extract the lengths and sort in descending order
        ssu_lsu_lengths = ssu_lsu.loc[:,'length'].sort_values(ascending = False).tolist()
        # Format the lengths as string and concatenate. No concatenation if only one element
        subunit_length[subunit+'_bp'] = ';'.join([ str(x) for x in ssu_lsu_lengths ])
    else: # if the report is empty
        subunit_length[subunit+'_bp'] = "0"
# Export the length dictionary as a dataframe
subunit_length_df = pd.DataFrame(subunit_length, index=[snakemake.wildcards.isolate])
subunit_length_df.to_csv(snakemake.output[0])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import sys
import pandas as pd

sys.stderr = open(snakemake.log[0], "w")

# Read Bakta tabular annotations
annotations = pd.read_table(snakemake.input[0], sep="\t", header=2)
# Casting the column as string to fix a AttributeError when no gene names are found
#   Can only use .str accessor with string values
annotations['Gene']=annotations['Gene'].astype(str)
# Keep a subset with the tRNAs in the predicted gene name and replace NaN by False
annotations_trnas = annotations[annotations['Gene'].str.contains('_trna', na=False)]

# Dictionary counts of the number of occurrences of each provided tRNA
trnas = {x: annotations_trnas['Gene'].str.count(x+'_trna').sum() for x in snakemake.params.tRNAs}

# How many tRNAs were detected?
how_many = sum([trnas[x]>0 for x in trnas.keys()])

# Subset the annotations with the prediceted 5S rRNA genes
five_s = annotations[annotations.Gene == '5S_rrna']
# Compute the length
five_s_lengths = five_s.Stop - five_s.Start
# Extract the lengths and sort in descending order
five_s_lengths = five_s_lengths.sort_values(ascending = False).tolist()
# Format the lengths as strings and concatenate. No concatenation if only one element
if five_s_lengths:
    five_s_lengths = ';'.join([ str(x) for x in five_s_lengths ])
else:
    five_s_lengths = '0'



# Export the detailed counts of tRNAs as a dataframe
detailled_df = pd.DataFrame(trnas, index=[snakemake.wildcards.isolate])
detailled_df.to_csv(snakemake.output.details)


# Export the summary count dictionary of tRNAs and 5S as a dataframe
summary_df = pd.merge(
        pd.DataFrame({'tRNAs': how_many}, index=[snakemake.wildcards.isolate]),
        pd.DataFrame({'5S_rRNA_length': five_s_lengths}, index=[snakemake.wildcards.isolate]),
        left_index=True, right_index=True)
summary_df.to_csv(snakemake.output.summary)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import sys
import pandas as pd

sys.stderr = open(snakemake.log[0], "w")

# Initialize a dictionary of metrics
metrics = {}

# Read the seqkit stats table for raw sequences R1 (and possibly R2)
raw_df = pd.read_table(snakemake.input['raw'], sep="\t")

# Gather metrics on raw sequences
metrics['sequence_count'] = raw_df.at[0, 'num_seqs']
metrics['basepairs_count'] = raw_df.loc[:, 'sum_len'].sum() # if paired end data
metrics['average_length'] = raw_df.at[0, 'avg_len']

# Read the seqkit stats table for trimmed/quality sequences R1 (and possibly R2)
trimmed_df = pd.read_table(snakemake.input['trimmed'], sep="\t")

# Gather metrics on trimmed/quality filtered sequences
metrics['sequence_count_qual'] = trimmed_df.at[0, 'num_seqs']
metrics['basepairs_count_qual'] = trimmed_df.loc[:, 'sum_len'].sum() # if paired end data

# Read the seqkit stats table for the assembly
assembly_df = pd.read_table(snakemake.input['assembly'], sep="\t")

# Compute the coverage as the basepairs count of raw sequence over the basepair count of the assembly
metrics['coverage'] = round(metrics['basepairs_count'] / assembly_df.at[0, 'sum_len'])

# Export the metrics dictionary as a dataframe
metrics_df = pd.DataFrame(metrics, index=[snakemake.wildcards.isolate])
metrics_df.to_csv(snakemake.output[0])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import sys
import os
import pandas as pd
from functools import reduce
from datetime import date
import subprocess

sys.stderr = open(snakemake.log[0], "w")

# Read the csv files of basepairs metrics and SSU/LSU extraction
dict_csv = { key: pd.read_csv(snakemake.input[key], index_col=0) for key in ['metrics', 'ssu_lsu', 'trnas_5s'] }
# Read the tsv files of sample table provided in the configuration file and outputs from genome quality assessment tools
dict_table = { key: pd.read_table(snakemake.input[key], sep='\t') for key in ['samples', 'checkm', 'quast'] }
# Read the checksums of the raw fastq files and the genome fasta file
dict_md5 = { key: pd.read_table(snakemake.input[key], sep='\s+', names=['md5', 'file']) for key in ['fastq_md5', 'archive_md5', 'genome_md5'] }


# Keep the csv as a list
list_csv = list(dict_csv.values())
# Append the samples table and add to the csv list of genome metrics, SSU/LSU and tRNAs
dict_table['samples'].set_index('isolate', inplace=True)
list_csv.append(dict_table['samples'])

# Merge the table based on the index that is the isolate name
merged = reduce(lambda  left,right: pd.merge(left,right, left_index=True, right_index=True), list_csv)

# Add quality metrics from the dictionaries containing the tables
merged['compl_score'] = float(dict_table['checkm'].loc[ dict_table['checkm']['Bin Id']==snakemake.wildcards.isolate+'.genome', 'Completeness'])
merged['compl_software'] = 'checkm'
merged['contam_score'] = float(dict_table['checkm'].loc[ dict_table['checkm']['Bin Id']==snakemake.wildcards.isolate+'.genome', 'Contamination'])
merged['contam_software'] = 'checkm'

# Set the index for the QUAST data
dict_table['quast'].set_index('Assembly', inplace=True)

# Add assembly metrics
merged['N50'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.final', 'N50']
merged['genome_length'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.final', 'Total length']
merged['number_contig'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.final', '# contigs (>= 0 bp)']
merged['number_contig_below_1kb'] = dict_table['quast'].loc[snakemake.wildcards.isolate+'.raw', '# contigs (>= 0 bp)'] - dict_table['quast'].loc[snakemake.wildcards.isolate+'.raw', '# contigs (>= 1000 bp)']
merged['max_contig_length'] =  dict_table['quast'].loc[snakemake.wildcards.isolate+'.final','Largest contig']

# Add software information
merged['SSU_recover_software'] = merged['LSU_recover_software'] = 'metaxa2'
merged['trna_ext_software'] = 'tRNAscan-SE'
merged['assembly_software'] = 'SPAdes'

# Add the location and checksums of the genome
merged['archive_file'] = dict_md5['archive_md5'].at[0, 'file']
merged['archive_file_md5'] = dict_md5['archive_md5'].at[0, 'md5']
merged['genome_md5'] = dict_md5['genome_md5'].at[0, 'md5']

# Add information about the raw sequences: location, checksums, metrics
merged['forward_file_md5'] = dict_md5['fastq_md5'].at[0, 'md5']
merged['reverse_file_md5'] = dict_md5['fastq_md5'].at[1, 'md5']


# Rename columns
merged.rename(columns={'SSU_bp':'16S_SSU_rRNA_length', 'LSU_bp':'23S_LSU_rRNA_length', 'tRNAs':'trnas'}, inplace=True)


# Compute criteria of the MIMAG and SeqCode
hq_criteria = {
    'is_compl_grtr_90': merged.at[snakemake.wildcards.isolate, 'compl_score'] > 90,
    'is_contam_less_5': merged.at[snakemake.wildcards.isolate, 'contam_score'] < 5,
    'is_coverage_grtr_10': merged.at[snakemake.wildcards.isolate, 'coverage'] > 10,
    'are_contigs_less_100': merged.at[snakemake.wildcards.isolate, 'number_contig'] < 100,
    'is_N50_grtr_25kb': merged.at[snakemake.wildcards.isolate, 'N50'] > 25000,
    'is_max_contig_grtr_100kb': merged.at[snakemake.wildcards.isolate, 'max_contig_length'] > 100000,
    'is_trnas_grtr_18': merged.at[snakemake.wildcards.isolate, 'trnas'] > 18,
    'is_SSU_grtr_0': merged.at[snakemake.wildcards.isolate, '16S_SSU_rRNA_length'] != "0",
    'is_LSU_grtr_0': merged.at[snakemake.wildcards.isolate, '23S_LSU_rRNA_length'] != "0",
    'is_5S_grtr_0': merged.at[snakemake.wildcards.isolate, '5S_rRNA_length'] != "0",
}

# Assess which criteria of the MIMAG and SeqCode did not pass
to_check = [x for x in hq_criteria.keys() if not hq_criteria[x]]
criteria_false = ', '.join(to_check)
# Indicate the assembly quality if compliance to the MIMAG and SeqCode criteria
merged['assembly_qual'] = 'High-quality draft' if all(hq_criteria.values()) else 'Manual review: ' + criteria_false

# Add the flags of the criteria to the main table
merged = pd.merge(merged, pd.DataFrame(hq_criteria, index = [snakemake.wildcards.isolate]),
                  left_index=True, right_index=True)

# Add the plasmids lengths if a non-empty file was provided
if os.path.getsize(snakemake.input['plasmids']) > 0:
    # Read the report
    plasmids = pd.read_table(snakemake.input['plasmids'], names = ['file', 'length'])
    # Extract the lengths and sort in descending order
    plasmids = plasmids.loc[:,'length'].sort_values(ascending = False).tolist()
    # Format the lengths as string and concatenate. No concatenation if only one element
    merged['plasmid_length'] = ';'.join([ str(x) for x in plasmids ])
else: # if the report is empty
    merged['plasmid_length'] = '0'


# Add the adapters file used
merged['adapters_file'] = snakemake.params.adapters

# Reorder columns
genome_csv = merged.reindex(columns=['genome_md5',
                'assembly_qual','genome_length',
                'number_contig', 'N50',
                'number_contig_below_1kb', 'max_contig_length',
                'coverage', 'assembly_software',
                'compl_score', 'compl_software',
                'contam_score', 'contam_software',
                '16S_SSU_rRNA_length', 'SSU_recover_software',
                '23S_LSU_rRNA_length', 'LSU_recover_software',
                'trnas', 'trna_ext_software', '5S_rRNA_length',
                'archive_file', 'archive_file_md5',
                'forward_file', 'forward_file_md5',
                'reverse_file', 'reverse_file_md5',
                'sequence_count', 'basepairs_count', 'average_length',
                'sequence_count_qual', 'basepairs_count_qual', 'adapters_file', 'plasmid_length',
                'is_compl_grtr_90', 'is_contam_less_5',
                'is_coverage_grtr_10', 'are_contigs_less_100',
                'is_N50_grtr_25kb','is_max_contig_grtr_100kb',
                'is_trnas_grtr_18',
                'is_SSU_grtr_0', 'is_LSU_grtr_0', 'is_5S_grtr_0'])

# Add the date when the assembly was done
genome_csv['assembly_date'] = date.today()
# Get the version of the workflow (and the last commit if no tags are present) and add to the table
try:
    get_tag = subprocess.check_output("git describe --always", shell=True, text=True)
    genome_csv['workflow_version'] = get_tag.strip()
except subprocess.CalledProcessError:
    genome_csv['workflow_version'] = "NA"
    print("Either git is not installed or the workflow is not run in the git repository")

# Convert the index to a column 'isolate' for clarity
# src: https://stackoverflow.com/a/63926255
genome_csv = genome_csv.rename_axis('isolate').reset_index()

# Write the table to file
genome_csv.T.to_csv(snakemake.output[0], header=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2021, Filipe G. Vieira"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

java_opts = get_java_opts(snakemake)
extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "in={}".format(snakemake.input.sample)
    trimmed = "out={}".format(snakemake.output.trimmed)
else:
    reads = "in={} in2={}".format(*snakemake.input.sample)
    trimmed = "out={} out2={}".format(*snakemake.output.trimmed)


singleton = snakemake.output.get("singleton", "")
if singleton:
    singleton = f"outs={singleton}"


discarded = snakemake.output.get("discarded", "")
if discarded:
    discarded = f"outm={discarded}"


stats = snakemake.output.get("stats", "")
if stats:
    stats = f"stats={stats}"


shell(
    "bbduk.sh {java_opts} t={snakemake.threads} "
    "{reads} "
    "{adapters} "
    "{extra} "
    "{trimmed} {singleton} {discarded} "
    "{stats} "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"

from os.path import splitext

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", splitext(snakemake.output.idx[0])[0])

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "koester@jimmy.harvard.edu, julianderuiter@gmail.com"
__license__ = "MIT"


import tempfile
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts


# Extract arguments.
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
samtools_opts = get_samtools_opts(snakemake)
java_opts = get_java_opts(snakemake)


index = snakemake.input.idx
if isinstance(index, str):
    index = path.splitext(snakemake.input.idx)[0]
else:
    index = path.splitext(snakemake.input.idx[0])[0]


# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")


if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))


# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view {samtools_opts}"

elif sort == "samtools":
    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"

elif sort == "picard":
    # Sort alignments using picard SortSam.
    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"

else:
    raise ValueError(f"Unexpected value for params.sort ({sort})")

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(bwa mem"
        " -t {snakemake.threads}"
        " {extra}"
        " {index}"
        " {snakemake.input.reads}"
        " | " + pipe_cmd + ") {log}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import tempfile
from pathlib import Path
from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts


samtools_opts = get_samtools_opts(snakemake)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


with tempfile.TemporaryDirectory() as tmpdir:
    tmp_prefix = Path(tmpdir) / "samtools_fastq.sort_"

    shell(
        "samtools sort {samtools_opts} {extra} -T {tmp_prefix} {snakemake.input[0]} {log}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts

samtools_opts = get_samtools_opts(snakemake)
extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)


shell("samtools view {samtools_opts} {extra} {snakemake.input[0]} {region} {log}")
ShowHide 35 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ClavelLab/genome-assembly
Name: genome-assembly
Version: v7.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...