MAGqual is a command line tool to evaluate the quality of metagenomic bins and generate recommended metadata in line with the MIMAG standards

public public 1yr ago Version: v0.1.1-beta 0 bookmarks

MAGqual is a command line tool that will evaluate the quality of metagenomic bins and generate required metadata in line with the MIMAG standards (as outlined in Bowers et al. 2017 ).

MAGqual Set-Up

Here is a step-by-step guide on how to install MAGqual from the GitHub repository at https://github.com/ac1513/MAGqual.

Step 1: Clone the MAGqual repository

Open a command-line interface (CLI) or terminal on your computer or computer cluster. Change to the directory where you want to install MAGqual. Then, run the following command to clone the MAGqual repository from GitHub:

git clone https://github.com/ac1513/MAGqual.git

This will create a copy of the MAGqual repository on your computer.

Step 2: Install dependencies

  • Conda (package manager)

  • Snakemake v.6.17.1 or higher (workflow manager)

In order to run MAGqual, Conda and Snakemake are required.

Miniconda can be installed following the instructions for your system in the Miniconda documentation .

Once Miniconda is installed, Snakemake can be installed following the instructions in the Snakemake documentation .

MAGqual will handle the other dependancies while the pipeline is running through the use of Conda environments.

Additional Notes:

MAGqual is compatible with Python 3.10.1 or higher (as of April 2023). Make sure to clone the MAGqual repository regularly to get the latest updates and bug fixes. Refer to the MAGqual documentation in the GitHub repository for more information on how to use the tool and interpret the results. Congratulations! You have successfully installed MAGqual on your system.

Running MAGqual

Quick start quide

Once you have created an environment with Snakemake in you can run MAGqual with the following command:

python MAGqual.py --asm assembly.fa --bins bins_dir/

  • --asm corresponds to the location of the assembly used to generate the metagenome bins

  • --bins is the location of the directory containing the metagenomic bins to be run through MAGqual (in fasta format)

Full Help documentation:

usage: MAGqual.py [-h] -a ASSEMBLY -b BINDIR [-p PREFIX] [-j JOBS] [--cluster CLUSTER] [--checkmdb CHECKMDB] [--baktadb BAKTADB]
Required: python MAGqual.py -a/--asm assembly.fa -b/--bins bins_dir/ 
options:
 -h, --help show this help message and exit
 -a ASSEMBLY, --asm ASSEMBLY
 location of the assembly used to generate the bins (Required)
 -b BINDIR, --bins BINDIR
 path containing the directory containing the bins to run through MAGqual (Required)
 -p PREFIX, --prefix PREFIX
 prefix for the MAGqual run, default = MAGqual_YYYYMMDD
 -j JOBS, --jobs JOBS The number of cores to be used or if running on a HPC the number of jobs
 to be run concurrently, default = 1
 --cluster CLUSTER OPTIONAL: The type of cluster to run MAGqual on a HPC system (available options: slurm), 
 dont use if running MAGqual locally.
 --checkmdb CHECKMDB OPTIONAL: location of a ready installed database for CheckM
 --baktadb BAKTADB OPTIONAL: location of a ready installed database for Bakta, note must be v5.0 or above

Additional Options

  • -p / --prefix : Specify a prefix for the output files (Default = MAGqual_YYYYMMDD)

  • -j / --jobs : If running locally this is the number of cores to be used, if using the --cluster option and running on a HPC queue this corresponds to the number of jobs to be run concurrently, (Default = 1)

Optional

CheckM and Bakta databases

MAGqual will handle the installation of both CheckM and Bakta, however if you have previously used either of these tools it is possible to speed up the process and save file space by specifying the location of pre-downloaded databases.

CheckM database

If you already have the CheckM database downloaded you can specify the location using the parameter --checkm_db to skip the download, otherwise MAGqual will download the required databases for CheckM for you (this will be the most recent version which is dated 2015-01-16).

Bakta database

If no Bakta database is provided MAGqual will automatically download the lightweight Bakta database (as the full database is large and can take a long time to download). NOTE: MAGqual uses Bakta v1.7.0 which requires a Bakta database version of > 5.0.
However, for more accurate MAG annotation we recommend downloading the full database (from following the instructions in the Bakta documentation ) and specifying the location for MAGqual using the parameter --bakta_db . This database can then be use for each subsequent MAGqual run.

Running on a computing cluster

MAGqual can be integrated into a HPC queuing system using the following option:

  • --cluster : Current option: slurm , run MAGqual with options configured to run on a HPC computer cluster with queuing architecture.
    The only queueing system integrated into MAGQual currently is slurm , however if this doesn't work for you it is possible to run MAGqual on different queuing systems through the Snakemake framework - see Running on different queuing architechture below.

For those familiar with Snakemake

It is possible (and encouraged) to further tweak MAGqual parameters if you are familiar with Snakemake.
The config file: config/config.yaml can be edited directly for common parameters.
And then the pipeline can be run from the MAGqual directory using the basic Snakemake command:

snakemake --use-conda -j 1

This command can then be decorated with any of the command line options Snakemake allows - see the Snakemake documentation for options.

Running on different queuing architechture

Snakemake provides an easy way to run this pipeline on a computing cluster. We have provided support for a HPC with a Slurm queuing system, however this configuration is unlikely to work for everyone. Rule specific cluster information is held in the config/cluster.json - you can see more about the format of this file in the Snakemake documentation . This file can be edited, as can the command used to submit the pipeline.
NOTE: When using the --cluster slurm option with MAGqual.py, the following is added to the Snakemake command:

--cluster-config config/cluster.json --cluster "sbatch -t {cluster.time} -c {cluster.threads} --mem-per-cpu {cluster.mem}

Code Snippets

 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
import pandas as pd
import glob
import argparse
import os
from shutil import copyfile
import seaborn as sns
import matplotlib.pyplot as plt

def qual_cluster(comp, cont):
    if (comp >90) and (cont<5):
        qual = "High Quality"
    elif (comp >= 50) and (cont <10):
        qual = "Medium Quality"
    elif (comp <50) and (cont <10):
        qual = "Low Quality"
    else:
        qual = "Failed"
    return qual

def gen_qual(comp, cont):
    if (comp - (cont*5)) >= 50:
        genome = "y"
    else:
        genome = "n"
    return genome

def bsearch(bakta_loc, cluster): 
    length = 0
    loc = str(bakta_loc + str(cluster) + '/' + str(cluster) + '.txt') #don't copy this bit change it !
    # loc = str(str(cluster) + '.txt') #don't copy this bit change it !
    for name in glob.glob(loc):
        bakta_file = name
        with open(bakta_file, 'r') as bakta_log:
            for line in bakta_log:
                if "Length:" in line:
                    length = int(line.split(":")[1])
                if "Count:" in line: 
                    count = int(line.split(":")[1])
                if "N50:" in line:
                    N50 = int(line.split(":")[1])
                if "Software:" in line: 
                    software = line.split(":")[1].strip()
                if "Database" in line:
                    db = line.split(":")[1].strip()
    bakta_v = "Bakta " + software + " DB " + db 
    row_dict = {"length" : length, "contigs" : count, "N50": N50, "bakta_v" : bakta_v}
    return pd.Series(row_dict)

parser = argparse.ArgumentParser(description='')
parser.add_argument('output', help='output directory for the organised bins', type=str)
parser.add_argument('checkm_log', help='checkm output log file (TAB SEPARATED', type=str)
parser.add_argument('bakta_loc', help='directory containing all bakta output for all clusters', type=str)
parser.add_argument('seqkit_log', help='file containing the seqkit output for all clusters', type=str)
parser.add_argument('bin_loc', help='directory containing fasta files for all clusters', type=str)
parser.add_argument('jobid', help='prefix for current jobs', type=str)

args = parser.parse_args()
output = args.output
checkm_log = args.checkm_log
bakta_loc = args.bakta_loc
seqkit_log = args.seqkit_log
bin_loc = args.bin_loc
job_id = args.jobid

comp_software = "CheckM v.1.0.13"
comp_approach = "Marker gene"

colour_dict = dict({'High Quality':'#50C5B7',
                  'Near Complete':'#9CEC5B',
                  'Medium Quality': '#F0F465',
                  'Low Quality': "#F8333C",
                  'Failed': '#646E78'})

# =============================================================================
# CHECKM PARSE
# =============================================================================
checkm_df = pd.read_csv(checkm_log, sep = "\t")
checkm_df['Quality'] = checkm_df.apply(lambda x: qual_cluster(x['Completeness'], x['Contamination']), axis=1)
checkm_df[['Size_bp', 'No_contigs', 'N50_length', '16S_Software']] = checkm_df.apply(lambda x: bsearch(bakta_loc, x["Bin Id"]), axis = 1)

checkm_df = checkm_df.set_index("Bin Id")

high_clusters = checkm_df[checkm_df['Quality'].str.contains("High Quality")].index.values.tolist()
med_qual_clusters = checkm_df[checkm_df['Quality'].str.contains("Medium Quality")].index.values.tolist()
low_qual_clusters = checkm_df[checkm_df['Quality'].str.contains("Low Quality")].index.values.tolist()
NA = checkm_df[checkm_df['Quality'].str.contains("Failed")].index.values.tolist()
all_clusters = high_clusters + med_qual_clusters + low_qual_clusters + NA

checkm_df = checkm_df.drop(checkm_df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 9]], axis=1)

# =============================================================================
# SEQKIT PARSE
# =============================================================================
seqkit_df = pd.read_csv(seqkit_log, sep = "\t")
seqkit_df = seqkit_df[["file", "max_len"]]
seqkit_df["file"] = seqkit_df["file"].str.replace('.fasta','', regex=True)
seqkit_df["file"] = seqkit_df["file"].str.split("/").str[-1]
seqkit_df.set_index("file", inplace=True)
seqkit_df.rename(columns={"max_len":"Max_contig_length"}, inplace = True)
checkm_df = pd.merge(checkm_df, seqkit_df, left_index=True, right_index=True, how="left")

# =============================================================================
# BAKTA PARSE
# =============================================================================

high_qual_clusters= []
near_comp_clusters = []
r16s_comp_clusters = []
trna_num = {}
for cluster in all_clusters:
    loc = str(bakta_loc + cluster + '/' + cluster + '.tsv') #change this too 
    # loc = str(str(cluster) + '.tsv') #don't copy this bit change it !
    for name in glob.glob(loc):
        bakta_file = name
        with open(bakta_file, 'r') as bakta_in:
            trna_set = set()
            rna_set = set()
            rrna_16S = "N"
            for line in bakta_in:
                if "tRNA-Ala" in line:
                    trna_set.add("tRNA-Ala")
                if "tRNA-Arg" in line:
                    trna_set.add("tRNA-Arg")
                if "tRNA-Asn" in line:
                    trna_set.add("tRNA-Asn")
                if "tRNA-Asp" in line:
                    trna_set.add("tRNA-Asp")
                if "tRNA-Cys" in line:
                    trna_set.add("tRNA-Cys")
                if "tRNA-Gln" in line:
                    trna_set.add("tRNA-Gln")
                if "tRNA-Glu" in line:
                    trna_set.add("tRNA-Glu")
                if "tRNA-Gly" in line:
                    trna_set.add("tRNA-Gly")
                if "tRNA-His" in line:
                    trna_set.add("tRNA-His")
                if "tRNA-Ile" in line:
                    trna_set.add("tRNA-Ile")
                if "tRNA-Leu" in line:
                    trna_set.add("tRNA-Leu")
                if "tRNA-Lys" in line:
                    trna_set.add("tRNA-Lys")
                if "tRNA-Met" in line:
                    trna_set.add("tRNA-Met")
                if "tRNA-Phe" in line:
                    trna_set.add("tRNA-Phe")
                if "tRNA-Pro" in line:
                    trna_set.add("tRNA-Pro")
                if "tRNA-Ser" in line:
                    trna_set.add("tRNA-Ser")
                if "tRNA-Thr" in line:
                    trna_set.add("tRNA-Thr")
                if "tRNA-Trp" in line:
                    trna_set.add("tRNA-Trp")
                if "tRNA-Tyr" in line:
                    trna_set.add("tRNA-Tyr")
                if "tRNA-Val" in line:
                    trna_set.add("tRNA-Val")
                if ("5S ribosomal RNA" in line) and ("partial" not in line):
                    rna_set.add('5S')
                if ("16S ribosomal RNA" in line) and ("partial" not in line):
                    rna_set.add('16S')
                    rrna_16S = "Y"
                if ("23S ribosomal RNA" in line) and ("partial" not in line):
                    rna_set.add('23s')
        if rrna_16S == "Y":
            r16s_comp_clusters.append(cluster)
        if cluster in high_clusters:
            if (len(trna_set) >= 18) and (len(rna_set) == 3):
                high_qual_clusters.append(cluster)
            else:
                near_comp_clusters.append(cluster) # adds high qual that fail trna/rna
        trna_num.update({cluster : len(trna_set)})


# =============================================================================
# Add these new qualities to the checkm dataframe        
# =============================================================================

checkm_df.loc[checkm_df.index.isin(near_comp_clusters), "Quality"] = "Near Complete"

checkm_df.loc[checkm_df.index.isin(r16s_comp_clusters), "16S_Recovered"] = "Y"
checkm_df["16S_Recovered"] = checkm_df["16S_Recovered"].fillna("N")

# Sort out legend order
label_order = ["High Quality", "Near Complete", "Medium Quality", "Low Quality", "Failed"]
labels_all = checkm_df["Quality"].unique().tolist()

for item in label_order:
    if item not in labels_all:
        label_order.remove(item)

labels_list = label_order

# =============================================================================
# Basic plot
# =============================================================================

checkm_df["Purity"] = 100 - checkm_df["Contamination"]
plt.figure(figsize=(15, 10))
ax = sns.scatterplot(data = checkm_df, x="Completeness", y="Purity",
                hue='Quality', size = 'Size_bp', sizes=(20, 800), alpha = 0.6, 
                palette=colour_dict, hue_order= labels_list)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

plt.xlabel('Completeness (%)', size = 24)
plt.ylabel('Purity (%)', size = 24)
plt.xlim(0)
plt.legend(prop={'size': 20})
plt.tight_layout()
plt.tick_params(labelsize=18)

plt.savefig(job_id + '_mag_qual.png')
# =============================================================================
# COPYING FILES INTO QUAL DIRECTORIES
# =============================================================================

location = bin_loc
new_loc = output + "/" + job_id + "/"
os.makedirs(new_loc + "high_qual", exist_ok=True)
os.makedirs(new_loc + "near_comp", exist_ok=True)
os.makedirs(new_loc + "med_qual", exist_ok=True)
os.makedirs(new_loc + "low_qual", exist_ok=True)
os.makedirs(new_loc + "failed", exist_ok=True)

for high in high_qual_clusters:
    file = location + high + ".fasta"
    copyfile(file, new_loc +"high_qual/"+high+".fasta")

for nc in near_comp_clusters:
    file = location + nc + ".fasta"
    copyfile(file, new_loc +"near_comp/"+nc+".fasta")

for med in med_qual_clusters:
    file = location + med + ".fasta"
    copyfile(file, new_loc+"med_qual/"+med+".fasta")

for low in low_qual_clusters:
    file = location + low + ".fasta"
    copyfile(file, new_loc+"low_qual/"+low+".fasta")

for NA_bin in NA:
    file = location + NA_bin + ".fasta"
    copyfile(file, new_loc+"failed/"+NA_bin+".fasta")

# =============================================================================
# OUTPUT CREATED HERE
# =============================================================================

magqual_df = checkm_df[["Quality", "Completeness", "Contamination", "16S_Recovered", "16S_Software", "Size_bp", "No_contigs", "N50_length", "Max_contig_length"]].copy()
magqual_df["tRNA_Extracted"] = pd.Series(trna_num)
magqual_df["tRNA_Software"] = magqual_df["16S_Software"]
magqual_df["Completeness_Approach"] = comp_approach
magqual_df["Completeness_Software"] = comp_software

magqual_df = magqual_df.reindex(columns=["Quality", "Completeness", "Contamination", "Completeness_Software","Completeness_Approach", "16S_Recovered", "16S_Software", "tRNA_Extracted", "tRNA_Software", "Size_bp", "No_contigs", "N50_length", "Max_contig_length"])

magqual_df.to_csv("analysis/" + job_id + "_mag_qual_statistics.csv")

print("-" * 12)
print(" NUMBER MAGs")
print("-" * 12)
print("High Quality:", len(high_qual_clusters))
print("Near Complete:", len(near_comp_clusters))
print("Med Quality:", len(med_qual_clusters))
print("Low Quality:", len(low_qual_clusters))
print("Failed:", len(NA), "\n")
print("-" * 12)
print(" MAG IDs")
print("-" * 12)
print("High Quality: " , ", ".join([str(x) for x in high_qual_clusters]))
print("Near Complete: ", ", ".join([str(x) for x in near_comp_clusters]))
print("Med Quality: ", ", ".join([str(x) for x in med_qual_clusters]))
print("Low Quality: ", ", ".join([str(x) for x in low_qual_clusters]))
print("Failed: ", ", ".join([str(x) for x in NA]))
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
shell:
    """
    if [ ! -z {params.database} ] && [ -d {params.database} ]; then 
        cd databases/
        rm -r bakta/
        cd ..
        ln -sTf {params.database} databases/bakta
    else
        wget https://zenodo.org/record/7669534/files/db-light.tar.gz
        mkdir -p databases/bakta/
        tar -xzf db-light.tar.gz -C databases/bakta/ --strip-components 1
        rm db-light.tar.gz
        amrfinder_update --force_update --database databases/bakta/amrfinderplus-db/
    fi
    """
68
69
70
71
shell:
    """
    bakta --db={params.database} --output {params.dir} --prefix {params.prefix} --threads {resources.threads} {input.clusters}
    """
80
81
82
83
84
85
86
87
88
89
90
91
92
93
shell:
    """
    if [ ! -z {params.database} ] && [ -d {params.database} ]; then 
        cd databases
        rm -r checkm/
        cd ..
        ln -sTf {params.database} databases/checkm
    else
        wget https://zenodo.org/record/7401545/files/checkm_data_2015_01_16.tar.gz
        mkdir -p databases/checkm/
        tar -xzf checkm_data_2015_01_16.tar.gz -C databases/checkm/
        rm checkm_data_2015_01_16.tar.gz
    fi
    """
113
114
115
116
117
118
shell:
    """
    checkm_db={params.database}
    echo ${{checkm_db}} | checkm data setRoot ${{checkm_db}}
    checkm lineage_wf -f {output.log} --tab_table -x fasta -t {resources.threads} {params.input} {params.out}
    """
128
129
130
131
shell:
    """
    seqkit stats -aT {input.clusters} > {output.tsv}
    """
152
153
154
155
shell:
    """
    python workflow/scripts/python/qual_parse.py {params.out_loc} {input.checkm} {params.bakta} {input.seqkit} {params.bin_loc} {params.jobid} > {output.txt}
    """
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ac1513/MAGqual
Name: magqual
Version: v0.1.1-beta
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...