Snakemake workflow to consolidate basic genome assembly benchmarking

public public 1yr ago 0 bookmarks

This is a small snakemake workflow that allows you to quickly gather basic genome assembly statistics so that you can compare methods. Summary statistics include N50s, scaffold/contig #s, BUSCO completeness, as well as K* statistics .

This is intended for a single genome as it relies on a single BUSCO lineage to function.

Most dependencies are handled via conda in this workflow.

If you need support getting snakemake to run, I'd suggest the docs !

Usage

This was designed on a SLURM cluster with some unique partitioning rules. You might need to tweak the resources options in some of the Snakefile rules!

To run this workflow, edit config/config.yaml to point to your hifi_reads file (or illumina, I haven't seen much difference!). These reads are used to create the genomescope2 plot and are used to create the merfin histogram and completeness scores. Also supply an assemblies.csv file under config - this should contain two columns, assembly which will be used as the name of that assembly (no special chars!) and path which is just the path to that assembly. Also supply a busco_lineage , this is passed onto a rule which downloads that lineage and uses it for all later analysis. Multiple lineages are not supported.

Output

There are three primary outputs of this pipeline:

Genomescope2 profile

genomescope2 plot

This is just the plot from genomescope2. Just realised as I'm typing this that I never made a connection between the genomescope2 kcov value, and it's later use in merfin... You'll have to run this twice, once to get the plot and the kcov value, and then again but edit the -peak value in the merfin rule to this. Might fix this in future!

Merfin histograms

merfin hist

Looking good hifiasm!

These are the KDE density histograms from merfin. I've restricted the output plot x axis to (-100,100) for better vis - you might want to change this.

Summary table

This pipeline also produces a summary table .csv :

Completeness # Scaffolds Scaffold N50 # Contigs Contig N50 Total Length BUSCO Complete assembly
0.93825 1667 46.3 MB 1667 46.3 MB 753.3 MB 98.5 hifiasm
0.93548 1666 46.3 MB 1666 46.3 MB 751.2 MB 98.5 hifiasm_hic
0.88635 991 22.1 MB 991 22.1 MB 713.2 MB 98.5 LJA
0.9037 4043 18.8 MB 4043 18.8 MB 803.0 MB 98.5 HiCanu
0.71632 1490 2.8 MB 1490 2.8 MB 728.3 MB 94.2 Canu
0.82445 1591 2.4 MB 1628 2.2 MB 679.9 MB 98.5 MaSuRCA

It'll look something like this.

Code Snippets

32
33
34
35
36
37
38
39
40
run:
    dfs = []

    for csv in input:
        df = pd.read_csv(csv)
        dfs.append(df)

    dfs_concat = pd.concat(dfs, ignore_index = True)
    dfs_concat.to_csv(output.csv, index = False)
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
run:
    summary_statistics = {}
    with open(input.completeness) as file:
        for line in file:
            if line.startswith("COMPLETENESS"):
                summary_statistics["Completeness"] = [line.split()[1]]

    with open(input.busco[0]) as file:
        data = json.load(file)
        summary_statistics["# Scaffolds"] = [data["results"]["Number of scaffolds"]]
        summary_statistics["Scaffold N50"] = [human_readable(int(data["results"]["Scaffold N50"]))]
        summary_statistics["# Contigs"] = [data["results"]["Number of contigs"]]
        summary_statistics["Contig N50"] = [human_readable(int(data["results"]["Contigs N50"]))]
        summary_statistics["Total Length"] = [human_readable(int(data["results"]["Total length"]))]
        summary_statistics["BUSCO Complete"] = [data["results"]["Complete"]]

    summary_statistics["assembly"] = [wildcards.assembly]

    summary_df = pd.DataFrame.from_dict(summary_statistics)
    summary_df.to_csv(output.csv, index = False)
82
83
shell:
    "busco --download {params.lineage} 2> {log}"
102
103
shell:
    "busco -f -c 8 -i {input.path} -l {params.lineage} -o {params.path} -m genome 2> {log}"
118
119
shell:
    "merfin -completeness -sequence {input.assembly} -readmers {input.readmers} -prob {input.lookup} -peak 48.8 2> {output}"
136
137
138
139
140
shell:
    """
    merfin -hist -sequence {input.assembly} -readmers {input.readmers} -prob {input.lookup} -peak 48.8 -output {output} 2> {log}
    sed -i 's/$/\t{wildcards.assembly}/' {output}
    """
147
148
149
150
151
152
153
154
155
156
157
158
run:
    frames = []
    for file in input:
        frames.append(pd.read_table(file, header = None))

    df = pd.concat(frames)
    df = df.rename(columns = {0: 'x', 1: 'weight', 2: 'method'})
    df = df.reset_index()
    sns.kdeplot(data = df, x = "x", weights = "weight", hue = "method", gridsize=10000)
    plt.xlim(-100, 100)
    plt.xlabel("K*")
    plt.savefig(output.plot)
173
174
shell:
    "meryl count k=21 {input.reads} output {output} 2> {log}"
190
191
shell:
    "kmc -k21 -t10 -m64 -ci1 -cs10000 {input.reads} results/reads $TMPDIR 2> {log}"
207
208
shell:
    "kmc_tools transform results/reads histogram {output} -cx10000 2> {log}"
230
231
shell:
    "genomescope2 --fitted_hist -i {input} -o results/genomescope -k 21 2> {log}"
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/SwiftSeal/assembly_olympics
Name: assembly_olympics
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...