Snakemake workflow to consolidate basic genome assembly benchmarking

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is a small snakemake workflow that allows you to quickly gather basic genome assembly statistics so that you can compare methods. Summary statistics include N50s, scaffold/contig #s, BUSCO completeness, as well as K* statistics .

This is intended for a single genome as it relies on a single BUSCO lineage to function.

Most dependencies are handled via conda in this workflow.

If you need support getting snakemake to run, I'd suggest the docs !

Usage

This was designed on a SLURM cluster with some unique partitioning rules. You might need to tweak the resources options in some of the Snakefile rules!

To run this workflow, edit config/config.yaml to point to your hifi_reads file (or illumina, I haven't seen much difference!). These reads are used to create the genomescope2 plot and are used to create the merfin histogram and completeness scores. Also supply an assemblies.csv file under config - this should contain two columns, assembly which will be used as the name of that assembly (no special chars!) and path which is just the path to that assembly. Also supply a busco_lineage , this is passed onto a rule which downloads that lineage and uses it for all later analysis. Multiple lineages are not supported.

Output

There are three primary outputs of this pipeline:

Genomescope2 profile

genomescope2 plot

This is just the plot from genomescope2. Just realised as I'm typing this that I never made a connection between the genomescope2 kcov value, and it's later use in merfin... You'll have to run this twice, once to get the plot and the kcov value, and then again but edit the -peak value in the merfin rule to this. Might fix this in future!

Merfin histograms

merfin hist

Looking good hifiasm!

These are the KDE density histograms from merfin. I've restricted the output plot x axis to (-100,100) for better vis - you might want to change this.

Summary table

This pipeline also produces a summary table .csv :

Completeness	# Scaffolds	Scaffold N50	# Contigs	Contig N50	Total Length	BUSCO Complete	assembly
0.93825	1667	46.3 MB	1667	46.3 MB	753.3 MB	98.5	hifiasm
0.93548	1666	46.3 MB	1666	46.3 MB	751.2 MB	98.5	hifiasm_hic
0.88635	991	22.1 MB	991	22.1 MB	713.2 MB	98.5	LJA
0.9037	4043	18.8 MB	4043	18.8 MB	803.0 MB	98.5	HiCanu
0.71632	1490	2.8 MB	1490	2.8 MB	728.3 MB	94.2	Canu
0.82445	1591	2.4 MB	1628	2.2 MB	679.9 MB	98.5	MaSuRCA

It'll look something like this.

Code Snippets

run:
    dfs = []

    for csv in input:
        df = pd.read_csv(csv)
        dfs.append(df)

    dfs_concat = pd.concat(dfs, ignore_index = True)
    dfs_concat.to_csv(output.csv, index = False)

SnakeMake From line 32 of workflow/Snakefile

run:
    summary_statistics = {}
    with open(input.completeness) as file:
        for line in file:
            if line.startswith("COMPLETENESS"):
                summary_statistics["Completeness"] = [line.split()[1]]

    with open(input.busco[0]) as file:
        data = json.load(file)
        summary_statistics["# Scaffolds"] = [data["results"]["Number of scaffolds"]]
        summary_statistics["Scaffold N50"] = [human_readable(int(data["results"]["Scaffold N50"]))]
        summary_statistics["# Contigs"] = [data["results"]["Number of contigs"]]
        summary_statistics["Contig N50"] = [human_readable(int(data["results"]["Contigs N50"]))]
        summary_statistics["Total Length"] = [human_readable(int(data["results"]["Total length"]))]
        summary_statistics["BUSCO Complete"] = [data["results"]["Complete"]]

    summary_statistics["assembly"] = [wildcards.assembly]

    summary_df = pd.DataFrame.from_dict(summary_statistics)
    summary_df.to_csv(output.csv, index = False)

SnakeMake BUSCO From line 48 of workflow/Snakefile

shell:
    "busco --download {params.lineage} 2> {log}"

SnakeMake From line 82 of workflow/Snakefile

shell:
    "busco -f -c 8 -i {input.path} -l {params.lineage} -o {params.path} -m genome 2> {log}"

SnakeMake From line 102 of workflow/Snakefile

shell:
    "merfin -completeness -sequence {input.assembly} -readmers {input.readmers} -prob {input.lookup} -peak 48.8 2> {output}"

SnakeMake Merfin From line 118 of workflow/Snakefile

shell:
    """
    merfin -hist -sequence {input.assembly} -readmers {input.readmers} -prob {input.lookup} -peak 48.8 -output {output} 2> {log}
    sed -i 's/$/\t{wildcards.assembly}/' {output}
    """

SnakeMake Merfin From line 136 of workflow/Snakefile

run:
    frames = []
    for file in input:
        frames.append(pd.read_table(file, header = None))

    df = pd.concat(frames)
    df = df.rename(columns = {0: 'x', 1: 'weight', 2: 'method'})
    df = df.reset_index()
    sns.kdeplot(data = df, x = "x", weights = "weight", hue = "method", gridsize=10000)
    plt.xlim(-100, 100)
    plt.xlabel("K*")
    plt.savefig(output.plot)

SnakeMake From line 147 of workflow/Snakefile

shell:
    "meryl count k=21 {input.reads} output {output} 2> {log}"

SnakeMake Meryl From line 173 of workflow/Snakefile

shell:
    "kmc -k21 -t10 -m64 -ci1 -cs10000 {input.reads} results/reads $TMPDIR 2> {log}"

SnakeMake KMC From line 190 of workflow/Snakefile

shell:
    "kmc_tools transform results/reads histogram {output} -cx10000 2> {log}"

SnakeMake KMC From line 207 of workflow/Snakefile

shell:
    "genomescope2 --fitted_hist -i {input} -o results/genomescope -k 21 2> {log}"

SnakeMake GenomeScope 2.0 From line 230 of workflow/Snakefile

ShowHide 6 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/SwiftSeal/assembly_olympics

Name: assembly_olympics

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

BUSCO GenomeScope 2.0 KMC Merfin Meryl Snakemake

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free