Snakemake workflow to consolidate basic genome assembly benchmarking
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is a small snakemake workflow that allows you to quickly gather basic genome assembly statistics so that you can compare methods. Summary statistics include N50s, scaffold/contig #s, BUSCO completeness, as well as K* statistics .
This is intended for a single genome as it relies on a single BUSCO lineage to function.
Most dependencies are handled via conda in this workflow.
If you need support getting snakemake to run, I'd suggest the docs !
Usage
This was designed on a SLURM cluster with some unique partitioning rules.
You might need to tweak the
resources
options in some of the
Snakefile
rules!
To run this workflow, edit
config/config.yaml
to point to your
hifi_reads
file (or illumina, I haven't seen much difference!).
These reads are used to create the genomescope2 plot and are used to create the merfin histogram and completeness scores.
Also supply an
assemblies.csv
file under config - this should contain two columns,
assembly
which will be used as the name of that assembly (no special chars!) and
path
which is just the path to that assembly.
Also supply a
busco_lineage
, this is passed onto a rule which downloads that lineage and uses it for all later analysis.
Multiple lineages are not supported.
Output
There are three primary outputs of this pipeline:
Genomescope2 profile
This is just the plot from genomescope2.
Just realised as I'm typing this that I never made a connection between the genomescope2 kcov value, and it's later use in merfin...
You'll have to run this twice, once to get the plot and the kcov value, and then again but edit the
-peak
value in the merfin rule to this.
Might fix this in future!
Merfin histograms
Looking good hifiasm!
These are the KDE density histograms from merfin. I've restricted the output plot x axis to (-100,100) for better vis - you might want to change this.
Summary table
This pipeline also produces a summary table
.csv
:
Completeness | # Scaffolds | Scaffold N50 | # Contigs | Contig N50 | Total Length | BUSCO Complete | assembly |
---|---|---|---|---|---|---|---|
0.93825 | 1667 | 46.3 MB | 1667 | 46.3 MB | 753.3 MB | 98.5 | hifiasm |
0.93548 | 1666 | 46.3 MB | 1666 | 46.3 MB | 751.2 MB | 98.5 | hifiasm_hic |
0.88635 | 991 | 22.1 MB | 991 | 22.1 MB | 713.2 MB | 98.5 | LJA |
0.9037 | 4043 | 18.8 MB | 4043 | 18.8 MB | 803.0 MB | 98.5 | HiCanu |
0.71632 | 1490 | 2.8 MB | 1490 | 2.8 MB | 728.3 MB | 94.2 | Canu |
0.82445 | 1591 | 2.4 MB | 1628 | 2.2 MB | 679.9 MB | 98.5 | MaSuRCA |
It'll look something like this.
Code Snippets
32 33 34 35 36 37 38 39 40 | run: dfs = [] for csv in input: df = pd.read_csv(csv) dfs.append(df) dfs_concat = pd.concat(dfs, ignore_index = True) dfs_concat.to_csv(output.csv, index = False) |
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | run: summary_statistics = {} with open(input.completeness) as file: for line in file: if line.startswith("COMPLETENESS"): summary_statistics["Completeness"] = [line.split()[1]] with open(input.busco[0]) as file: data = json.load(file) summary_statistics["# Scaffolds"] = [data["results"]["Number of scaffolds"]] summary_statistics["Scaffold N50"] = [human_readable(int(data["results"]["Scaffold N50"]))] summary_statistics["# Contigs"] = [data["results"]["Number of contigs"]] summary_statistics["Contig N50"] = [human_readable(int(data["results"]["Contigs N50"]))] summary_statistics["Total Length"] = [human_readable(int(data["results"]["Total length"]))] summary_statistics["BUSCO Complete"] = [data["results"]["Complete"]] summary_statistics["assembly"] = [wildcards.assembly] summary_df = pd.DataFrame.from_dict(summary_statistics) summary_df.to_csv(output.csv, index = False) |
82 83 | shell: "busco --download {params.lineage} 2> {log}" |
102 103 | shell: "busco -f -c 8 -i {input.path} -l {params.lineage} -o {params.path} -m genome 2> {log}" |
118 119 | shell: "merfin -completeness -sequence {input.assembly} -readmers {input.readmers} -prob {input.lookup} -peak 48.8 2> {output}" |
136 137 138 139 140 | shell: """ merfin -hist -sequence {input.assembly} -readmers {input.readmers} -prob {input.lookup} -peak 48.8 -output {output} 2> {log} sed -i 's/$/\t{wildcards.assembly}/' {output} """ |
147 148 149 150 151 152 153 154 155 156 157 158 | run: frames = [] for file in input: frames.append(pd.read_table(file, header = None)) df = pd.concat(frames) df = df.rename(columns = {0: 'x', 1: 'weight', 2: 'method'}) df = df.reset_index() sns.kdeplot(data = df, x = "x", weights = "weight", hue = "method", gridsize=10000) plt.xlim(-100, 100) plt.xlabel("K*") plt.savefig(output.plot) |
173 174 | shell: "meryl count k=21 {input.reads} output {output} 2> {log}" |
190 191 | shell: "kmc -k21 -t10 -m64 -ci1 -cs10000 {input.reads} results/reads $TMPDIR 2> {log}" |
207 208 | shell: "kmc_tools transform results/reads histogram {output} -cx10000 2> {log}" |
230 231 | shell: "genomescope2 --fitted_hist -i {input} -o results/genomescope -k 21 2> {log}" |
Support
- Future updates