BLAST-like search across all pre-2019 bacteria on ordinary laptop and desktop computers within a few hours

public public 1yr ago Version: v0.1.0 0 bookmarks

Introduction

MOF-Search is a pipeline for BLAST-like search across all pre-2019 bacteria from ENA (the 661k collection ) on ordinary standard desktop and laptops computers.

The central idea behind, enabling search at such a scale, is phylogenetic compression - a technique based on using estimated evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. In short, input data are reorganized according to the topology of the estimated phylogenies, which makes data highly locally compressible even using basic techniques. Existing software packages for compression, indexing, and search - in this case XZ , COBS , and Minimap2 - are then used as low-level tools. The resulting performance gains come from a wide range of benefits of phylogenetic compression, including easy parallelization, small memory requirements, small database size, better memory locality, and better branch prediction.

For more information about phylogenetic compression and implementation details of MOF-Search, see the corresponding paper (and its supplementary and the associated website for the whole MOF framework ).

Citation

K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996

Installation

Step 1: Install dependencies

MOF-Search is implemented as a Snakemake pipeline, using the Conda system to manage all non-standard dependencies. To function smoothly, we recommend having Conda with the following packages:

The last three packages can be installed using Conda by

 conda install -y "python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0"

Step 2: Clone the repository

 git clone https://github.com/karel-brinda/mof-search
 cd mof-search

Step 3: Run a simple test

Run make test to ensure the pipeline works for the sample queries and just 3 batches. This will also install additional dependencies using Conda or Mamba, such as COBS, SeqTK, and Minimap 2.

Notes:

  • make test should return 0 (success) and you should have the following message at the end of the execution, to ensure the test produced the expected output:

     Files output/backbone19Kbp___ecoli_reads_1___ecoli_reads_2___gc01_1kl.sam_summary.xz and data/backbone19Kbp___ecoli_reads_1___ecoli_reads_2___gc01_1kl.sam_summary.xz are identical
    
  • If the test did not produce the expected output and you obtained an error message such as

     Files output/backbone19Kbp___ecoli_reads_1___ecoli_reads_2___gc01_1kl.sam_summary.xz and data/backbone19Kbp.fa differ make: *** [Makefile:21: test] Error 1
    

    you should verify why.

Step 4: Download the database

Run make download to download from Zenodo all the remaining phylogenetically compressed assemblies and COBS k -mer indexes for the 661k-HQ collection . The downloaded files will then be located in the asms/ and cobs/ directories.

Notes:

  • The compressed assemblies comprise all the genomes from the 661k collection.The COBS indexes comprise only those genomes that passed quality control.

  • For file accessions, see the MOF website .

Usage

Step 1: Copy or symlink your queries

Remove the default test files or you old files in the queries/ directory and copy or symlink (recommended) your query files. The supported input formats are FASTA and FASTQ, possibly gzipped. All query files will be preprocessed and merged together.

Notes:

  • All query names have to be unique among all query files.

  • All non- ACGT characters in your query sequences will be translated to A .

Step 2: Adjust configuration

Edit the config.yaml file for your desired search. All available options are documented directly there.

Step 3: Clean up intermediate files

Run make clean to clean the intermediate files from the previous runs. This includes COBS matching files, alignment files, and various reports.

Step 4: Run the pipeline

Simply run make , which will execute Snakemake with the corresponding parameters. If you want to run the pipeline step by step, run make match followed by make map .

Step 5: Analyze your results

Check the output files in results/ .

If the results don't correspond to what you expected and you need to re-adjust your parameters, go to Step 2. If only the mapping part is affected by the changes, you proceed more rapidly, by manually removing the files in intermediate/03_map and output/ and running directly make map .

Additional information

List of workflow commands

MOF-Search is executed via GNU Make , which handles all parameters and passes them to Snakemake.

Here's a list of all implemented commands (to be executed as make {command} ):

######################
## General commands ##
######################
 all Run everything (the default rule)
 test Quick test using 3 batches
 help Print help messages
 clean Clean intermediate search files
 cleanall Clean all generated and downloaded files
####################
## Pipeline steps ##
####################
 conda Create the conda environments
 download Download the assemblies and COBS indexes
 match Match queries using COBS (queries -> candidates)
 map Map candidates to assemblies (candidates -> alignments)
###############
## Reporting ##
###############
 viewconf View configuration without comments
 report Generate Snakemake report
##########
## Misc ##
##########
 cluster_slurm Submit to a SLURM cluster
 cluster_lsf_test Submit the test pipeline to LSF cluster
 cluster_lsf Submit to LSF cluster
 format Reformat Python and Snakemake files

Directories

  • asms/ , cobs/ Downloaded assemblies and COBS indexes

  • queries/ Queries, to be provided within one or more FASTA files ( .fa )

  • intermediate/ Intermediate files

    • 00_cobs Decompressed COBS indexes (tmp)

    • 01_match COBS matches

    • 02_filter Filtered candidates

    • 03_map Minimap2 alignments

    • fixed_queries Sanitized queries

  • output/ Results

Running on a cluster

Running on a cluster is much faster as the jobs produced by this pipeline are quite light and usually start running as soon as they are scheduled.

For LSF clusters:

  1. Test if the pipeline is working on a LSF cluster: make cluster_lsf_test ;

  2. Configure you queries and run the full pipeline: make cluster_lsf ;

Known limitations

  • When the number of queries is too high, the auxiliary Python scripts start to use too much memory, which may result in swapping. Try to keep the number of queries moderate and ideally their names short. If you have tens or hundreds or more query files, concatenate them all into one before running mof-search .

License

MIT

Contacts

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import sys
import argparse
from pathlib import Path
import subprocess
import datetime


def get_args():
    parser = argparse.ArgumentParser(description='Benchmark a command.')
    parser.add_argument('command', type=str, help='The command to be benchmarked')
    parser.add_argument('--log', type=str, required=True, help='Path to the log file with benchmark statistics.')
    args = parser.parse_args()
    return args


def get_time_command():
    if sys.platform == "linux":
        time_command = "/usr/bin/time"
    elif sys.platform == "darwin":
        time_command = "gtime"
    else:
        raise Exception("Unsupported OS")
    return time_command


def main():
    args = get_args()
    log_file = Path(args.log)
    log_file.parent.mkdir(parents=True, exist_ok=True)
    tmp_log_file = Path(f"{log_file}.tmp")
    is_benchmarking_pipeline = args.command.split()[0] == "snakemake"

    with open(log_file, "w") as log_fh:
        formatted_command = " ".join(args.command.replace("\\\n", " ").strip().split())
        print(f"# Benchmarking command: {formatted_command}", file=log_fh)
        header = [
            "real(s)", "sys(s)", "user(s)", "percent_CPU", "max_RAM(kb)", "FS_inputs", "FS_outputs",
            "elapsed_time_alt(s)"
        ]
        if is_benchmarking_pipeline:
            header.append("max_delta_system_RAM(kb)")
        print("\t".join(header), file=log_fh)

    time_command = get_time_command()
    benchmark_command = f'{time_command} -o {tmp_log_file} -f "%e\t%S\t%U\t%P\t%M\t%I\t%O"'

    start_time = datetime.datetime.now()
    main_process = subprocess.Popen(f'{benchmark_command} {args.command}', shell=True)
    if is_benchmarking_pipeline:
        RAM_tmp_log_file = Path(f"{log_file}.RAM.tmp")
        RAM_benchmarking_process = subprocess.Popen(
            [sys.executable, "scripts/get_RAM_usage.py",
             str(RAM_tmp_log_file),
             str(main_process.pid)])
    return_code = main_process.wait()
    if return_code:
        raise subprocess.CalledProcessError(return_code,
                                            main_process.args,
                                            output=main_process.stdout,
                                            stderr=main_process.stderr)

    end_time = datetime.datetime.now()
    elapsed_seconds = (end_time - start_time).total_seconds()
    with open(tmp_log_file) as log_fh_tmp, open(log_file, "a") as log_fh:
        log_line = log_fh_tmp.readline().strip()
        log_line += f"\t{elapsed_seconds}"

        if is_benchmarking_pipeline:
            RAM_benchmarking_process.kill()
            with open(RAM_tmp_log_file) as RAM_tmp_log_fh:
                RAM_usage = RAM_tmp_log_fh.readline().strip()
            log_line += f"\t{RAM_usage}"
            RAM_tmp_log_file.unlink()

        print(log_line, file=log_fh)

    tmp_log_file.unlink()


if __name__ == "__main__":
    main()
253
254
255
256
257
shell:
    """
    curl "{params.url}/{wildcards.batch}.tar.xz"  > {output.xz}
    scripts/test_xz.py {output.xz}
    """
SnakeMake From line 253 of master/Snakefile
271
272
273
274
275
shell:
    """
    curl "{params.url}"  > {output.xz}
    scripts/test_xz.py {output.xz}
    """
SnakeMake From line 271 of master/Snakefile
301
302
303
304
305
306
shell:
    """
    seqtk seq -A -U -C {input.original_query} \\
            | awk '{{if(NR%2==1){{print $0;}}else{{gsub(/[^ACGT]/, \"{params.base_to_replace}\"); print;}}}}' \\
        > {output.fixed_query}
    """
321
322
323
324
shell:
    """
    cat {input} > {output}
    """
SnakeMake From line 321 of master/Snakefile
345
346
347
348
349
350
shell:
    """
    ./scripts/benchmark.py --log logs/benchmarks/decompress_cobs/{wildcards.batch}.txt \\
        'xzcat --no-sparse --ignore-check "{input.xz}" > "{params.cobs_index_tmp}" \\
        && mv "{params.cobs_index_tmp}" "{output.cobs_index}"'
    """
SnakeMake From line 345 of master/Snakefile
380
381
382
383
384
385
386
387
388
389
390
391
392
shell:
    """
    ./scripts/benchmark.py --log logs/benchmarks/run_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\
        'cobs query \\
                {params.load_complete} \\
                -t {params.kmer_thres} \\
                -T {threads} \\
                -i {input.cobs_index} \\
                -f {input.fa} \\
            | ./scripts/postprocess_cobs.py -n {params.nb_best_hits} \\
            | gzip --fast \\
            > {output.match}'
    """
SnakeMake From line 380 of master/Snakefile
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
shell:
    """
    if [ {params.streaming} = 1 ]
    then
        ./scripts/benchmark.py --log logs/benchmarks/run_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\
        './scripts/run_cobs_streaming.sh {params.kmer_thres} {threads} "{input.compressed_cobs_index}" {params.uncompressed_batch_size} "{input.fa}" \\
                | ./scripts/postprocess_cobs.py -n {params.nb_best_hits} \\
                | gzip --fast\\
                > {output.match}'
    else
        mkdir -p {params.decompression_dir}
        ./scripts/benchmark.py --log logs/benchmarks/decompress_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\
            'xzcat "{input.compressed_cobs_index}" > "{params.cobs_index_tmp}" \\
            && mv "{params.cobs_index_tmp}" "{params.cobs_index}"'
        ./scripts/benchmark.py --log logs/benchmarks/run_cobs/{wildcards.batch}____{wildcards.qfile}.txt \\
            'cobs query \\
                    {params.load_complete} \\
                    -t {params.kmer_thres} \\
                    -T {threads} \\
                    -i "{params.cobs_index}" \\
                    -f "{input.fa}" \\
                | ./scripts/postprocess_cobs.py -n {params.nb_best_hits} \\
                | gzip --fast\\
                > {output.match}'
        rm -v "{params.cobs_index}"
    fi
    """
SnakeMake From line 426 of master/Snakefile
477
478
479
480
481
482
483
484
485
shell:
    """
    ./scripts/benchmark.py --log logs/benchmarks/translate_matches/translate_matches___{wildcards.qfile}.txt \\
        './scripts/filter_queries.py \\
                -n {params.nb_best_hits} \\
                -q {input.fa} \\
                {input.all_matches} \\
            > {output.fa} 2>{log}'
    """
SnakeMake From line 477 of master/Snakefile
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
shell:
    """
    xzcat data/661k_batches.txt.xz \\
        | grep {wildcards.batch} \\
        | cut -f2 \\
        > {params.refs_tmp}

    ./scripts/benchmark.py --log logs/benchmarks/batch_align_minimap2/{wildcards.batch}____{wildcards.qfile}.txt \\
        './scripts/batch_align.py \\
                --minimap-preset {params.minimap_preset} \\
                --threads {threads} \\
                --extra-params=\"{params.minimap_extra_params}\" \\
                --accessions {params.refs_tmp} \\
                {params.pipe} \\
                {input.asm} \\
                {input.qfa} \\
            2>{log} \\
            | {{ grep -Ev "^@" || true; }} \\
            | gzip --fast\\
            > {output.sam}'

    rm -f {params.refs_tmp}
    """
SnakeMake From line 506 of master/Snakefile
539
540
541
542
543
544
shell:
    """
    ./scripts/benchmark.py --log logs/benchmarks/aggregate_sams/aggregate_sams___{wildcards.qfile}.txt \\
        './scripts/aggregate_sams.sh {input.sam} \\
            > {output.pseudosam}'
    """
SnakeMake From line 539 of master/Snakefile
558
559
560
561
562
563
shell:
    """
    ./scripts/benchmark.py --log logs/benchmarks/aggregate_sams/final_stats___{wildcards.qfile}.txt \\
        './scripts/final_stats.py {input.concatenated_query} {input.pseudosam} \\
            > {output.stats}'
    """
SnakeMake From line 558 of master/Snakefile
ShowHide 10 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/karel-brinda/mof-search
Name: mof-search
Version: v0.1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...