Snakemake pipeline for basic processing of metagenomic data from the lab

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Overview

Snakemake pipeline for basic processing of metagenomic data from the lab. It accepts raw fastq files of metagenomic data, quality filters it, removes reads that map to the host genome, then builds assemblies of each sample and generates a sourmash profile. The current version also generates a taxonomic profile of each sample using MetaPhlAn3 . Modules that are currently underdevelopment will handle automated binning procedures, as well as strain-level profiling.

Quick Start Guide

Install

First, clone this github repository:

$ git clone https://github.com/CUMoellerLab/sn-mg-pipeline.git
cd sn-mg-pipeline

We recommend installing and using mamba:

$ conda install -c conda-forge mamba

Then install the snakemake version for this workflow using mamba:

$ mamba env create -n snakemake -f resources/env/snakemake.yaml
$ conda activate snakemake

Update Files

Now you can update three files that are located in the ./resources/config directory.

The first two files are samples.txt and units.txt . The samples.txt file is your basic metadata file, with each row representing a sample in your dataset and each column containing the corresponding information about that sample. The first column should named "Sample" and should contain the name for each sample. Any addition columns are not used at this step.

The units.txt file should have only 4 columns, and each row should correspond to a sample found in the samples.txt file. The first column, "Sample", should be all or a subset of the "Sample" column in the samples.txt file. The second column, "Unit", should denote which analysis block each sample belongs to. In our case, we use sequencing run/lane, but you may use other information based on your experimental design. The third and fourth columns (named "R1" and "R2", respectively) should include the full file paths to the forward and reverse fastq files for that sample.

The last file to update is the the config.yaml file. This is where you can select the parameters for each step in the analysis pipeline. Refer to the documentation for each tool individually for more information. Also, be sure to change the NCBI GenBank Accession number to your host of interest.

NOTE: You can select which metagenomic assembler you want to use under the the "assemblers:" header. The current options are metaSPAdes and MEGAHIT Simply delete the assembler you don't want to use. Otherwise both will run.

Run the Pipeline

After you have updated the files described above, you can start the pipeline. First run:

conda install -n base -c conda-forge mamba

Then begin the run using:

snakemake --cores 8 --use-conda

The first time you run this, it may take longer to set up your conda environment. Be sure to select the appropriate number of cores for your analysis.

Code Snippets

shell:
    """
    # Make temporary output directory
    mkdir -p {params.temp_dir}

    # run the metaspades assembly
    metaspades.py --threads {threads} \
      -o {params.temp_dir}/ \
      --memory $(({resources.mem_mb}/1024)) \
      --pe1-1 {input.fastq1} \
      --pe1-2 {input.fastq2} \
      2> {log} 1>&2

    # move and rename the contigs file into a permanent directory
    mv {params.temp_dir}/contigs.fasta {output.contigs}
    rm -rf {params.temp_dir}
    """

SnakeMake metaspades From line 24 of snakefiles/assemble.smk

shell:
    """
    megahit -t {threads} \
      -o {params.temp_dir}/ \
      --memory $(({resources.mem_mb}*1024*1024)) \
      -1 {input.fastq1} \
      -2 {input.fastq2} \
      2> {log} 1>&2

    # move and rename the contigs file into a permanent directory
    mv {params.temp_dir}/final.contigs.fa {output.contigs}
    rm -rf {params.temp_dir}

    """

SnakeMake MEGAHIT From line 65 of snakefiles/assemble.smk

shell:
    """
    quast.py \
      -o {params.outdir} \
      -t {threads} \
      {input}
      touch {output.report}
    """

SnakeMake QUAST From line 100 of snakefiles/assemble.smk

wrapper:
    "v1.7.0/bio/multiqc"

SnakeMake MultiQC From line 122 of snakefiles/assemble.smk

shell:
    """
    metaquast.py \
      -r {params.refs} \
      -o {params.outdir} \
      -t {threads} \
      {params.extra} \
      {input}
    """

SnakeMake From line 147 of snakefiles/assemble.smk

wrapper:
    "0.72.0/bio/multiqc"

SnakeMake MultiQC From line 170 of snakefiles/assemble.smk

shell:
    """
        jgi_summarize_bam_contig_depths --outputDepth {output.coverage_table} {input.bams} 2> {log}
    """

SnakeMake From line 30 of snakefiles/binning.smk

shell:
    """
        metabat2 {params.extra} --numThreads {threads} \
        --inFile {input.contigs} \
        --outFile {params.basename} \
        --abdFile {input.coverage_table} \
        --minContig {params.min_contig_length} \
        2> {log} 1>&2
    """

SnakeMake MetaBAT 2 From line 63 of snakefiles/binning.smk

shell:
    """
      samtools coverage {input.bams} | \
      tail -n +2 | \
      sort -k1 | \
      cut -f1,6 > {output.coverage_table} 2> {log}
   """

SnakeMake SAMtools From line 88 of snakefiles/binning.smk

run:
    with open(output.abund_list, 'w') as f:
        for fp in input:
            f.write('%s\n' % fp)

SnakeMake From line 111 of snakefiles/binning.smk

shell:
    """
        mkdir -p {output.bins}

        run_MaxBin.pl -thread {threads} -prob_threshold {params.prob} \
        -min_contig_length {params.min_contig_length} {params.extra} \
        -contig {input.contigs} \
        -abund_list {input.abund_list} \
        -out {params.basename}
        2> {log} 1>&2
    """

SnakeMake SqueezeMeta From line 144 of snakefiles/binning.smk

shell:
    """
      cut_up_fasta.py {input.contigs} \
      -c {params.chunk_size} \
      -o {params.overlap_size} \
      --merge_last \
      -b {output.bed} > {output.contigs_10K} 2> {log}
    """

SnakeMake From line 178 of snakefiles/binning.smk

shell:
    """
      concoct_coverage_table.py {input.bed} \
      {input.bam} > {output.coverage_table} 2> {log}
    """

SnakeMake CONCOCT From line 205 of snakefiles/binning.smk

shell:
    """
        concoct --threads {threads} -l {params.min_contig_length} \
        --composition_file {input.contigs_10K} \
        --coverage_file {input.coverage_table} \
        -b {params.bins}
        2> {log} 1>&2

        mv output/binning/concoct/{wildcards.mapper}/run_concoct/{wildcards.contig_sample}/{wildcards.contig_sample}_bins_clustering_gt{params.min_contig_length}.csv output/binning/concoct/{wildcards.mapper}/run_concoct/{wildcards.contig_sample}/{wildcards.contig_sample}_bins_clustering.csv
    """

SnakeMake CONCOCT From line 232 of snakefiles/binning.smk

shell:
    """
        merge_cutup_clustering.py {input.bins} > {output.merged} 2> {log}
    """

SnakeMake From line 259 of snakefiles/binning.smk

shell:
    """
        mkdir -p {output.fasta_bins}
        extract_fasta_bins.py \
        {input.original_contigs} \
        {input.clustering_merged} \
        --output_path {output.fasta_bins} \
        2> {log}
    """

SnakeMake From line 281 of snakefiles/binning.smk

shell:
    """
    {params.bt2b_command} --threads {threads} \
    {input.contigs} {params.indexbase} 2> {log}
    """

SnakeMake From line 27 of snakefiles/mapping.smk

shell:
    """
    # Map reads against reference genome
    {params.bt2_command} {params.extra} -p {threads} -x {params.ref} \
      -1 {input.reads[0]} -2 {input.reads[1]} \
      2> {log} | samtools view -bS - > {output.aln}

    """

SnakeMake SAMtools From line 56 of snakefiles/mapping.smk

shell:
    """
    minimap2 -d {output.index} {input.contigs} -t {threads} 2> {log}

    """

SnakeMake Minimap2 From line 79 of snakefiles/mapping.smk

shell:
    """
    # Map reads against contigs
    minimap2 -a {input.db} {input.reads} -x {params.x} -K {params.k} -t {threads} \
    2> {log} | samtools view -bS - > {output.aln}

    """

SnakeMake SAMtools Minimap2 From line 107 of snakefiles/mapping.smk

shell:
    """
    samtools sort -o {output.bam} -@ {threads} {input.aln} 2> {log}
    samtools index -b -@ {threads} {output.bam} 2>> {log}
    """

SnakeMake SAMtools From line 132 of snakefiles/mapping.smk

shell:
    """
      # get stem file path
      stem={output.report}
      stem=${{stem%.report.txt}}

      # run Kraken to align reads against reference genomes
      kraken2 {input.fastq1} {input.fastq2} \
        --db {params.db} \
        --paired \
        --gzip-compressed \
        --only-classified-output \
        --threads {threads} \
        --report {output.report} \
        --output - \
        2> {log}

      # run Bracken to re-estimate abundance at given rank
      if [[ ! -z {params.levels} ]]
      then
        IFS=',' read -r -a levels <<< "{params.levels}"
        for level in "${{levels[@]}}"
        do
          bracken \
            -d {params.bracken_db} \
            -i {output.report} \
            -t 10 \
            -l $(echo $level | head -c 1 | tr a-z A-Z) \
            -o $stem.redist.$level.txt \
            2>> {log} 1>&2
        done
      fi
      """

SnakeMake kraken2 bracken From line 22 of snakefiles/profile.smk

shell:
    """
    perl resources/scripts/kraken2-translate.pl {input} > {input}.temp
    ktImportText -o {output} {input}.temp
    rm {input}.temp
    """

SnakeMake Krona From line 69 of snakefiles/profile.smk

shell:
    """
    if test -f "{output}/mpa_latest"; then
        touch {output}
        echo "DB already installed at {output}"
    else
        metaphlan --install --bowtie2db {output} \
            2> {log} 1>&2
    fi
    """

SnakeMake MetaPhlAn From line 92 of snakefiles/profile.smk

shell:
    """
    metaphlan {input.fastq1},{input.fastq2} \
    --input_type fastq \
    --nproc {threads} {params.other} \
    --bowtie2db {input.db_path} \
    --bowtie2out {output.bt2}  \
    -s {output.sam}  \
    -o {output.profile} \
    2> {log} 1>&2
    """

SnakeMake MetaPhlAn From line 127 of snakefiles/profile.smk

shell:
    """
    merge_metaphlan_tables.py {input} \
    -o {output.merged_abundance_table} \
    2> {log} 1>&2
    """

SnakeMake From line 156 of snakefiles/profile.smk

shell:
    """
    sourmash sketch dna \
    -p k={params.k},scaled={params.scaled}  \
    {params.extra} \
    -o {output} \
    --merge \
    {input} 2> {log} 1>&2
    """

SnakeMake From line 185 of snakefiles/prototype_selection.smk

shell:
    """
    sourmash compare \
    --output {output.dm} \
    --csv {output.csv} \
    {input} 2> {log} 1>&2
    """

SnakeMake From line 207 of snakefiles/prototype_selection.smk

shell:
    """
    sourmash plot --pdf --labels \
    --output-dir {output} \
    {input} 2> {log} 1>&2
    """

SnakeMake From line 224 of snakefiles/prototype_selection.smk

run:
    df = pd.read_csv(input[0], header=0, encoding= 'unicode_escape')
    df.index = df.columns

    # test file sizes

    pf_seqs = []
    for fp in df.columns:
        # print(depth)
        print(fp)
        with gzip.open(fp, 'rb') as f:
            for i, l in enumerate(f):
                pass
        seqs = (i + 1) / 4
        print(seqs)
        if params['min_seqs'] <= seqs <= params['max_seqs']:
            pf_seqs.append(fp)

    df_filt = df.loc[pf_seqs, pf_seqs]

    labels = [os.path.basename(x) for x in pf_seqs]

    dm = DistanceMatrix(1 - df_filt.values)

    print("The imported distance matrix has "
          "{} elements.".format(len(labels)))
    print("Selecting 2 to {} prototypes.\n".format(len(labels) - 1))

    proto_dict = {}

    for k in range(2, len(labels)):
        # run prototypeSelection function
        prototypes = prototype_selection_destructive_maxdist(dm, k)
        proto_dict[k] = [labels[int(x)] for x in prototypes]

    with open(output[0], 'w') as outfile:
        dump(proto_dict, outfile, default_flow_style=False)

    with open(log[0], "w") as logfile:
        logfile.write("Running the run_prototypeSelection.py script.\n"
                      "The imported distance matrix has {0} elements.\n"
                      "Selecting 2 to "
                      "{1} prototypes.\n".format(df.shape[1],
                                                 len(labels) - 1))

SnakeMake From line 244 of snakefiles/prototype_selection.smk

wrapper:
    "0.72.0/bio/fastqc"

SnakeMake FastQC From line 19 of snakefiles/qc.smk

wrapper:
    "0.17.4/bio/cutadapt/pe"

SnakeMake From line 43 of snakefiles/qc.smk

wrapper:
    "0.72.0/bio/fastqc"

SnakeMake FastQC From line 59 of snakefiles/qc.smk

shell: "cat {input} > {output}"

SnakeMake From line 80 of snakefiles/qc.smk

shell:
    """
    bowtie2-build --threads {threads} {params.extra} \
    {input.reference} {params.indexbase} 2> {log} 1>&2
    """

SnakeMake Bowtie 2 From line 104 of snakefiles/qc.smk

shell:
    """
    # Map reads against reference genome
    bowtie2 -p {threads} -x {params.ref} \
      -1 {input.fastq1} -2 {input.fastq2} \
      --un-conc-gz {wildcards.sample}_nonhost \
      --no-unal \
      2> {log} | samtools view -bS - > {output.host}

    # rename nonhost samples
    mv {wildcards.sample}_nonhost.1 output/qc/host_filter/nonhost/{wildcards.sample}.R1.fastq.gz
    mv {wildcards.sample}_nonhost.2 output/qc/host_filter/nonhost/{wildcards.sample}.R2.fastq.gz
    """

SnakeMake SAMtools Bowtie 2 From line 145 of snakefiles/qc.smk

wrapper:
    "0.72.0/bio/fastqc"

SnakeMake FastQC From line 170 of snakefiles/qc.smk

wrapper:
    "v1.7.0/bio/multiqc"

SnakeMake MultiQC From line 193 of snakefiles/qc.smk

shell:
    """
        Fasta_to_Scaffolds2Bin.sh \
        -i {input.bins} \
        -e fa > {output.scaffolds2bin}
    """

SnakeMake From line 22 of snakefiles/selected_bins.smk

shell:
    """
        Fasta_to_Scaffolds2Bin.sh \
        -i {input.bins} \
        -e fasta > {output.scaffolds2bin}
    """

SnakeMake From line 46 of snakefiles/selected_bins.smk

shell:
    """
        Fasta_to_Scaffolds2Bin.sh \
        -i {input.bins} \
        -e fa > {output.scaffolds2bin}
    """

SnakeMake From line 69 of snakefiles/selected_bins.smk

shell:
    """
        DAS_Tool \
        --bins {input.metabat2},{input.maxbin2},{input.concoct} \
        --contigs {input.contigs} \
        --outputbasename {params.basename} \
        --labels metabat2,maxbin2,concoct \
        --write_bins 1 \
        --write_bin_evals 1 \
        --threads {threads} \
        --search_engine {params.search_engine}
    """

SnakeMake DAS From line 127 of snakefiles/selected_bins.smk

run:
    sample = wildcards.contig_sample 
    fasta_dir = join(dirname(input[0]),
                     sample + '_DASTool_bins')
    output_dir = dirname(output.done)

    fasta_files = glob(join(fasta_dir, '*.fa'))

    for file in fasta_file:
        copyfile(file,
                 join(output.out,
                      sample + '_' + basename(file)))

SnakeMake From line 151 of snakefiles/selected_bins.smk

shell:
    """
    sourmash sketch dna \
    -p k={params.k},scaled={params.scaled}  \
    {params.extra} \
    -o {output} \
    --merge \
    {input} 2> {log} 1>&2
    """

SnakeMake From line 183 of snakefiles/sourmash.smk

shell:
    """
    sourmash compare \
    --output {output.dm} \
    --csv {output.csv} \
    {input} 2> {log} 1>&2
    """

SnakeMake From line 205 of snakefiles/sourmash.smk

shell:
    """
    sourmash plot --pdf --labels \
    --output-dir {output} \
    {input} 2> {log} 1>&2
    """

SnakeMake From line 222 of snakefiles/sourmash.smk

run:
    df = pd.read_csv(input[0], header=0, encoding= 'unicode_escape')
    df.index = df.columns

    # test file sizes

    pf_seqs = []
    for fp in df.columns:
        print(fp)
        with gzip.open(fp, 'rb') as f:
            for i, l in enumerate(f):
                pass
        seqs = (i + 1) / 4
        print(seqs)
        if params['min_seqs'] <= seqs <= params['max_seqs']:
            pf_seqs.append(fp)

    df_filt = df.loc[pf_seqs, pf_seqs]

    labels = [os.path.basename(x) for x in pf_seqs]

    dm = DistanceMatrix(1 - df_filt.values)

    print("The imported distance matrix has "
          "{} elements.".format(len(labels)))
    print("Selecting 2 to {} prototypes.\n".format(len(labels) - 1))

    proto_dict = {}

    for k in range(2, len(labels)):
        # run prototypeSelection function
        prototypes = prototype_selection_destructive_maxdist(dm, k)
        proto_dict[k] = [labels[int(x)] for x in prototypes]

    with open(output[0], 'w') as outfile:
        dump(proto_dict, outfile, default_flow_style=False)

    with open(log[0], "w") as logfile:
        logfile.write("Running the run_prototypeSelection.py script.\n"
                      "The imported distance matrix has {0} elements.\n"
                      "Selecting 2 to "
                      "{1} prototypes.\n".format(df.shape[1],
                                                 len(labels) - 1))

SnakeMake From line 241 of snakefiles/sourmash.smk

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}")

Python Snakemake Cutadapt From line 3 of pe/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} --quiet -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log:q}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)

Python Snakemake MultiQC From line 3 of multiqc/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)