Snakemake pipeline for basic processing of metagenomic data from the lab

public public 1yr ago 0 bookmarks

Overview

Snakemake pipeline for basic processing of metagenomic data from the lab. It accepts raw fastq files of metagenomic data, quality filters it, removes reads that map to the host genome, then builds assemblies of each sample and generates a sourmash profile. The current version also generates a taxonomic profile of each sample using MetaPhlAn3 . Modules that are currently underdevelopment will handle automated binning procedures, as well as strain-level profiling.

Quick Start Guide

Install

First, clone this github repository:

$ git clone https://github.com/CUMoellerLab/sn-mg-pipeline.git
cd sn-mg-pipeline

We recommend installing and using mamba:

$ conda install -c conda-forge mamba

Then install the snakemake version for this workflow using mamba:

$ mamba env create -n snakemake -f resources/env/snakemake.yaml
$ conda activate snakemake

Update Files

Now you can update three files that are located in the ./resources/config directory.

The first two files are samples.txt and units.txt . The samples.txt file is your basic metadata file, with each row representing a sample in your dataset and each column containing the corresponding information about that sample. The first column should named "Sample" and should contain the name for each sample. Any addition columns are not used at this step.

The units.txt file should have only 4 columns, and each row should correspond to a sample found in the samples.txt file. The first column, "Sample", should be all or a subset of the "Sample" column in the samples.txt file. The second column, "Unit", should denote which analysis block each sample belongs to. In our case, we use sequencing run/lane, but you may use other information based on your experimental design. The third and fourth columns (named "R1" and "R2", respectively) should include the full file paths to the forward and reverse fastq files for that sample.

The last file to update is the the config.yaml file. This is where you can select the parameters for each step in the analysis pipeline. Refer to the documentation for each tool individually for more information. Also, be sure to change the NCBI GenBank Accession number to your host of interest.

NOTE: You can select which metagenomic assembler you want to use under the the "assemblers:" header. The current options are metaSPAdes and MEGAHIT Simply delete the assembler you don't want to use. Otherwise both will run.

Run the Pipeline

After you have updated the files described above, you can start the pipeline. First run:

conda install -n base -c conda-forge mamba

Then begin the run using:

snakemake --cores 8 --use-conda

The first time you run this, it may take longer to set up your conda environment. Be sure to select the appropriate number of cores for your analysis.

Code Snippets

24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
shell:
    """
    # Make temporary output directory
    mkdir -p {params.temp_dir}

    # run the metaspades assembly
    metaspades.py --threads {threads} \
      -o {params.temp_dir}/ \
      --memory $(({resources.mem_mb}/1024)) \
      --pe1-1 {input.fastq1} \
      --pe1-2 {input.fastq2} \
      2> {log} 1>&2

    # move and rename the contigs file into a permanent directory
    mv {params.temp_dir}/contigs.fasta {output.contigs}
    rm -rf {params.temp_dir}
    """
65
66
67
68
69
70
71
72
73
74
75
76
77
78
shell:
    """
    megahit -t {threads} \
      -o {params.temp_dir}/ \
      --memory $(({resources.mem_mb}*1024*1024)) \
      -1 {input.fastq1} \
      -2 {input.fastq2} \
      2> {log} 1>&2

    # move and rename the contigs file into a permanent directory
    mv {params.temp_dir}/final.contigs.fa {output.contigs}
    rm -rf {params.temp_dir}

    """
100
101
102
103
104
105
106
107
shell:
    """
    quast.py \
      -o {params.outdir} \
      -t {threads} \
      {input}
      touch {output.report}
    """
122
123
wrapper:
    "v1.7.0/bio/multiqc"
147
148
149
150
151
152
153
154
155
shell:
    """
    metaquast.py \
      -r {params.refs} \
      -o {params.outdir} \
      -t {threads} \
      {params.extra} \
      {input}
    """
170
171
wrapper:
    "0.72.0/bio/multiqc"
30
31
32
33
shell:
    """
        jgi_summarize_bam_contig_depths --outputDepth {output.coverage_table} {input.bams} 2> {log}
    """
63
64
65
66
67
68
69
70
71
shell:
    """
        metabat2 {params.extra} --numThreads {threads} \
        --inFile {input.contigs} \
        --outFile {params.basename} \
        --abdFile {input.coverage_table} \
        --minContig {params.min_contig_length} \
        2> {log} 1>&2
    """
88
89
90
91
92
93
94
shell:
    """
      samtools coverage {input.bams} | \
      tail -n +2 | \
      sort -k1 | \
      cut -f1,6 > {output.coverage_table} 2> {log}
   """
111
112
113
114
run:
    with open(output.abund_list, 'w') as f:
        for fp in input:
            f.write('%s\n' % fp)
144
145
146
147
148
149
150
151
152
153
154
shell:
    """
        mkdir -p {output.bins}

        run_MaxBin.pl -thread {threads} -prob_threshold {params.prob} \
        -min_contig_length {params.min_contig_length} {params.extra} \
        -contig {input.contigs} \
        -abund_list {input.abund_list} \
        -out {params.basename}
        2> {log} 1>&2
    """
178
179
180
181
182
183
184
185
shell:
    """
      cut_up_fasta.py {input.contigs} \
      -c {params.chunk_size} \
      -o {params.overlap_size} \
      --merge_last \
      -b {output.bed} > {output.contigs_10K} 2> {log}
    """
205
206
207
208
209
shell:
    """
      concoct_coverage_table.py {input.bed} \
      {input.bam} > {output.coverage_table} 2> {log}
    """
232
233
234
235
236
237
238
239
240
241
shell:
    """
        concoct --threads {threads} -l {params.min_contig_length} \
        --composition_file {input.contigs_10K} \
        --coverage_file {input.coverage_table} \
        -b {params.bins}
        2> {log} 1>&2

        mv output/binning/concoct/{wildcards.mapper}/run_concoct/{wildcards.contig_sample}/{wildcards.contig_sample}_bins_clustering_gt{params.min_contig_length}.csv output/binning/concoct/{wildcards.mapper}/run_concoct/{wildcards.contig_sample}/{wildcards.contig_sample}_bins_clustering.csv
    """
259
260
261
262
shell:
    """
        merge_cutup_clustering.py {input.bins} > {output.merged} 2> {log}
    """
281
282
283
284
285
286
287
288
289
shell:
    """
        mkdir -p {output.fasta_bins}
        extract_fasta_bins.py \
        {input.original_contigs} \
        {input.clustering_merged} \
        --output_path {output.fasta_bins} \
        2> {log}
    """
27
28
29
30
31
shell:
    """
    {params.bt2b_command} --threads {threads} \
    {input.contigs} {params.indexbase} 2> {log}
    """
56
57
58
59
60
61
62
63
shell:
    """
    # Map reads against reference genome
    {params.bt2_command} {params.extra} -p {threads} -x {params.ref} \
      -1 {input.reads[0]} -2 {input.reads[1]} \
      2> {log} | samtools view -bS - > {output.aln}

    """
79
80
81
82
83
shell:
    """
    minimap2 -d {output.index} {input.contigs} -t {threads} 2> {log}

    """
107
108
109
110
111
112
113
shell:
    """
    # Map reads against contigs
    minimap2 -a {input.db} {input.reads} -x {params.x} -K {params.k} -t {threads} \
    2> {log} | samtools view -bS - > {output.aln}

    """
132
133
134
135
136
shell:
    """
    samtools sort -o {output.bam} -@ {threads} {input.aln} 2> {log}
    samtools index -b -@ {threads} {output.bam} 2>> {log}
    """
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
shell:
    """
      # get stem file path
      stem={output.report}
      stem=${{stem%.report.txt}}

      # run Kraken to align reads against reference genomes
      kraken2 {input.fastq1} {input.fastq2} \
        --db {params.db} \
        --paired \
        --gzip-compressed \
        --only-classified-output \
        --threads {threads} \
        --report {output.report} \
        --output - \
        2> {log}

      # run Bracken to re-estimate abundance at given rank
      if [[ ! -z {params.levels} ]]
      then
        IFS=',' read -r -a levels <<< "{params.levels}"
        for level in "${{levels[@]}}"
        do
          bracken \
            -d {params.bracken_db} \
            -i {output.report} \
            -t 10 \
            -l $(echo $level | head -c 1 | tr a-z A-Z) \
            -o $stem.redist.$level.txt \
            2>> {log} 1>&2
        done
      fi
      """
69
70
71
72
73
74
shell:
    """
    perl resources/scripts/kraken2-translate.pl {input} > {input}.temp
    ktImportText -o {output} {input}.temp
    rm {input}.temp
    """
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
    """
    if test -f "{output}/mpa_latest"; then
        touch {output}
        echo "DB already installed at {output}"
    else
        metaphlan --install --bowtie2db {output} \
            2> {log} 1>&2
    fi
    """
127
128
129
130
131
132
133
134
135
136
137
shell:
    """
    metaphlan {input.fastq1},{input.fastq2} \
    --input_type fastq \
    --nproc {threads} {params.other} \
    --bowtie2db {input.db_path} \
    --bowtie2out {output.bt2}  \
    -s {output.sam}  \
    -o {output.profile} \
    2> {log} 1>&2
    """
156
157
158
159
160
161
shell:
    """
    merge_metaphlan_tables.py {input} \
    -o {output.merged_abundance_table} \
    2> {log} 1>&2
    """
185
186
187
188
189
190
191
192
193
shell:
    """
    sourmash sketch dna \
    -p k={params.k},scaled={params.scaled}  \
    {params.extra} \
    -o {output} \
    --merge \
    {input} 2> {log} 1>&2
    """
207
208
209
210
211
212
213
shell:
    """
    sourmash compare \
    --output {output.dm} \
    --csv {output.csv} \
    {input} 2> {log} 1>&2
    """
224
225
226
227
228
229
shell:
    """
    sourmash plot --pdf --labels \
    --output-dir {output} \
    {input} 2> {log} 1>&2
    """
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
run:
    df = pd.read_csv(input[0], header=0, encoding= 'unicode_escape')
    df.index = df.columns

    # test file sizes

    pf_seqs = []
    for fp in df.columns:
        # print(depth)
        print(fp)
        with gzip.open(fp, 'rb') as f:
            for i, l in enumerate(f):
                pass
        seqs = (i + 1) / 4
        print(seqs)
        if params['min_seqs'] <= seqs <= params['max_seqs']:
            pf_seqs.append(fp)

    df_filt = df.loc[pf_seqs, pf_seqs]

    labels = [os.path.basename(x) for x in pf_seqs]

    dm = DistanceMatrix(1 - df_filt.values)

    print("The imported distance matrix has "
          "{} elements.".format(len(labels)))
    print("Selecting 2 to {} prototypes.\n".format(len(labels) - 1))

    proto_dict = {}

    for k in range(2, len(labels)):
        # run prototypeSelection function
        prototypes = prototype_selection_destructive_maxdist(dm, k)
        proto_dict[k] = [labels[int(x)] for x in prototypes]

    with open(output[0], 'w') as outfile:
        dump(proto_dict, outfile, default_flow_style=False)

    with open(log[0], "w") as logfile:
        logfile.write("Running the run_prototypeSelection.py script.\n"
                      "The imported distance matrix has {0} elements.\n"
                      "Selecting 2 to "
                      "{1} prototypes.\n".format(df.shape[1],
                                                 len(labels) - 1))
19
20
wrapper:
    "0.72.0/bio/fastqc"
43
44
wrapper:
    "0.17.4/bio/cutadapt/pe"
59
60
wrapper:
    "0.72.0/bio/fastqc"
80
shell: "cat {input} > {output}"
104
105
106
107
108
shell:
    """
    bowtie2-build --threads {threads} {params.extra} \
    {input.reference} {params.indexbase} 2> {log} 1>&2
    """
145
146
147
148
149
150
151
152
153
154
155
156
157
shell:
    """
    # Map reads against reference genome
    bowtie2 -p {threads} -x {params.ref} \
      -1 {input.fastq1} -2 {input.fastq2} \
      --un-conc-gz {wildcards.sample}_nonhost \
      --no-unal \
      2> {log} | samtools view -bS - > {output.host}

    # rename nonhost samples
    mv {wildcards.sample}_nonhost.1 output/qc/host_filter/nonhost/{wildcards.sample}.R1.fastq.gz
    mv {wildcards.sample}_nonhost.2 output/qc/host_filter/nonhost/{wildcards.sample}.R2.fastq.gz
    """
170
171
wrapper:
    "0.72.0/bio/fastqc"
193
194
wrapper:
    "v1.7.0/bio/multiqc"
22
23
24
25
26
27
shell:
    """
        Fasta_to_Scaffolds2Bin.sh \
        -i {input.bins} \
        -e fa > {output.scaffolds2bin}
    """
46
47
48
49
50
51
shell:
    """
        Fasta_to_Scaffolds2Bin.sh \
        -i {input.bins} \
        -e fasta > {output.scaffolds2bin}
    """
69
70
71
72
73
74
shell:
    """
        Fasta_to_Scaffolds2Bin.sh \
        -i {input.bins} \
        -e fa > {output.scaffolds2bin}
    """
127
128
129
130
131
132
133
134
135
136
137
138
shell:
    """
        DAS_Tool \
        --bins {input.metabat2},{input.maxbin2},{input.concoct} \
        --contigs {input.contigs} \
        --outputbasename {params.basename} \
        --labels metabat2,maxbin2,concoct \
        --write_bins 1 \
        --write_bin_evals 1 \
        --threads {threads} \
        --search_engine {params.search_engine}
    """
151
152
153
154
155
156
157
158
159
160
161
162
run:
    sample = wildcards.contig_sample 
    fasta_dir = join(dirname(input[0]),
                     sample + '_DASTool_bins')
    output_dir = dirname(output.done)

    fasta_files = glob(join(fasta_dir, '*.fa'))

    for file in fasta_file:
        copyfile(file,
                 join(output.out,
                      sample + '_' + basename(file)))
183
184
185
186
187
188
189
190
191
shell:
    """
    sourmash sketch dna \
    -p k={params.k},scaled={params.scaled}  \
    {params.extra} \
    -o {output} \
    --merge \
    {input} 2> {log} 1>&2
    """
205
206
207
208
209
210
211
shell:
    """
    sourmash compare \
    --output {output.dm} \
    --csv {output.csv} \
    {input} 2> {log} 1>&2
    """
222
223
224
225
226
227
shell:
    """
    sourmash plot --pdf --labels \
    --output-dir {output} \
    {input} 2> {log} 1>&2
    """
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
run:
    df = pd.read_csv(input[0], header=0, encoding= 'unicode_escape')
    df.index = df.columns

    # test file sizes

    pf_seqs = []
    for fp in df.columns:
        print(fp)
        with gzip.open(fp, 'rb') as f:
            for i, l in enumerate(f):
                pass
        seqs = (i + 1) / 4
        print(seqs)
        if params['min_seqs'] <= seqs <= params['max_seqs']:
            pf_seqs.append(fp)

    df_filt = df.loc[pf_seqs, pf_seqs]

    labels = [os.path.basename(x) for x in pf_seqs]

    dm = DistanceMatrix(1 - df_filt.values)

    print("The imported distance matrix has "
          "{} elements.".format(len(labels)))
    print("Selecting 2 to {} prototypes.\n".format(len(labels) - 1))

    proto_dict = {}

    for k in range(2, len(labels)):
        # run prototypeSelection function
        prototypes = prototype_selection_destructive_maxdist(dm, k)
        proto_dict[k] = [labels[int(x)] for x in prototypes]

    with open(output[0], 'w') as outfile:
        dump(proto_dict, outfile, default_flow_style=False)

    with open(log[0], "w") as logfile:
        logfile.write("Running the run_prototypeSelection.py script.\n"
                      "The imported distance matrix has {0} elements.\n"
                      "Selecting 2 to "
                      "{1} prototypes.\n".format(df.shape[1],
                                                 len(labels) - 1))
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} --quiet -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log:q}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)
ShowHide 35 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/CUMoellerLab/sn-mg-pipeline
Name: sn-mg-pipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...