Snakemake pipeline for basic processing of metagenomic data from the lab
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Overview
Snakemake pipeline for basic processing of metagenomic data from the lab. It accepts raw fastq files of metagenomic data, quality filters it, removes reads that map to the host genome, then builds assemblies of each sample and generates a sourmash profile. The current version also generates a taxonomic profile of each sample using MetaPhlAn3 . Modules that are currently underdevelopment will handle automated binning procedures, as well as strain-level profiling.
Quick Start Guide
Install
First, clone this github repository:
$ git clone https://github.com/CUMoellerLab/sn-mg-pipeline.git
cd sn-mg-pipeline
We recommend installing and using mamba:
$ conda install -c conda-forge mamba
Then install the snakemake version for this workflow using mamba:
$ mamba env create -n snakemake -f resources/env/snakemake.yaml
$ conda activate snakemake
Update Files
Now you can update three files that are located in the
./resources/config
directory.
The first two files are
samples.txt
and
units.txt
. The
samples.txt
file is your basic metadata file, with each row representing a sample in your dataset and each column containing the corresponding information about that sample. The first column should named "Sample" and should contain the name for each sample. Any addition columns are not used at this step.
The
units.txt
file should have only 4 columns, and each row should correspond to a sample found in the
samples.txt
file. The first column, "Sample", should be all or a subset of the "Sample" column in the
samples.txt
file. The second column, "Unit", should denote which analysis block each sample belongs to. In our case, we use sequencing run/lane, but you may use other information based on your experimental design. The third and fourth columns (named "R1" and "R2", respectively) should include the full file paths to the forward and reverse fastq files for that sample.
The last file to update is the the
config.yaml
file. This is where you can select the parameters for each step in the analysis pipeline. Refer to the documentation for each tool individually for more information. Also, be sure to change the NCBI GenBank Accession number to your host of interest.
NOTE: You can select which metagenomic assembler you want to use under the the "assemblers:" header. The current options are metaSPAdes and MEGAHIT Simply delete the assembler you don't want to use. Otherwise both will run.
Run the Pipeline
After you have updated the files described above, you can start the pipeline. First run:
conda install -n base -c conda-forge mamba
Then begin the run using:
snakemake --cores 8 --use-conda
The first time you run this, it may take longer to set up your conda environment. Be sure to select the appropriate number of cores for your analysis.
Code Snippets
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | shell: """ # Make temporary output directory mkdir -p {params.temp_dir} # run the metaspades assembly metaspades.py --threads {threads} \ -o {params.temp_dir}/ \ --memory $(({resources.mem_mb}/1024)) \ --pe1-1 {input.fastq1} \ --pe1-2 {input.fastq2} \ 2> {log} 1>&2 # move and rename the contigs file into a permanent directory mv {params.temp_dir}/contigs.fasta {output.contigs} rm -rf {params.temp_dir} """ |
65 66 67 68 69 70 71 72 73 74 75 76 77 78 | shell: """ megahit -t {threads} \ -o {params.temp_dir}/ \ --memory $(({resources.mem_mb}*1024*1024)) \ -1 {input.fastq1} \ -2 {input.fastq2} \ 2> {log} 1>&2 # move and rename the contigs file into a permanent directory mv {params.temp_dir}/final.contigs.fa {output.contigs} rm -rf {params.temp_dir} """ |
100 101 102 103 104 105 106 107 | shell: """ quast.py \ -o {params.outdir} \ -t {threads} \ {input} touch {output.report} """ |
122 123 | wrapper: "v1.7.0/bio/multiqc" |
147 148 149 150 151 152 153 154 155 | shell: """ metaquast.py \ -r {params.refs} \ -o {params.outdir} \ -t {threads} \ {params.extra} \ {input} """ |
170 171 | wrapper: "0.72.0/bio/multiqc" |
30 31 32 33 | shell: """ jgi_summarize_bam_contig_depths --outputDepth {output.coverage_table} {input.bams} 2> {log} """ |
63 64 65 66 67 68 69 70 71 | shell: """ metabat2 {params.extra} --numThreads {threads} \ --inFile {input.contigs} \ --outFile {params.basename} \ --abdFile {input.coverage_table} \ --minContig {params.min_contig_length} \ 2> {log} 1>&2 """ |
88 89 90 91 92 93 94 | shell: """ samtools coverage {input.bams} | \ tail -n +2 | \ sort -k1 | \ cut -f1,6 > {output.coverage_table} 2> {log} """ |
111 112 113 114 | run: with open(output.abund_list, 'w') as f: for fp in input: f.write('%s\n' % fp) |
144 145 146 147 148 149 150 151 152 153 154 | shell: """ mkdir -p {output.bins} run_MaxBin.pl -thread {threads} -prob_threshold {params.prob} \ -min_contig_length {params.min_contig_length} {params.extra} \ -contig {input.contigs} \ -abund_list {input.abund_list} \ -out {params.basename} 2> {log} 1>&2 """ |
178 179 180 181 182 183 184 185 | shell: """ cut_up_fasta.py {input.contigs} \ -c {params.chunk_size} \ -o {params.overlap_size} \ --merge_last \ -b {output.bed} > {output.contigs_10K} 2> {log} """ |
205 206 207 208 209 | shell: """ concoct_coverage_table.py {input.bed} \ {input.bam} > {output.coverage_table} 2> {log} """ |
232 233 234 235 236 237 238 239 240 241 | shell: """ concoct --threads {threads} -l {params.min_contig_length} \ --composition_file {input.contigs_10K} \ --coverage_file {input.coverage_table} \ -b {params.bins} 2> {log} 1>&2 mv output/binning/concoct/{wildcards.mapper}/run_concoct/{wildcards.contig_sample}/{wildcards.contig_sample}_bins_clustering_gt{params.min_contig_length}.csv output/binning/concoct/{wildcards.mapper}/run_concoct/{wildcards.contig_sample}/{wildcards.contig_sample}_bins_clustering.csv """ |
259 260 261 262 | shell: """ merge_cutup_clustering.py {input.bins} > {output.merged} 2> {log} """ |
281 282 283 284 285 286 287 288 289 | shell: """ mkdir -p {output.fasta_bins} extract_fasta_bins.py \ {input.original_contigs} \ {input.clustering_merged} \ --output_path {output.fasta_bins} \ 2> {log} """ |
27 28 29 30 31 | shell: """ {params.bt2b_command} --threads {threads} \ {input.contigs} {params.indexbase} 2> {log} """ |
56 57 58 59 60 61 62 63 | shell: """ # Map reads against reference genome {params.bt2_command} {params.extra} -p {threads} -x {params.ref} \ -1 {input.reads[0]} -2 {input.reads[1]} \ 2> {log} | samtools view -bS - > {output.aln} """ |
79 80 81 82 83 | shell: """ minimap2 -d {output.index} {input.contigs} -t {threads} 2> {log} """ |
107 108 109 110 111 112 113 | shell: """ # Map reads against contigs minimap2 -a {input.db} {input.reads} -x {params.x} -K {params.k} -t {threads} \ 2> {log} | samtools view -bS - > {output.aln} """ |
132 133 134 135 136 | shell: """ samtools sort -o {output.bam} -@ {threads} {input.aln} 2> {log} samtools index -b -@ {threads} {output.bam} 2>> {log} """ |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | shell: """ # get stem file path stem={output.report} stem=${{stem%.report.txt}} # run Kraken to align reads against reference genomes kraken2 {input.fastq1} {input.fastq2} \ --db {params.db} \ --paired \ --gzip-compressed \ --only-classified-output \ --threads {threads} \ --report {output.report} \ --output - \ 2> {log} # run Bracken to re-estimate abundance at given rank if [[ ! -z {params.levels} ]] then IFS=',' read -r -a levels <<< "{params.levels}" for level in "${{levels[@]}}" do bracken \ -d {params.bracken_db} \ -i {output.report} \ -t 10 \ -l $(echo $level | head -c 1 | tr a-z A-Z) \ -o $stem.redist.$level.txt \ 2>> {log} 1>&2 done fi """ |
69 70 71 72 73 74 | shell: """ perl resources/scripts/kraken2-translate.pl {input} > {input}.temp ktImportText -o {output} {input}.temp rm {input}.temp """ |
92 93 94 95 96 97 98 99 100 101 | shell: """ if test -f "{output}/mpa_latest"; then touch {output} echo "DB already installed at {output}" else metaphlan --install --bowtie2db {output} \ 2> {log} 1>&2 fi """ |
127 128 129 130 131 132 133 134 135 136 137 | shell: """ metaphlan {input.fastq1},{input.fastq2} \ --input_type fastq \ --nproc {threads} {params.other} \ --bowtie2db {input.db_path} \ --bowtie2out {output.bt2} \ -s {output.sam} \ -o {output.profile} \ 2> {log} 1>&2 """ |
156 157 158 159 160 161 | shell: """ merge_metaphlan_tables.py {input} \ -o {output.merged_abundance_table} \ 2> {log} 1>&2 """ |
185 186 187 188 189 190 191 192 193 | shell: """ sourmash sketch dna \ -p k={params.k},scaled={params.scaled} \ {params.extra} \ -o {output} \ --merge \ {input} 2> {log} 1>&2 """ |
207 208 209 210 211 212 213 | shell: """ sourmash compare \ --output {output.dm} \ --csv {output.csv} \ {input} 2> {log} 1>&2 """ |
224 225 226 227 228 229 | shell: """ sourmash plot --pdf --labels \ --output-dir {output} \ {input} 2> {log} 1>&2 """ |
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | run: df = pd.read_csv(input[0], header=0, encoding= 'unicode_escape') df.index = df.columns # test file sizes pf_seqs = [] for fp in df.columns: # print(depth) print(fp) with gzip.open(fp, 'rb') as f: for i, l in enumerate(f): pass seqs = (i + 1) / 4 print(seqs) if params['min_seqs'] <= seqs <= params['max_seqs']: pf_seqs.append(fp) df_filt = df.loc[pf_seqs, pf_seqs] labels = [os.path.basename(x) for x in pf_seqs] dm = DistanceMatrix(1 - df_filt.values) print("The imported distance matrix has " "{} elements.".format(len(labels))) print("Selecting 2 to {} prototypes.\n".format(len(labels) - 1)) proto_dict = {} for k in range(2, len(labels)): # run prototypeSelection function prototypes = prototype_selection_destructive_maxdist(dm, k) proto_dict[k] = [labels[int(x)] for x in prototypes] with open(output[0], 'w') as outfile: dump(proto_dict, outfile, default_flow_style=False) with open(log[0], "w") as logfile: logfile.write("Running the run_prototypeSelection.py script.\n" "The imported distance matrix has {0} elements.\n" "Selecting 2 to " "{1} prototypes.\n".format(df.shape[1], len(labels) - 1)) |
19 20 | wrapper: "0.72.0/bio/fastqc" |
43 44 | wrapper: "0.17.4/bio/cutadapt/pe" |
59 60 | wrapper: "0.72.0/bio/fastqc" |
80 | shell: "cat {input} > {output}" |
104 105 106 107 108 | shell: """ bowtie2-build --threads {threads} {params.extra} \ {input.reference} {params.indexbase} 2> {log} 1>&2 """ |
145 146 147 148 149 150 151 152 153 154 155 156 157 | shell: """ # Map reads against reference genome bowtie2 -p {threads} -x {params.ref} \ -1 {input.fastq1} -2 {input.fastq2} \ --un-conc-gz {wildcards.sample}_nonhost \ --no-unal \ 2> {log} | samtools view -bS - > {output.host} # rename nonhost samples mv {wildcards.sample}_nonhost.1 output/qc/host_filter/nonhost/{wildcards.sample}.R1.fastq.gz mv {wildcards.sample}_nonhost.2 output/qc/host_filter/nonhost/{wildcards.sample}.R2.fastq.gz """ |
170 171 | wrapper: "0.72.0/bio/fastqc" |
193 194 | wrapper: "v1.7.0/bio/multiqc" |
22 23 24 25 26 27 | shell: """ Fasta_to_Scaffolds2Bin.sh \ -i {input.bins} \ -e fa > {output.scaffolds2bin} """ |
46 47 48 49 50 51 | shell: """ Fasta_to_Scaffolds2Bin.sh \ -i {input.bins} \ -e fasta > {output.scaffolds2bin} """ |
69 70 71 72 73 74 | shell: """ Fasta_to_Scaffolds2Bin.sh \ -i {input.bins} \ -e fa > {output.scaffolds2bin} """ |
127 128 129 130 131 132 133 134 135 136 137 138 | shell: """ DAS_Tool \ --bins {input.metabat2},{input.maxbin2},{input.concoct} \ --contigs {input.contigs} \ --outputbasename {params.basename} \ --labels metabat2,maxbin2,concoct \ --write_bins 1 \ --write_bin_evals 1 \ --threads {threads} \ --search_engine {params.search_engine} """ |
151 152 153 154 155 156 157 158 159 160 161 162 | run: sample = wildcards.contig_sample fasta_dir = join(dirname(input[0]), sample + '_DASTool_bins') output_dir = dirname(output.done) fasta_files = glob(join(fasta_dir, '*.fa')) for file in fasta_file: copyfile(file, join(output.out, sample + '_' + basename(file))) |
183 184 185 186 187 188 189 190 191 | shell: """ sourmash sketch dna \ -p k={params.k},scaled={params.scaled} \ {params.extra} \ -o {output} \ --merge \ {input} 2> {log} 1>&2 """ |
205 206 207 208 209 210 211 | shell: """ sourmash compare \ --output {output.dm} \ --csv {output.csv} \ {input} 2> {log} 1>&2 """ |
222 223 224 225 226 227 | shell: """ sourmash plot --pdf --labels \ --output-dir {output} \ {input} 2> {log} 1>&2 """ |
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | run: df = pd.read_csv(input[0], header=0, encoding= 'unicode_escape') df.index = df.columns # test file sizes pf_seqs = [] for fp in df.columns: print(fp) with gzip.open(fp, 'rb') as f: for i, l in enumerate(f): pass seqs = (i + 1) / 4 print(seqs) if params['min_seqs'] <= seqs <= params['max_seqs']: pf_seqs.append(fp) df_filt = df.loc[pf_seqs, pf_seqs] labels = [os.path.basename(x) for x in pf_seqs] dm = DistanceMatrix(1 - df_filt.values) print("The imported distance matrix has " "{} elements.".format(len(labels))) print("Selecting 2 to {} prototypes.\n".format(len(labels) - 1)) proto_dict = {} for k in range(2, len(labels)): # run prototypeSelection function prototypes = prototype_selection_destructive_maxdist(dm, k) proto_dict[k] = [labels[int(x)] for x in prototypes] with open(output[0], 'w') as outfile: dump(proto_dict, outfile, default_flow_style=False) with open(log[0], "w") as logfile: logfile.write("Running the run_prototypeSelection.py script.\n" "The imported distance matrix has {0} elements.\n" "Selecting 2 to " "{1} prototypes.\n".format(df.shape[1], len(labels) - 1)) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "cutadapt" " {snakemake.params}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " {snakemake.input}" " > {snakemake.output.qc} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".fastq.gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} --quiet -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log:q}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell extra = snakemake.params.get("extra", "") # Set this to False if multiqc should use the actual input directly # instead of parsing the folders where the provided files are located use_input_files_only = snakemake.params.get("use_input_files_only", False) if not use_input_files_only: input_data = set(path.dirname(fp) for fp in snakemake.input) else: input_data = set(snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {extra}" " --force" " -o {output_dir}" " -n {output_name}" " {input_data}" " {log}" ) |
Support
- Future updates
Related Workflows





