:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data

public public 7mo ago Version: v1.0.5 0 bookmarks

Note An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data.

metawrapfigs_final4 001

metaGEM is a Snakemake workflow that integrates an array of existing bioinformatics and metabolic modeling tools, for the purpose of predicting metabolic interactions within bacterial communities of microbiomes. From whole metagenome shotgun datasets, metagenome assembled genomes (MAGs) are reconstructed, which are then converted into genome-scale metabolic models (GEMs) for in silico simulations. Additional outputs include abundance estimates, taxonomic assignment, growth rate estimation, pangenome analysis, and eukaryotic MAG identification.

⚙️ Installation

You can start using metaGEM on your cluster with just one line of code with the mamba package manager

mamba create -n metagem -c bioconda metagem

This will create an environment called metagem and start installing dependencies. Please consult the config/README.md page for more detailed setup instructions.

🔧 Usage

Clone this repo

git clone https://github.com/franciscozorrilla/metaGEM.git && cd metaGEM/workflow

Run metaGEM without any arguments to see usage instructions:

bash metaGEM.sh
Usage: bash metaGEM.sh [-t|--task TASK] 
 [-j|--nJobs NUMBER OF JOBS] 
 [-c|--cores NUMBER OF CORES] 
 [-m|--mem GB RAM] 
 [-h|--hours MAX RUNTIME]
 [-l|--local]
 Options:
 -t, --task Specify task to complete:
 SETUP
 createFolders
 downloadToy
 organizeData
 check
 CORE WORKFLOW
 fastp 
 megahit 
 crossMapSeries
 kallistoIndex
 crossMapParallel
 kallisto2concoct 
 concoct 
 metabat
 maxbin 
 binRefine 
 binReassemble 
 extractProteinBins
 carveme
 memote
 organizeGEMs
 smetana
 extractDnaBins
 gtdbtk
 abundance
 BONUS
 grid
 prokka
 roary
 eukrep
 eukcc
 VISUALIZATION (in development)
 stats
 qfilterVis
 assemblyVis
 binningVis
 taxonomyVis
 modelVis
 interactionVis
 growthVis
 -j, --nJobs Specify number of jobs to run in parallel
 -c, --nCores Specify number of cores per job
 -m, --mem Specify memory in GB required for job
 -h, --hours Specify number of hours to allocated to job runtime
 -l, --local Run jobs on local machine for non-cluster usage

🧉 Try it now

You can set up and use metaGEM on the cloud by following along the google colab notebook.

Please note that google colab does not provide the computational resources necessary to fully run metaGEM on a real dataset. This notebook demonstrates how to set up and use metaGEM by perfoming the first steps in the workflow on a toy dataset.

💩 Tutorials

metaGEM can be used to explore your own gut microbiome sequencing data from at-home-test-kit services such as unseen bio . The following tutorial showcases the metaGEM workflow on two unseenbio samples.

For an introductory metabolic modeling tutorial, refer to the resources compiled for the EMBOMicroCom: Metabolite and species dynamics in microbial communities workshop in 2022.

For a more advanced tutorial, check out the resources we put together for the SymbNET: from metagenomics to metabolic interactions course in 2022.

🏛️ Wiki

Refer to the wiki for additional usage tips, frequently asked questions, and implementation details.

📦 Datasets

  • You can access the metaGEM-generated results for the publication here .
 🧪 Small communities of gut microbes from lab cultures 💩 Real gut microbiome samples from Swedish diabetes paper 🪴 Plant-associated soil samples from Chinese rhizobiome study 🌏 Bulk-soil samples from Australian biodiversity analysis 🌊 Ocean water samples from global TARA Oceans expeditions
  • Additionally, you can access metaGEM-generated results from a reanalysis of recently published ancient metagenomes here .

🐍 Workflow

Core

  1. Quality filter reads with fastp

  2. Assembly with megahit

  3. Draft bin sets with CONCOCT , MaxBin2 , and MetaBAT2

  4. Refine & reassemble bins with metaWRAP

  5. Taxonomic assignment with GTDB-tk

  6. Relative abundances with bwa and samtools

  7. Reconstruct & evaluate genome-scale metabolic models with CarveMe and memote

  8. Species metabolic coupling analysis with SMETANA

Bonus

  1. Growth rate estimation with GRiD , SMEG or CoPTR

  2. Pangenome analysis with roary

  3. Eukaryotic draft bins with EukRep and EukCC

🏗️ Active Development

If you want to see any new additional or alternative tools incorporated into the metaGEM workflow please raise an issue or create a pull request. Snakemake allows workflows to be very flexible, so adding new rules is as easy as filling out the following template and adding it to the Snakefile:

rule package-name:
 input:
 rules.rulename.output
 output:
 f'{config["path"]["root"]}/{config["folder"]["X"]}/{{IDs}}/output.file'
 message:
 """
 Helpful and descriptive message detailing goal of this rule/package.
 """
 shell:
 """
 # Well documented command line instructions go here
 # Load conda environment 
 set +u;source activate {config[envs][package]};set -u;
 # Run tool
 package-name -i {input} -o {output}
 """

🖇️ Publications

The metaGEM workflow has been used in the following publications:

Plastic-degrading potential across the global microbiome correlates with recent pollution trends
J Zrimec, M Kokina, S Jonasson, F Zorrilla, A Zelezniak
MBio, 2021
Competition-cooperation in the chemoautotrophic ecosystem of Movile Cave: first metagenomic approach on sediments
Chiciudean, I., Russo, G., Bogdan, D.F. et al.
Environmental Microbiome, 2022
The National Ecological Observatory Network’s soil metagenomes: assembly and basic analysis
Werbin ZR, Hackos B, Lopez-Nava J et al.
F1000Research, 2022

🍾 Please cite

metaGEM: reconstruction of genome scale metabolic models directly from metagenomes
Francisco Zorrilla, Filip Buric, Kiran R Patil, Aleksej Zelezniak
Nucleic Acids Research, 2021; gkab815, https://doi.org/10.1093/nar/gkab815

📲 Contact

Please reach out with any comments, concerns, or discussions regarding metaGEM .

Code Snippets

29
30
31
32
shell:
    """
    echo "Gathering {input} ... "
    """
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
shell:
    """
    cd {input}
    echo -e "Setting up result folders in the following work directory: $(echo {input}) \n"

    # Generate folders.txt by extracting folder names from config.yaml file
    paste config.yaml |cut -d':' -f2|tail -n +4|head -n 25|sed '/^$/d' > folders.txt # NOTE: hardcoded numbers (tail 4, head 25) for folder names, increase number as new folders are introduced.

    while read line;do 
        echo "Creating $line folder ... "
        mkdir -p $line;
    done < folders.txt

    echo -e "\nDone creating folders. \n"

    rm folders.txt
    """
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
shell:
    """
    cd {config[path][root]}/{config[folder][data]}

    # Download each link in download_toydata.txt
    echo -e "\nBegin downloading toy dataset ... "
    while read line;do 
        wget $line;
    done < {input}
    echo -e "\nDone donwloading dataset."

    # Rename downloaded files, this is only necessary for toy dataset (will cause error if used for real dataset)
    echo -ne "\nRenaming downloaded files ... "
    for file in *;do 
        mv $file ./$(echo $file|sed 's/?download=1//g'|sed 's/_/_R/g');
    done
    echo -e " done. \n"

    # Organize data into sample specific sub-folders

    echo -ne "Generating list of unique sample IDs ... "
    for file in *.gz; do 
        echo $file; 
    done | sed 's/_.*$//g' | sed 's/.fastq.gz//g' | uniq > ID_samples.txt
    echo -e " done.\n $(less ID_samples.txt|wc -l) samples identified."

    echo -ne "\nOrganizing downloaded files into sample specific sub-folders ... "
    while read line; do 
        mkdir -p $line; 
        mv $line*.gz $line; 
    done < ID_samples.txt
    echo " done."

    rm ID_samples.txt
    """
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
shell:
    """
    cd {input}

    echo -ne "\nGenerating list of unique sample IDs ... "

    # Create list of unique sample IDs
    for file in *.gz; do 
        echo $file; 
    done | sed 's/_[^_]*$//g' | sed 's/.fastq.gz//g' | uniq > ID_samples.txt

    echo -e " done.\n $(less ID_samples.txt|wc -l) samples identified.\n"

    # Create folder and move corresponding files for each sample

    echo -ne "\nOrganizing dataset into sample specific sub-folders ... "
    while read line; do 
        mkdir -p $line; 
        mv $line*.gz $line; 
    done < ID_samples.txt
    echo -e " done. \n"

    rm ID_samples.txt
    """
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
shell:
    """
    # Activate metagem environment
    echo -e "Activating {config[envs][metagem]} conda environment ... "
    set +u;source activate {config[envs][metagem]};set -u;

    # This is just to make sure that output folder exists
    mkdir -p $(dirname {output.R1})

    # Make job specific scratch dir
    idvar=$(echo $(basename $(dirname {output.R1}))|sed 's/_R1.fastq.gz//g')
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}}

    # Copy files
    echo -e "Copying {input.R1} and {input.R2} to {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}} ... "
    cp {input.R1} {input.R2} .

    echo -e "Appending .raw to temporary input files to avoid name conflict ... "
    for file in *.gz; do mv -- "$file" "${{file}}.raw.gz"; done

    # Run fastp
    echo -n "Running fastp ... "
    fastp --thread {config[cores][fastp]} \
        -i *R1*raw.gz \
        -I *R2*raw.gz \
        -o $(basename {output.R1}) \
        -O $(basename {output.R2}) \
        -j $(dirname {output.R1})/$(echo $(basename $(dirname {output.R1}))).json \
        -h $(dirname {output.R1})/$(echo $(basename $(dirname {output.R1}))).html

    # Move output files to root dir
    echo -e "Moving output files $(basename {output.R1}) and $(basename {output.R2}) to $(dirname {output.R1})"
    mv $(basename {output.R1}) $(basename {output.R2}) $(dirname {output.R1})

    # Warning
    echo -e "Note that you must manually clean up these temporary directories if your scratch directory points to a static location instead of variable with a job specific location ... "

    # Done message
    echo -e "Done quality filtering sample ${{idvar}}"
    """
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
shell:
    """
    # Activate metagem env
    set +u;source activate {config[envs][metagem]};set -u;

    # Make sure stats folder exists
    mkdir -p $(dirname {output.text})

    # Move into qfiltered folder
    cd {input}

    # Read and summarize files
    echo -e "\nGenerating quality filtering results file qfilter.stats: ... "
    for folder in */;do
        for file in $folder*json;do

            # Define sample
            ID=$(echo $file|sed 's|/.*$||g')

            # Reads before filtering
            readsBF=$(head -n 25 $file|grep total_reads|cut -d ':' -f2|sed 's/,//g'|head -n 1)

            # Reads after filtering
            readsAF=$(head -n 25 $file|grep total_reads|cut -d ':' -f2|sed 's/,//g'|tail -n 1)

            # Bases before filtering
            basesBF=$(head -n 25 $file|grep total_bases|cut -d ':' -f2|sed 's/,//g'|head -n 1)

            # Bases after filtering
            basesAF=$(head -n 25 $file|grep total_bases|cut -d ':' -f2|sed 's/,//g'|tail -n 1)

            # Q20 bases before filtering
            q20BF=$(head -n 25 $file|grep q20_rate|cut -d ':' -f2|sed 's/,//g'|head -n 1)

            # Q20 bases after filtering
            q20AF=$(head -n 25 $file|grep q20_rate|cut -d ':' -f2|sed 's/,//g'|tail -n 1)

            # Q30 bases before filtering
            q30BF=$(head -n 25 $file|grep q30_rate|cut -d ':' -f2|sed 's/,//g'|head -n 1)

            # Q30 bases after filtering
            q30AF=$(head -n 25 $file|grep q30_rate|cut -d ':' -f2|sed 's/,//g'|tail -n 1)

            # Percentage of reads kept after filtering
            percent=$(awk -v RBF="$readsBF" -v RAF="$readsAF" 'BEGIN{{print RAF/RBF}}' )

            # Write values to qfilter.stats file
            echo "$ID $readsBF $readsAF $basesBF $basesAF $percent $q20BF $q20AF $q30BF $q30AF" >> qfilter.stats

            # Print values
            echo "Sample $ID retained $percent * 100 % of reads ... "
        done
    done

    echo "Done summarizing quality filtering results ... \nMoving to /stats/ folder and running plotting script ... "
    mv qfilter.stats {config[path][root]}/{config[folder][stats]}

    # Move to stats folder
    cd {config[path][root]}/{config[folder][stats]}

    # Run script for quality filter visualization
    Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][qfilterVis]}
    echo "Done. "

    # Remove duplicate/extra plot
    rm Rplots.pdf
    """
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Make sure that output folder exists
    mkdir -p $(dirname {output})

    # Make job specific scratch dir
    idvar=$(echo $(basename $(dirname {output})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][assemblies]}/${{idvar}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][assemblies]}/${{idvar}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][assemblies]}/${{idvar}}

    # Copy files
    echo -n "Copying qfiltered reads to {config[path][scratch]}/${{idvar}} ... "
    cp {input.R1} {input.R2} .
    echo "done. "

    # Run megahit
    echo -n "Running MEGAHIT ... "
    megahit -t {config[cores][megahit]} \
        --presets {config[params][assemblyPreset]} \
        --verbose \
        --min-contig-len {config[params][assemblyMin]} \
        -1 $(basename {input.R1}) \
        -2 $(basename {input.R2}) \
        -o tmp;
    echo "done. "

    # Rename assembly
    echo "Renaming assembly ... "
    mv tmp/final.contigs.fa contigs.fasta

    # Remove spaces from contig headers and replace with hyphens
    echo "Fixing contig header names: replacing spaces with hyphens ... "
    sed -i 's/ /-/g' contigs.fasta

    # Zip and move assembly to output folder
    echo "Zipping and moving assembly ... "
    gzip contigs.fasta
    mv contigs.fasta.gz $(dirname {output})

    # Done message
    echo -e "Done assembling quality filtered reads for sample ${{idvar}}"
    """
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
shell:
    """
    # Activate metagem env
    set +uo pipefail;source activate {config[envs][metagem]};set -u;

    # Make sure stats folder exists
    mkdir -p $(dirname {output.text})

    # Move into assembly folder
    cd {input}

    echo -e "\nGenerating assembly results file assembly.stats: ... "
    while read assembly;do

        # Define sample ID
        ID=$(echo $(basename $(dirname $assembly)))

        # Check if assembly file is empty
        check=$(zcat $assembly | head | wc -l)
        if [ $check -eq 0 ]
        then
            N=0
            L=0
        else
            N=$(zcat $assembly | grep -c ">");
            L=$(zcat $assembly | grep ">"|cut -d '-' -f4|sed 's/len=//'|awk '{{sum+=$1}}END{{print sum}}');
        fi

        # Write values to stats file
        echo $ID $N $L >> assembly.stats;

        # Print values to terminal
        echo -e "Sample $ID has a total of $L bp across $N contigs ... "
    done< <(find {input} -name "*.gz")

    echo "Done summarizing assembly results ... \nMoving to /stats/ folder and running plotting script ... "
    mv assembly.stats {config[path][root]}/{config[folder][stats]}

    # Move to stats folder
    cd {config[path][root]}/{config[folder][stats]}

    # Running assembly Vis R script
    Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][assemblyVis]}
    echo "Done. "

    # Remove unnecessary file
    rm Rplots.pdf
    """
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Create output folders
    mkdir -p {output.concoct}
    mkdir -p {output.metabat}
    mkdir -p {output.maxbin}

    # Make job specific scratch dir
    idvar=$(echo $(basename $(dirname {output.concoct})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][crossMap]}/${{idvar}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][crossMap]}/${{idvar}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][crossMap]}/${{idvar}}

    # Copy files
    cp {input.contigs} .

    # Define the focal sample ID, fsample: 
    # The one sample's assembly that all other samples' read will be mapped against in a for loop
    fsampleID=$(echo $(basename $(dirname {input.contigs})))
    echo -e "\nFocal sample: $fsampleID ... "

    echo "Renaming and unzipping assembly ... "
    mv $(basename {input.contigs}) $(echo $fsampleID|sed 's/$/.fa.gz/g')
    gunzip $(echo $fsampleID|sed 's/$/.fa.gz/g')

    echo -e "\nIndexing assembly ... "
    bwa index $fsampleID.fa

    for folder in {input.reads}/*;do 

            id=$(basename $folder)

            echo -e "\nCopying sample $id to be mapped against the focal sample $fsampleID ..."
            cp $folder/*.gz .

            # Maybe I should be piping the lines below to reduce I/O ?

            echo -e "\nMapping sample to assembly ... "
            bwa mem -t {config[cores][crossMap]} $fsampleID.fa *.fastq.gz > $id.sam

            echo -e "\nConverting SAM to BAM with samtools view ... " 
            samtools view -@ {config[cores][crossMap]} -Sb $id.sam > $id.bam

            echo -e "\nSorting BAM file with samtools sort ... " 
            samtools sort -@ {config[cores][crossMap]} -o $id.sort $id.bam

            echo -e "\nRunning jgi_summarize_bam_contig_depths script to generate contig abundance/depth file for maxbin2 input ... "
            jgi_summarize_bam_contig_depths --outputDepth $id.depth $id.sort

            echo -e "\nMoving depth file to sample $fsampleID maxbin2 folder ... "
            mv $id.depth {output.maxbin}

            echo -e "\nIndexing sorted BAM file with samtools index for CONCOCT input table generation ... " 
            samtools index $id.sort

            echo -e "\nRemoving temporary files ... "
            rm *.fastq.gz *.sam *.bam

    done

    nSamples=$(ls {input.reads}|wc -l)
    echo -e "\nDone mapping focal sample $fsampleID agains $nSamples samples in dataset folder."

    echo -e "\nRunning jgi_summarize_bam_contig_depths for all sorted bam files to generate metabat2 input ... "
    jgi_summarize_bam_contig_depths --outputDepth $id.all.depth *.sort

    echo -e "\nMoving input file $id.all.depth to $fsampleID metabat2 folder... "
    mv $id.all.depth {output.metabat}

    echo -e "Done. \nCutting up contigs to 10kbp chunks (default), not to be used for mapping!"
    cut_up_fasta.py -c {config[params][cutfasta]} -o 0 -m $fsampleID.fa -b assembly_c10k.bed > assembly_c10k.fa

    echo -e "\nSummarizing sorted and indexed BAM files with concoct_coverage_table.py to generate CONCOCT input table ... " 
    concoct_coverage_table.py assembly_c10k.bed *.sort > coverage_table.tsv

    echo -e "\nMoving CONCOCT input table to $fsampleID concoct folder"
    mv coverage_table.tsv {output.concoct}

    echo -e "\nRemoving intermediate sorted bam files ... "
    rm *.sort
    """
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Create output folder
    mkdir -p $(dirname {output})

    # Make job specific scratch dir
    sampleID=$(echo $(basename $(dirname {input})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][kallistoIndex]}/${{sampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][kallistoIndex]}/${{sampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][kallistoIndex]}/${{sampleID}}

    # Copy files
    echo -e "\nCopying and unzipping sample $sampleID assembly ... "
    cp {input} .

    # Rename files
    mv $(basename {input}) $(echo $sampleID|sed 's/$/.fa.gz/g')
    gunzip $(echo $sampleID|sed 's/$/.fa.gz/g')

    echo -e "\nCutting up assembly contigs >= 20kbp into 10kbp chunks ... "
    cut_up_fasta.py $sampleID.fa -c 10000 -o 0 --merge_last > contigs_10K.fa

    echo -e "\nCreating kallisto index ... "
    kallisto index contigs_10K.fa -i index.kaix

    mv index.kaix $(dirname {output})
    """
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Create output folder
    mkdir -p {output}

    # Make job specific scratch dir
    focal=$(echo $(basename $(dirname {input.index})))
    mapping=$(echo $(basename $(dirname {input.R1})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}}

    # Move into tmp dir
    cd {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}}

    # Copy files
    echo -e "\nCopying assembly index {input.index} and reads {input.R1} {input.R2} to $(pwd) ... "
    cp {input.index} {input.R1} {input.R2} .

    # Run kallisto
    echo -e "\nRunning kallisto ... "
    kallisto quant --threads {config[cores][crossMap]} --plaintext -i index.kaix -o . $(basename {input.R1}) $(basename {input.R2})

    # Zip file
    echo -e "\nZipping abundance file ... "
    gzip abundance.tsv

    # Move mapping file out output folder
    mv abundance.tsv.gz {output}

    # Cleanup temp folder
    echo -e "\nRemoving temporary directory {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}} ... "
    cd -
    rm -r {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}}
    """
596
597
598
599
shell:
    """
    echo "Gathering cross map jobs ..." 
    """
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Create output folder
    mkdir -p $(dirname {output})

    # Make job specific scratch dir
    sampleID=$(echo $(basename $(dirname {input.contigs})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][concoct]}/${{sampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][concoct]}/${{sampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][concoct]}/${{sampleID}}

    # Copy files
    cp {input.contigs} {input.table} .

    echo "Unzipping assembly ... "
    gunzip $(basename {input.contigs})

    echo -e "Done. \nCutting up contigs (default 10kbp chunks) ... "
    cut_up_fasta.py -c {config[params][cutfasta]} -o 0 -m $(echo $(basename {input.contigs})|sed 's/.gz//') > assembly_c10k.fa

    echo -e "\nRunning CONCOCT ... "
    concoct --coverage_file $(basename {input.table}) \
        --composition_file assembly_c10k.fa \
        -b $(basename $(dirname {output})) \
        -t {config[cores][concoct]} \
        -c {config[params][concoct]}

    echo -e "\nMerging clustering results into original contigs ... "
    merge_cutup_clustering.py $(basename $(dirname {output}))_clustering_gt1000.csv > $(basename $(dirname {output}))_clustering_merged.csv

    echo -e "\nExtracting bins ... "
    mkdir -p $(basename {output})
    extract_fasta_bins.py $(echo $(basename {input.contigs})|sed 's/.gz//') $(basename $(dirname {output}))_clustering_merged.csv --output_path $(basename {output})

    # Move final result files to output folder
    mv $(basename {output}) *.txt *.csv $(dirname {output})
    """
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Create output folder
    mkdir -p {output}

    # Make job specific scratch dir
    fsampleID=$(echo $(basename $(dirname {input.assembly})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}}

    # Copy files to tmp
    cp {input.assembly} {input.depth}/*.all.depth .

    # Unzip assembly
    gunzip $(basename {input.assembly})

    # Run metabat2
    echo -e "\nRunning metabat2 ... "
    metabat2 -i contigs.fasta -a *.all.depth -s {config[params][metabatMin]} -v --seed {config[params][seed]} -t 0 -m {config[params][minBin]} -o $(basename $(dirname {output}))

    # Move result files to output dir
    mv *.fa {output}
    """
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Create output folder
    mkdir -p $(dirname {output})

    # Make job specific scratch dir
    fsampleID=$(echo $(basename $(dirname {input.assembly})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}}

    # Copy files to tmp
    cp -r {input.assembly} {input.depth}/*.depth .

    echo -e "\nUnzipping assembly ... "
    gunzip $(basename {input.assembly})

    echo -e "\nGenerating list of depth files based on crossMapSeries rule output ... "
    find . -name "*.depth" > abund.list

    echo -e "\nRunning maxbin2 ... "
    run_MaxBin.pl -thread {config[cores][maxbin]} -contig contigs.fasta -out $(basename $(dirname {output})) -abund_list abund.list

    # Clean up un-needed files
    rm *.depth abund.list contigs.fasta

    # Move files into output dir
    mkdir -p $(basename {output})
    while read bin;do mv $bin $(basename {output});done< <(ls|grep fasta)
    mv * $(dirname {output})
    """
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
shell:
    """
    # Activate metawrap environment
    set +u;source activate {config[envs][metawrap]};set -u;

    # Create output folder
    mkdir -p {output}

    # Make job specific scratch dir
    fsampleID=$(echo $(basename $(dirname {input.concoct})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][refined]}/${{fsampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][refined]}/${{fsampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][refined]}/${{fsampleID}}

    # Copy files to tmp
    echo "Copying bins from CONCOCT, metabat2, and maxbin2 to {config[path][scratch]} ... "
    cp -r {input.concoct} {input.metabat} {input.maxbin} .

    echo "Renaming bin folders to avoid errors with metaWRAP ... "
    mv $(basename {input.concoct}) $(echo $(basename {input.concoct})|sed 's/-bins//g')
    mv $(basename {input.metabat}) $(echo $(basename {input.metabat})|sed 's/-bins//g')
    mv $(basename {input.maxbin}) $(echo $(basename {input.maxbin})|sed 's/-bins//g')

    echo "Running metaWRAP bin refinement module ... "
    metaWRAP bin_refinement -o . \
        -A $(echo $(basename {input.concoct})|sed 's/-bins//g') \
        -B $(echo $(basename {input.metabat})|sed 's/-bins//g') \
        -C $(echo $(basename {input.maxbin})|sed 's/-bins//g') \
        -t {config[cores][refine]} \
        -m {config[params][refineMem]} \
        -c {config[params][refineComp]} \
        -x {config[params][refineCont]}

    rm -r $(echo $(basename {input.concoct})|sed 's/-bins//g') $(echo $(basename {input.metabat})|sed 's/-bins//g') $(echo $(basename {input.maxbin})|sed 's/-bins//g') work_files
    mv * {output}
    """
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
shell:
    """
    # Activate metawrap environment
    set +u;source activate {config[envs][metawrap]};set -u;

    # Prevents spades from using just one thread
    export OMP_NUM_THREADS={config[cores][reassemble]}

    # Create output folder
    mkdir -p {output}

    # Make job specific scratch dir
    fsampleID=$(echo $(basename $(dirname {input.R1})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][reassembled]}/${{fsampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][reassembled]}/${{fsampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][reassembled]}/${{fsampleID}}

    # Copy files to tmp   
    cp -r {input.refinedBins}/metawrap_*_bins {input.R1} {input.R2} .

    echo "Running metaWRAP bin reassembly ... "
    metaWRAP reassemble_bins --parallel -o $(basename {output}) \
        -b metawrap_*_bins \
        -1 $(basename {input.R1}) \
        -2 $(basename {input.R2}) \
        -t {config[cores][reassemble]} \
        -m {config[params][reassembleMem]} \
        -c {config[params][reassembleComp]} \
        -x {config[params][reassembleCont]}

    # Cleaning up files
    rm -r metawrap_*_bins
    rm -r $(basename {output})/work_files
    rm *.fastq.gz 

    # Move results to output folder
    mv * $(dirname {output})
    """
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
shell:
    """
    # Activate metagem env
    set +u;source activate {config[envs][metagem]};set -u;

    # Read CONCOCT bins
    echo "Generating concoct_bins.stats file containing bin ID, number of contigs, and length ... "
    cd {input}/{config[folder][concoct]}
    for folder in */;do 

        # Define sample name
        var=$(echo $folder|sed 's|/||g');
        for bin in $folder*concoct-bins/*.fa;do 

            # Define bin name
            name=$(echo $bin | sed "s|^.*/|$var.bin.|g" | sed 's/.fa//g');

            # Count contigs
            N=$(less $bin | grep -c ">");

            # Sum length
            L=$(less $bin |grep ">"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')

            # Print values to terminal and write to stats file
            echo "Reading bin $bin ... Contigs: $N , Length: $L "
            echo $name $N $L >> concoct_bins.stats;
        done;
    done
    mv *.stats {input}/{config[folder][reassembled]}

    # Read MetaBAT2 bins
    echo "Generating metabat_bins.stats file containing bin ID, number of contigs, and length ... "
    cd {input}/{config[folder][metabat]}
    for folder in */;do 

        # Define sample name
        var=$(echo $folder | sed 's|/||');
        for bin in $folder*metabat-bins/*.fa;do 

            # Define bin name
            name=$(echo $bin|sed 's/.fa//g'|sed 's|^.*/||g'|sed "s/^/$var./g"); 

            # Count contigs
            N=$(less $bin | grep -c ">");

            # Sum length
            L=$(less $bin |grep ">"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')

            # Print values to terminal and write to stats file
            echo "Reading bin $bin ... Contigs: $N , Length: $L "
            echo $name $N $L >> metabat_bins.stats;
        done;
    done
    mv *.stats {input}/{config[folder][reassembled]}

    # Read MaxBin2 bins
    echo "Generating maxbin_bins.stats file containing bin ID, number of contigs, and length ... "
    cd {input}/{config[folder][maxbin]}
    for folder in */;do
        for bin in $folder*maxbin-bins/*.fasta;do 

            # Define bin name
            name=$(echo $bin | sed 's/.fasta//g' | sed 's|^.*/||g');

            # Count contigs
            N=$(less $bin | grep -c ">");

            # Sum length
            L=$(less $bin |grep ">"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')

            # Print values to terminal and write to stats file
            echo "Reading bin $bin ... Contigs: $N , Length: $L "
            echo $name $N $L >> maxbin_bins.stats;
        done;
    done
    mv *.stats {input}/{config[folder][reassembled]}

    # Read metaWRAP refined bins
    echo "Generating refined_bins.stats file containing bin ID, number of contigs, and length ... "
    cd {input}/{config[folder][refined]}
    for folder in */;do 

        # Define sample name 
        samp=$(echo $folder | sed 's|/||');
        for bin in $folder*metawrap_*_bins/*.fa;do 

            # Define bin name
            name=$(echo $bin | sed 's/.fa//g'|sed 's|^.*/||g'|sed "s/^/$samp./g");

            # Count contigs
            N=$(less $bin | grep -c ">");

            # Sum length
            L=$(less $bin |grep ">"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}')

            # Print values to terminal and write to stats file
            echo "Reading bin $bin ... Contigs: $N , Length: $L "
            echo $name $N $L >> refined_bins.stats;
        done;
    done

    # Compile CONCOCT, MetaBAT2, MaxBin2, and metaWRAP checkM files
    echo "Generating CheckM summary files across samples: concoct.checkm, metabat.checkm, maxbin.checkm, and refined.checkm ... "
    for folder in */;do 

        # Define sample name
        var=$(echo $folder|sed 's|/||g');

        # Write values to checkm files
        paste $folder*concoct.stats|tail -n +2 | sed "s/^/$var.bin./g" >> concoct.checkm
        paste $folder*metabat.stats|tail -n +2 | sed "s/^/$var./g" >> metabat.checkm
        paste $folder*maxbin.stats|tail -n +2 >> maxbin.checkm
        paste $folder*metawrap_*_bins.stats|tail -n +2|sed "s/^/$var./g" >> refined.checkm
    done 
    mv *.stats *.checkm {input}/{config[folder][reassembled]}

    # Read metaWRAP reassembled bins
    echo "Generating reassembled_bins.stats file containing bin ID, number of contigs, and length ... "
    cd {input}/{config[folder][reassembled]}
    for folder in */;do 

        # Define sample name 
        samp=$(echo $folder | sed 's|/||');
        for bin in $folder*reassembled_bins/*.fa;do 

            # Define bin name
            name=$(echo $bin | sed 's/.fa//g' | sed 's|^.*/||g' | sed "s/^/$samp./g");
            N=$(less $bin | grep -c ">");

            # Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)
            if [[ $name == *.strict ]] || [[ $name == *.permissive ]];then
                L=$(less $bin |grep ">"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}')
            else
                L=$(less $bin |grep ">"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}')
            fi

            # Print values to terminal and write to stats file
            echo "Reading bin $bin ... Contigs: $N , Length: $L "
            echo $name $N $L >> reassembled_bins.stats;
        done;
    done
    echo "Done reading metawrap reassembled bins ... "

    # Read metaWRAP reassembled checkM file
    echo "Generating CheckM summary file reassembled.checkm across samples for reassembled bins ... "
    for folder in */;do 

        # Define sample name
        var=$(echo $folder|sed 's|/||g');

        # Write values to checkM file
        paste $folder*reassembled_bins.stats|tail -n +2|sed "s/^/$var./g";
    done >> reassembled.checkm
    echo "Done generating all statistics files for binning results ... running plotting script ... "

    # Move files and cd to stats folder
    mv *.stats *.checkm {config[path][root]}/{config[folder][stats]}
    cd {config[path][root]}/{config[folder][stats]}

    # Run Rscript
    Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][binningVis]}

    # Delete redundant pdf file
    rm Rplots.pdf
    """
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
shell:
    """     
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Make sure output folder exists
    mkdir -p {output}

    # Make job specific scratch dir
    sampleID=$(echo $(basename $(dirname {input.R1})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][abundance]}/${{sampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][abundance]}/${{sampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][abundance]}/${{sampleID}}

    # Copy files
    echo -e "\nCopying quality filtered paired end reads and generated MAGs to {config[path][scratch]} ... "
    cp {input.R1} {input.R2} {input.bins}/* .

    echo -e "\nConcatenating all bins into one FASTA file ... "
    cat *.fa > $(basename {output}).fa

    echo -e "\nCreating bwa index for concatenated FASTA file ... "
    bwa index $(basename {output}).fa

    echo -e "\nMapping quality filtered paired end reads to concatenated FASTA file with bwa mem ... "
    bwa mem -t {config[cores][abundance]} $(basename {output}).fa \
        $(basename {input.R1}) $(basename {input.R2}) > $(basename {output}).sam

    echo -e "\nConverting SAM to BAM with samtools view ... "
    samtools view -@ {config[cores][abundance]} -Sb $(basename {output}).sam > $(basename {output}).bam

    echo -e "\nSorting BAM file with samtools sort ... "
    samtools sort -@ {config[cores][abundance]} -o $(basename {output}).sort.bam $(basename {output}).bam

    echo -e "\nExtracting stats from sorted BAM file with samtools flagstat ... "
    samtools flagstat $(basename {output}).sort.bam > map.stats

    echo -e "\nCopying sample_map.stats file to root/abundance/sample for bin concatenation and deleting temporary FASTA file ... "
    cp map.stats {output}/$(basename {output})_map.stats
    rm $(basename {output}).fa

    echo -e "\nRepeat procedure for each bin ... "
    for bin in *.fa;do

        echo -e "\nSetting up temporary sub-directory to map against bin $bin ... "
        mkdir -p $(echo "$bin"| sed "s/.fa//")

        # Move bin into subirectory
        mv $bin $(echo "$bin"| sed "s/.fa//")
        cd $(echo "$bin"| sed "s/.fa//")

        echo -e "\nCreating bwa index for bin $bin ... "
        bwa index $bin

        echo -e "\nMapping quality filtered paired end reads to bin $bin with bwa mem ... "
        bwa mem -t {config[cores][abundance]} $bin \
            ../$(basename {input.R1}) ../$(basename {input.R2}) > $(echo "$bin"|sed "s/.fa/.sam/")

        echo -e "\nConverting SAM to BAM with samtools view ... "
        samtools view -@ {config[cores][abundance]} -Sb $(echo "$bin"|sed "s/.fa/.sam/") > $(echo "$bin"|sed "s/.fa/.bam/")

        echo -e "\nSorting BAM file with samtools sort ... "
        samtools sort -@ {config[cores][abundance]} -o $(echo "$bin"|sed "s/.fa/.sort.bam/") $(echo "$bin"|sed "s/.fa/.bam/")

        echo -e "\nExtracting stats from sorted BAM file with samtools flagstat ... "
        samtools flagstat $(echo "$bin"|sed "s/.fa/.sort.bam/") > $(echo "$bin"|sed "s/.fa/.map/")

        echo -e "\nAppending bin length to bin.map stats file ... "
        echo -n "Bin Length = " >> $(echo "$bin"|sed "s/.fa/.map/")

        # Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)
        if [[ $bin == *.strict.fa ]] || [[ $bin == *.permissive.fa ]] || [[ $bin == *.s.fa ]] || [[ $bin == *.p.fa ]];then
            less $bin |grep ">"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
        else
            less $bin |grep ">"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo "$bin"|sed "s/.fa/.map/")
        fi

        paste $(echo "$bin"|sed "s/.fa/.map/")

        echo -e "\nCalculating abundance for bin $bin ... "
        echo -n "$bin"|sed "s/.fa//" >> $(echo "$bin"|sed "s/.fa/.abund/")
        echo -n $'\t' >> $(echo "$bin"|sed "s/.fa/.abund/")

        X=$(less $(echo "$bin"|sed "s/.fa/.map/")|grep "mapped ("|awk -F' ' '{{print $1}}')
        Y=$(less $(echo "$bin"|sed "s/.fa/.map/")|tail -n 1|awk -F' ' '{{print $4}}')
        Z=$(less "../map.stats"|grep "mapped ("|awk -F' ' '{{print $1}}')
        awk -v x="$X" -v y="$Y" -v z="$Z" 'BEGIN{{print (x/y/z) * 1000000}}' >> $(echo "$bin"|sed "s/.fa/.abund/")

        paste $(echo "$bin"|sed "s/.fa/.abund/")

        echo -e "\nRemoving temporary files for bin $bin ... "
        rm $bin
        cp $(echo "$bin"|sed "s/.fa/.map/") {output}
        mv $(echo "$bin"|sed "s/.fa/.abund/") ../
        cd ..
        rm -r $(echo "$bin"| sed "s/.fa//")
    done

    echo -e "\nDone processing all bins, summarizing results into sample.abund file ... "
    cat *.abund > $(basename {output}).abund

    echo -ne "\nSumming calculated abundances to obtain normalization value ... "
    norm=$(less $(basename {output}).abund |awk '{{sum+=$2}}END{{print sum}}');
    echo $norm

    echo -e "\nGenerating column with abundances normalized between 0 and 1 ... "
    awk -v NORM="$norm" '{{printf $1"\t"$2"\t"$2/NORM"\\n"}}' $(basename {output}).abund > abundance.txt

    rm $(basename {output}).abund
    mv abundance.txt $(basename {output}).abund

    mv $(basename {output}).abund {output}
    """
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Make sure output folder exists
    mkdir -p {output}

    # Make job specific scratch dir
    sampleID=$(echo $(basename $(dirname {input})))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][classification]}/${{sampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][classification]}/${{sampleID}}

    # Move into scratch dir
    cd {config[path][scratch]}/{config[folder][classification]}/${{sampleID}}

    # Copy files
    echo -e "\nCopying files to tmp dir ... "
    cp -r {input} .

    # In case you GTDBTk is not properly configured you may need to export the GTDBTK_DATA_PATH variable,
    # Simply uncomment the following line and fill in the path to your GTDBTk database:
    # export GTDBTK_DATA_PATH=/path/to/the/gtdbtk/database/you/downloaded

    # Run GTDBTk
    gtdbtk classify_wf --genome_dir $(basename {input}) --out_dir GTDBTk -x fa --cpus {config[cores][gtdbtk]}

    mv GTDBTk/* {output}
    """
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
shell:
    """
    set +u;source activate {config[envs][metagem]};set -u

    # Generate summary abundance file

    cd {input.abundance}
    for folder in */;do
        # Define sample ID
        sample=$(echo $folder|sed 's|/||g')
        # Same as in taxonomyVis rule, modify bin names by adding sample ID and shortening metaWRAP naming scheme (orig/permissive/strict)
        paste $sample/$sample.abund | sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed "s/^/$sample./g" >> abundance.stats
    done
    mv abundance.stats {config[path][root]}/{config[folder][stats]}

    # Generate summary taxonomy file

    cd {input.taxonomy}
    # Summarize GTDBTk output across samples
    for folder in */;do 
        samp=$(echo $folder|sed 's|^.*/||');
        cat $folder/classify/*summary.tsv| sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed "s/^/$samp./g";
    done > GTDBTk.stats
    # Clean up stats file
    header=$(head -n 1 GTDBTk.stats | sed 's/^.*\.//g')
    sed -i '/other_related_references(genome_id,species_name,radius,ANI,AF)/d' GTDBTk.stats
    sed -i "1i$header" GTDBTk.stats
    mv GTDBTk.stats {config[path][root]}/{config[folder][stats]}

    cd {config[path][root]}/{config[folder][stats]}
    Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][compositionVis]}
    """
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
shell:
    """
    # Move to root directory
    cd {config[path][root]}

    # Make sure protein bins folder exists
    mkdir -p {config[folder][proteinBins]}

    echo -e "Begin moving and renaming ORF annotated protein fasta bins from reassembled_bins/ to protein_bins/ ... \n"
    for folder in reassembled_bins/*/;do 
        #Loop through each sample
        echo "Copying bins from sample $(echo $(basename $folder)) ... "
        for bin in $folder*reassembled_bins.checkm/bins/*;do 
            # Loop through each bin
            var=$(echo $bin/genes.faa | sed 's|reassembled_bins/||g'|sed 's|/reassembled_bins.checkm/bins||'|sed 's|/genes||g'|sed 's|/|_|g'|sed 's/permissive/p/g'|sed 's/orig/o/g'|sed 's/strict/s/g');
            cp $bin/*.faa {config[path][root]}/{config[folder][proteinBins]}/$var;
        done;
    done
    """
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
shell:
    """
    # Activate metagem environment
    set +u;source activate {config[envs][metagem]};set -u;

    # Make sure output folder exists
    mkdir -p $(dirname {output})

    # Make job specific scratch dir
    binID=$(echo $(basename {input})|sed 's/.faa//g')
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][GEMs]}/${{binID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][GEMs]}/${{binID}}

    # Move into tmp dir
    cd {config[path][scratch]}/{config[folder][GEMs]}/${{binID}}

    # Copy files
    cp {input.bin} {input.media} .

    echo "Begin carving GEM ... "
    carve -g {config[params][carveMedia]} \
        -v \
        --mediadb $(basename {input.media}) \
        --fbc2 \
        -o $(echo $(basename {input.bin}) | sed 's/.faa/.xml/g') $(basename {input.bin})

    echo "Done carving GEM. "
    [ -f *.xml ] && mv *.xml $(dirname {output})
    """
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
shell:
    """
    set +u;source activate {config[envs][metagem]};set -u;
    cd {input}

    echo -e "\nBegin reading models ... \n"
    while read model;do 
        id=$(echo $(basename $model)|sed 's/.xml//g'); 
        mets=$(less $model| grep "species id="|cut -d ' ' -f 8|sed 's/..$//g'|sort|uniq|wc -l);
        rxns=$(less $model|grep -c 'reaction id=');
        genes=$(less $model|grep 'fbc:geneProduct fbc:id='|grep -vic spontaneous);
        echo "Model: $id has $mets mets, $rxns reactions, and $genes genes ... "
        echo "$id $mets $rxns $genes" >> GEMs.stats;
    done< <(find . -name "*.xml")

    echo -e "\nDone generating GEMs.stats summary file, moving to stats/ folder and running modelVis.R script ... "
    mv GEMs.stats {config[path][root]}/{config[folder][stats]}
    cd {config[path][root]}/{config[folder][stats]}

    Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][modelVis]}
    rm Rplots.pdf # Delete redundant pdf file
    echo "Done. "
    """
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
shell:
    """
    echo -e "\nCopying GEMs from specified input directory to {config[path][scratch]} ... "
    cp -r {input} {config[path][scratch]}

    cd {config[path][scratch]}
    mkdir ecfiles

    while read model; do

        # Read E.C. numbers from  each sbml file and write to a unique file, note that grep expression is hardcoded for specific GEM batches           
        less $(basename {input})/$model|
            grep 'EC Number'| \
            sed 's/^.*: //g'| \
            sed 's/<.*$//g'| \
            sed '/-/d'|sed '/N\/A/d' | \
            sort|uniq -c \
        > ecfiles/$model.ec

        echo -ne "Reading E.C. numbers in model $model, unique E.C. : "
        ECNUM=$(less ecfiles/$model.ec|wc -l)
        echo $ECNUM

    done< <(ls $(basename {input}))

    echo -e "\nMoving ecfiles folder back to {config[path][root]}"
    mv ecfiles {config[path][root]}
    cd {config[path][root]}

    echo -e "\nCreating sorted unique file EC.summary for easy EC inspection ... "
    cat ecfiles/*.ec|awk '{{print $NF}}'|sort|uniq -c > EC.summary

    paste EC.summary

    """
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
shell:
    """
    cd {input}
    for folder in */;do
        echo -n "Creating GEM subfolder for sample $folder ... "
        mkdir -p ../{config[folder][GEMs]}/$folder;
        echo -n "moving GEMs ... "
        mv ../{config[folder][GEMs]}/$(echo $folder|sed 's|/||')_*.xml ../{config[folder][GEMs]}/$folder;
        echo "done. "
    done
    """
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
shell:
    """
    # Activate metagem env
    set +u;source activate {config[envs][metagem]};set -u

    # Make sure output folder exists
    mkdir -p $(dirname {output})

    # Make job specific scratch dir
    sampleID=$(echo $(basename {input}))
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][SMETANA]}/${{sampleID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][SMETANA]}/${{sampleID}}

    # Move to tmp dir
    cd {config[path][scratch]}/{config[folder][SMETANA]}/${{sampleID}}

    # Copy media db and GEMs
    cp {config[path][root]}/{config[folder][scripts]}/{config[scripts][carveme]} {input}/*.xml .

    # Run SMETANA        
    smetana -o $(basename {input}) --flavor fbc2 \
        --mediadb media_db.tsv -m {config[params][smetanaMedia]} \
        --detailed \
        --solver {config[params][smetanaSolver]} -v *.xml

    # Copy results to output folder
    cp *.tsv $(dirname {output})
    """
1462
1463
1464
1465
1466
1467
1468
1469
1470
shell:
    """
    cd {input}
    mv media_db.tsv ../scripts/
    cat *.tsv|sed '/community/d' > smetana.all
    less smetana.all |cut -f2|sort|uniq > media.txt
    ll|grep tsv|awk '{print $NF}'|sed 's/_.*$//g'>samples.txt
    while read sample;do echo -n "$sample ";while read media;do var=$(less smetana.all|grep $sample|grep -c $media); echo -n "$var " ;done < media.txt; echo "";done < samples.txt > sampleMedia.stats
    """
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
shell:
    """
    # Activate metagem env
    set +u;source activate {config[envs][metagem]};set -u

    # Make sure output folder exists
    mkdir -p {output}

    # Make job specific scratch dir
    gemID=$(echo $(basename {input})|sed 's/.xml//g')
    echo -e "\nCreating temporary directory {config[path][scratch]}/{config[folder][memote]}/${{gemID}} ... "
    mkdir -p {config[path][scratch]}/{config[folder][memote]}/${{gemID}}

    # Move to tmp dir
    cd {config[path][scratch]}/{config[folder][memote]}/${{gemID}}

    # Copy GEM to tmp
    cp {input} .

    # Uncomment the following line in case errors are raised about missing git module,
    # also ensure that module name matches that of your cluster
    # module load git

    # Run memote
    memote report snapshot --skip test_find_metabolites_produced_with_closed_bounds --skip test_find_metabolites_consumed_with_closed_bounds --skip test_find_metabolites_not_produced_with_open_bounds --skip test_find_metabolites_not_consumed_with_open_bounds --skip test_find_incorrect_thermodynamic_reversibility --filename $(echo $(basename {input})|sed 's/.xml/.html/') *.xml
    memote run --skip test_find_metabolites_produced_with_closed_bounds --skip test_find_metabolites_consumed_with_closed_bounds --skip test_find_metabolites_not_produced_with_open_bounds --skip test_find_metabolites_not_consumed_with_open_bounds --skip test_find_incorrect_thermodynamic_reversibility *.xml

    # Rename output file with sample ID
    mv result.json.gz $(echo $(basename {input})|sed 's/.xml/.json.gz/')

    # Move results to output folder
    mv *.gz *.html {output}
    """
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
shell:
    """
    set +u;source activate {config[envs][metagem]};set -u

    cp -r {input.bins} {input.R1} {input.R2} {config[path][scratch]}
    cd {config[path][scratch]}

    cat *.gz > $(basename $(dirname {input.bins})).fastq.gz
    rm $(basename {input.R1}) $(basename {input.R2})

    mkdir MAGdb out
    update_database -d MAGdb -g $(basename {input.bins}) -p MAGdb
    rm -r $(basename {input.bins})

    grid multiplex -r . -e fastq.gz -d MAGdb -p -c 0.2 -o out -n {config[cores][grid]}

    rm $(basename $(dirname {input.bins})).fastq.gz
    mkdir {output}
    mv out/* {output}
    """
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
shell:
    """
    # Move into root dir
    cd {config[path][root]}

    # Make sure dnaBins folder exists
    mkdir -p {config[folder][dnaBins]}

    # Copy files
    echo -e "Begin copying and renaming dna fasta bins from reassembled_bins/ to dna_bins/ ... \n"
    for folder in reassembled_bins/*/;do
        # Loop through each sample
        sample=$(echo $(basename $folder));
        mkdir -p {config[path][root]}/{config[folder][dnaBins]}/$sample
        echo "Copying bins from sample $sample ... "
        for bin in $folder*reassembled_bins/*;do 
            # Loop through each bin
            var=$(echo $bin| sed 's|reassembled_bins/||g'|sed 's|/|_|g'|sed 's/permissive/p/g'|sed 's/orig/o/g'|sed 's/strict/s/g');
            cp $bin {config[path][root]}/{config[folder][dnaBins]}/$sample/$var;
        done;
    done
    """
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
shell:
    """
    set +u;source activate {config[envs][prokkaroary]};set -u
    mkdir -p $(dirname $(dirname {output}))
    mkdir -p $(dirname {output})

    cp {input} {config[path][scratch]}
    cd {config[path][scratch]}

    id=$(echo $(basename {input})|sed "s/.fa//g")
    prokka -locustag $id --cpus {config[cores][prokka]} --centre MAG --compliant -outdir prokka/$id -prefix $id $(basename {input})

    mv prokka/$id $(dirname {output})
    """
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
shell:
    """
    set +u;source activate {config[envs][prokkaroary]};set -u
    mkdir -p $(dirname {output})
    cd {config[path][scratch]}
    cp -r {input} .

    roary -s -p {config[cores][roary]} -i {config[params][roaryI]} -cd {config[params][roaryCD]} -f yes_al -e -v $(basename {input})/*.gff
    cd yes_al
    create_pan_genome_plots.R 
    cd ..
    mkdir -p {output}

    mv yes_al/* {output}
    """
1630
1631
1632
1633
1634
shell:
    """
    mkdir -p $(dirname {output.gff})
    prodigal -i <(gunzip -c {input}) -o {output.gff} -a {output.faa} -d {output.fna} -p meta  &> {output.log}
    """
1644
1645
wrapper:
    "https://github.com/snakemake/snakemake-wrappers/raw/0.80.1/bio/diamond/blastp"
ShowHide 20 more snippets with no or duplicated tags.

Free

Created: 7mo ago
Updated: 7mo ago
Maitainers: public
URL: https://franciscozorrilla.github.io/metaGEM/
Name: metagem
Version: v1.0.5
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Accessed: 4
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...