A snakemake workflow for metagenomic projects

public public 1yr ago Version: v2.3.2 0 bookmarks

A workflow for metagenomic projects

Overview

A snakemake workflow for paired- and/or single-end whole-genome shotgun metagenomic data.

You can use this workflow for e.g. :

  • read-trimming and QC

  • taxonomic classification

  • assembly

  • functional and taxonomic annotation

  • metagenomic binning

See the Wiki-pages for instructions on how to run the workflow.

Installation

From GitHub

  1. Checkout the latest version:
git clone https://github.com/NBISweden/nbis-meta.git

or download a tarball of the latest release from the release page .

  1. Install and activate the workflow environment:
conda env create -f environment.yml
conda activate nbis-meta

From DockerHub

To pull the latest Docker image with all dependencies and source code from DockerHub, run:

docker pull nbisweden/nbis-meta

See the Wiki for instructions on how to run the Workflow with Docker.

Code Snippets

42
43
44
45
46
shell:
    """
    prodigal -i {input} -d {output.genes} -a {output.faa} -o {output.gff} \
        -f gff -p meta 2>{log}
    """
60
61
62
63
64
shell:
    """
    tRNAscan-SE -G -b {output.bed} -o {output.file} -a {output.fasta} \
        --thread {threads} {input} >{log} 2>&1
    """
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
shell:
    """
    # Get the Rfam tarball
    curl -o {output.tar} {params.url}/Rfam.tar.gz > {log} 2>&1
    # Extract only rfams of interest
    tar -C {params.dir} -zxf {output.tar} {params.rfams}
    cat {params.dir}/*.cm > {output.cm}

    # Get release
    curl -o {output.readme} {params.url}/README 2>/dev/null
    grep -m 1 Release {output.readme} > {output.version}

    # Get clans
    curl {params.url}/Rfam.clanin 2>/dev/null | egrep -w \
        "CL0011[123]" > {output.clanin}
    """
109
110
111
112
shell:
    """
    cmpress {input} > {log} 2>&1
    """
131
132
133
134
135
136
shell:
    """
    cmscan --cpu {threads} --oskip --rfam --cut_ga --nohmmonly \
        --tblout {output} --fmt 2 --clanin {input.cl} {params.db} \
        {input.fastafile} > /dev/null 2>{log}
    """
149
150
151
152
153
154
155
156
157
158
shell:
    """
    curl -s -L -o {output.hmmfile}.gz {params.ftp}/Pfam-A.hmm.gz
    curl -s -L -o {output.datfile}.gz {params.ftp}/Pfam-A.hmm.dat.gz
    curl -s -L -o {output.versionfile}.gz {params.ftp}/Pfam.version.gz

    gunzip {output.hmmfile}.gz
    gunzip {output.datfile}.gz
    gunzip {output.versionfile}.gz
    """
168
169
170
171
172
173
174
175
shell:
    """
    curl -s -L -o {output.clanfile}.gz {params.ftp}/database_files/clan.txt.gz
    curl -s -L -o {output.info}.gz {params.ftp}/Pfam-A.clans.tsv.gz

    gunzip {output.clanfile}.gz
    gunzip {output.info}.gz
    """
187
188
189
190
shell:
    """
    hmmpress {input.hmmfile} > {log} 2>&1
    """
209
210
211
212
213
214
shell:
    """
    pfam_scan.pl -fasta {input[0]} -dir {params.dir} -cpu {threads} \
        -outfile {params.tmp_out} >{log} 2>&1
    mv {params.tmp_out} {output[0]}
    """
223
224
script:
    "../scripts/annotation_utils.py"
239
240
241
242
243
shell:
    """
    download_eggnog_data.py --data_dir {params.data_dir} -y > {log} 2>&1
    egrep -o "emapperdb-[0-9].[0-9].[0-9]" {log} > {output.version}
    """
257
258
259
260
shell:
    """
    python {params.src} download {params.outdir} > {log} 2>&1
    """
282
283
284
285
286
287
288
289
290
shell:
    """
    mkdir -p {params.tmpdir}
    emapper.py {params.flags} --cpu {threads} --temp_dir {params.tmpdir} \
    -i {input[0]} -o {params.out} --output_dir {params.tmpdir} \
        --data_dir {params.resource_dir} >{log} 2>&1
    mv {params.tmp_out}.emapper.seed_orthologs {output[0]}
    rm -rf {params.tmpdir}
    """
314
315
316
317
318
319
320
321
322
323
324
325
shell:
    """
    if [ -z ${{SLURM_JOB_ID+x}} ]; then SLURM_JOB_ID="emapper_annotate_hits_uppmax"; fi
    #Copy eggnog.db
    mkdir -p /dev/shm/$SLURM_JOB_ID
    cp {params.resource_dir}/eggnog.db {params.resource_dir}/eggnog_proteins.dmnd /dev/shm/$SLURM_JOB_ID

    emapper.py {params.flags} --cpu {threads} -o {params.out} \
        --annotate_hits_table {input[0]} --usemem \
        --data_dir /dev/shm/$SLURM_JOB_ID >{log} 2>&1
    rm -rf /dev/shm/$SLURM_JOB_ID
    """
344
345
346
347
348
349
shell:
    """
    emapper.py {params.flags} --cpu {threads} -o {params.out} \
        --annotate_hits_table {input[0]} --usemem \
        --data_dir {params.resource_dir} >{log} 2>&1
    """
359
360
script:
    "../scripts/annotation_utils.py"
373
374
375
376
377
378
379
380
381
shell:
     """
     curl -L -o {params.tar} \
        https://card.mcmaster.ca/latest/data >{log} 2>&1
     tar -C {params.dir} -xf {params.tar} ./card.json
     # Store download date in versionfile
     date > {output.version}
     rm {params.tar}
     """
403
404
405
406
407
408
409
410
411
shell:
    """
    mkdir -p {params.tmpdir}
    rgi load -i {input.db} --local > {log} 2>&1
    sed 's/*//g' {input.faa} > {params.faa}
    rgi main -i {params.faa} -o {params.out} \
        -n {threads} {params.settings} >>{log} 2>>{log}
    rm -r {params.tmpdir}
    """
418
419
script:
    "../scripts/annotation_utils.py"
31
32
script:
    "../scripts/assembly_utils.py"
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
shell:
    """
    # Create directories
    mkdir -p {params.tmp}
    # Clean output dir
    #rm -rf {params.output_dir}/*
    # Clean temp dir
    rm -rf {params.tmp}/*
    # Only use single-end if present
    if [ -s {input.se} ]; then
        single="-s {input.se}"
    else
        single=""
    fi
    metaspades.py \
        -t {threads} -1 {input.R1} -2 {input.R2} $single \
        -o {params.tmp} > {log} 2>&1

    # If set to keep intermediate contigs, move to intermediate folder before deleting
    if [ "{config[metaspades][keep_intermediate]}" == "True" ]; then
        mkdir -p {params.intermediate_contigs}
        cp -r {params.tmp}/K* {params.intermediate_contigs}
    fi
    if [ "{config[metaspades][keep_corrected]}" == "True" ]; then
        mkdir -p {params.corrected}
        cp -r {params.tmp}/corrected {params.corrected}
    fi

    # Clear intermediate contigs
    rm -rf {params.tmp}/K*
    # Clear corrected reads dir
    rm -rf {params.tmp}/corrected
    # Sync tmp output to outdir before removing
    cp -r {params.tmp}/* {params.output_dir}
    rm -rf {params.tmp}
    mv {params.output_dir}/scaffolds.fasta {params.output_dir}/final_contigs.fa
    """
107
108
script:
    "../scripts/assembly_utils.py"
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
shell:
    """
    mkdir -p {config[paths][temp]}
    rm -rf {params.tmp}
    # Only use paired-end if present
    if [ -s {input.R1} ]; then
        R1=$(cat {input.R1})
        R2=$(cat {input.R2})
        paired="-1 $R1 -2 $R2"
    else
        paired=""
    fi
    # Only use single-end if present
    se=$(cat {input.se})
    if [ -s {input.se} ]; then
        single="-r $se"
    else
        single=""
    fi

    # Run Megahit
    megahit -t {threads} $paired $single -o {params.tmp} \
        {params.additional_settings} >{log} 2>&1

    # Sync intermediate contigs if asked for
    if [ "{config[megahit][keep_intermediate]}" == "True" ]; then
        mkdir -p {params.intermediate_contigs}
        cp -r {params.tmp}/intermediate_contigs/* {params.intermediate_contigs}
    fi

    # Cleanup intermediate
    rm -rf {params.tmp}/intermediate_contigs

    # Sync tmp output to outdir before removing
    cp -r {params.tmp}/* {params.output_dir}
    rm -rf {params.tmp}
    mv {params.output_dir}/final.contigs.fa {params.output_dir}/final_contigs.fa
    """
174
175
script:
    "../scripts/assembly_utils.py"
192
193
194
195
196
197
198
199
shell:
    """
    bowtie2-build \
        --large-index \
        --threads {threads} \
        {params.prefix} \
        {params.prefix} > /dev/null 2>&1
    """
221
222
223
224
225
226
227
228
shell:
    """
    bowtie2 {params.setting} -p {threads} -x {params.prefix} -1 {input.R1} -2 {input.R2} 2> {log} \
        | samtools view -bh - | samtools sort - -o {params.temp_bam}
    samtools index {params.temp_bam}
    mv {params.temp_bam} {output.bam}
    mv {params.temp_bam}.bai {output.bai}
    """
249
250
251
252
253
254
255
256
shell:
    """
    bowtie2 {params.setting} -p {threads} -x {params.prefix} \
        -U {input.se} 2>{log} | samtools view -bh - | samtools sort - -o {params.temp_bam}
    samtools index {params.temp_bam}
    mv {params.temp_bam} {output.bam}
    mv {params.temp_bam}.bai {output.bai}
    """
276
277
278
279
280
281
282
283
284
285
286
shell:
    """
    for f in {input} ;
    do
        al=$(samtools \
            flagstat \
            $f | grep " mapped (" | cut -f2 -d '('| cut -f1 -d ' ')
        n=$(basename $f | sed 's/_[ps]e{params.post}.bam//g')
        echo -e "$n\t$al" >> {output}
    done
    """
296
297
script:
    "../scripts/assembly_utils.py"
54
55
56
57
58
shell:
    """
    jgi_summarize_bam_contig_depths \
        --outputDepth {output.depth} {input.bam} >{log} 2>&1
    """
75
76
77
78
79
shell:
    """
    metabat2 -i {input.fa} -a {input.depth} -m {wildcards.l} -t {threads} \
        -o {params.n} > {log} 2>&1
    """
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
shell:
    """
    set +e
    mkdir -p {params.dir}
    mkdir -p {params.tmp_dir}
    run_MaxBin.pl -markerset {params.markerset} -contig {input} \
        {params.reads} -min_contig_length {wildcards.l} -thread {threads} \
        -out {params.tmp_dir}/maxbin >{log} 2>{log}
    exitcode=$?
    if [ $exitcode -eq 255 ]; then
        exit 0
    else
        # Rename fasta files
        ls {params.tmp_dir} | grep ".fasta" | while read f;
        do
            mv {params.tmp_dir}/$f {params.dir}/${{f%.fasta}}.fa
        done
        # Move output from temporary dir
        ls {params.tmp_dir} | while read f;
        do
            mv {params.tmp_dir}/$f {params.dir}/
        done
    fi
    # Clean up
    rm -r {params.tmp_dir}
    """
143
144
145
146
147
148
149
150
151
152
153
154
155
shell:
    """
    for f in {input.bam} ;
        do
            n=$(basename $f);
            s=$(echo -e $n | sed 's/_[ps]e{params.p}.bam//g');
            echo $s;
        done > {params.samplenames}
    concoct_coverage_table.py \
        --samplenames {params.samplenames} \
        {input.bed} {input.bam} > {output.cov}
    rm {params.samplenames}
    """
167
168
169
170
171
shell:
    """
    cut_up_fasta.py -b {output.bed} -c 10000 -o 0 -m {input.fa} \
        > {output.fa} 2>{log}
    """
189
190
191
192
193
shell:
    """
    concoct -t {threads} --coverage_file {input.cov} -l {params.length} \
        --composition_file {input.fa} -b {params.basename}/ >/dev/null 2>&1
    """
204
205
206
207
shell:
    """
    merge_cutup_clustering.py {input[0]} > {output[0]} 2> {log}
    """
222
223
224
225
226
227
228
229
230
231
shell:
    """
    mkdir -p {params.tmp_dir}
    extract_fasta_bins.py {input[0]} {input[1]} \
        --output_path {params.tmp_dir} 2> {log}
    ls {params.tmp_dir} | egrep "[0-9].fa" | while read f;
    do
        mv {params.tmp_dir}/$f {params.dir}/concoct.$f
    done
    """
242
243
script:
    "../scripts/binning_utils.py"
254
255
script:
    "../scripts/binning_utils.py"
268
269
270
run:
    df=concatenate(input, index=-2)
    df.to_csv(output[0], sep="\t", index=True)
284
285
286
287
288
289
290
291
292
shell:
    """
    # Download
    curl -L https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz -o {params.tar} -s
    # Extract
    tar -C {params.dir} -xf {params.tar}
    # Set root
    checkm data setRoot {params.dir} > {log} 2>&1
    """
315
316
317
318
319
320
321
322
323
324
325
326
327
328
shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output.tsv}
        touch {output.ms}
    else
        checkm taxonomy_wf -t {threads} -x {params.suff} -q \
            --tab_table -f {output.tsv} \
            {params.rank} {params.taxon} {params.indir} {params.outdir} \
            > {log} 2>&1
        ln -s {params.taxon}.ms {output.ms}
    fi
    """
349
350
351
352
353
354
355
356
357
358
359
360
361
362
shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 0 ] ; then
        echo "NO BINS FOUND" > {output.tsv}
        touch {output.ms}
    else
        checkm lineage_wf -t {threads} --pplacer_threads {threads} \
            -x {params.suff} {params.tree} -q \
            --tab_table -f {output.tsv} \
            {params.indir} {params.outdir} \
            > {log} 2>&1
    fi
    """
379
380
381
382
383
384
385
386
387
388
shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output.tsv}
    else
        checkm qa -o 2 --tab_table -f {output.tsv} \
            {input.ms} {params.dir} > {log} 2>&1
    fi
    """
406
407
408
409
410
411
412
413
414
415
shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output}
    else
        checkm coverage -x fa -t {threads} {params.dir} \
            {output} {input.bam} > {log} 2>&1
    fi
    """
426
427
script:
    "../scripts/binning_utils.py"
439
440
441
442
443
444
445
446
447
448
449
450
shell:
    """
    lines=$(wc -l {input.stats} | cut -f1 -d ' ')
    cov_lines=$(wc -l {input.cov} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output}
    elif [ $cov_lines == 0 ] ; then
        echo "NO READS MAPPED" > {output}
    else
        checkm profile -f {output} --tab_table {input.cov} > {log} 2>&1
    fi
    """
460
461
462
run:
    df=concatenate(input, index=-3)
    df.to_csv(output.tsv, sep="\t", index=True)
472
473
474
run:
    df=concatenate(input, index=-3)
    df.to_csv(output.tsv, sep="\t", index=True)
487
488
489
490
491
shell:
    """
    curl -L -o {params.tar} {params.url} > {log} 2>&1
    tar xzf {params.tar} -C {params.dir} --strip 1 > {log} 2>&1
    """
511
512
513
514
515
516
517
518
519
520
521
522
523
shell:
    """
    bins=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $bins == 0 ] ; then
        echo "NO BINS FOUND" > {output}
    else
        export PYTHONPATH=$(which python)
        export GTDBTK_DATA_PATH={params.dbdir}
        gtdbtk classify_wf -x {params.suff} --out_dir {params.outdir} \
            --cpus {threads} --pplacer_cpus {threads} \
            --genome_dir {params.indir} > {log} 2>&1
    fi
    """
537
538
539
540
541
542
543
544
545
546
run:
    summaries=[]
    for f in input:
        gtdb_dir=os.path.dirname(f)
        for m in ["bac120/ar122"]:
            summary=gtdb_dir+"/gtdbtk.{}.summary.tsv".format(m)
            if os.path.exists(summary):
                summaries.append(summary)
    df=concatenate(summaries, index=-3)
    df.to_csv(output.summary, sep="\t")
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
shell:
    """
    bins=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $bins == 0 ] ; then
        touch {output}
    else
        cat {params.gtdbtk_dir}/gtdbtk.*.summary.tsv | cut -f1 -d ';' | grep -v "user_genome" | \
            while read line;
            do
                d=$(echo -e "$line" | cut -f2)
                g=$(echo -e "$line" | cut -f1)
                if [ "$d" == "d__Bacteria" ]; then
                    k="bac"
                else
                    k="arc"
                fi
                barrnap --kingdom $k {params.indir}/$g.fa > {params.outdir}/out 2>>{log}
                lines=$(wc -l {params.outdir}/out | cut -f1 -d ' ')
                if [ $lines -gt 1 ]; then
                    egrep -v "^#" {params.outdir}/out | sed "s/$/;genome=$g/g" >> {output}
                else
                    touch {output}
                fi
            done
        rm {params.outdir}/out
    fi
    """
603
604
script:
    "../scripts/binning_utils.py"
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
shell:
    """
    bins=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $bins == 0 ] ; then
        touch {output}
    else
        echo -e "Name\ttRNA#\ttRNA_Begin\ttRNA_End\ttRNA_type\tAnti_Codon\tIntron_Begin\tIntron_End\tInf_Score\tNote\tBin_Id" > {output}
        cat {params.gtdbtk_dir}/gtdbtk.*.summary.tsv | cut -f1 -d ';' | grep -v "user_genome" | \
            while read line;
            do
                d=$(echo -e "$line" | cut -f2)
                g=$(echo -e "$line" | cut -f1)
                if [ "$d" == "d__Bacteria" ]; then
                    model="-B"
                else
                    model="-A"
                fi
                tRNAscan-SE $model --quiet --thread {threads} \
                    {params.indir}/$g.fa | tail -n +4 | sed "s/$/\t$g/g" >> {output}
            done
    fi
    """
655
656
script:
    "../scripts/binning_utils.py"
671
672
673
674
675
run:
    df=concatenate(input.trna, index=-3)
    df.to_csv(output.trna, sep="\t", index=True)
    df=concatenate(input.rrna, index=-3)
    df.to_csv(output.rrna, sep="\t", index=True)
684
685
script:
    "../scripts/binning_utils.py"
703
704
script:
    "../scripts/binning_utils.py"
725
726
727
728
729
730
shell:
    """
    fastANI --rl {input[0]} --ql {input[1]} -k {params.k} -t {threads} \
        --fragLen {params.frag_len} --minFraction {params.fraction} \
        --matrix -o {output[0]} > {log} 2>&1
    """
743
744
script:
    "../scripts/binning_utils.py"
37
38
39
40
41
42
43
44
45
46
shell:
     """
     mkdir -p {params.tmpdir}
     curl -L -o {params.tar} {params.url} > {log} 2>&1
     tar -C {params.tmpdir} -xf {params.tar}
     hashfile=$(find {params.tmpdir}/ -name "hash.k2d")
     hashdir=$(dirname $hashfile)
     mv $hashdir/* {params.dir}
     rm -r {params.tar} {params.tmpdir}
     """
61
62
63
64
65
shell:
    """
    kraken2-build --standard --db {params.dir} --threads {threads} > {log.build} 2>&1
    kraken2-build --clean {params.dir} > {log.clean} 2>&1
    """
92
93
94
95
96
97
shell:
    """
    kraken2 {params.mem} --db {params.db} --output {output[0]} \
        --report {output[1]} --gzip-compressed \
        --threads {threads} --paired {input.R1} {input.R2} > {log} 2>&1
    """
122
123
124
125
126
127
shell:
    """
    kraken2 {params.mem} --db {params.db} --output {output[0]} \
        --report {output[1]} --gzip-compressed \
        --threads {threads} {input.se} > {log} 2>&1
    """
145
146
147
148
149
150
shell:
    """
    curl -o {params.tar} {params.url} > {log} 2>&1
    tar -C {params.dir} -xf {params.tar} >>{log} 2>&1
    rm {params.tar}
    """
183
184
185
186
187
188
189
190
191
shell:
    """
    mkdir -p {params.tmpdir}
    centrifuge -k {params.k} -x {params.prefix} -1 {input.R1} -2 {input.R2} \
        -S {params.tmp_out} -p {threads} --report-file {params.tmp_report} \
        > {log} 2>&1
    mv {params.tmp_out} {output[0]}
    mv {params.tmp_report} {output[1]}
    """
222
223
224
225
226
227
228
229
230
shell:
    """
    mkdir -p {params.tmpdir}
    centrifuge -k {params.k} -U {input.se} -x {params.prefix} \
        -S {params.tmp_out} -p {threads} --report-file {params.tmp_report} \
        > {log} 2>&1
    mv {params.tmp_out} {output[0]}
    mv {params.tmp_report} {output[1]}
    """
248
249
250
251
252
shell:
    """
    centrifuge-kreport --min-score {params.min_score} -x {params.prefix} \
        {input.f} > {output[0]}
    """
274
275
276
277
278
shell:
    """
    metaphlan --install --bowtie2db {params.dir} \
        --nproc {threads} -x {params.index} >{log} 2>&1
    """
304
305
306
307
308
309
shell:
    """
    metaphlan {input.R1},{input.R2} --bowtie2db {params.dir} --add_viruses \
        --force --nproc {threads} --input_type fastq -o {output.tsv} \
         --bowtie2out {output.bt2} > {log} 2>&1
    """
333
334
335
336
337
338
shell:
    """
    metaphlan {input.se} --bowtie2db {params.dir} --add_viruses --force \
        --nproc {threads} --input_type fastq -o {output.tsv} \
         --bowtie2out {output.bt2} > {log} 2>&1
    """
347
348
349
350
shell:
    """
    merge_metaphlan_tables.py {input} > {output}
    """
359
360
script:
    "../scripts/classification_utils.py"
375
376
377
378
379
shell:
    """
    ktImportTaxonomy -t 1 -m 2 -o {output} -tax {params.dbdir} \
        {params.input_string} > {log} 2>&1
    """
406
407
408
409
shell:
    """
    ktUpdateTaxonomy.sh {params.taxdir} >{log} 2>&1
    """
428
429
430
431
432
shell:
    """
    ktImportTaxonomy -t 5 -m 3 -tax {params.tax} -o {output[0]} \
        {input[0]},{wildcards.sample}_{wildcards.unit} > {log} 2>&1
    """
456
457
458
459
460
461
shell:
     """
     ktImportTaxonomy \
        -t 5 -m 3 -tax {params.tax} -o {output[0]} \
        {params.input_string} > {log} 2>&1
     """
16
17
18
19
20
21
shell:
     """
     curl -L -s -o {params.tar} {params.url}
     tar -C {params.outdir} -xf {params.tar}
     rm {params.tar}
     """
37
38
39
40
41
shell:
     """
     seqtk sample -s {wildcards.s} {input} \
        {params.example_dataset_size} | gzip -c > {output}
     """
30
31
run:
    link(input[0], output[0])
42
43
44
45
shell:
     """
     curl -L -o {output} {params.url} > {log} 2>&1
     """
59
60
61
62
shell:
    """
    indexdb_rna --ref {input.fasta},{input.fasta} > {log} 2>&1
    """
82
83
84
85
86
87
88
89
90
91
92
93
94
95
shell:
    """
    mkdir -p {params.scratch}
    # Unzip to scratch dir
    gunzip -c {input.R1} > {params.R1_unzipped}
    gunzip -c {input.R2} > {params.R2_unzipped}
    # Merge
    merge-paired-reads.sh {params.R1_unzipped} {params.R2_unzipped} \
        {params.merged} >{log} 2>&1
    # Move output
    mv {params.merged} {output}
    # Clean up
    rm {params.R1_unzipped} {params.R2_unzipped}
    """
121
122
123
124
125
126
127
128
129
130
131
132
133
shell:
     """
     mkdir -p {params.scratch}
     # Run SortMeRNA
     sortmerna --blast 1 --log -v --fastx --ref {params.ref_string} \
        --reads {input.fastq} -a {threads} --{params.paired_strategy} \
        --aligned {params.aligned_prefix} --other {params.other_prefix} \
        {params.score_params} >{log} 2>&1

     mv {params.aligned_prefix}.fastq {output.aligned}
     mv {params.aligned_prefix}.log {log}
     mv {params.other_prefix}.fastq {output.other}
     """
151
152
153
154
155
156
157
158
159
shell:
    """
    mkdir -p {params.tmpdir}
    unmerge-paired-reads.sh {input.aligned} {params.R1} {params.R2} >{log} 2>&1
    gzip {params.R1}
    gzip {params.R2}
    mv {params.R1}.gz {output.R1}
    mv {params.R2}.gz {output.R2}
    """
177
178
179
180
181
182
183
184
185
shell:
    """
    mkdir -p {params.tmpdir}
    unmerge-paired-reads.sh {input.other} {params.R1} {params.R2} >{log} 2>&1
    gzip {params.R1}
    gzip {params.R2}
    mv {params.R1}.gz {output.R1}
    mv {params.R2}.gz {output.R2}
    """
194
195
196
197
shell:
    """
    gunzip -c {input} > {output} 2>{log}
    """
221
222
223
224
225
226
227
228
229
230
231
232
233
shell:
    """
    mkdir -p {params.scratch}
    # Run SortMeRNA
    sortmerna --blast 1 --log -v --fastx --ref {params.ref_string} \
        --reads {input.fastq} -a {threads} --other {params.other_prefix} \
        --aligned {params.aligned_prefix} {params.score_params} \
        >{log} 2>&1

    mv {params.aligned_prefix}.fastq {output.aligned}
    mv {params.aligned_prefix}.log {log}
    mv {params.other_prefix}.fastq {output.other}
    """
242
243
244
245
shell:
    """
    gzip {input.fastq} 2>{log}
    """
254
255
256
257
shell:
    """
    gzip {input.fastq} 2>{log}
    """
266
267
268
run:
    link(input.R1, output.R1)
    link(input.R2, output.R2)
275
276
run:
    link(input.se, output.se)
301
302
303
304
305
306
307
308
309
310
311
312
313
shell:
    """
    trimmomatic PE \
        -threads {threads} \
        {input.R1} {input.R2} \
        {output.R1P} {output.R1U} \
        {output.R2P} {output.R2U} \
        {params.trim_string} \
        2>{log.R1log}
    sed \
        's/{wildcards.sample}_{wildcards.unit}_R1/{wildcards.sample}_{wildcards.unit}_R2/g' \
        {log.R1log} > {log.R2log}
    """
330
331
332
333
334
335
336
337
338
shell:
    """
    trimmomatic SE \
        -threads {threads} \
        {input} \
        {output} \
        {params.trim_string} \
        2>{log}
    """
361
362
363
364
365
366
367
368
369
370
371
372
373
shell:
    """
    cutadapt \
        {params.extra_params} \
        -e {params.error_rate} \
        -a {params.adapter} \
        -A {params.rev_adapter} \
        -o {output.fastq1} \
        -p {output.fastq2} \
        -j {threads} \
        {input.R1} {input.R2} > {log.R1log} 2>{log.err}
    cp {log.R1log} {log.R2log}
    """
391
392
393
394
395
396
397
398
399
shell:
    """
    cutadapt \
        {params.extra_params} \
        -e {params.error_rate} \
        -a {params.adapter} \
        -j {threads} \
        -o {output} {input} > {log}
    """
409
410
411
412
413
414
415
416
417
418
shell:
    """
    curl \
        -L \
        -o {output}.gz \
        -s \
        {params.url_base}/GCF_000819615.1_ViralProj14015_genomic.fna.gz \
        > {log} 2>&1
    gunzip {output}.gz
    """
433
434
435
436
437
438
shell:
    """
    bowtie2-build \
        --threads {threads} \
        {input.fasta} {params.prefix} >{log} 2>&1
    """
460
461
462
463
464
465
466
467
468
469
470
471
472
473
shell:
    """
    mkdir -p {params.tmp_out}
    bowtie2 \
        {params.setting} \
        -p {threads} \
        -x {params.prefix} \
        -1 {input.R1} \
        -2 {input.R2} \
        --un-conc-gz \
        {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_R%.filtered.fastq.gz > /dev/null 2>{log}
    mv {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_R1.filtered.fastq.gz {output.R1}
    mv {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_R2.filtered.fastq.gz {output.R2}
    """
493
494
495
496
497
498
499
500
501
502
503
shell:
    """
    mkdir -p {params.tmp_out}
    bowtie2 \
        {params.setting} \
        -p {threads} \
        -x {params.prefix} \
        --un-gz \
        {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_se.filtered.fastq.gz {input.se} > /dev/null 2>{log}
    mv {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_se.filtered.fastq.gz {output.se}
    """
525
526
527
528
529
530
531
532
533
534
535
536
537
shell:
    """
    gunzip -c {input.R1} > {params.R1_intmp}
    gunzip -c {input.R2} > {params.R2_intmp}
    echo {params.R1_intmp} > {params.file_list}
    echo {params.R2_intmp} >> {params.file_list}
    fastuniq -i {params.file_list} -t q -o {params.R1_outtmp} \
        -p {params.R2_outtmp} >{log} 2>&1
    gzip -c {params.R1_outtmp} > {output.R1}
    gzip -c {params.R2_outtmp} > {output.R2}
    rm {params.R1_intmp} {params.R1_outtmp} {params.R2_intmp} {params.R2_outtmp}
    rm {params.file_list}
    """
547
548
run:
    link(input.se, output.se)
565
566
567
568
shell:
    """
    fastqc -q --noextract -o {params.dir} {input} >{log} 2>&1
    """
587
588
589
590
591
shell:
    """
    multiqc -f -c {params.config} -n samples_report.html \
        -o {params.output_dir} {input} >{log} 2>{log}
    """
26
27
script:
    "../scripts/quantification_utils.py"
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
shell:
    """
    mkdir -p {params.temp_dir}
    # Fix bam header
    samtools view -H {input} | egrep -v "^@PG" > {params.header}
    samtools reheader -P {params.header} {input} > {params.rehead_bam}
    # Set memory max
    mem="-Xmx$((6 * {threads}))g"
    java -Xms2g $mem -XX:ParallelGCThreads={threads} \
        -jar $CONDA_PREFIX/share/picard-*/picard.jar MarkDuplicates \
        I={params.rehead_bam} M={output[2]} O={params.temp_bam} REMOVE_DUPLICATES=TRUE \
        USE_JDK_DEFLATER=TRUE USE_JDK_INFLATER=TRUE ASSUME_SORT_ORDER=coordinate \
        PROGRAM_RECORD_ID=null ADD_PG_TAG_TO_READS=FALSE 2> {log}
    # Re sort the bam file using samtools
    samtools_threads="$(({threads} - 1))"
    samtools sort -@ $samtools_threads -o {params.temp_sort_bam} {params.temp_bam} > /dev/null 2>&1
    # Index the bam file
    samtools index {params.temp_sort_bam}
    mv {params.temp_sort_bam} {output[0]}
    mv {params.temp_sort_bam}.bai {output[1]}
    rm {params.temp_bam} {params.rehead_bam} {params.header}
    """
93
94
95
96
97
98
shell:
    """
    mkdir -p {params.tmpdir}
    featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id -M \
        {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
    """
105
106
script:
    "../scripts/quantification_utils.py"
116
117
script:
    "../scripts/quantification_utils.py"
133
134
script:
    "../scripts/edger.R"
145
146
script:
    "../scripts/quantification_utils.py"
162
163
script:
    "../scripts/edger.R"
177
178
script:
    "../scripts/metagenomeseq.R"
190
191
script:
    "../scripts/quantification_utils.py"
35
36
37
38
shell:
    """
    contigtax download taxonomy -t {params.taxdir} >{log} 2>&1
    """
50
51
52
53
54
shell:
    """
    contigtax download {wildcards.db} --tmpdir {params.tmpdir} \
        -d {params.dldir} --skip_idmap >{log} 2>{log}
    """
65
66
67
68
shell:
    """
    contigtax download idmap -d {params.dldir} > {log} 2>&1
    """
82
83
84
85
86
shell:
    """
    contigtax format -m {output.idmap} --tmpdir {params.tmpdir} \
        {input.fasta} {output.fasta} > {log} 2>&1
    """
 99
100
101
102
103
shell:
    """
    contigtax format --tmpdir {params.tmpdir} {input.fasta} \
        {output.fasta} > {log} 2>&1
    """
116
117
118
119
120
121
122
123
124
125
126
127
shell:
    """
    # If an idmap file is available, use it to create an updated idmap file
    if [ -e {params.dir}/idmap.tsv.gz ] ; then
        contigtax update {input.idmap} {params.dir}/idmap.tsg.gz \
            {output.idmap} > {log} 2>&1
    # Otherwise, just create a symlink
    else
        cd {params.dir}
        ln -s $(basename {input.idmap}) $(basename {output.idmap})
    fi
    """
143
144
145
146
147
shell:
     """
     contigtax build -d {output} -p {threads} {input.fasta} \
        {input.idmap} {input.nodes} >{log} 2>&1
     """
167
168
169
170
171
172
shell:
    """
    contigtax search {params.settings} -p {threads} \
        --tmpdir {params.tmpdir} -l {params.min_len} \
        {input.fasta} {input.db} {output} >{log} 2>&1
    """
193
194
195
196
197
198
shell:
     """
     contigtax assign {params.settings} -p {threads} -m rank_lca \
        --reportranks {params.taxonomy_ranks} -t {params.taxdir} \
        {input.tsv} {output} > {log} 2>&1
     """
210
211
212
213
214
215
shell:
    """
    curl -L -v -o {output.lca}.gz {params.url} > {log} 2>&1
    grep filename {log} | cut -f2 -d ';' > {output.version}
    gunzip {output.lca}.gz
    """
229
230
231
232
233
shell:
    """
    sourmash compute --singleton --scaled {params.frac} \
        -k {params.k} -o {output} {input} > {log} 2>&1
    """
250
251
252
253
254
shell:
    """
    sourmash lca classify --db {input.db} --scaled {params.frac} \
        --query {input.sig} -o {output.csv} > {log} 2>&1
    """
267
268
script:
    "../scripts/taxonomy_utils.py"
277
278
script:
    "../scripts/taxonomy_utils.py"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import pandas as pd


def orf2feat(d, val_name="vals", regex=""):
    import re
    keys = []
    vals = []
    for key, val in d.items():
        for item in val.split(","):
            if regex != "":
                m = re.search(regex, item)
                if m == None:
                    continue
                item = m.group()
            keys.append(key)
            vals.append(item)
    df = pd.DataFrame(data={"orf": keys, val_name: vals})
    df.set_index("orf", inplace=True)
    return df


def parse_emapper(sm):
    from pandas.errors import EmptyDataError
    db = sm.wildcards.db
    db_att = {'kos': {'val_name': 'ko', 'regex': "K\d{5}"},
              'pathways': {'val_name': 'pathway', 'regex': "map\d{5}"},
              'modules': {'val_name': 'module', 'regex': ""},
              'enzymes': {'val_name': 'enzyme', 'regex': ""}}
    df = pd.read_csv(sm.input.annotations, sep="\t", index_col=0)
    df.rename(columns={'KEGG_ko': 'kos', 'KEGG_Pathway': 'pathways',
                          'KEGG_Module': 'modules', 'EC': 'enzymes'},
               inplace=True)
    df.fillna("-", inplace=True)
    d = df.loc[df[db] != "-", db].to_dict()
    how = "inner"
    try:
        info_df = pd.read_csv(sm.input.info, sep="\t", index_col=0)
    except EmptyDataError:
        info_df = pd.DataFrame()
        how = "right"
    annot = orf2feat(d, val_name=db_att[db]["val_name"],
                     regex=db_att[db]["regex"])
    annot = pd.merge(info_df, annot, left_index=True,
                     right_on=db_att[db]["val_name"], how=how)
    annot.to_csv(sm.output[0], sep="\t", index=True, header=True)


def parse_rgi(sm):
    annot = pd.read_csv(sm.input.txt, sep="\t", index_col=0)
    annot = annot.loc[:, ["Model_ID", "AMR Gene Family", "Resistance Mechanism"]]
    annot.loc[:, "Model_ID"] = ["RGI_{}".format(x) for x in annot.Model_ID]
    annot.rename(index=lambda x: x.split(" ")[0], inplace=True)
    annot.to_csv(sm.output.tsv, sep="\t", index=True)


def parse_pfam(sm):
    annot = pd.read_csv(sm.input[0], comment="#", header=None, sep=" +",
                        usecols=[0, 5, 7, 14], engine="python",
                        names=["orf", "pfam", "pfam_type", "pfam_clan"])
    clans = pd.read_csv(sm.input[1], header=None, names=["clan", "clan_name"],
                        usecols=[0, 3], sep="\t")
    info = pd.read_csv(sm.input[2], header=None,
                       names=["pfam", "clan", "pfam_name"], usecols=[0, 1, 4],
                       sep="\t")
    # Strip suffix for pfams
    annot.loc[:, "pfam"] = [x.split(".")[0] for x in annot.pfam]
    # Select unique orf->pfam mappings
    # TODO: This masks multiple occurrences of domains on the same orf. Figure out if this is wanted or not.
    # Merge with pfam info and clan info
    annot = annot.groupby(["orf", "pfam"]).first().reset_index()
    annot = pd.merge(annot, info, left_on="pfam", right_on="pfam")
    annot = pd.merge(annot, clans, left_on="clan", right_on="clan", how="left")
    annot.fillna("No_clan", inplace=True)
    annot = annot.loc[:, ["orf", "pfam", "pfam_name", "clan", "clan_name"]]
    annot.sort_values("orf", inplace=True)
    # Write to file
    annot.to_csv(sm.output[0], sep="\t", index=False)


def main(sm):
    toolbox = {"parse_pfam": parse_pfam,
               "parse_emapper": parse_emapper,
               "parse_rgi": parse_rgi}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import pandas as pd
from Bio.SeqIO import parse
from pathlib import Path


# file io

def fasta2bed(sm):
    with open(sm.input[0], 'r') as fhin, open(sm.output[0], 'w') as fhout:
        for record in parse(fhin, "fasta"):
            fhout.write("{}\t{}\t{}\n".format(record.id, 0, len(record)))


# statistics

def store_lengths(f, minlen=False):
    """
    Reads lengths of contigs from fasta
    :param f: fasta file
    :param minlen: minimum length to store
    :return: pandas DataFrame of lengths
    """
    r = {}
    for record in parse(f, "fasta"):
        if minlen:
            if len(record.seq) < minlen:
                continue
        r[record.id] = len(record.seq)
    df = pd.DataFrame(r, index=["length"]).T
    return df


def size_distribute(df, lengths=None):
    """
    Calculates the distribution of an assembly in length bins

    For each <l> in <lengths> calculate for contigs >= <l>:
     n = the number of contigs
     s = the total length in bp
     p = the fraction of lengths / total assembly size
    :param df: pandas DataFrame of lengths
    :param lengths: intervals at which to calculate stats
    :return: pandas DataFrame
    """
    if lengths is None:
        lengths = [0, 100, 250, 500, 1000, 2500, 5000, 10000, 15000, 20000,
                   25000, 30000, 35000, 40000, 45000, 50000, 75000, 100000,
                   125000, 150000, 200000, 250000, 500000]
    size_dist = {}
    for i, l in enumerate(lengths):
        if len(df.loc[df.length >= l]) == 0:
            break
        n = len(df.loc[df.length >= l])
        s = int(df.loc[df.length >= l].sum())
        p = int(df.loc[df.length >= l].sum()) / float(df.sum()) * 100
        size_dist[i] = {"min_length": l, "num_contigs": n, "total_length": s,
                        "%": p}
    size_dist_df = pd.DataFrame(size_dist).T
    size_dist_df = size_dist_df[
        ["min_length", "num_contigs", "total_length", "%"]]
    return size_dist_df


def calculate_n_stats(df):
    """
    Calculates n50 and n90 statistics from a list of lengths
    :param df: pandas DataFrame of contig lengths
    :return:
    """
    df.sort_values("length", inplace=True, ascending=True)
    size = int(df.sum())
    N50_length = N90_length = 0
    cumulative = 0
    for contig in df.index:
        l = df.loc[contig, "length"]
        cumulative += l
        if float(cumulative) >= 0.5 * size and not N50_length:
            N50_length = l
        elif float(cumulative) >= 0.1 * size and not N90_length:
            N90_length = l
    return N50_length, N90_length


def calculate_length_stats(df):
    """
    Calculates length statistics from a dataframe
    :param df: pandas DataFrame with contig lengths
    :return:
    """
    contigs = len(df)
    total_size = int(df.sum())
    min_length = int(df["length"].min())
    max_length = int(df["length"].max())
    avg_length = float(df["length"].mean())
    median_length = float(df["length"].median())
    return contigs, total_size, min_length, max_length, avg_length, median_length


def generate_stat_df(contig_lengths):
    """
    Generates statistics from a dataframe of contig lengths
    :param contig_lengths: pandas DataFrame
    :return:
    """
    index = ["contigs", "total_size_bp", "min_length", "max_length",
             "avg_length", "median_length", "N50_length", "N90_length"]
    stat_items = calculate_length_stats(contig_lengths)
    n50_length, n90_length = calculate_n_stats(contig_lengths)
    stat_df = pd.DataFrame([stat_items[0], stat_items[1], stat_items[2],
                            stat_items[3], stat_items[4], stat_items[5],
                            n50_length, n90_length], index=index).T
    return stat_df


def stats(sm):
    """
    Reads a list of assembly fasta files and generates statistics

    :param sm: snakemake object
    :return:
    """
    stat_result = pd.DataFrame()
    sizedist_result = pd.DataFrame()
    for f in sm.input.fa:
        p = Path(f)
        name = p.parent.name
        contig_lengths = store_lengths(f)
        stat_df = generate_stat_df(contig_lengths)
        size_dist = size_distribute(contig_lengths)
        stat_df["assembly"] = [name]*len(stat_df)
        size_dist["assembly"] = [name]*len(size_dist)
        stat_result = pd.concat([stat_result, stat_df])
        sizedist_result = pd.concat([sizedist_result,size_dist])
    stat_result = stat_result[["assembly", "contigs", "total_size_bp",
                               "min_length", "max_length", "avg_length",
                               "median_length", "N50_length", "N90_length"]]
    stat_result.to_csv(sm.output[0], sep="\t", index=False)
    sizedist_result = sizedist_result[["assembly", "min_length",
                                       "num_contigs", "total_length", "%"]]
    sizedist_result.to_csv(sm.output[1], sep="\t", index=False)

# assembly input

def metaspades_input(sm):
    """
    Generates fastq files to use as input for metaspades assembler

    :param sm: snakemake object
    :return:
    """
    from common import rename_records
    files = {"R1": [], "R2": [], "se": []}
    assembly_dict = sm.params.assembly
    # Collect all files belonging to the assembly group
    for sample in assembly_dict.keys():
        for unit in assembly_dict[sample]:
            for pair in assembly_dict[sample][unit].keys():
                files[pair].append(
                    assembly_dict[sample][unit][pair][0])
    # Rename and concatenate reads (required for Metaspades)
    with open(sm.output.R1, 'w') as fh1, open(sm.output.R2, 'w') as fh2, open(
        sm.output.se, 'w') as fhse:
        i = 0
        for f in files["R1"]:
            f2 = files["R2"][i]
            fh1 = rename_records(f, fh1, i)
            fh2 = rename_records(f2, fh2, i)
            i += 1
        for i, f in enumerate(files["se"], start=i):
            fhse = rename_records(f, fhse, i)


def megahit_input(sm):
    """
    Genereate input lists for megahit assembler

    :param sm: snakemake object
    :return:
    """
    files = {"R1": [], "R2": [], "se": []}
    assembly_dict = sm.params.assembly
    for sample in assembly_dict.keys():
        for unit in assembly_dict[sample]:
            for pair in assembly_dict[sample][unit].keys():
                files[pair].append(assembly_dict[sample][unit][pair][0])
    with open(sm.output.R1, 'w') as fh1, \
        open(sm.output.R2, 'w') as fh2, \
        open(sm.output.se, 'w') as fhse:
        fh1.write(",".join(files["R1"]))
        fh2.write(",".join(files["R2"]))
        fhse.write(",".join(files["se"]))


def main(sm):
    toolbox = {"assembly_stats": stats,
               "generate_megahit_input": megahit_input,
               "generate_metaspades_input": metaspades_input,
               "fasta2bed": fasta2bed}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
import pandas as pd
from glob import glob
import os
from os.path import join as opj, basename, splitext
import numpy as np


# bin info

def contig_map(sm):
    """
    Generates a map of bin->contig id
    :param sm: snakemake object
    :return:
    """
    from Bio.SeqIO import parse
    files = glob(opj(sm.params.dir, "*.fa"))
    if len(files) == 0:
        with open(sm.output[0], 'w') as fhout:
            pass
        return
    for f in files:
        with open(f, 'r') as fhin, open(sm.output[0], 'w') as fhout:
            genome, _ = splitext(basename(f))
            for record in parse(fhin, "fasta"):
                fhout.write("{}\t{}\n".format(genome, record.id))


# bin stats

def n50(lengths):
    """
    Calculate N50 stats
    :param lengths:
    :return:
    """
    cumulative = 0
    size = sum(lengths)
    for l in sorted(lengths):
        cumulative += l
        if float(cumulative) / size >= 0.5:
            return l


def bin_stats(f):
    """
    Generate bin statistics
    :param f:
    :return:
    """
    from Bio.SeqIO import parse
    size = 0
    gc = 0
    contig_lengths = []
    for record in parse(f, "fasta"):
        l = len(record)
        seq = record.seq.upper()
        g = seq.count("G")
        c = seq.count("C")
        gc += g + c
        size += l
        contig_lengths.append(l)
    gc_f = round(float(gc) / size * 100, 2)
    size_mb = size / 1000000
    mean_l = round(np.mean(contig_lengths), 2)
    median_l = round(np.median(contig_lengths), 2)
    min_l = np.min(contig_lengths)
    max_l = np.max(contig_lengths)
    n50_l = n50(contig_lengths)
    return {'bp': size, 'GC': gc_f, 'Mbp': round(size_mb, 2),
            'mean_contig': mean_l, 'median_contig': median_l,
            'min_contig': min_l, 'max_contig': max_l, 'n50': n50_l,
            'contigs': len(contig_lengths)}


def calculate_bin_stats(files):
    """
    Calls bin statistics for each file
    :param files:
    :return:
    """
    stats = {}
    for f in files:
        name = basename(f)
        name, _ = splitext(name)
        stats[name] = bin_stats(f)
    return stats


def binning_stats(sm):
    """
    Main function for calculating bin statistics
    :param sm:
    :return:
    """
    files = glob(opj(sm.params.dir, "*.fa"))
    if len(files) == 0:
        with open(sm.output[0], 'w') as fh:
            fh.write("No bins found\n")
        return
    else:
        stats = calculate_bin_stats(files)
        cols = ["bp", "Mbp", "GC", "contigs", "n50", "mean_contig",
                "median_contig", "min_contig", "max_contig"]
        df = pd.DataFrame(stats).T[cols]
        df.index.name = "bin"
        df.sort_values("bp", ascending=False, inplace=True)
        df.to_csv(sm.output[0], sep="\t", index=True, header=True)


# checkm utils

def remove_checkm_zerocols(sm):
    """
    Reads checkm coverage and removes samples with no reads mapped to contigs

    :param sm:
    :return:
    """
    df = pd.read_csv(sm.input[0], header=0, sep="\t")
    if df.shape[0] == 0:
        with open(sm.output[0], 'w') as fh:
            fh.write("NO BINS FOUND\n")
        return
    # base columns are independent of samples
    base_columns = ["Sequence Id", "Bin Id", "Sequence length (bp)"]
    # get all columns with mapped read counts
    cols = [x for x in df.columns if "Mapped reads" in x]
    # sum counts
    df_sum = df.loc[:, cols].sum()
    # get columns with zero reads mapped
    zero_cols = df_sum.loc[df_sum == 0].index
    cols_to_drop = []
    # find suffix of zero cols
    for c in list(zero_cols):
        suffix = ".{}".format(c.split(".")[-1])
        if suffix == ".Mapped reads":
            suffix = ""
        cols_to_drop += (
            "Mapped reads{s},Bam Id{s},Coverage{s}".format(s=suffix)).split(",")
    df_checked = df.drop(cols_to_drop, axis=1)
    # check that there are remaining columns
    diff_cols = set(df_checked.columns).difference(base_columns)
    if len(diff_cols) > 0:
        df_checked.to_csv(sm.output[0], sep="\t", index=False, header=True)
    else:
        with open(sm.output[0], 'w') as fhout:
            pass

# bin annotation

def count_rrna(sm):
    """
    Counts rRNA genes in bins
    :param sm:
    :return:
    """
    df = pd.read_csv(sm.input[0], sep="\t", usecols=[0, 2, 8], header=None,
                     names=["contig", "type", "fields"])
    # If empty dataframe, just write an empty file
    if df.shape[0] == 0:
        table = pd.DataFrame(columns=["5S_rRNA", "16S_rRNA", "23S_rRNA"])
        table.index.name = "Bin Id"
        table.to_csv(sm.output[0], sep="\t", header=True)
        return
    types = [x.split(";")[0].split("=")[-1] for x in df.fields]
    bins = [x.split(";")[-1].split("=")[-1] for x in df.fields]
    _df = pd.DataFrame(data={'rRNA_type': types, 'Bin_Id': bins})
    dfc = _df.reset_index().groupby(["Bin_Id", "rRNA_type"]).count()
    table = dfc.pivot_table(columns="rRNA_type", index="Bin_Id", fill_value=0)[
        "index"]
    table.index.name = table.columns.name = ""

    missing = set(["16S_rRNA", "23S_rRNA", "5S_rRNA"]).difference(table.columns)
    if len(missing) > 0:
        table = pd.merge(table, pd.DataFrame(columns=missing, index=table.index,
                                             data=0), left_index=True,
                         right_index=True)
    table = table.loc[:, ["5S_rRNA", "16S_rRNA", "23S_rRNA"]]
    table.index.name = "Bin_Id"
    table.to_csv(sm.output[0], sep="\t", index=True)


def count_trna(sm):
    """
    Counts tRNA genes in bins
    :param sm:
    :return:
    """
    df = pd.read_csv(sm.input[0], sep="\t")
    dfc = df.groupby(["tRNA_type", "Bin_Id"]).count().reset_index().loc[:,
          ["tRNA_type", "tRNA#", "Bin_Id"]]
    table = dfc.pivot_table(columns="tRNA_type", index="Bin_Id")[
        "tRNA#"].fillna(0)
    table.index.name = table.columns.name = ""
    table.to_csv(sm.output[0], sep="\t", index=True)

    total = {}
    for m in table.index:
        c = len(table.loc[m, table.loc[m] > 0])
        total[m] = c
    table = pd.DataFrame(total, index=["tRNAs"]).T
    table.index.name = "Bin_Id"
    table.to_csv(sm.output[1], sep="\t", index=True)


# genome clustering

def fetch_genome(ftp_base, outfile):
    """
    Performs the ftp fetching
    :param ftp_base: Base url to RefSeq/GenBank ftp for the genome
    :param outfile: Output filename
    :return:
    """
    from urllib import request
    ftp_base = ftp_base.rstrip("/")
    n = os.path.basename(ftp_base)
    fna = opj(ftp_base, "{}_genomic.fna.gz".format(n))
    r = request.urlretrieve(fna, outfile)
    return r


def unzip(z, o):
    """
    Unzips a zipped fasta file and removes the zipped file
    :param z: input gzip file
    :param o: output plain text file
    :return:
    """
    import gzip as gz
    with gz.open(z, 'rt') as fhin, open(o, 'w') as fhout:
        fhout.write(fhin.read())
    os.remove(z)


def download_ref_genome(sm):
    """
    Fetches reference genomes for clustering with fastANI
    :param sm: snakemake object
    :return:
    """
    fetch_genome(sm.params.ftp_base, "{}.gz".format(sm.output[0]))
    unzip("{}.gz".format(sm.output[0]), sm.output[0])


def generate_bin_list(input, outdir, min_completeness, max_contamination):
    """
    Generates a list of bins to use for fastANI
    Also symlinks each bin file into the fastANI folder
    :param input: List of checkm genome statistics files
    :param outdir: Output directory to store symlinks
    :param min_completeness: Minimum completeness level (%)
    :param max_contamination: Maximum contamination level (%)
    :return:
    """
    genomes = []
    for f in input:
        with open(f, 'r') as fh:
            if fh.readline().rstrip() == "NO BINS FOUND":
                continue
        items = f.split("/")
        # extract wildcards from file path
        binner, assembly, l = items[-5], items[-4], items[-3]
        bindir = os.path.dirname(os.path.dirname(f))
        # get absolute path for bin directory
        abs_in = os.path.abspath(bindir)
        # read the checkm summary file
        df = pd.read_csv(f, sep="\t", index_col=0)
        # filter to at least <min_completeness> and at most <max_contamination>
        df = df.loc[(df.Completeness >= min_completeness) & (df.Contamination <= max_contamination)]
        if df.shape[0] == 0:
            continue
        # generate a unique suffix for each bin
        uniq_suffix = "{assembly}.{l}".format(assembly=assembly, l=l)
        # make a map of the bin id and the unique suffix
        idmap = dict(zip(df.index,
                         ["{x}.{s}".format(x=x, s=uniq_suffix) for x in
                          df.index]))
        # create symlink in the output path for each bin id that points
        # to the original fasta file
        for bin_id, uniq_id in idmap.items():
            src = opj(abs_in, "{}.fa".format(bin_id))
            dst = opj(outdir, "{}.fa".format(uniq_id))
            if os.path.exists(dst):
                os.remove(dst)
            os.symlink(src, dst)
            genomes.append(dst)
    return genomes


def generate_ref_list(input, outdir):
    """
    Generates a list of reference genomes
    Also symlinks each reference file into the fastANI folder

    :param input:
    :param outdir:
    :return:
    """
    genomes = []
    for f in input:
        basename = os.path.basename(f)
        src = os.path.abspath(f)
        dst = opj(outdir, basename)
        if os.path.exists(dst):
            os.remove(dst)
        os.symlink(src, dst)
        genomes.append(dst)
    return genomes


def write_list(genomes, output):
    """
    Writes the fastANI lists to file
    :param genomes: list of genomes to use for fastANI
    :param output:
    :return:
    """
    with open(output, 'w') as fh:
        for g in genomes:
            fh.write("{}\n".format(g))
    return genomes


def generate_fastANI_lists(sm):
    """
    Main function for generating fastANI lists
    :param sm:
    :return:
    """
    bins = generate_bin_list(sm.input.bins, sm.params.outdir,
                             sm.params.completeness,
                             sm.params.contamination)
    refs = generate_ref_list(sm.input.refs, sm.params.outdir)
    genomes = bins + refs
    write_list(genomes, sm.output[0])
    genomes.reverse()
    write_list(genomes, sm.output[1])


def check_pairs(pairs, min_frags):
    allowed_pairs = {}
    for i in pairs.index:
        q = pairs.loc[i, "query"]
        r = pairs.loc[i, "ref"]
        for key in [q, r]:
            if not key in allowed_pairs.keys():
                allowed_pairs[key] = []
        if pairs.loc[i, "aligned"] >= min_frags:
            allowed_pairs[q].append(r)
            allowed_pairs[r].append(q)
    return allowed_pairs


def fastani2dist(mat, txt, min_frags):
    """
    Converts the fastANI out.txt.matrix file to a pandas DataFrame
    :param mat: Distance matrix file
    :param txt: Pairwise output table with ANI and aligned + total fragments
    :param min_frags: Minimum aligned fragments to compare two genomes
    :return:
    """
    # read the pairwise table
    pairs = pd.read_csv(txt, header=None,
                        sep="\t",
                        names=["query", "ref", "ANI", "aligned", "total"])
    for key in ["query", "ref"]:
        pairs[key] = [x.split("/")[-1].replace(".fna", "").replace(".fa", "")
                      for x in pairs[key]]
    allowed_pairs = check_pairs(pairs, min_frags)
    genomes = list(pd.read_table(mat, index_col=0, skiprows=1, sep="\t",
                                 header=None, usecols=[0]).index)
    genomes = [os.path.splitext(os.path.basename(g))[0] for g in genomes]
    r = {}
    with open(mat, 'r') as fh:
        for i, line in enumerate(fh):
            if i == 0:
                continue
            line = line.rstrip()
            items = line.rsplit("\t")
            genome = os.path.splitext(os.path.basename(items[0]))[0]
            r[genome] = {}
            for j, item in enumerate(items[1:]):
                genome2 = genomes[j]
                # check that the pairing is allowed
                if item == "NA" or genome not in allowed_pairs[genome2] or genome2 not in allowed_pairs[genome]:
                    item = np.nan
                else:
                    item = float(item)
                r[genome][genome2] = item
    df = pd.DataFrame(r)
    df.fillna(0, inplace=True)
    return 1-df.div(100)


def cluster(linkage):
    """
    Cluster all genomes based on established linkages using networkx

    :param linkage: Dictionary of dictionaries
    :return: A dictionary with cluster index and list of genomes
    """
    import networkx as nx
    g = nx.from_dict_of_dicts(linkage)
    clustered = []
    clusters = {}
    clust_num = 1
    for n in g.nodes():
        c = [n]
        if n in clustered: continue
        edges = list(nx.dfs_edges(g, n))
        for e in edges:
            n1, n2 = e
            clustered += [n1, n2]
            c += [n1, n2]
        c = list(set(c))
        clusters[clust_num] = c[:]
        clust_num += 1
    return clusters


def generate_linkage(dist_mat, max_dist):
    """
    Create a nested dictionary linking genomes if their distance is within
    a certain threshold.

    :param dist_mat: pandas DataFrame with distances
    :param max_dist: maximum allowed distance to link genomes
    :return: a nested dictionary
    """
    linkage = {}
    for i in range(len(dist_mat.index)):
        g1 = dist_mat.index[i]
        if not g1 in linkage.keys():
            linkage[g1] = {}
        for j in range(i + 1, len(dist_mat.columns)):
            g2 = dist_mat.columns[j]
            if not g2 in linkage.keys():
                linkage[g2] = {}
            distance = dist_mat.iloc[i, j]
            if distance <= max_dist:
                linkage[g1][g2] = ""
                linkage[g2][g1] = ""
    return linkage


def cluster_genomes(sm):
    """
    Main function to run clustering of genomes

    :param sm: snakemake object
    :return:
    """
    dist = fastani2dist(sm.input.mat, sm.input.txt, sm.params.minfrags)
    linkage = generate_linkage(dist, sm.params.thresh)
    clusters = cluster(linkage)
    write_clusters(clusters, sm.output[0])


def write_clusters(clusters, outfile):
    """
    Sorts clusters by size and writes to file
    :param clusters: Dictionary of clusters with genomes as a list
    :param outfile: Output file path
    :return:
    """
    import operator
    # Calculate cluster sizes
    cluster_sizes = {}
    for clust_num, l in clusters.items():
        cluster_sizes[clust_num] = len(l)
    # Sort clusters by sizes
    sorted_clusters = sorted(cluster_sizes.items(), key=operator.itemgetter(1),
                             reverse=True)
    # Write table
    with open(outfile, 'w') as fh:
        for i, item in enumerate(sorted_clusters, start=1):
            old_num = item[0]
            for g in clusters[old_num]:
                fh.write("Cluster{}\t{}\n".format(i, g))


def main(sm):
    toolbox = {"contig_map": contig_map, "count_tRNA": count_trna,
               "remove_checkm_zerocols": remove_checkm_zerocols,
               "count_rRNA": count_rrna, "binning_stats": binning_stats,
               "download_ref_genome": download_ref_genome,
               "generate_fastANI_lists": generate_fastANI_lists,
               "cluster_genomes": cluster_genomes}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd


def metaphlan2krona(sm):
    """
    Creates a krona-input table from metaphlan (standard format) results
    :param sm: snakemake object
    :return:
    """
    df = pd.read_csv(sm.input[0], sep="\t", header=None, comment="#",
                     index_col=0, usecols=[0, 1, 2],
                     names=["lineage", "taxids", "%"])
    # Extract species level
    df_sp = df.loc[df.index.str.contains("s__")]
    # Create new dataframe with taxids \t %
    data = dict(zip([x.split("|")[-1] for x in df_sp.taxids], df_sp["%"]))
    df_out = pd.DataFrame(data, index=["%"]).T
    df_out.to_csv(sm.output[0], sep="\t", index=True, header=False)


def main(sm):
    toolbox = {"metaphlan2krona_table": metaphlan2krona}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
source("workflow/scripts/common.R")

library(edgeR)
method <- snakemake@params$method
input <- snakemake@input[[1]]
output <- snakemake@output[[1]]

# Read the counts
x <- read.delim(input, row.names = 1, sep = "\t", header = TRUE)
# Get sample names
sample_names <- colnames(x)[unlist(lapply(x, is.numeric))]
# Extract row names
xrownames <- row.names(x)[row.names(x)!="Unclassified"]
# Get info names
info_names <- colnames(x)[unlist(lapply(x, is.character))]
# Remove unclassified
if ("Unclassified" %in% row.names(x)){
    x <- x[row.names(x)!="Unclassified", ]
}

x <- as.data.frame(x, row.names=xrownames)
colnames(x) <- append(info_names, sample_names)

x_num <- process_data(x, output)

if (method %in% c("TMM", "RLE")) {
    # Create DGE
    obj <- DGEList(x_num)
    # Calculate norm factors
    obj <- calcNormFactors(obj, method = method)
    # Calculate cpms
    norm <- cpm(obj, normalized.lib.sizes = TRUE)
} else if (method == "RPKM") {
    # Extract gene length column
    gene_length <- x_num$Length
    x_num <- x_num[, colnames(x_num) != "Length"]
    obj <- DGEList(x_num)
    # Calculate norm factors
    obj <- calcNormFactors(obj, method = "TMM")
    # Calculate RPKM
    norm <- rpkm(obj, normalized.lib.sizes = TRUE, gene.length = gene_length)
}

# Add info columns back
norm <- cbind(str_cols(x), norm)
if (length(info_names)>0) {
    colnames(norm) <- append(info_names, sample_names)
}
# Convert to numeric
norm <- as.data.frame(norm)
row.names(norm) <- xrownames

write.table(x = norm, file = output, quote = FALSE, sep="\t")
R From line 3 of scripts/edger.R
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(metagenomeSeq)
source("workflow/scripts/common.R")

method <- snakemake@params$method
input <- snakemake@input[[1]]
output <- snakemake@output[[1]]
normalize <- TRUE
# Read the counts
x <- read.delim(input, row.names = 1, sep = "\t", header = TRUE)
# Get sample names
sample_names <- colnames(x)[unlist(lapply(x, is.numeric))]
# Extract row names
xrownames <- row.names(x)[row.names(x)!="Unclassified"]
# Get info names
info_names <- colnames(x)[unlist(lapply(x, is.character))]

# Remove unclassified
if ("Unclassified" %in% row.names(x)){
    x <- x[row.names(x)!="Unclassified", ]
}

x <- as.data.frame(x, row.names=xrownames)
colnames(x) <- append(info_names, sample_names)

# Returns a vector in the case of 1 sample only
x_num <- process_data(x, output)

# If only one sample, set normalize=FALSE
if (length(sample_names) == 1) {
    print("ONLY ONE SAMPLE! WILL NOT RUN CSS")
    normalize <- FALSE
    x_num <- as.data.frame(x_num, row.names = rownames(x))
    colnames(x_num) <- sample_names
}

# Turn data into new experimentobject
obj <- newMRexperiment(x_num)
too_few_features <- FALSE
# In cases with only one sample, check whether there are enough features to run
# CSS
if (is.null(ncol(x_num))) {
    if (length(x_num) <= 1) {
        too_few_features <- TRUE
    }
} else { # Do the same type of check for many samples
    smat <- lapply(1:ncol(x_num), function(i) {
        sort(x_num[which(x_num[, i]>0),i], decreasing = TRUE)
    })
    if (any(sapply(smat,length)==1)) {
        too_few_features <- TRUE
    }
}
if (too_few_features == TRUE) {
    fh <-file(snakemake@output[[1]])
    writeLines(c("WARNING: Sample with one or zero features",
                 "Cumulative Sum Scaling failed for sample"), fh)
    close(fh)
    quit()
}

# Normalize
norm <- MRcounts(obj, norm = normalize)

# Add info columns back
norm <- cbind(str_cols(x), norm)
colnames(norm) <- append(info_names, sample_names)

# Convert to numeric
norm <- as.data.frame(norm)

# Set sample names
colnames(norm)[unlist(lapply(norm, is.numeric))] <- sample_names

# Write output
write.table(x = norm, file = output, quote = FALSE, sep="\t")
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import pandas as pd


def write_featurefile(sm, score=".", group="gene_id", phase="."):
    """
    Takes a prodigal GFF file and turns it into a file for use with
    featureCounts
    :param sm: snakemake object
    :param score: dummy score column
    :param group: identifier for featurecounts
    :param phase: dummy phase column
    :return:
    """
    with open(sm.input[0], 'r') as fhin, open(sm.output[0], 'w') as fhout:
        for line in fhin:
            line = line.rstrip()
            if line.startswith("#"):
                continue
            items = line.split("\t")
            contig = items[0]
            source = items[1]
            method = items[2]
            start = items[3]
            stop = items[4]
            _strand = items[6]
            geneid = items[8]
            if _strand in ['+1', '1', '+']:
                strand = '+'
            else:
                strand = '-'
            gene_id = "{} {}\n".format(group, geneid.split(";")[0].split("=")[-1])
            fhout.write("\t".join([contig, source, method, start, stop, score,
                                   strand, phase, gene_id]))
    return


def clean_featurecount(sm):
    """
    This cleans the featureCounts output table from format:
    # Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "etc."
    Geneid  Chr         Start   End     Strand  Length  path/to/bam/sample.bam
    1_1     k141_7581   1       459     -       459     1
    2_1     k141_0      469     714     +       246     2

    To format:
    gene_id         Length  sample
    k141_7581_1     459     1
    k141_0_1        246     1
    """
    df = pd.read_csv(sm.input[0], comment="#", sep="\t")
    # Extract gene number and combine with contig id
    df["gene_num"] = [x[1] for x in df.Geneid.str.split("_")]
    df.set_index(df.Chr.map(str) + "_" + df.gene_num, inplace=True)
    df.drop("gene_num", axis=1, inplace=True)
    df.index.name = 'gene_id'
    # Set sample and unit name from wildcards
    sample_unit = "{sample}_{unit}".format(sample=sm.wildcards.sample,
                                           unit=sm.wildcards.unit)
    df.columns = list(df.columns)[0:-1] + [sample_unit]
    # Extract length and counts
    df = df.loc[:, ["Length", sample_unit]]
    df.to_csv(sm.output[0], sep="\t")


def aggregate_featurecount(sm):
    """
    Aggregates cleaned featureCounts tables into one table per assembly

    :param sm: snakemake object
    :return:
    """
    df = pd.DataFrame()
    lmap = {}
    for f in sm.input:
        _df = pd.read_csv(f, sep="\t", index_col=0)
        lmap.update(_df.to_dict()["Length"])
        _df.drop("Length", axis=1, inplace=True)
        df = pd.merge(df, _df, right_index=True, left_index=True, how="outer")
    counts = pd.merge(df, pd.DataFrame(lmap, index=["Length"]).T,
                      left_index=True, right_index=True)
    counts.to_csv(sm.output[0], sep="\t")


def process_and_sum(q_df, annot_df):
    # Merge annotations and abundance
    # keep ORFs without annotation as "Unclassified"
    annot_q_df = pd.merge(annot_df, q_df, left_index=True, right_index=True,
                          how="right")
    annot_q_df.fillna("Unclassified", inplace=True)
    feature_cols = annot_df.columns
    annot_q_sum = annot_q_df.groupby(list(feature_cols)).sum().reset_index()
    annot_q_sum.set_index(feature_cols[0], inplace=True)
    return annot_q_sum


def sum_to_features(abundance, parsed):
    parsed_df = pd.read_csv(parsed, index_col=0, sep="\t")
    abundance_df = pd.read_csv(abundance, index_col=0, sep="\t")
    abundance_df.drop("Length", axis=1, inplace=True, errors="ignore")
    feature_sum = process_and_sum(abundance_df, parsed_df)
    return feature_sum


def count_features(sm):
    """
    Counts reads mapped to features such as KOs, PFAMs etc.

    :param sm:
    :return:
    """
    feature_sum = sum_to_features(sm.input.abund, sm.input.annot)
    feature_sum.to_csv(sm.output[0], sep="\t")

def sum_to_taxa(sm):
    """
    Takes taxonomic assignments per orf and abundance values (tpm or raw) and
    returns abundance values summed to unique combinations of ranks

    :param sm: snakemake object
    :return:
    """
    header = ["protein", "superkingdom", "phylum", "class", "order", "family",
              "genus", "species"]
    df = pd.read_csv(sm.input.tax[0], sep="\t", index_col=0, header=None,
                     names=header)
    abund_df = pd.read_csv(sm.input.abund, header=0, index_col=0, sep="\t")
    # Remove length column
    abund_df.drop("Length", axis=1, inplace=True, errors="ignore")
    taxa_abund = pd.merge(df, abund_df, right_index=True, left_index=True)
    taxa_abund_sum = taxa_abund.groupby(header[1:]).sum().reset_index()
    taxa_abund_sum.to_csv(sm.output[0], sep="\t", index=False)


def main(sm):
    toolbox = {"write_featurefile": write_featurefile,
               "clean_featurecount": clean_featurecount,
               "aggregate_featurecount": aggregate_featurecount,
               "count_features": count_features,
               "sum_to_taxa": sum_to_taxa}

    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import sys
import pandas as pd

def add_lower(df, ranks):
    """
    Propagates assignments from higher to lower taxonomic ranks,
    and adds a 'Unclassified.' prefix.
    :param df: pandas DataFrame
    :param ranks: ranks for which to propagate
    :return:
    """
    for i in df.index:
        last_known = df.loc[i, ranks[0]]
        for rank in ranks[1:]:
            if df.loc[i, rank] != "Unclassified":
                last_known = df.loc[i, rank]
            else:
                if last_known == "Unclassified":
                    df.loc[i, rank] = last_known
                else:
                    df.loc[i, rank] = "Unclassified.{}".format(last_known)
    return df


def contigtax_mash(sm):
    # Keep stats on assignments
    # resolved = cases where sourmash helped resolve assignments
    # transferred = cases where blast-based assignments were overwritten
    # added = cases where assignments from sourmash were added
    # total = total number of contigs
    stats = {'resolved': 0, 'transferred': 0, 'added': 0, 'total': 0}
    df1 = pd.read_csv(sm.input.smash, sep=",", header=0, index_col=0)
    stats['total'] = df1.shape[0]
    df2 = pd.read_csv(sm.input.contigtax[0], sep="\t", header=0, index_col=0)
    ranks = list(df2.columns)
    ranks.reverse()
    # Only use subset of contigs with matches
    df1 = df1.loc[df1["status"] == "found", df2.columns]
    df1.fillna("Unclassified", inplace=True)

    # Get common set of contigs
    common = set(df1.index).intersection(set(df2.index))
    for contig in common:
        s = df1.loc[contig]
        b = df2.loc[contig]
        for rank in ranks:
            # If sourmash has an assignment at this rank
            if s[rank] != "Unclassified":
                # If blast-based contains 'Unclassified',
                # mark contig as resolved
                if "Unclassified" in b[rank]:
                    stats['resolved'] += 1
                # Otherwise, mark contig as transferred
                else:
                    stats['transferred'] += 1
                # As soon as a contig has been transferred or resolved
                # we can stop the merge
                df2.loc[contig] = df1.loc[contig]
                break
            # If sourmash does not have an assignment at this rank
            else:
                # but blast-based does have an assignment,
                # then the blast-based is more resolved and we can stop
                # trying to merge
                if "Unclassified" not in b[rank]:
                    break
    # Get contigs in sourmash missing from blast
    missing1 = set(df1.index).difference(set(df2.index))
    if len(missing1) > 0:
        stats['added'] += len(missing1)
        df2 = pd.concat([df2, df1.loc[missing1]])
    df2 = add_lower(df2, df2.columns)
    df2.to_csv(sm.output[0], sep="\t")
    # Write to log
    with open(sm.log[0], 'w') as fhout:
        fhout.write("Total:       {}\n".format(stats['total']))
        fhout.write("Resolved:    {}\n".format(stats['resolved']))
        fhout.write("Transferred: {}\n".format(stats["transferred"]))
        fhout.write("Added:       {}\n".format(stats['added']))


def contigtax_assign_orfs(sm):
    """
    Transfers taxonomic assignments from contigs down to ORFs called on contigs
    :param sm: snakemake object
    :return:
    """
    gff_df=pd.read_csv(sm.input.gff, header=None, sep="\t", comment="#",
                           usecols=[0, 8], names=["contig", "id"])
    # Extract ids
    ids=["{}_{}".format(gff_df.loc[i, "contig"],
                        gff_df.loc[i, "id"].split(";")[0].split("_")[-1]) for i in gff_df.index]
    gff_df.loc[:, "id"]=ids
    # Read taxonomy for contigs
    tax_df=pd.read_csv(sm.input.tax, header=0, sep="\t", index_col=0)
    # Merge dataframes
    orf_tax_df=pd.merge(gff_df, tax_df, left_on="contig",
                        right_index=True, how="outer")
    # When using 'outer' merging there may be contigs with no called ORF
    # but with a tax assignment. Drop these contigs.
    orf_tax_df=orf_tax_df.loc[orf_tax_df["id"]==orf_tax_df["id"]]
    # Set Unclassified for NA values
    orf_tax_df.fillna("Unclassified", inplace=True)
    # Set index to ORF ids
    orf_tax_df.set_index("id", inplace=True)
    orf_tax_df.drop("contig", axis=1, inplace=True)
    orf_tax_df.to_csv(sm.output.tax[0], sep="\t", index=True, header=True)


def main(sm):
    toolbox = {"merge_contigtax_sourmash": contigtax_mash,
               "contigtax_assign_orfs": contigtax_assign_orfs}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)
ShowHide 103 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...