A snakemake workflow for metagenomic projects

public 1yr ago Version: v2.3.2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A workflow for metagenomic projects

Overview

A snakemake workflow for paired- and/or single-end whole-genome shotgun metagenomic data.

You can use this workflow for e.g. :

read-trimming and QC
taxonomic classification
assembly
functional and taxonomic annotation
metagenomic binning

See the Wiki-pages for instructions on how to run the workflow.

Installation

From GitHub

Checkout the latest version:

git clone https://github.com/NBISweden/nbis-meta.git

or download a tarball of the latest release from the release page .

Install and activate the workflow environment:

conda env create -f environment.yml
conda activate nbis-meta

From DockerHub

To pull the latest Docker image with all dependencies and source code from DockerHub, run:

docker pull nbisweden/nbis-meta

See the Wiki for instructions on how to run the Workflow with Docker.

Code Snippets

shell:
    """
    prodigal -i {input} -d {output.genes} -a {output.faa} -o {output.gff} \
        -f gff -p meta 2>{log}
    """

SnakeMake prodigal From line 42 of rules/annotation.smk

shell:
    """
    tRNAscan-SE -G -b {output.bed} -o {output.file} -a {output.fasta} \
        --thread {threads} {input} >{log} 2>&1
    """

SnakeMake Trnascan-SE From line 60 of rules/annotation.smk

shell:
    """
    # Get the Rfam tarball
    curl -o {output.tar} {params.url}/Rfam.tar.gz > {log} 2>&1
    # Extract only rfams of interest
    tar -C {params.dir} -zxf {output.tar} {params.rfams}
    cat {params.dir}/*.cm > {output.cm}

    # Get release
    curl -o {output.readme} {params.url}/README 2>/dev/null
    grep -m 1 Release {output.readme} > {output.version}

    # Get clans
    curl {params.url}/Rfam.clanin 2>/dev/null | egrep -w \
        "CL0011[123]" > {output.clanin}
    """

SnakeMake From line 82 of rules/annotation.smk

shell:
    """
    cmpress {input} > {log} 2>&1
    """

SnakeMake From line 109 of rules/annotation.smk

shell:
    """
    cmscan --cpu {threads} --oskip --rfam --cut_ga --nohmmonly \
        --tblout {output} --fmt 2 --clanin {input.cl} {params.db} \
        {input.fastafile} > /dev/null 2>{log}
    """

SnakeMake Infernal cmscan (EBI) From line 131 of rules/annotation.smk

shell:
    """
    curl -s -L -o {output.hmmfile}.gz {params.ftp}/Pfam-A.hmm.gz
    curl -s -L -o {output.datfile}.gz {params.ftp}/Pfam-A.hmm.dat.gz
    curl -s -L -o {output.versionfile}.gz {params.ftp}/Pfam.version.gz

    gunzip {output.hmmfile}.gz
    gunzip {output.datfile}.gz
    gunzip {output.versionfile}.gz
    """

SnakeMake From line 149 of rules/annotation.smk

shell:
    """
    curl -s -L -o {output.clanfile}.gz {params.ftp}/database_files/clan.txt.gz
    curl -s -L -o {output.info}.gz {params.ftp}/Pfam-A.clans.tsv.gz

    gunzip {output.clanfile}.gz
    gunzip {output.info}.gz
    """

SnakeMake From line 168 of rules/annotation.smk

shell:
    """
    hmmpress {input.hmmfile} > {log} 2>&1
    """

SnakeMake From line 187 of rules/annotation.smk

shell:
    """
    pfam_scan.pl -fasta {input[0]} -dir {params.dir} -cpu {threads} \
        -outfile {params.tmp_out} >{log} 2>&1
    mv {params.tmp_out} {output[0]}
    """

SnakeMake From line 209 of rules/annotation.smk

script:
    "../scripts/annotation_utils.py"

SnakeMake From line 223 of rules/annotation.smk

shell:
    """
    download_eggnog_data.py --data_dir {params.data_dir} -y > {log} 2>&1
    egrep -o "emapperdb-[0-9].[0-9].[0-9]" {log} > {output.version}
    """

SnakeMake From line 239 of rules/annotation.smk

shell:
    """
    python {params.src} download {params.outdir} > {log} 2>&1
    """

SnakeMake From line 257 of rules/annotation.smk

shell:
    """
    mkdir -p {params.tmpdir}
    emapper.py {params.flags} --cpu {threads} --temp_dir {params.tmpdir} \
    -i {input[0]} -o {params.out} --output_dir {params.tmpdir} \
        --data_dir {params.resource_dir} >{log} 2>&1
    mv {params.tmp_out}.emapper.seed_orthologs {output[0]}
    rm -rf {params.tmpdir}
    """

SnakeMake eggNOG-mapper v2 From line 282 of rules/annotation.smk

shell:
    """
    if [ -z ${{SLURM_JOB_ID+x}} ]; then SLURM_JOB_ID="emapper_annotate_hits_uppmax"; fi
    #Copy eggnog.db
    mkdir -p /dev/shm/$SLURM_JOB_ID
    cp {params.resource_dir}/eggnog.db {params.resource_dir}/eggnog_proteins.dmnd /dev/shm/$SLURM_JOB_ID

    emapper.py {params.flags} --cpu {threads} -o {params.out} \
        --annotate_hits_table {input[0]} --usemem \
        --data_dir /dev/shm/$SLURM_JOB_ID >{log} 2>&1
    rm -rf /dev/shm/$SLURM_JOB_ID
    """

SnakeMake eggNOG-mapper v2 From line 314 of rules/annotation.smk

shell:
    """
    emapper.py {params.flags} --cpu {threads} -o {params.out} \
        --annotate_hits_table {input[0]} --usemem \
        --data_dir {params.resource_dir} >{log} 2>&1
    """

SnakeMake eggNOG-mapper v2 From line 344 of rules/annotation.smk

script:
    "../scripts/annotation_utils.py"

SnakeMake From line 359 of rules/annotation.smk

shell:
     """
     curl -L -o {params.tar} \
        https://card.mcmaster.ca/latest/data >{log} 2>&1
     tar -C {params.dir} -xf {params.tar} ./card.json
     # Store download date in versionfile
     date > {output.version}
     rm {params.tar}
     """

SnakeMake From line 373 of rules/annotation.smk

shell:
    """
    mkdir -p {params.tmpdir}
    rgi load -i {input.db} --local > {log} 2>&1
    sed 's/*//g' {input.faa} > {params.faa}
    rgi main -i {params.faa} -o {params.out} \
        -n {threads} {params.settings} >>{log} 2>>{log}
    rm -r {params.tmpdir}
    """

SnakeMake RGI From line 403 of rules/annotation.smk

script:
    "../scripts/annotation_utils.py"

SnakeMake From line 418 of rules/annotation.smk

script:
    "../scripts/assembly_utils.py"

SnakeMake From line 31 of rules/assembly.smk

shell:
    """
    # Create directories
    mkdir -p {params.tmp}
    # Clean output dir
    #rm -rf {params.output_dir}/*
    # Clean temp dir
    rm -rf {params.tmp}/*
    # Only use single-end if present
    if [ -s {input.se} ]; then
        single="-s {input.se}"
    else
        single=""
    fi
    metaspades.py \
        -t {threads} -1 {input.R1} -2 {input.R2} $single \
        -o {params.tmp} > {log} 2>&1

    # If set to keep intermediate contigs, move to intermediate folder before deleting
    if [ "{config[metaspades][keep_intermediate]}" == "True" ]; then
        mkdir -p {params.intermediate_contigs}
        cp -r {params.tmp}/K* {params.intermediate_contigs}
    fi
    if [ "{config[metaspades][keep_corrected]}" == "True" ]; then
        mkdir -p {params.corrected}
        cp -r {params.tmp}/corrected {params.corrected}
    fi

    # Clear intermediate contigs
    rm -rf {params.tmp}/K*
    # Clear corrected reads dir
    rm -rf {params.tmp}/corrected
    # Sync tmp output to outdir before removing
    cp -r {params.tmp}/* {params.output_dir}
    rm -rf {params.tmp}
    mv {params.output_dir}/scaffolds.fasta {params.output_dir}/final_contigs.fa
    """

SnakeMake metaspades From line 54 of rules/assembly.smk

script:
    "../scripts/assembly_utils.py"

SnakeMake From line 107 of rules/assembly.smk

shell:
    """
    mkdir -p {config[paths][temp]}
    rm -rf {params.tmp}
    # Only use paired-end if present
    if [ -s {input.R1} ]; then
        R1=$(cat {input.R1})
        R2=$(cat {input.R2})
        paired="-1 $R1 -2 $R2"
    else
        paired=""
    fi
    # Only use single-end if present
    se=$(cat {input.se})
    if [ -s {input.se} ]; then
        single="-r $se"
    else
        single=""
    fi

    # Run Megahit
    megahit -t {threads} $paired $single -o {params.tmp} \
        {params.additional_settings} >{log} 2>&1

    # Sync intermediate contigs if asked for
    if [ "{config[megahit][keep_intermediate]}" == "True" ]; then
        mkdir -p {params.intermediate_contigs}
        cp -r {params.tmp}/intermediate_contigs/* {params.intermediate_contigs}
    fi

    # Cleanup intermediate
    rm -rf {params.tmp}/intermediate_contigs

    # Sync tmp output to outdir before removing
    cp -r {params.tmp}/* {params.output_dir}
    rm -rf {params.tmp}
    mv {params.output_dir}/final.contigs.fa {params.output_dir}/final_contigs.fa
    """

SnakeMake MEGAHIT From line 129 of rules/assembly.smk

script:
    "../scripts/assembly_utils.py"

SnakeMake From line 174 of rules/assembly.smk

shell:
    """
    bowtie2-build \
        --large-index \
        --threads {threads} \
        {params.prefix} \
        {params.prefix} > /dev/null 2>&1
    """

SnakeMake Bowtie 2 From line 192 of rules/assembly.smk

shell:
    """
    bowtie2 {params.setting} -p {threads} -x {params.prefix} -1 {input.R1} -2 {input.R2} 2> {log} \
        | samtools view -bh - | samtools sort - -o {params.temp_bam}
    samtools index {params.temp_bam}
    mv {params.temp_bam} {output.bam}
    mv {params.temp_bam}.bai {output.bai}
    """

SnakeMake SAMtools Bowtie 2 From line 221 of rules/assembly.smk

shell:
    """
    bowtie2 {params.setting} -p {threads} -x {params.prefix} \
        -U {input.se} 2>{log} | samtools view -bh - | samtools sort - -o {params.temp_bam}
    samtools index {params.temp_bam}
    mv {params.temp_bam} {output.bam}
    mv {params.temp_bam}.bai {output.bai}
    """

SnakeMake SAMtools Bowtie 2 From line 249 of rules/assembly.smk

shell:
    """
    for f in {input} ;
    do
        al=$(samtools \
            flagstat \
            $f | grep " mapped (" | cut -f2 -d '('| cut -f1 -d ' ')
        n=$(basename $f | sed 's/_[ps]e{params.post}.bam//g')
        echo -e "$n\t$al" >> {output}
    done
    """

SnakeMake From line 276 of rules/assembly.smk

script:
    "../scripts/assembly_utils.py"

SnakeMake From line 296 of rules/assembly.smk

shell:
    """
    jgi_summarize_bam_contig_depths \
        --outputDepth {output.depth} {input.bam} >{log} 2>&1
    """

SnakeMake From line 54 of rules/binning.smk

shell:
    """
    metabat2 -i {input.fa} -a {input.depth} -m {wildcards.l} -t {threads} \
        -o {params.n} > {log} 2>&1
    """

SnakeMake MetaBAT 2 From line 75 of rules/binning.smk

shell:
    """
    set +e
    mkdir -p {params.dir}
    mkdir -p {params.tmp_dir}
    run_MaxBin.pl -markerset {params.markerset} -contig {input} \
        {params.reads} -min_contig_length {wildcards.l} -thread {threads} \
        -out {params.tmp_dir}/maxbin >{log} 2>{log}
    exitcode=$?
    if [ $exitcode -eq 255 ]; then
        exit 0
    else
        # Rename fasta files
        ls {params.tmp_dir} | grep ".fasta" | while read f;
        do
            mv {params.tmp_dir}/$f {params.dir}/${{f%.fasta}}.fa
        done
        # Move output from temporary dir
        ls {params.tmp_dir} | while read f;
        do
            mv {params.tmp_dir}/$f {params.dir}/
        done
    fi
    # Clean up
    rm -r {params.tmp_dir}
    """

SnakeMake SqueezeMeta MaxBin From line 100 of rules/binning.smk

shell:
    """
    for f in {input.bam} ;
        do
            n=$(basename $f);
            s=$(echo -e $n | sed 's/_[ps]e{params.p}.bam//g');
            echo $s;
        done > {params.samplenames}
    concoct_coverage_table.py \
        --samplenames {params.samplenames} \
        {input.bed} {input.bam} > {output.cov}
    rm {params.samplenames}
    """

SnakeMake CONCOCT From line 143 of rules/binning.smk

shell:
    """
    cut_up_fasta.py -b {output.bed} -c 10000 -o 0 -m {input.fa} \
        > {output.fa} 2>{log}
    """

SnakeMake From line 167 of rules/binning.smk

shell:
    """
    concoct -t {threads} --coverage_file {input.cov} -l {params.length} \
        --composition_file {input.fa} -b {params.basename}/ >/dev/null 2>&1
    """

SnakeMake CONCOCT From line 189 of rules/binning.smk

shell:
    """
    merge_cutup_clustering.py {input[0]} > {output[0]} 2> {log}
    """

SnakeMake From line 204 of rules/binning.smk

shell:
    """
    mkdir -p {params.tmp_dir}
    extract_fasta_bins.py {input[0]} {input[1]} \
        --output_path {params.tmp_dir} 2> {log}
    ls {params.tmp_dir} | egrep "[0-9].fa" | while read f;
    do
        mv {params.tmp_dir}/$f {params.dir}/concoct.$f
    done
    """

SnakeMake From line 222 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 242 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 254 of rules/binning.smk

run:
    df=concatenate(input, index=-2)
    df.to_csv(output[0], sep="\t", index=True)

SnakeMake From line 268 of rules/binning.smk

shell:
    """
    # Download
    curl -L https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz -o {params.tar} -s
    # Extract
    tar -C {params.dir} -xf {params.tar}
    # Set root
    checkm data setRoot {params.dir} > {log} 2>&1
    """

SnakeMake CheckM From line 284 of rules/binning.smk

shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output.tsv}
        touch {output.ms}
    else
        checkm taxonomy_wf -t {threads} -x {params.suff} -q \
            --tab_table -f {output.tsv} \
            {params.rank} {params.taxon} {params.indir} {params.outdir} \
            > {log} 2>&1
        ln -s {params.taxon}.ms {output.ms}
    fi
    """

SnakeMake CheckM From line 315 of rules/binning.smk

shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 0 ] ; then
        echo "NO BINS FOUND" > {output.tsv}
        touch {output.ms}
    else
        checkm lineage_wf -t {threads} --pplacer_threads {threads} \
            -x {params.suff} {params.tree} -q \
            --tab_table -f {output.tsv} \
            {params.indir} {params.outdir} \
            > {log} 2>&1
    fi
    """

SnakeMake CheckM From line 349 of rules/binning.smk

shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output.tsv}
    else
        checkm qa -o 2 --tab_table -f {output.tsv} \
            {input.ms} {params.dir} > {log} 2>&1
    fi
    """

SnakeMake CheckM From line 379 of rules/binning.smk

shell:
    """
    lines=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output}
    else
        checkm coverage -x fa -t {threads} {params.dir} \
            {output} {input.bam} > {log} 2>&1
    fi
    """

SnakeMake CheckM From line 406 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 426 of rules/binning.smk

shell:
    """
    lines=$(wc -l {input.stats} | cut -f1 -d ' ')
    cov_lines=$(wc -l {input.cov} | cut -f1 -d ' ')
    if [ $lines == 1 ] ; then
        echo "NO BINS FOUND" > {output}
    elif [ $cov_lines == 0 ] ; then
        echo "NO READS MAPPED" > {output}
    else
        checkm profile -f {output} --tab_table {input.cov} > {log} 2>&1
    fi
    """

SnakeMake CheckM From line 439 of rules/binning.smk

run:
    df=concatenate(input, index=-3)
    df.to_csv(output.tsv, sep="\t", index=True)

SnakeMake From line 460 of rules/binning.smk

run:
    df=concatenate(input, index=-3)
    df.to_csv(output.tsv, sep="\t", index=True)

SnakeMake From line 472 of rules/binning.smk

shell:
    """
    curl -L -o {params.tar} {params.url} > {log} 2>&1
    tar xzf {params.tar} -C {params.dir} --strip 1 > {log} 2>&1
    """

SnakeMake From line 487 of rules/binning.smk

shell:
    """
    bins=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $bins == 0 ] ; then
        echo "NO BINS FOUND" > {output}
    else
        export PYTHONPATH=$(which python)
        export GTDBTK_DATA_PATH={params.dbdir}
        gtdbtk classify_wf -x {params.suff} --out_dir {params.outdir} \
            --cpus {threads} --pplacer_cpus {threads} \
            --genome_dir {params.indir} > {log} 2>&1
    fi
    """

SnakeMake gtdbtk From line 511 of rules/binning.smk

run:
    summaries=[]
    for f in input:
        gtdb_dir=os.path.dirname(f)
        for m in ["bac120/ar122"]:
            summary=gtdb_dir+"/gtdbtk.{}.summary.tsv".format(m)
            if os.path.exists(summary):
                summaries.append(summary)
    df=concatenate(summaries, index=-3)
    df.to_csv(output.summary, sep="\t")

SnakeMake From line 537 of rules/binning.smk

shell:
    """
    bins=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $bins == 0 ] ; then
        touch {output}
    else
        cat {params.gtdbtk_dir}/gtdbtk.*.summary.tsv | cut -f1 -d ';' | grep -v "user_genome" | \
            while read line;
            do
                d=$(echo -e "$line" | cut -f2)
                g=$(echo -e "$line" | cut -f1)
                if [ "$d" == "d__Bacteria" ]; then
                    k="bac"
                else
                    k="arc"
                fi
                barrnap --kingdom $k {params.indir}/$g.fa > {params.outdir}/out 2>>{log}
                lines=$(wc -l {params.outdir}/out | cut -f1 -d ' ')
                if [ $lines -gt 1 ]; then
                    egrep -v "^#" {params.outdir}/out | sed "s/$/;genome=$g/g" >> {output}
                else
                    touch {output}
                fi
            done
        rm {params.outdir}/out
    fi
    """

SnakeMake Barrnap From line 570 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 603 of rules/binning.smk

shell:
    """
    bins=$(wc -l {input.tsv} | cut -f1 -d ' ')
    if [ $bins == 0 ] ; then
        touch {output}
    else
        echo -e "Name\ttRNA#\ttRNA_Begin\ttRNA_End\ttRNA_type\tAnti_Codon\tIntron_Begin\tIntron_End\tInf_Score\tNote\tBin_Id" > {output}
        cat {params.gtdbtk_dir}/gtdbtk.*.summary.tsv | cut -f1 -d ';' | grep -v "user_genome" | \
            while read line;
            do
                d=$(echo -e "$line" | cut -f2)
                g=$(echo -e "$line" | cut -f1)
                if [ "$d" == "d__Bacteria" ]; then
                    model="-B"
                else
                    model="-A"
                fi
                tRNAscan-SE $model --quiet --thread {threads} \
                    {params.indir}/$g.fa | tail -n +4 | sed "s/$/\t$g/g" >> {output}
            done
    fi
    """

SnakeMake Trnascan-SE From line 626 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 655 of rules/binning.smk

run:
    df=concatenate(input.trna, index=-3)
    df.to_csv(output.trna, sep="\t", index=True)
    df=concatenate(input.rrna, index=-3)
    df.to_csv(output.rrna, sep="\t", index=True)

SnakeMake From line 671 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 684 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 703 of rules/binning.smk

shell:
    """
    fastANI --rl {input[0]} --ql {input[1]} -k {params.k} -t {threads} \
        --fragLen {params.frag_len} --minFraction {params.fraction} \
        --matrix -o {output[0]} > {log} 2>&1
    """

SnakeMake FastANI From line 725 of rules/binning.smk

script:
    "../scripts/binning_utils.py"

SnakeMake From line 743 of rules/binning.smk

shell:
     """
     mkdir -p {params.tmpdir}
     curl -L -o {params.tar} {params.url} > {log} 2>&1
     tar -C {params.tmpdir} -xf {params.tar}
     hashfile=$(find {params.tmpdir}/ -name "hash.k2d")
     hashdir=$(dirname $hashfile)
     mv $hashdir/* {params.dir}
     rm -r {params.tar} {params.tmpdir}
     """

SnakeMake From line 37 of rules/classification.smk

shell:
    """
    kraken2-build --standard --db {params.dir} --threads {threads} > {log.build} 2>&1
    kraken2-build --clean {params.dir} > {log.clean} 2>&1
    """

SnakeMake kraken2 From line 61 of rules/classification.smk

shell:
    """
    kraken2 {params.mem} --db {params.db} --output {output[0]} \
        --report {output[1]} --gzip-compressed \
        --threads {threads} --paired {input.R1} {input.R2} > {log} 2>&1
    """

SnakeMake kraken2 From line 92 of rules/classification.smk

shell:
    """
    kraken2 {params.mem} --db {params.db} --output {output[0]} \
        --report {output[1]} --gzip-compressed \
        --threads {threads} {input.se} > {log} 2>&1
    """

SnakeMake kraken2 From line 122 of rules/classification.smk

shell:
    """
    curl -o {params.tar} {params.url} > {log} 2>&1
    tar -C {params.dir} -xf {params.tar} >>{log} 2>&1
    rm {params.tar}
    """

SnakeMake From line 145 of rules/classification.smk

shell:
    """
    mkdir -p {params.tmpdir}
    centrifuge -k {params.k} -x {params.prefix} -1 {input.R1} -2 {input.R2} \
        -S {params.tmp_out} -p {threads} --report-file {params.tmp_report} \
        > {log} 2>&1
    mv {params.tmp_out} {output[0]}
    mv {params.tmp_report} {output[1]}
    """

SnakeMake Centrifuge From line 183 of rules/classification.smk

shell:
    """
    mkdir -p {params.tmpdir}
    centrifuge -k {params.k} -U {input.se} -x {params.prefix} \
        -S {params.tmp_out} -p {threads} --report-file {params.tmp_report} \
        > {log} 2>&1
    mv {params.tmp_out} {output[0]}
    mv {params.tmp_report} {output[1]}
    """

SnakeMake Centrifuge From line 222 of rules/classification.smk

shell:
    """
    centrifuge-kreport --min-score {params.min_score} -x {params.prefix} \
        {input.f} > {output[0]}
    """

SnakeMake From line 248 of rules/classification.smk

shell:
    """
    metaphlan --install --bowtie2db {params.dir} \
        --nproc {threads} -x {params.index} >{log} 2>&1
    """

SnakeMake MetaPhlAn From line 274 of rules/classification.smk

shell:
    """
    metaphlan {input.R1},{input.R2} --bowtie2db {params.dir} --add_viruses \
        --force --nproc {threads} --input_type fastq -o {output.tsv} \
         --bowtie2out {output.bt2} > {log} 2>&1
    """

SnakeMake MetaPhlAn From line 304 of rules/classification.smk

shell:
    """
    metaphlan {input.se} --bowtie2db {params.dir} --add_viruses --force \
        --nproc {threads} --input_type fastq -o {output.tsv} \
         --bowtie2out {output.bt2} > {log} 2>&1
    """

SnakeMake MetaPhlAn From line 333 of rules/classification.smk

shell:
    """
    merge_metaphlan_tables.py {input} > {output}
    """

SnakeMake From line 347 of rules/classification.smk

script:
    "../scripts/classification_utils.py"

SnakeMake From line 359 of rules/classification.smk

shell:
    """
    ktImportTaxonomy -t 1 -m 2 -o {output} -tax {params.dbdir} \
        {params.input_string} > {log} 2>&1
    """

SnakeMake Krona From line 375 of rules/classification.smk

shell:
    """
    ktUpdateTaxonomy.sh {params.taxdir} >{log} 2>&1
    """

SnakeMake Krona From line 406 of rules/classification.smk

shell:
    """
    ktImportTaxonomy -t 5 -m 3 -tax {params.tax} -o {output[0]} \
        {input[0]},{wildcards.sample}_{wildcards.unit} > {log} 2>&1
    """

SnakeMake Krona From line 428 of rules/classification.smk

shell:
     """
     ktImportTaxonomy \
        -t 5 -m 3 -tax {params.tax} -o {output[0]} \
        {params.input_string} > {log} 2>&1
     """

SnakeMake Krona From line 456 of rules/classification.smk

shell:
     """
     curl -L -s -o {params.tar} {params.url}
     tar -C {params.outdir} -xf {params.tar}
     rm {params.tar}
     """

SnakeMake From line 16 of rules/other.smk

shell:
     """
     seqtk sample -s {wildcards.s} {input} \
        {params.example_dataset_size} | gzip -c > {output}
     """

SnakeMake seqtk From line 37 of rules/other.smk

run:
    link(input[0], output[0])

SnakeMake From line 30 of rules/preprocessing.smk

shell:
     """
     curl -L -o {output} {params.url} > {log} 2>&1
     """

SnakeMake From line 42 of rules/preprocessing.smk

shell:
    """
    indexdb_rna --ref {input.fasta},{input.fasta} > {log} 2>&1
    """

SnakeMake From line 59 of rules/preprocessing.smk

shell:
    """
    mkdir -p {params.scratch}
    # Unzip to scratch dir
    gunzip -c {input.R1} > {params.R1_unzipped}
    gunzip -c {input.R2} > {params.R2_unzipped}
    # Merge
    merge-paired-reads.sh {params.R1_unzipped} {params.R2_unzipped} \
        {params.merged} >{log} 2>&1
    # Move output
    mv {params.merged} {output}
    # Clean up
    rm {params.R1_unzipped} {params.R2_unzipped}
    """

SnakeMake From line 82 of rules/preprocessing.smk

shell:
     """
     mkdir -p {params.scratch}
     # Run SortMeRNA
     sortmerna --blast 1 --log -v --fastx --ref {params.ref_string} \
        --reads {input.fastq} -a {threads} --{params.paired_strategy} \
        --aligned {params.aligned_prefix} --other {params.other_prefix} \
        {params.score_params} >{log} 2>&1

     mv {params.aligned_prefix}.fastq {output.aligned}
     mv {params.aligned_prefix}.log {log}
     mv {params.other_prefix}.fastq {output.other}
     """

SnakeMake From line 121 of rules/preprocessing.smk

shell:
    """
    mkdir -p {params.tmpdir}
    unmerge-paired-reads.sh {input.aligned} {params.R1} {params.R2} >{log} 2>&1
    gzip {params.R1}
    gzip {params.R2}
    mv {params.R1}.gz {output.R1}
    mv {params.R2}.gz {output.R2}
    """

SnakeMake From line 151 of rules/preprocessing.smk

shell:
    """
    mkdir -p {params.tmpdir}
    unmerge-paired-reads.sh {input.other} {params.R1} {params.R2} >{log} 2>&1
    gzip {params.R1}
    gzip {params.R2}
    mv {params.R1}.gz {output.R1}
    mv {params.R2}.gz {output.R2}
    """

SnakeMake From line 177 of rules/preprocessing.smk

shell:
    """
    gunzip -c {input} > {output} 2>{log}
    """

SnakeMake From line 194 of rules/preprocessing.smk

shell:
    """
    mkdir -p {params.scratch}
    # Run SortMeRNA
    sortmerna --blast 1 --log -v --fastx --ref {params.ref_string} \
        --reads {input.fastq} -a {threads} --other {params.other_prefix} \
        --aligned {params.aligned_prefix} {params.score_params} \
        >{log} 2>&1

    mv {params.aligned_prefix}.fastq {output.aligned}
    mv {params.aligned_prefix}.log {log}
    mv {params.other_prefix}.fastq {output.other}
    """

SnakeMake From line 221 of rules/preprocessing.smk

shell:
    """
    gzip {input.fastq} 2>{log}
    """

SnakeMake From line 242 of rules/preprocessing.smk

shell:
    """
    gzip {input.fastq} 2>{log}
    """

SnakeMake From line 254 of rules/preprocessing.smk

run:
    link(input.R1, output.R1)
    link(input.R2, output.R2)

SnakeMake From line 266 of rules/preprocessing.smk

run:
    link(input.se, output.se)

SnakeMake From line 275 of rules/preprocessing.smk

shell:
    """
    trimmomatic PE \
        -threads {threads} \
        {input.R1} {input.R2} \
        {output.R1P} {output.R1U} \
        {output.R2P} {output.R2U} \
        {params.trim_string} \
        2>{log.R1log}
    sed \
        's/{wildcards.sample}_{wildcards.unit}_R1/{wildcards.sample}_{wildcards.unit}_R2/g' \
        {log.R1log} > {log.R2log}
    """

SnakeMake Trimmomatic From line 301 of rules/preprocessing.smk

shell:
    """
    trimmomatic SE \
        -threads {threads} \
        {input} \
        {output} \
        {params.trim_string} \
        2>{log}
    """

SnakeMake Trimmomatic From line 330 of rules/preprocessing.smk

shell:
    """
    cutadapt \
        {params.extra_params} \
        -e {params.error_rate} \
        -a {params.adapter} \
        -A {params.rev_adapter} \
        -o {output.fastq1} \
        -p {output.fastq2} \
        -j {threads} \
        {input.R1} {input.R2} > {log.R1log} 2>{log.err}
    cp {log.R1log} {log.R2log}
    """

SnakeMake Cutadapt From line 361 of rules/preprocessing.smk

shell:
    """
    cutadapt \
        {params.extra_params} \
        -e {params.error_rate} \
        -a {params.adapter} \
        -j {threads} \
        -o {output} {input} > {log}
    """

SnakeMake Cutadapt From line 391 of rules/preprocessing.smk

shell:
    """
    curl \
        -L \
        -o {output}.gz \
        -s \
        {params.url_base}/GCF_000819615.1_ViralProj14015_genomic.fna.gz \
        > {log} 2>&1
    gunzip {output}.gz
    """

SnakeMake From line 409 of rules/preprocessing.smk

shell:
    """
    bowtie2-build \
        --threads {threads} \
        {input.fasta} {params.prefix} >{log} 2>&1
    """

SnakeMake Bowtie 2 From line 433 of rules/preprocessing.smk

shell:
    """
    mkdir -p {params.tmp_out}
    bowtie2 \
        {params.setting} \
        -p {threads} \
        -x {params.prefix} \
        -1 {input.R1} \
        -2 {input.R2} \
        --un-conc-gz \
        {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_R%.filtered.fastq.gz > /dev/null 2>{log}
    mv {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_R1.filtered.fastq.gz {output.R1}
    mv {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_R2.filtered.fastq.gz {output.R2}
    """

SnakeMake Bowtie 2 From line 460 of rules/preprocessing.smk

shell:
    """
    mkdir -p {params.tmp_out}
    bowtie2 \
        {params.setting} \
        -p {threads} \
        -x {params.prefix} \
        --un-gz \
        {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_se.filtered.fastq.gz {input.se} > /dev/null 2>{log}
    mv {params.tmp_out}/{wildcards.sample}_{wildcards.unit}_se.filtered.fastq.gz {output.se}
    """

SnakeMake Bowtie 2 From line 493 of rules/preprocessing.smk

shell:
    """
    gunzip -c {input.R1} > {params.R1_intmp}
    gunzip -c {input.R2} > {params.R2_intmp}
    echo {params.R1_intmp} > {params.file_list}
    echo {params.R2_intmp} >> {params.file_list}
    fastuniq -i {params.file_list} -t q -o {params.R1_outtmp} \
        -p {params.R2_outtmp} >{log} 2>&1
    gzip -c {params.R1_outtmp} > {output.R1}
    gzip -c {params.R2_outtmp} > {output.R2}
    rm {params.R1_intmp} {params.R1_outtmp} {params.R2_intmp} {params.R2_outtmp}
    rm {params.file_list}
    """

SnakeMake From line 525 of rules/preprocessing.smk

run:
    link(input.se, output.se)

SnakeMake From line 547 of rules/preprocessing.smk

shell:
    """
    fastqc -q --noextract -o {params.dir} {input} >{log} 2>&1
    """

SnakeMake FastQC From line 565 of rules/preprocessing.smk

shell:
    """
    multiqc -f -c {params.config} -n samples_report.html \
        -o {params.output_dir} {input} >{log} 2>{log}
    """

SnakeMake MultiQC From line 587 of rules/preprocessing.smk

script:
    "../scripts/quantification_utils.py"

SnakeMake From line 26 of rules/quantification.smk

shell:
    """
    mkdir -p {params.temp_dir}
    # Fix bam header
    samtools view -H {input} | egrep -v "^@PG" > {params.header}
    samtools reheader -P {params.header} {input} > {params.rehead_bam}
    # Set memory max
    mem="-Xmx$((6 * {threads}))g"
    java -Xms2g $mem -XX:ParallelGCThreads={threads} \
        -jar $CONDA_PREFIX/share/picard-*/picard.jar MarkDuplicates \
        I={params.rehead_bam} M={output[2]} O={params.temp_bam} REMOVE_DUPLICATES=TRUE \
        USE_JDK_DEFLATER=TRUE USE_JDK_INFLATER=TRUE ASSUME_SORT_ORDER=coordinate \
        PROGRAM_RECORD_ID=null ADD_PG_TAG_TO_READS=FALSE 2> {log}
    # Re sort the bam file using samtools
    samtools_threads="$(({threads} - 1))"
    samtools sort -@ $samtools_threads -o {params.temp_sort_bam} {params.temp_bam} > /dev/null 2>&1
    # Index the bam file
    samtools index {params.temp_sort_bam}
    mv {params.temp_sort_bam} {output[0]}
    mv {params.temp_sort_bam}.bai {output[1]}
    rm {params.temp_bam} {params.rehead_bam} {params.header}
    """

SnakeMake SAMtools Picard From line 51 of rules/quantification.smk

shell:
    """
    mkdir -p {params.tmpdir}
    featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id -M \
        {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
    """

SnakeMake FeatureCounts From line 93 of rules/quantification.smk

script:
    "../scripts/quantification_utils.py"

SnakeMake From line 105 of rules/quantification.smk

script:
    "../scripts/quantification_utils.py"

SnakeMake From line 116 of rules/quantification.smk

script:
    "../scripts/edger.R"

SnakeMake edgeR From line 133 of rules/quantification.smk

script:
    "../scripts/quantification_utils.py"

SnakeMake From line 145 of rules/quantification.smk

script:
    "../scripts/edger.R"

SnakeMake edgeR From line 162 of rules/quantification.smk

script:
    "../scripts/metagenomeseq.R"

SnakeMake metagenomeSeq From line 177 of rules/quantification.smk

script:
    "../scripts/quantification_utils.py"

SnakeMake From line 190 of rules/quantification.smk

shell:
    """
    contigtax download taxonomy -t {params.taxdir} >{log} 2>&1
    """

SnakeMake contigtax From line 35 of rules/taxonomy.smk

shell:
    """
    contigtax download {wildcards.db} --tmpdir {params.tmpdir} \
        -d {params.dldir} --skip_idmap >{log} 2>{log}
    """

SnakeMake contigtax From line 50 of rules/taxonomy.smk

shell:
    """
    contigtax download idmap -d {params.dldir} > {log} 2>&1
    """

SnakeMake contigtax From line 65 of rules/taxonomy.smk

shell:
    """
    contigtax format -m {output.idmap} --tmpdir {params.tmpdir} \
        {input.fasta} {output.fasta} > {log} 2>&1
    """

SnakeMake contigtax From line 82 of rules/taxonomy.smk

shell:
    """
    contigtax format --tmpdir {params.tmpdir} {input.fasta} \
        {output.fasta} > {log} 2>&1
    """

SnakeMake contigtax From line 99 of rules/taxonomy.smk

shell:
    """
    # If an idmap file is available, use it to create an updated idmap file
    if [ -e {params.dir}/idmap.tsv.gz ] ; then
        contigtax update {input.idmap} {params.dir}/idmap.tsg.gz \
            {output.idmap} > {log} 2>&1
    # Otherwise, just create a symlink
    else
        cd {params.dir}
        ln -s $(basename {input.idmap}) $(basename {output.idmap})
    fi
    """

SnakeMake contigtax From line 116 of rules/taxonomy.smk

shell:
     """
     contigtax build -d {output} -p {threads} {input.fasta} \
        {input.idmap} {input.nodes} >{log} 2>&1
     """

SnakeMake contigtax From line 143 of rules/taxonomy.smk

shell:
    """
    contigtax search {params.settings} -p {threads} \
        --tmpdir {params.tmpdir} -l {params.min_len} \
        {input.fasta} {input.db} {output} >{log} 2>&1
    """

SnakeMake contigtax From line 167 of rules/taxonomy.smk

shell:
     """
     contigtax assign {params.settings} -p {threads} -m rank_lca \
        --reportranks {params.taxonomy_ranks} -t {params.taxdir} \
        {input.tsv} {output} > {log} 2>&1
     """

SnakeMake contigtax From line 193 of rules/taxonomy.smk

shell:
    """
    curl -L -v -o {output.lca}.gz {params.url} > {log} 2>&1
    grep filename {log} | cut -f2 -d ';' > {output.version}
    gunzip {output.lca}.gz
    """

SnakeMake From line 210 of rules/taxonomy.smk

shell:
    """
    sourmash compute --singleton --scaled {params.frac} \
        -k {params.k} -o {output} {input} > {log} 2>&1
    """

SnakeMake From line 229 of rules/taxonomy.smk

shell:
    """
    sourmash lca classify --db {input.db} --scaled {params.frac} \
        --query {input.sig} -o {output.csv} > {log} 2>&1
    """

SnakeMake From line 250 of rules/taxonomy.smk

script:
    "../scripts/taxonomy_utils.py"

SnakeMake From line 267 of rules/taxonomy.smk

script:
    "../scripts/taxonomy_utils.py"

SnakeMake From line 277 of rules/taxonomy.smk

import pandas as pd


def orf2feat(d, val_name="vals", regex=""):
    import re
    keys = []
    vals = []
    for key, val in d.items():
        for item in val.split(","):
            if regex != "":
                m = re.search(regex, item)
                if m == None:
                    continue
                item = m.group()
            keys.append(key)
            vals.append(item)
    df = pd.DataFrame(data={"orf": keys, val_name: vals})
    df.set_index("orf", inplace=True)
    return df


def parse_emapper(sm):
    from pandas.errors import EmptyDataError
    db = sm.wildcards.db
    db_att = {'kos': {'val_name': 'ko', 'regex': "K\d{5}"},
              'pathways': {'val_name': 'pathway', 'regex': "map\d{5}"},
              'modules': {'val_name': 'module', 'regex': ""},
              'enzymes': {'val_name': 'enzyme', 'regex': ""}}
    df = pd.read_csv(sm.input.annotations, sep="\t", index_col=0)
    df.rename(columns={'KEGG_ko': 'kos', 'KEGG_Pathway': 'pathways',
                          'KEGG_Module': 'modules', 'EC': 'enzymes'},
               inplace=True)
    df.fillna("-", inplace=True)
    d = df.loc[df[db] != "-", db].to_dict()
    how = "inner"
    try:
        info_df = pd.read_csv(sm.input.info, sep="\t", index_col=0)
    except EmptyDataError:
        info_df = pd.DataFrame()
        how = "right"
    annot = orf2feat(d, val_name=db_att[db]["val_name"],
                     regex=db_att[db]["regex"])
    annot = pd.merge(info_df, annot, left_index=True,
                     right_on=db_att[db]["val_name"], how=how)
    annot.to_csv(sm.output[0], sep="\t", index=True, header=True)


def parse_rgi(sm):
    annot = pd.read_csv(sm.input.txt, sep="\t", index_col=0)
    annot = annot.loc[:, ["Model_ID", "AMR Gene Family", "Resistance Mechanism"]]
    annot.loc[:, "Model_ID"] = ["RGI_{}".format(x) for x in annot.Model_ID]
    annot.rename(index=lambda x: x.split(" ")[0], inplace=True)
    annot.to_csv(sm.output.tsv, sep="\t", index=True)


def parse_pfam(sm):
    annot = pd.read_csv(sm.input[0], comment="#", header=None, sep=" +",
                        usecols=[0, 5, 7, 14], engine="python",
                        names=["orf", "pfam", "pfam_type", "pfam_clan"])
    clans = pd.read_csv(sm.input[1], header=None, names=["clan", "clan_name"],
                        usecols=[0, 3], sep="\t")
    info = pd.read_csv(sm.input[2], header=None,
                       names=["pfam", "clan", "pfam_name"], usecols=[0, 1, 4],
                       sep="\t")
    # Strip suffix for pfams
    annot.loc[:, "pfam"] = [x.split(".")[0] for x in annot.pfam]
    # Select unique orf->pfam mappings
    # TODO: This masks multiple occurrences of domains on the same orf. Figure out if this is wanted or not.
    # Merge with pfam info and clan info
    annot = annot.groupby(["orf", "pfam"]).first().reset_index()
    annot = pd.merge(annot, info, left_on="pfam", right_on="pfam")
    annot = pd.merge(annot, clans, left_on="clan", right_on="clan", how="left")
    annot.fillna("No_clan", inplace=True)
    annot = annot.loc[:, ["orf", "pfam", "pfam_name", "clan", "clan_name"]]
    annot.sort_values("orf", inplace=True)
    # Write to file
    annot.to_csv(sm.output[0], sep="\t", index=False)


def main(sm):
    toolbox = {"parse_pfam": parse_pfam,
               "parse_emapper": parse_emapper,
               "parse_rgi": parse_rgi}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)

Python Pandas From line 3 of scripts/annotation_utils.py

import pandas as pd
from Bio.SeqIO import parse
from pathlib import Path


# file io

def fasta2bed(sm):
    with open(sm.input[0], 'r') as fhin, open(sm.output[0], 'w') as fhout:
        for record in parse(fhin, "fasta"):
            fhout.write("{}\t{}\t{}\n".format(record.id, 0, len(record)))


# statistics

def store_lengths(f, minlen=False):
    """
    Reads lengths of contigs from fasta
    :param f: fasta file
    :param minlen: minimum length to store
    :return: pandas DataFrame of lengths
    """
    r = {}
    for record in parse(f, "fasta"):
        if minlen:
            if len(record.seq) < minlen:
                continue
        r[record.id] = len(record.seq)
    df = pd.DataFrame(r, index=["length"]).T
    return df


def size_distribute(df, lengths=None):
    """
    Calculates the distribution of an assembly in length bins

    For each <l> in <lengths> calculate for contigs >= <l>:
     n = the number of contigs
     s = the total length in bp
     p = the fraction of lengths / total assembly size
    :param df: pandas DataFrame of lengths
    :param lengths: intervals at which to calculate stats
    :return: pandas DataFrame
    """
    if lengths is None:
        lengths = [0, 100, 250, 500, 1000, 2500, 5000, 10000, 15000, 20000,
                   25000, 30000, 35000, 40000, 45000, 50000, 75000, 100000,
                   125000, 150000, 200000, 250000, 500000]
    size_dist = {}
    for i, l in enumerate(lengths):
        if len(df.loc[df.length >= l]) == 0:
            break
        n = len(df.loc[df.length >= l])
        s = int(df.loc[df.length >= l].sum())
        p = int(df.loc[df.length >= l].sum()) / float(df.sum()) * 100
        size_dist[i] = {"min_length": l, "num_contigs": n, "total_length": s,
                        "%": p}
    size_dist_df = pd.DataFrame(size_dist).T
    size_dist_df = size_dist_df[
        ["min_length", "num_contigs", "total_length", "%"]]
    return size_dist_df


def calculate_n_stats(df):
    """
    Calculates n50 and n90 statistics from a list of lengths
    :param df: pandas DataFrame of contig lengths
    :return:
    """
    df.sort_values("length", inplace=True, ascending=True)
    size = int(df.sum())
    N50_length = N90_length = 0
    cumulative = 0
    for contig in df.index:
        l = df.loc[contig, "length"]
        cumulative += l
        if float(cumulative) >= 0.5 * size and not N50_length:
            N50_length = l
        elif float(cumulative) >= 0.1 * size and not N90_length:
            N90_length = l
    return N50_length, N90_length


def calculate_length_stats(df):
    """
    Calculates length statistics from a dataframe
    :param df: pandas DataFrame with contig lengths
    :return:
    """
    contigs = len(df)
    total_size = int(df.sum())
    min_length = int(df["length"].min())
    max_length = int(df["length"].max())
    avg_length = float(df["length"].mean())
    median_length = float(df["length"].median())
    return contigs, total_size, min_length, max_length, avg_length, median_length


def generate_stat_df(contig_lengths):
    """
    Generates statistics from a dataframe of contig lengths
    :param contig_lengths: pandas DataFrame
    :return:
    """
    index = ["contigs", "total_size_bp", "min_length", "max_length",
             "avg_length", "median_length", "N50_length", "N90_length"]
    stat_items = calculate_length_stats(contig_lengths)
    n50_length, n90_length = calculate_n_stats(contig_lengths)
    stat_df = pd.DataFrame([stat_items[0], stat_items[1], stat_items[2],
                            stat_items[3], stat_items[4], stat_items[5],
                            n50_length, n90_length], index=index).T
    return stat_df


def stats(sm):
    """
    Reads a list of assembly fasta files and generates statistics

    :param sm: snakemake object
    :return:
    """
    stat_result = pd.DataFrame()
    sizedist_result = pd.DataFrame()
    for f in sm.input.fa:
        p = Path(f)
        name = p.parent.name
        contig_lengths = store_lengths(f)
        stat_df = generate_stat_df(contig_lengths)
        size_dist = size_distribute(contig_lengths)
        stat_df["assembly"] = [name]*len(stat_df)
        size_dist["assembly"] = [name]*len(size_dist)
        stat_result = pd.concat([stat_result, stat_df])
        sizedist_result = pd.concat([sizedist_result,size_dist])
    stat_result = stat_result[["assembly", "contigs", "total_size_bp",
                               "min_length", "max_length", "avg_length",
                               "median_length", "N50_length", "N90_length"]]
    stat_result.to_csv(sm.output[0], sep="\t", index=False)
    sizedist_result = sizedist_result[["assembly", "min_length",
                                       "num_contigs", "total_length", "%"]]
    sizedist_result.to_csv(sm.output[1], sep="\t", index=False)

# assembly input

def metaspades_input(sm):
    """
    Generates fastq files to use as input for metaspades assembler

    :param sm: snakemake object
    :return:
    """
    from common import rename_records
    files = {"R1": [], "R2": [], "se": []}
    assembly_dict = sm.params.assembly
    # Collect all files belonging to the assembly group
    for sample in assembly_dict.keys():
        for unit in assembly_dict[sample]:
            for pair in assembly_dict[sample][unit].keys():
                files[pair].append(
                    assembly_dict[sample][unit][pair][0])
    # Rename and concatenate reads (required for Metaspades)
    with open(sm.output.R1, 'w') as fh1, open(sm.output.R2, 'w') as fh2, open(
        sm.output.se, 'w') as fhse:
        i = 0
        for f in files["R1"]:
            f2 = files["R2"][i]
            fh1 = rename_records(f, fh1, i)
            fh2 = rename_records(f2, fh2, i)
            i += 1
        for i, f in enumerate(files["se"], start=i):
            fhse = rename_records(f, fhse, i)


def megahit_input(sm):
    """
    Genereate input lists for megahit assembler

    :param sm: snakemake object
    :return:
    """
    files = {"R1": [], "R2": [], "se": []}
    assembly_dict = sm.params.assembly
    for sample in assembly_dict.keys():
        for unit in assembly_dict[sample]:
            for pair in assembly_dict[sample][unit].keys():
                files[pair].append(assembly_dict[sample][unit][pair][0])
    with open(sm.output.R1, 'w') as fh1, \
        open(sm.output.R2, 'w') as fh2, \
        open(sm.output.se, 'w') as fhse:
        fh1.write(",".join(files["R1"]))
        fh2.write(",".join(files["R2"]))
        fhse.write(",".join(files["se"]))


def main(sm):
    toolbox = {"assembly_stats": stats,
               "generate_megahit_input": megahit_input,
               "generate_metaspades_input": metaspades_input,
               "fasta2bed": fasta2bed}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)

Python Snakemake Pandas Biopython common MEGAHIT metaspades From line 3 of scripts/assembly_utils.py

import pandas as pd
from glob import glob
import os
from os.path import join as opj, basename, splitext
import numpy as np


# bin info

def contig_map(sm):
    """
    Generates a map of bin->contig id
    :param sm: snakemake object
    :return:
    """
    from Bio.SeqIO import parse
    files = glob(opj(sm.params.dir, "*.fa"))
    if len(files) == 0:
        with open(sm.output[0], 'w') as fhout:
            pass
        return
    for f in files:
        with open(f, 'r') as fhin, open(sm.output[0], 'w') as fhout:
            genome, _ = splitext(basename(f))
            for record in parse(fhin, "fasta"):
                fhout.write("{}\t{}\n".format(genome, record.id))


# bin stats

def n50(lengths):
    """
    Calculate N50 stats
    :param lengths:
    :return:
    """
    cumulative = 0
    size = sum(lengths)
    for l in sorted(lengths):
        cumulative += l
        if float(cumulative) / size >= 0.5:
            return l


def bin_stats(f):
    """
    Generate bin statistics
    :param f:
    :return:
    """
    from Bio.SeqIO import parse
    size = 0
    gc = 0
    contig_lengths = []
    for record in parse(f, "fasta"):
        l = len(record)
        seq = record.seq.upper()
        g = seq.count("G")
        c = seq.count("C")
        gc += g + c
        size += l
        contig_lengths.append(l)
    gc_f = round(float(gc) / size * 100, 2)
    size_mb = size / 1000000
    mean_l = round(np.mean(contig_lengths), 2)
    median_l = round(np.median(contig_lengths), 2)
    min_l = np.min(contig_lengths)
    max_l = np.max(contig_lengths)
    n50_l = n50(contig_lengths)
    return {'bp': size, 'GC': gc_f, 'Mbp': round(size_mb, 2),
            'mean_contig': mean_l, 'median_contig': median_l,
            'min_contig': min_l, 'max_contig': max_l, 'n50': n50_l,
            'contigs': len(contig_lengths)}


def calculate_bin_stats(files):
    """
    Calls bin statistics for each file
    :param files:
    :return:
    """
    stats = {}
    for f in files:
        name = basename(f)
        name, _ = splitext(name)
        stats[name] = bin_stats(f)
    return stats


def binning_stats(sm):
    """
    Main function for calculating bin statistics
    :param sm:
    :return:
    """
    files = glob(opj(sm.params.dir, "*.fa"))
    if len(files) == 0:
        with open(sm.output[0], 'w') as fh:
            fh.write("No bins found\n")
        return
    else:
        stats = calculate_bin_stats(files)
        cols = ["bp", "Mbp", "GC", "contigs", "n50", "mean_contig",
                "median_contig", "min_contig", "max_contig"]
        df = pd.DataFrame(stats).T[cols]
        df.index.name = "bin"
        df.sort_values("bp", ascending=False, inplace=True)
        df.to_csv(sm.output[0], sep="\t", index=True, header=True)


# checkm utils

def remove_checkm_zerocols(sm):
    """
    Reads checkm coverage and removes samples with no reads mapped to contigs

    :param sm:
    :return:
    """
    df = pd.read_csv(sm.input[0], header=0, sep="\t")
    if df.shape[0] == 0:
        with open(sm.output[0], 'w') as fh:
            fh.write("NO BINS FOUND\n")
        return
    # base columns are independent of samples
    base_columns = ["Sequence Id", "Bin Id", "Sequence length (bp)"]
    # get all columns with mapped read counts
    cols = [x for x in df.columns if "Mapped reads" in x]
    # sum counts
    df_sum = df.loc[:, cols].sum()
    # get columns with zero reads mapped
    zero_cols = df_sum.loc[df_sum == 0].index
    cols_to_drop = []
    # find suffix of zero cols
    for c in list(zero_cols):
        suffix = ".{}".format(c.split(".")[-1])
        if suffix == ".Mapped reads":
            suffix = ""
        cols_to_drop += (
            "Mapped reads{s},Bam Id{s},Coverage{s}".format(s=suffix)).split(",")
    df_checked = df.drop(cols_to_drop, axis=1)
    # check that there are remaining columns
    diff_cols = set(df_checked.columns).difference(base_columns)
    if len(diff_cols) > 0:
        df_checked.to_csv(sm.output[0], sep="\t", index=False, header=True)
    else:
        with open(sm.output[0], 'w') as fhout:
            pass

# bin annotation

def count_rrna(sm):
    """
    Counts rRNA genes in bins
    :param sm:
    :return:
    """
    df = pd.read_csv(sm.input[0], sep="\t", usecols=[0, 2, 8], header=None,
                     names=["contig", "type", "fields"])
    # If empty dataframe, just write an empty file
    if df.shape[0] == 0:
        table = pd.DataFrame(columns=["5S_rRNA", "16S_rRNA", "23S_rRNA"])
        table.index.name = "Bin Id"
        table.to_csv(sm.output[0], sep="\t", header=True)
        return
    types = [x.split(";")[0].split("=")[-1] for x in df.fields]
    bins = [x.split(";")[-1].split("=")[-1] for x in df.fields]
    _df = pd.DataFrame(data={'rRNA_type': types, 'Bin_Id': bins})
    dfc = _df.reset_index().groupby(["Bin_Id", "rRNA_type"]).count()
    table = dfc.pivot_table(columns="rRNA_type", index="Bin_Id", fill_value=0)[
        "index"]
    table.index.name = table.columns.name = ""

    missing = set(["16S_rRNA", "23S_rRNA", "5S_rRNA"]).difference(table.columns)
    if len(missing) > 0:
        table = pd.merge(table, pd.DataFrame(columns=missing, index=table.index,
                                             data=0), left_index=True,
                         right_index=True)
    table = table.loc[:, ["5S_rRNA", "16S_rRNA", "23S_rRNA"]]
    table.index.name = "Bin_Id"
    table.to_csv(sm.output[0], sep="\t", index=True)


def count_trna(sm):
    """
    Counts tRNA genes in bins
    :param sm:
    :return:
    """
    df = pd.read_csv(sm.input[0], sep="\t")
    dfc = df.groupby(["tRNA_type", "Bin_Id"]).count().reset_index().loc[:,
          ["tRNA_type", "tRNA#", "Bin_Id"]]
    table = dfc.pivot_table(columns="tRNA_type", index="Bin_Id")[
        "tRNA#"].fillna(0)
    table.index.name = table.columns.name = ""
    table.to_csv(sm.output[0], sep="\t", index=True)

    total = {}
    for m in table.index:
        c = len(table.loc[m, table.loc[m] > 0])
        total[m] = c
    table = pd.DataFrame(total, index=["tRNAs"]).T
    table.index.name = "Bin_Id"
    table.to_csv(sm.output[1], sep="\t", index=True)


# genome clustering

def fetch_genome(ftp_base, outfile):
    """
    Performs the ftp fetching
    :param ftp_base: Base url to RefSeq/GenBank ftp for the genome
    :param outfile: Output filename
    :return:
    """
    from urllib import request
    ftp_base = ftp_base.rstrip("/")
    n = os.path.basename(ftp_base)
    fna = opj(ftp_base, "{}_genomic.fna.gz".format(n))
    r = request.urlretrieve(fna, outfile)
    return r


def unzip(z, o):
    """
    Unzips a zipped fasta file and removes the zipped file
    :param z: input gzip file
    :param o: output plain text file
    :return:
    """
    import gzip as gz
    with gz.open(z, 'rt') as fhin, open(o, 'w') as fhout:
        fhout.write(fhin.read())
    os.remove(z)


def download_ref_genome(sm):
    """
    Fetches reference genomes for clustering with fastANI
    :param sm: snakemake object
    :return:
    """
    fetch_genome(sm.params.ftp_base, "{}.gz".format(sm.output[0]))
    unzip("{}.gz".format(sm.output[0]), sm.output[0])


def generate_bin_list(input, outdir, min_completeness, max_contamination):
    """
    Generates a list of bins to use for fastANI
    Also symlinks each bin file into the fastANI folder
    :param input: List of checkm genome statistics files
    :param outdir: Output directory to store symlinks
    :param min_completeness: Minimum completeness level (%)
    :param max_contamination: Maximum contamination level (%)
    :return:
    """
    genomes = []
    for f in input:
        with open(f, 'r') as fh:
            if fh.readline().rstrip() == "NO BINS FOUND":
                continue
        items = f.split("/")
        # extract wildcards from file path
        binner, assembly, l = items[-5], items[-4], items[-3]
        bindir = os.path.dirname(os.path.dirname(f))
        # get absolute path for bin directory
        abs_in = os.path.abspath(bindir)
        # read the checkm summary file
        df = pd.read_csv(f, sep="\t", index_col=0)
        # filter to at least <min_completeness> and at most <max_contamination>
        df = df.loc[(df.Completeness >= min_completeness) & (df.Contamination <= max_contamination)]
        if df.shape[0] == 0:
            continue
        # generate a unique suffix for each bin
        uniq_suffix = "{assembly}.{l}".format(assembly=assembly, l=l)
        # make a map of the bin id and the unique suffix
        idmap = dict(zip(df.index,
                         ["{x}.{s}".format(x=x, s=uniq_suffix) for x in
                          df.index]))
        # create symlink in the output path for each bin id that points
        # to the original fasta file
        for bin_id, uniq_id in idmap.items():
            src = opj(abs_in, "{}.fa".format(bin_id))
            dst = opj(outdir, "{}.fa".format(uniq_id))
            if os.path.exists(dst):
                os.remove(dst)
            os.symlink(src, dst)
            genomes.append(dst)
    return genomes


def generate_ref_list(input, outdir):
    """
    Generates a list of reference genomes
    Also symlinks each reference file into the fastANI folder

    :param input:
    :param outdir:
    :return:
    """
    genomes = []
    for f in input:
        basename = os.path.basename(f)
        src = os.path.abspath(f)
        dst = opj(outdir, basename)
        if os.path.exists(dst):
            os.remove(dst)
        os.symlink(src, dst)
        genomes.append(dst)
    return genomes


def write_list(genomes, output):
    """
    Writes the fastANI lists to file
    :param genomes: list of genomes to use for fastANI
    :param output:
    :return:
    """
    with open(output, 'w') as fh:
        for g in genomes:
            fh.write("{}\n".format(g))
    return genomes


def generate_fastANI_lists(sm):
    """
    Main function for generating fastANI lists
    :param sm:
    :return:
    """
    bins = generate_bin_list(sm.input.bins, sm.params.outdir,
                             sm.params.completeness,
                             sm.params.contamination)
    refs = generate_ref_list(sm.input.refs, sm.params.outdir)
    genomes = bins + refs
    write_list(genomes, sm.output[0])
    genomes.reverse()
    write_list(genomes, sm.output[1])


def check_pairs(pairs, min_frags):
    allowed_pairs = {}
    for i in pairs.index:
        q = pairs.loc[i, "query"]
        r = pairs.loc[i, "ref"]
        for key in [q, r]:
            if not key in allowed_pairs.keys():
                allowed_pairs[key] = []
        if pairs.loc[i, "aligned"] >= min_frags:
            allowed_pairs[q].append(r)
            allowed_pairs[r].append(q)
    return allowed_pairs


def fastani2dist(mat, txt, min_frags):
    """
    Converts the fastANI out.txt.matrix file to a pandas DataFrame
    :param mat: Distance matrix file
    :param txt: Pairwise output table with ANI and aligned + total fragments
    :param min_frags: Minimum aligned fragments to compare two genomes
    :return:
    """
    # read the pairwise table
    pairs = pd.read_csv(txt, header=None,
                        sep="\t",
                        names=["query", "ref", "ANI", "aligned", "total"])
    for key in ["query", "ref"]:
        pairs[key] = [x.split("/")[-1].replace(".fna", "").replace(".fa", "")
                      for x in pairs[key]]
    allowed_pairs = check_pairs(pairs, min_frags)
    genomes = list(pd.read_table(mat, index_col=0, skiprows=1, sep="\t",
                                 header=None, usecols=[0]).index)
    genomes = [os.path.splitext(os.path.basename(g))[0] for g in genomes]
    r = {}
    with open(mat, 'r') as fh:
        for i, line in enumerate(fh):
            if i == 0:
                continue
            line = line.rstrip()
            items = line.rsplit("\t")
            genome = os.path.splitext(os.path.basename(items[0]))[0]
            r[genome] = {}
            for j, item in enumerate(items[1:]):
                genome2 = genomes[j]
                # check that the pairing is allowed
                if item == "NA" or genome not in allowed_pairs[genome2] or genome2 not in allowed_pairs[genome]:
                    item = np.nan
                else:
                    item = float(item)
                r[genome][genome2] = item
    df = pd.DataFrame(r)
    df.fillna(0, inplace=True)
    return 1-df.div(100)


def cluster(linkage):
    """
    Cluster all genomes based on established linkages using networkx

    :param linkage: Dictionary of dictionaries
    :return: A dictionary with cluster index and list of genomes
    """
    import networkx as nx
    g = nx.from_dict_of_dicts(linkage)
    clustered = []
    clusters = {}
    clust_num = 1
    for n in g.nodes():
        c = [n]
        if n in clustered: continue
        edges = list(nx.dfs_edges(g, n))
        for e in edges:
            n1, n2 = e
            clustered += [n1, n2]
            c += [n1, n2]
        c = list(set(c))
        clusters[clust_num] = c[:]
        clust_num += 1
    return clusters


def generate_linkage(dist_mat, max_dist):
    """
    Create a nested dictionary linking genomes if their distance is within
    a certain threshold.

    :param dist_mat: pandas DataFrame with distances
    :param max_dist: maximum allowed distance to link genomes
    :return: a nested dictionary
    """
    linkage = {}
    for i in range(len(dist_mat.index)):
        g1 = dist_mat.index[i]
        if not g1 in linkage.keys():
            linkage[g1] = {}
        for j in range(i + 1, len(dist_mat.columns)):
            g2 = dist_mat.columns[j]
            if not g2 in linkage.keys():
                linkage[g2] = {}
            distance = dist_mat.iloc[i, j]
            if distance <= max_dist:
                linkage[g1][g2] = ""
                linkage[g2][g1] = ""
    return linkage


def cluster_genomes(sm):
    """
    Main function to run clustering of genomes

    :param sm: snakemake object
    :return:
    """
    dist = fastani2dist(sm.input.mat, sm.input.txt, sm.params.minfrags)
    linkage = generate_linkage(dist, sm.params.thresh)
    clusters = cluster(linkage)
    write_clusters(clusters, sm.output[0])


def write_clusters(clusters, outfile):
    """
    Sorts clusters by size and writes to file
    :param clusters: Dictionary of clusters with genomes as a list
    :param outfile: Output file path
    :return:
    """
    import operator
    # Calculate cluster sizes
    cluster_sizes = {}
    for clust_num, l in clusters.items():
        cluster_sizes[clust_num] = len(l)
    # Sort clusters by sizes
    sorted_clusters = sorted(cluster_sizes.items(), key=operator.itemgetter(1),
                             reverse=True)
    # Write table
    with open(outfile, 'w') as fh:
        for i, item in enumerate(sorted_clusters, start=1):
            old_num = item[0]
            for g in clusters[old_num]:
                fh.write("Cluster{}\t{}\n".format(i, g))


def main(sm):
    toolbox = {"contig_map": contig_map, "count_tRNA": count_trna,
               "remove_checkm_zerocols": remove_checkm_zerocols,
               "count_rRNA": count_rrna, "binning_stats": binning_stats,
               "download_ref_genome": download_ref_genome,
               "generate_fastANI_lists": generate_fastANI_lists,
               "cluster_genomes": cluster_genomes}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)

Python Snakemake Pandas numpy Biopython CheckM networkx FastANI From line 3 of scripts/binning_utils.py

import pandas as pd


def metaphlan2krona(sm):
    """
    Creates a krona-input table from metaphlan (standard format) results
    :param sm: snakemake object
    :return:
    """
    df = pd.read_csv(sm.input[0], sep="\t", header=None, comment="#",
                     index_col=0, usecols=[0, 1, 2],
                     names=["lineage", "taxids", "%"])
    # Extract species level
    df_sp = df.loc[df.index.str.contains("s__")]
    # Create new dataframe with taxids \t %
    data = dict(zip([x.split("|")[-1] for x in df_sp.taxids], df_sp["%"]))
    df_out = pd.DataFrame(data, index=["%"]).T
    df_out.to_csv(sm.output[0], sep="\t", index=True, header=False)


def main(sm):
    toolbox = {"metaphlan2krona_table": metaphlan2krona}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)

Python Snakemake Pandas MetaPhlAn From line 3 of scripts/classification_utils.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
source("workflow/scripts/common.R")

library(edgeR)
method <- snakemake@params$method
input <- snakemake@input[[1]]
output <- snakemake@output[[1]]

# Read the counts
x <- read.delim(input, row.names = 1, sep = "\t", header = TRUE)
# Get sample names
sample_names <- colnames(x)[unlist(lapply(x, is.numeric))]
# Extract row names
xrownames <- row.names(x)[row.names(x)!="Unclassified"]
# Get info names
info_names <- colnames(x)[unlist(lapply(x, is.character))]
# Remove unclassified
if ("Unclassified" %in% row.names(x)){
    x <- x[row.names(x)!="Unclassified", ]
}

x <- as.data.frame(x, row.names=xrownames)
colnames(x) <- append(info_names, sample_names)

x_num <- process_data(x, output)

if (method %in% c("TMM", "RLE")) {
    # Create DGE
    obj <- DGEList(x_num)
    # Calculate norm factors
    obj <- calcNormFactors(obj, method = method)
    # Calculate cpms
    norm <- cpm(obj, normalized.lib.sizes = TRUE)
} else if (method == "RPKM") {
    # Extract gene length column
    gene_length <- x_num$Length
    x_num <- x_num[, colnames(x_num) != "Length"]
    obj <- DGEList(x_num)
    # Calculate norm factors
    obj <- calcNormFactors(obj, method = "TMM")
    # Calculate RPKM
    norm <- rpkm(obj, normalized.lib.sizes = TRUE, gene.length = gene_length)
}

# Add info columns back
norm <- cbind(str_cols(x), norm)
if (length(info_names)>0) {
    colnames(norm) <- append(info_names, sample_names)
}
# Convert to numeric
norm <- as.data.frame(norm)
row.names(norm) <- xrownames

write.table(x = norm, file = output, quote = FALSE, sep="\t")

R From line 3 of scripts/edger.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(metagenomeSeq)
source("workflow/scripts/common.R")

method <- snakemake@params$method
input <- snakemake@input[[1]]
output <- snakemake@output[[1]]
normalize <- TRUE
# Read the counts
x <- read.delim(input, row.names = 1, sep = "\t", header = TRUE)
# Get sample names
sample_names <- colnames(x)[unlist(lapply(x, is.numeric))]
# Extract row names
xrownames <- row.names(x)[row.names(x)!="Unclassified"]
# Get info names
info_names <- colnames(x)[unlist(lapply(x, is.character))]

# Remove unclassified
if ("Unclassified" %in% row.names(x)){
    x <- x[row.names(x)!="Unclassified", ]
}

x <- as.data.frame(x, row.names=xrownames)
colnames(x) <- append(info_names, sample_names)

# Returns a vector in the case of 1 sample only
x_num <- process_data(x, output)

# If only one sample, set normalize=FALSE
if (length(sample_names) == 1) {
    print("ONLY ONE SAMPLE! WILL NOT RUN CSS")
    normalize <- FALSE
    x_num <- as.data.frame(x_num, row.names = rownames(x))
    colnames(x_num) <- sample_names
}

# Turn data into new experimentobject
obj <- newMRexperiment(x_num)
too_few_features <- FALSE
# In cases with only one sample, check whether there are enough features to run
# CSS
if (is.null(ncol(x_num))) {
    if (length(x_num) <= 1) {
        too_few_features <- TRUE
    }
} else { # Do the same type of check for many samples
    smat <- lapply(1:ncol(x_num), function(i) {
        sort(x_num[which(x_num[, i]>0),i], decreasing = TRUE)
    })
    if (any(sapply(smat,length)==1)) {
        too_few_features <- TRUE
    }
}
if (too_few_features == TRUE) {
    fh <-file(snakemake@output[[1]])
    writeLines(c("WARNING: Sample with one or zero features",
                 "Cumulative Sum Scaling failed for sample"), fh)
    close(fh)
    quit()
}

# Normalize
norm <- MRcounts(obj, norm = normalize)

# Add info columns back
norm <- cbind(str_cols(x), norm)
colnames(norm) <- append(info_names, sample_names)

# Convert to numeric
norm <- as.data.frame(norm)

# Set sample names
colnames(norm)[unlist(lapply(norm, is.numeric))] <- sample_names

# Write output
write.table(x = norm, file = output, quote = FALSE, sep="\t")

R From line 3 of scripts/metagenomeseq.R

import pandas as pd


def write_featurefile(sm, score=".", group="gene_id", phase="."):
    """
    Takes a prodigal GFF file and turns it into a file for use with
    featureCounts
    :param sm: snakemake object
    :param score: dummy score column
    :param group: identifier for featurecounts
    :param phase: dummy phase column
    :return:
    """
    with open(sm.input[0], 'r') as fhin, open(sm.output[0], 'w') as fhout:
        for line in fhin:
            line = line.rstrip()
            if line.startswith("#"):
                continue
            items = line.split("\t")
            contig = items[0]
            source = items[1]
            method = items[2]
            start = items[3]
            stop = items[4]
            _strand = items[6]
            geneid = items[8]
            if _strand in ['+1', '1', '+']:
                strand = '+'
            else:
                strand = '-'
            gene_id = "{} {}\n".format(group, geneid.split(";")[0].split("=")[-1])
            fhout.write("\t".join([contig, source, method, start, stop, score,
                                   strand, phase, gene_id]))
    return


def clean_featurecount(sm):
    """
    This cleans the featureCounts output table from format:
    # Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "etc."
    Geneid  Chr         Start   End     Strand  Length  path/to/bam/sample.bam
    1_1     k141_7581   1       459     -       459     1
    2_1     k141_0      469     714     +       246     2

    To format:
    gene_id         Length  sample
    k141_7581_1     459     1
    k141_0_1        246     1
    """
    df = pd.read_csv(sm.input[0], comment="#", sep="\t")
    # Extract gene number and combine with contig id
    df["gene_num"] = [x[1] for x in df.Geneid.str.split("_")]
    df.set_index(df.Chr.map(str) + "_" + df.gene_num, inplace=True)
    df.drop("gene_num", axis=1, inplace=True)
    df.index.name = 'gene_id'
    # Set sample and unit name from wildcards
    sample_unit = "{sample}_{unit}".format(sample=sm.wildcards.sample,
                                           unit=sm.wildcards.unit)
    df.columns = list(df.columns)[0:-1] + [sample_unit]
    # Extract length and counts
    df = df.loc[:, ["Length", sample_unit]]
    df.to_csv(sm.output[0], sep="\t")


def aggregate_featurecount(sm):
    """
    Aggregates cleaned featureCounts tables into one table per assembly

    :param sm: snakemake object
    :return:
    """
    df = pd.DataFrame()
    lmap = {}
    for f in sm.input:
        _df = pd.read_csv(f, sep="\t", index_col=0)
        lmap.update(_df.to_dict()["Length"])
        _df.drop("Length", axis=1, inplace=True)
        df = pd.merge(df, _df, right_index=True, left_index=True, how="outer")
    counts = pd.merge(df, pd.DataFrame(lmap, index=["Length"]).T,
                      left_index=True, right_index=True)
    counts.to_csv(sm.output[0], sep="\t")


def process_and_sum(q_df, annot_df):
    # Merge annotations and abundance
    # keep ORFs without annotation as "Unclassified"
    annot_q_df = pd.merge(annot_df, q_df, left_index=True, right_index=True,
                          how="right")
    annot_q_df.fillna("Unclassified", inplace=True)
    feature_cols = annot_df.columns
    annot_q_sum = annot_q_df.groupby(list(feature_cols)).sum().reset_index()
    annot_q_sum.set_index(feature_cols[0], inplace=True)
    return annot_q_sum


def sum_to_features(abundance, parsed):
    parsed_df = pd.read_csv(parsed, index_col=0, sep="\t")
    abundance_df = pd.read_csv(abundance, index_col=0, sep="\t")
    abundance_df.drop("Length", axis=1, inplace=True, errors="ignore")
    feature_sum = process_and_sum(abundance_df, parsed_df)
    return feature_sum


def count_features(sm):
    """
    Counts reads mapped to features such as KOs, PFAMs etc.

    :param sm:
    :return:
    """
    feature_sum = sum_to_features(sm.input.abund, sm.input.annot)
    feature_sum.to_csv(sm.output[0], sep="\t")

def sum_to_taxa(sm):
    """
    Takes taxonomic assignments per orf and abundance values (tpm or raw) and
    returns abundance values summed to unique combinations of ranks

    :param sm: snakemake object
    :return:
    """
    header = ["protein", "superkingdom", "phylum", "class", "order", "family",
              "genus", "species"]
    df = pd.read_csv(sm.input.tax[0], sep="\t", index_col=0, header=None,
                     names=header)
    abund_df = pd.read_csv(sm.input.abund, header=0, index_col=0, sep="\t")
    # Remove length column
    abund_df.drop("Length", axis=1, inplace=True, errors="ignore")
    taxa_abund = pd.merge(df, abund_df, right_index=True, left_index=True)
    taxa_abund_sum = taxa_abund.groupby(header[1:]).sum().reset_index()
    taxa_abund_sum.to_csv(sm.output[0], sep="\t", index=False)


def main(sm):
    toolbox = {"write_featurefile": write_featurefile,
               "clean_featurecount": clean_featurecount,
               "aggregate_featurecount": aggregate_featurecount,
               "count_features": count_features,
               "sum_to_taxa": sum_to_taxa}

    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)

Python Snakemake Pandas FeatureCounts prodigal From line 3 of scripts/quantification_utils.py

import sys
import pandas as pd

def add_lower(df, ranks):
    """
    Propagates assignments from higher to lower taxonomic ranks,
    and adds a 'Unclassified.' prefix.
    :param df: pandas DataFrame
    :param ranks: ranks for which to propagate
    :return:
    """
    for i in df.index:
        last_known = df.loc[i, ranks[0]]
        for rank in ranks[1:]:
            if df.loc[i, rank] != "Unclassified":
                last_known = df.loc[i, rank]
            else:
                if last_known == "Unclassified":
                    df.loc[i, rank] = last_known
                else:
                    df.loc[i, rank] = "Unclassified.{}".format(last_known)
    return df


def contigtax_mash(sm):
    # Keep stats on assignments
    # resolved = cases where sourmash helped resolve assignments
    # transferred = cases where blast-based assignments were overwritten
    # added = cases where assignments from sourmash were added
    # total = total number of contigs
    stats = {'resolved': 0, 'transferred': 0, 'added': 0, 'total': 0}
    df1 = pd.read_csv(sm.input.smash, sep=",", header=0, index_col=0)
    stats['total'] = df1.shape[0]
    df2 = pd.read_csv(sm.input.contigtax[0], sep="\t", header=0, index_col=0)
    ranks = list(df2.columns)
    ranks.reverse()
    # Only use subset of contigs with matches
    df1 = df1.loc[df1["status"] == "found", df2.columns]
    df1.fillna("Unclassified", inplace=True)

    # Get common set of contigs
    common = set(df1.index).intersection(set(df2.index))
    for contig in common:
        s = df1.loc[contig]
        b = df2.loc[contig]
        for rank in ranks:
            # If sourmash has an assignment at this rank
            if s[rank] != "Unclassified":
                # If blast-based contains 'Unclassified',
                # mark contig as resolved
                if "Unclassified" in b[rank]:
                    stats['resolved'] += 1
                # Otherwise, mark contig as transferred
                else:
                    stats['transferred'] += 1
                # As soon as a contig has been transferred or resolved
                # we can stop the merge
                df2.loc[contig] = df1.loc[contig]
                break
            # If sourmash does not have an assignment at this rank
            else:
                # but blast-based does have an assignment,
                # then the blast-based is more resolved and we can stop
                # trying to merge
                if "Unclassified" not in b[rank]:
                    break
    # Get contigs in sourmash missing from blast
    missing1 = set(df1.index).difference(set(df2.index))
    if len(missing1) > 0:
        stats['added'] += len(missing1)
        df2 = pd.concat([df2, df1.loc[missing1]])
    df2 = add_lower(df2, df2.columns)
    df2.to_csv(sm.output[0], sep="\t")
    # Write to log
    with open(sm.log[0], 'w') as fhout:
        fhout.write("Total:       {}\n".format(stats['total']))
        fhout.write("Resolved:    {}\n".format(stats['resolved']))
        fhout.write("Transferred: {}\n".format(stats["transferred"]))
        fhout.write("Added:       {}\n".format(stats['added']))


def contigtax_assign_orfs(sm):
    """
    Transfers taxonomic assignments from contigs down to ORFs called on contigs
    :param sm: snakemake object
    :return:
    """
    gff_df=pd.read_csv(sm.input.gff, header=None, sep="\t", comment="#",
                           usecols=[0, 8], names=["contig", "id"])
    # Extract ids
    ids=["{}_{}".format(gff_df.loc[i, "contig"],
                        gff_df.loc[i, "id"].split(";")[0].split("_")[-1]) for i in gff_df.index]
    gff_df.loc[:, "id"]=ids
    # Read taxonomy for contigs
    tax_df=pd.read_csv(sm.input.tax, header=0, sep="\t", index_col=0)
    # Merge dataframes
    orf_tax_df=pd.merge(gff_df, tax_df, left_on="contig",
                        right_index=True, how="outer")
    # When using 'outer' merging there may be contigs with no called ORF
    # but with a tax assignment. Drop these contigs.
    orf_tax_df=orf_tax_df.loc[orf_tax_df["id"]==orf_tax_df["id"]]
    # Set Unclassified for NA values
    orf_tax_df.fillna("Unclassified", inplace=True)
    # Set index to ORF ids
    orf_tax_df.set_index("id", inplace=True)
    orf_tax_df.drop("contig", axis=1, inplace=True)
    orf_tax_df.to_csv(sm.output.tax[0], sep="\t", index=True, header=True)


def main(sm):
    toolbox = {"merge_contigtax_sourmash": contigtax_mash,
               "contigtax_assign_orfs": contigtax_assign_orfs}
    toolbox[sm.rule](sm)


if __name__ == "__main__":
    main(snakemake)