pipeline to find unique primer for genome bins in metagenomes

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This pipeline designs primers that are unique to a genome/genome bin after metagenome assemlby and binning, which can be further used to verify the genome/genome bin are real by running PCR using the primers in the same DNA extract as the metagenome are sequenced from. Another use case is to match isolates from culturing to genome bins.

Authors

Jiarong Guo (@jiarong)

Usage

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further develop this workflow, fork this repository. Please consider providing any generally applicable modifications via a pull request.

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake -n

Execute the workflow locally via

snakemake --cores $N

using $N cores or run it in a cluster environment via

snakemake --cluster qsub --jobs 100

snakemake --drmaa --jobs 100

See the Snakemake documentation for further details.

Testing

Code Snippets

shell: """
    cd {wildcards.sample} \
    && mkdir -p maxbin \
    && run_MaxBin.pl -thread {threads} \
        -max_iteration 100 \
        -contig {wildcards.pid}_contigs.fasta.gz \
        -abund {wildcards.pid}_contigcoverage.tsv \
        -out maxbin/maxbin
"""

SnakeMake SqueezeMeta MaxBin From line 38 of master/Snakefile

shell: """
    cd {wildcards.sample}
    rm -rf mb.cm.tmpdir
    checkm lineage_wf -x fasta \
        -f mb.cm.out \
        -t {threads} \
        maxbin \
        mb.cm.tmpdir \
      || rm -rf mb.cm.tmpdir
    checkm tree_qa mb.cm.tmpdir -o 1 -f mb.cm.tax --tab_table
"""

SnakeMake CheckM MaxBin From line 59 of master/Snakefile

shell: """
    cd {wildcards.sample}
    ## gather
    mash screen -p {threads} {DBDIR}/refseq.genomes.k21s1000.msh {input} \
      | sort -g -r > mb.mashscreen.tsv
    mash screen -w -p {threads} {DBDIR}/refseq.genomes.k21s1000.msh {input} \
      | sort -g -r > mb.mashscreenw.tsv
"""

SnakeMake Mash From line 79 of master/Snakefile

shell: """
    # only pick top 10 most abundant to analyze
    if [ {wildcards.idx} -gt {TOPN} ]; then exit 0; fi
    cd {wildcards.sample}/maxbin
    ## search
    # -m: minimum kmer count, 1 for genome or bins, >2 for reads
    f=maxbin.{wildcards.idx}.fasta
    mash sketch -p {threads} -k 21 -m 1 $f
    # sort -k 3 --> pick 3rd col as key for sorting
    mash dist -p {threads} {DBDIR}/refseq.genomes.k21s1000.msh $f.msh \
      | sort -g -k 3 \
      > $f.mashdist.tsv
"""

SnakeMake Mash MaxBin From line 95 of master/Snakefile

shell: """
    cd {wildcards.sample}
    ## gather
    refseq_masher contains {input} \
    -o mb.rmcontain.csv \
    --output-type csv \
    -i 0.5 \
    -p {threads}
"""

SnakeMake refseq_masher From line 117 of master/Snakefile

shell: """
    # only pick top 10 most abundant to analyze
    if [ {wildcards.idx} -gt {TOPN} ]; then exit 0; fi
    cd {wildcards.sample}/maxbin
    ## search
    refseq_masher matches maxbin.{wildcards.idx}.fasta \
        -o maxbin.{wildcards.idx}.fasta.rmmatch.csv \
        --output-type csv \
        -p {threads} \
        -n 10 \ top n matches
        -m 1  # minimum kmer count, 1 for genome or bins, >2 for reads
"""

SnakeMake MaxBin refseq_masher From line 134 of master/Snakefile

shell: """
    if [ {wildcards.idx} -gt {TOPN} ]; then exit 0; fi
    cd {wildcards.sample}/maxbin
    Seqfile_list=(maxbin.???.fasta)
    f=maxbin.{wildcards.idx}.fasta
    Other=${{Seqfile_list[@]/${{f}}}}
    python {SCRIPTDIR}/get-uniqkmer.py $f \
        --ref $Other \
        -k {K} \
        -x 4e9 -N 4 \
        --x2 1e8 --N2 4 \
      > $f.uniqkmer \
      || rm $f.uniqkmer
"""

SnakeMake MaxBin From line 153 of master/Snakefile

shell: """
    cat {input} \
      | python {SCRIPTDIR}/get-uniqkmer-nothread.py - \
        --load {DBDIR}/refseq.20.bf \
        -k {K} --x2 1e6 --N2 4  \
      > {wildcards.sample}/maxbin/combined_uniqkmer.uniq2ref \
      || rm {wildcards.sample}/maxbin/combined_uniqkmer.uniq2ref

"""

SnakeMake From line 177 of master/Snakefile

shell: """
    cd {wildcards.sample}/maxbin
    f=maxbin.{wildcards.idx}.fasta.uniqkmer
    python {SCRIPTDIR}/get-uniqkmer-nothread.py $f \
        --ref combined_uniqkmer.uniq2ref \
        -k {K} -x 1e6 -N 4 \
        --x2 1e6 --N2 4  \
      >  $f.uniq2ref \
      || rm $f.uniq2ref 
"""

SnakeMake MaxBin From line 194 of master/Snakefile

shell: """
    cd {wildcards.sample}/maxbin
    f=maxbin.{wildcards.idx}.fasta
    echo "*** output results for $f"
    python  $Scriptdir/kmer2primer.py $Scriptdir/params.config \
      $f $f.uniqkmer.uniq2ref \
      | python $Scriptdir/filter-primer-individual.py \
        $Scriptdir/params.config - \
      | python $Scriptdir/filter-primer-pair.py $Scriptdir/params.config - \
      | tee $f.uniqkmer.uniq2ref.primer
"""