pipeline to find unique primer for genome bins in metagenomes

public public 1yr ago 0 bookmarks

This pipeline designs primers that are unique to a genome/genome bin after metagenome assemlby and binning, which can be further used to verify the genome/genome bin are real by running PCR using the primers in the same DNA extract as the metagenome are sequenced from. Another use case is to match isolates from culturing to genome bins.

Authors

  • Jiarong Guo (@jiarong)

Usage

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further develop this workflow, fork this repository. Please consider providing any generally applicable modifications via a pull request.

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake -n

Execute the workflow locally via

snakemake --cores $N

using $N cores or run it in a cluster environment via

snakemake --cluster qsub --jobs 100

or

snakemake --drmaa --jobs 100

See the Snakemake documentation for further details.

Testing

Code Snippets

38
39
40
41
42
43
44
45
46
shell: """
    cd {wildcards.sample} \
    && mkdir -p maxbin \
    && run_MaxBin.pl -thread {threads} \
        -max_iteration 100 \
        -contig {wildcards.pid}_contigs.fasta.gz \
        -abund {wildcards.pid}_contigcoverage.tsv \
        -out maxbin/maxbin
"""
59
60
61
62
63
64
65
66
67
68
69
shell: """
    cd {wildcards.sample}
    rm -rf mb.cm.tmpdir
    checkm lineage_wf -x fasta \
        -f mb.cm.out \
        -t {threads} \
        maxbin \
        mb.cm.tmpdir \
      || rm -rf mb.cm.tmpdir
    checkm tree_qa mb.cm.tmpdir -o 1 -f mb.cm.tax --tab_table
"""
79
80
81
82
83
84
85
86
shell: """
    cd {wildcards.sample}
    ## gather
    mash screen -p {threads} {DBDIR}/refseq.genomes.k21s1000.msh {input} \
      | sort -g -r > mb.mashscreen.tsv
    mash screen -w -p {threads} {DBDIR}/refseq.genomes.k21s1000.msh {input} \
      | sort -g -r > mb.mashscreenw.tsv
"""
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
shell: """
    # only pick top 10 most abundant to analyze
    if [ {wildcards.idx} -gt {TOPN} ]; then exit 0; fi
    cd {wildcards.sample}/maxbin
    ## search
    # -m: minimum kmer count, 1 for genome or bins, >2 for reads
    f=maxbin.{wildcards.idx}.fasta
    mash sketch -p {threads} -k 21 -m 1 $f
    # sort -k 3 --> pick 3rd col as key for sorting
    mash dist -p {threads} {DBDIR}/refseq.genomes.k21s1000.msh $f.msh \
      | sort -g -k 3 \
      > $f.mashdist.tsv
"""
117
118
119
120
121
122
123
124
125
shell: """
    cd {wildcards.sample}
    ## gather
    refseq_masher contains {input} \
    -o mb.rmcontain.csv \
    --output-type csv \
    -i 0.5 \
    -p {threads}
"""
134
135
136
137
138
139
140
141
142
143
144
145
shell: """
    # only pick top 10 most abundant to analyze
    if [ {wildcards.idx} -gt {TOPN} ]; then exit 0; fi
    cd {wildcards.sample}/maxbin
    ## search
    refseq_masher matches maxbin.{wildcards.idx}.fasta \
        -o maxbin.{wildcards.idx}.fasta.rmmatch.csv \
        --output-type csv \
        -p {threads} \
        -n 10 \ top n matches
        -m 1  # minimum kmer count, 1 for genome or bins, >2 for reads
"""
153
154
155
156
157
158
159
160
161
162
163
164
165
166
shell: """
    if [ {wildcards.idx} -gt {TOPN} ]; then exit 0; fi
    cd {wildcards.sample}/maxbin
    Seqfile_list=(maxbin.???.fasta)
    f=maxbin.{wildcards.idx}.fasta
    Other=${{Seqfile_list[@]/${{f}}}}
    python {SCRIPTDIR}/get-uniqkmer.py $f \
        --ref $Other \
        -k {K} \
        -x 4e9 -N 4 \
        --x2 1e8 --N2 4 \
      > $f.uniqkmer \
      || rm $f.uniqkmer
"""
177
178
179
180
181
182
183
184
185
shell: """
    cat {input} \
      | python {SCRIPTDIR}/get-uniqkmer-nothread.py - \
        --load {DBDIR}/refseq.20.bf \
        -k {K} --x2 1e6 --N2 4  \
      > {wildcards.sample}/maxbin/combined_uniqkmer.uniq2ref \
      || rm {wildcards.sample}/maxbin/combined_uniqkmer.uniq2ref

"""
SnakeMake From line 177 of master/Snakefile
194
195
196
197
198
199
200
201
202
203
shell: """
    cd {wildcards.sample}/maxbin
    f=maxbin.{wildcards.idx}.fasta.uniqkmer
    python {SCRIPTDIR}/get-uniqkmer-nothread.py $f \
        --ref combined_uniqkmer.uniq2ref \
        -k {K} -x 1e6 -N 4 \
        --x2 1e6 --N2 4  \
      >  $f.uniq2ref \
      || rm $f.uniq2ref 
"""
213
214
215
216
217
218
219
220
221
222
223
shell: """
    cd {wildcards.sample}/maxbin
    f=maxbin.{wildcards.idx}.fasta
    echo "*** output results for $f"
    python  $Scriptdir/kmer2primer.py $Scriptdir/params.config \
      $f $f.uniqkmer.uniq2ref \
      | python $Scriptdir/filter-primer-individual.py \
        $Scriptdir/params.config - \
      | python $Scriptdir/filter-primer-pair.py $Scriptdir/params.config - \
      | tee $f.uniqkmer.uniq2ref.primer
"""
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jiarong/uniqprimer
Name: uniqprimer
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...