Orochi: Metagenomic Analysis Pipeline for DNA Samples

public public 1yr ago Version: 0.1 0 bookmarks

Orochi is a Snakemake-based metagenomic analysis pipeline designed to perform a full-featured analysis of metagenomic DNA samples from assembly to functional and taxonomic annotation, primarily through the automated and reproducible use of a series of external analysis tools.

Documentation

https://pipelines.pages.bioinf.nioo.knaw.nl/orochi/

Code Snippets

10
11
shell:
    "cat {input} > {output} 2> {log}"
22
23
24
25
shell:
    """
    awk -v RS='>[^\\n]+\\n' 'length() <= 2000 {{printf "%s", prt $0}} {{prt = RT}}' {input} > {output} 2> {log}
    """
38
39
shell: 
   "seqtk seq -L {params.length} {input}  > {output} 2> {log}"
52
53
shell:
    "cd-hit-est -i {input} -o {output} -T 90 -M 500000 -c 0.99 -n 10 > {log}"
64
65
shell:
    "seqtk seq -C {input} | seqtk rename - contig_ > {output} 2> {log}"
76
77
shell:
    "toAmos -s {input} -o {output} 2> {log}"
88
89
90
91
92
93
94
95
shell:
    """
    cd .snakemake/conda 
    find . > pathfiles.txt
    grep '\minimus2$' pathfiles.txt > minimus2_location.txt
    python ../../src/scripts/minimus2_replacement.py 
    touch ../../.minimus_replaced.txt
    """
108
109
shell:
    "minimus2 `file={input.file}; echo ${{file%.*}}` -D OVERLAP=100 MINID=95"
121
122
shell:
    "cat {input} > {output} 2> {log}"
133
shell: "cat {input.short} {input.merged} > {output} 2> {log}"
7
shell: "cut_up_fasta.py {input} -c 10000 -o 0 --merge_last -b contigs_10K.bed > {output}"
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
    shell: "cut_up_fasta.py {input} -c 10000 -o 0 --merge_last -b contigs_10K.bed > {output}"
"""
rule concoct:
    input:
        contigs="scratch/assembly/megahit/minimus2/secondary.contigs.fasta",
        coverage="results/stats/coverage/coverage.tsv"
    output: "results/binning/concoct/clustering_merged.csv"
    params:
        outdir=lambda wildcards, output: os.path.dirname(str(output))
    log: "logs/concoct/concoct.log"
    conda: "../../../envs/concoct.yaml"
    shell: "concoct --composition_file {input.contigs} --coverage_file {input.coverage} -b {params.outdir}"

rule concoct_write_bins:
    input:
        contigs="scratch/assembly/megahit/minimus2/secondary.contigs.fasta",
        clusters="results/binning/concoct/clustering_merged.csv"
    output: "results/binning/concoct/bin1.fasta"
    params:
        outdir=lambda wildcards, output: os.path.dirname(str(output))
    log: "logs/concoct/concoct_write_bins.log"
    conda: "../../../envs/concoct.yaml"
    shell:
        """
18
19
20
21
22
23
24
25
26
27
28
shell:
    """
    DAS_Tool -i {input.metabat},{input.vamb}
             -l metabat,vamb
             -c {input.contigs}
             -o {params.outprefix}
             --write_bins 1
             --search_engine diamond
             -t 24
             2> {log}
    """
7
shell: "python src/scripts/assembly_rmdup.py"
22
23
24
25
26
27
shell: 
    """
    mkdir results/binning/metabat
    jgi_summarize_bam_contig_depths --outputDepth {output.depth} {input.bam}
    metabat2 -i {input.contigs} -a {output.depth} -o {params.prefix} -v > {log}
    """
42
shell: "python src/scripts/mmgenome_metabat.py"
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    shell: "seqtk seq -L {params.length} {input}  > {output}"
"""

rule concatenate:
    input: expand("scratch/assembly/megahit/{treatment}/{kmers}/final.contigs.fa", treatment=config["treatment"], kmers=config["assembly-klist"])
    output: "scratch/binning/vamb/catalogue.fna.gz"
    conda: "../../../envs/vamb.yaml"
    log: "logs/vamb/concatenate.log"
    shell:     
        "concatenate.py {output} {input} 2> {log}"
    #Just cat with extras to make it more suitable to VAMB

rule vamb_map:
    input: 
        catalogue="scratch/binning/vamb/catalogue.fna.gz",
        forward="scratch/host_filtering/{sample}_R1.fastq" if config['host_removal'] \
             else "scratch/filter/{sample}_R1.fq",
        rev="scratch/host_filtering/{sample}_R2.fastq" if config['host_removal'] \
             else "scratch/filter/{sample}_R2.fq"
    output: temp("scratch/binning/vamb/{sample}.bam")
    conda: "../../../envs/minimap2.yaml"
    log: "logs/vamb/vamb_map_{sample}.log"
    shell:
        """
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
    shell: 
        "rm -rf results/binning/vamb;"
        "vamb --outdir results/binning/vamb --fasta {input.catalogue} --bamfiles scratch/binning/vamb/*.bam -o C --minfasta 200000 2> {log}"

"""
rule vamb_write_bins:
    input:
        clusters="results/binning/vamb/clusters.tsv",
        contigs="scratch/assembly/megahit/minimus2/secondary.contigs.fasta"
    output: "results/binning/vamb/bins/bin1.fasta"
    params:
        outdir="results/binning/vamb"
    conda:
        "../../../envs/vamb.yaml"
    run:
        with open('{input.clusters}', 'w') as file:
            vamb.cluster.write_clusters(file, filtered_bins)
        keptcontigs = set.union(*filtered_bins.values())
        with open('{input.contigs}', 'rb') as file:
            fastadict = vamb.vambtools.loadfasta(file, keep=keptcontigs)
        bindir = '{params.outdir}'
        vamb.vambtools.write_bins(bindir, filtered_bins, fastadict, maxbins=500)
"""
65
66
67
68
69
70
71
72
run:
    with open('{input.clusters}', 'w') as file:
        vamb.cluster.write_clusters(file, filtered_bins)
    keptcontigs = set.union(*filtered_bins.values())
    with open('{input.contigs}', 'rb') as file:
        fastadict = vamb.vambtools.loadfasta(file, keep=keptcontigs)
    bindir = '{params.outdir}'
    vamb.vambtools.write_bins(bindir, filtered_bins, fastadict, maxbins=500)
11
12
shell: 
   "seqtk seq -L {params.length} {input}  > {output} 2> {log}"
28
29
shell:
    "antismash --cb-general --cb-knownclusters --cb-subclusters --asf --genefinding-tool prodigal-m --output-dir {params.outdir} --cpus {threads} {input}"
39
40
script:
    "../../../scripts/antismash_get_bgcs.py"
53
54
shell:
    "coverm contig --methods count --mapper minimap2-sr --proper-pairs-only -1 {input.forward} -2 {input.rev} --reference {input.bgcs} --threads {threads} 2> {log} > {output}"
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
    shell:
        """
        git clone https://git.wur.nl/medema-group/BiG-SCAPE.git
        cd BiG-SCAPE
        wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz && gunzip Pfam-A.hmm.gz
        hmmpress Pfam-A.hmm
        python bigscape.py -i ../{params.inputdir} -o ../{params.outdir} -c {threads} --mode glocal --mibig --include_singletons --cores {threads} --mix
        """


"""
if config['big']=='bigslice':
   rule bigslice:
        input: "scratch/annotation/antismash/secondary.contigs.gbk"
        params:
            inputdir="scratch/annotation/antismash"
            outdir="scratch/annotation/{big}"
        output:
        conda:
            "../../../envs/bigslice.yaml"
        log:
        threads:
        shell:
            "download_bigslice_hmmdb"
            "bigslice -i <{params.inputdir}> <{params.outdir}>"

rule rast:
# Uses myRAST batch processor to upload bigscape output (clusters) and gets rast IDs and downloads faa and txt files in batch.

rule prepare_corason:
# Arranges bigscape and rast output into specified input director structure needed for corason.

rule corason:
    input:
    output:
    container:
        "docker://nselem/corason"
    log:
    threads:
    shell:
"""
9
shell: "cp {input} {params.outdir} 2> {log}"
22
shell: "emapper.py --dmnd_db {params.db} -m diamond --no_annot --no_file_comments --cpu {threads} -i {input} -o {input}"
36
37
38
39
shell:
    """
    emapper.py --annotate_hits_table {input.diamond} --no_file_comments --cpu {threads} --data_dir {params.db} -o {input.seq}
    """
13
14
15
16
17
18
19
20
21
22
23
24
25
26
shell:
    """
    input=({input.forward})
    if [ "${{input##*.}}" == "bz2" ]; then
        pbzip2 -p{threads} -dc {input.forward}  > {output.forward} 2> {log.forward}
        pbzip2 -p{threads} -dc {input.rev}  > {output.rev} 2> {log.rev}
    elif [ "${{input##*.}}" == "gz" ]; then
        pigz -p {threads} -dc {input.forward}  > {output.forward} 2> {log.forward}
        pigz -p {threads} -dc {input.rev}  > {output.rev} 2> {log.rev}
    else
        cp {input.forward} {output.forward}
        cp {input.rev} {output.rev}
    fi
    """
17
18
19
20
21
22
23
24
shell:"""bbduk.sh in={input.forward} in2={input.rev} out={output.forward} out2={output.rev} \
trimpolygright=1 \
entropy=0.6 entropywindow=50 entropymask=f \
qtrim=rl trimq={params.quality} \
minlength=51 \
ref={params.ref} ktrim=r \
stats={output.stats} \
t={threads} 2> {log}"""
38
shell: "bbmap.sh ref={params.ref} in1={input.forward} in2={input.rev} outu1={output.forward} outu2={output.rev} t={threads} 2> {log}"
47
shell: "bwa index {input}"
61
shell: "bwa mem -t {threads} {params.refindex} {input.forward} {input.rev} -o {output} 2> {log}"
70
shell: "samtools flagstat {input} > {output} 2> {log}"
80
shell: "samtools view -b -f 4 {input} > {output} 2> {log}"
89
shell: "samtools sort {input} > {output}"
99
shell: "bamToFastq -i {input}  -fq {output.forward} -fq2 {output.rev} 2> {log}"
108
shell: "samtools view -b -F 4 {input} > {output} 2> {log}"
117
shell: "samtools sort {input} > {output}"
127
shell: "bamToFastq -i {input}  -fq {output.forward} -fq2 {output.rev} 2> {log}"
20
21
shell:
    """CAT contigs -c {input} -o {params.prefix} -d {params.db} -t {params.tax} --nproc {threads} --sensitive --force --compress --verbose --index_chunks 1 --top 11 --I_know_what_Im_doing"""
34
shell: "CAT add_names -i {input} -o {output} -t {params.tax} --only_official > {log} 2>&1"
46
shell: "CAT summarise -c {input.contigs} -i {input.classification} -o {output} > {log} 2>&1"
6
7
shell:
    "cat {input} > {output}"
18
19
shell: 
   "seqtk seq -L {params.length} {input}  > {output}"
30
31
shell:
    "cd-hit-est -i {input} -o {output} -T 90 -M 500000 -c 0.99 -n 10 > {log}"
38
39
shell:
    """awk '/^>/ {{print ">contig_" ++i; next}}{{print}}' < {input} > {output}"""
48
49
shell:
    "toAmos -s {input} -o {output}"
61
62
shell:
    "minimus2 `file={input}; echo ${{file%.*}}` -D OVERLAP=100 MINID=95"
70
71
shell:
    "cat {input} > {output}"
11
12
13
14
15
16
17
18
shell:
    """
    groopm parse {params.db} {input.contigs} scratch/coverm/bamfiles/*.bam -t {threads}
    groopm core {params.db} -t {threads}
    groopm refine {params.db} -t {threads}
    groopm recruit {params.db} -t {threads}
    groopm extract {params.db} {input.contigs} -o {params.outdir} -t {threads}
    """
8
shell: "bamm parse -c {output} -m pmean -b {input}"
17
shell: """echo "contig\tLength\t{params.sample}" > {output} && tail -n +2 {input} >> {output}"""
30
31
32
33
shell: """
    zcat {input} | prodigal -d {output.nucleotide} -a {output.orfs} -i /dev/stdin -m -o {log} -p meta -q
    cut -f1 -d ' ' {output.orfs} > {output.orfscleaned}
"""
46
47
48
49
50
shell: """
    hmmsearch --tblout {output.prediction} --cut_tc --notextw ~/install/mmgenome/scripts/essential.hmm {input} > {log}
    echo "scaffold orf hmm.id" > {output.essential}
    tail -n+4 {output.prediction} | sed "s/ * / /g" | cut -f1,4 -d " " | sed "s/_/ /" >> {output.essential}
    """
59
60
61
run:
    shell("grep -v '^#' {input.prediction} | cut -f1 -d ' ' > {output.posorfs}")
    shell("perl ~/install/mmgenome/scripts/extract.using.header.list.pl -l {output.posorfs} -s {input.orfs} -o {output.faa}")
77
78
79
80
81
shell: """
    blastp -query {input.faa} -db /data/db/blast/nr/20150311/nr -evalue 1e-5 -num_threads {threads} -max_target_seqs 5 -outfmt 5 -out {output.blast}
    # Here we need to run MEGAN first
    java -Xmx32G -Djava.awt.headless=true -Duser.language=en -Duser.region=US -cp '/data/tools/MEGAN/6.10.8/jars/MEGAN.jar:/data/tools/MEGAN/6.10.8/jars/data.jar' megan.tools.Blast2LCA -i {output.blast} -f BlastXML -ms 50 -me 0.01 -top 50 -a2t /data/db/megan/prot_acc2tax-Oct2017X1.abin
    """
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
run:
    import re
    import sys

    minscore = int(params.minscore)
    out = open(output.taxonomy, 'w')

    filteredtax = {}
    # Parse the taxonomy string and apply score filter
    for line in open(input.taxonomy):
        taxdict = {}
        s = line.split(';')
        for i in range(2,len(s),2):                
            try:
                level, value = s[i].strip().split('__')
                if int(s[i+1]) >= 80:
                    taxdict[level] = s[i].strip()
            except:
                pass
        taxonomy = ["root","cellular organisms"]
        for level in ["d","p","c","o","f","g","s"]:
            taxonomy.append(taxdict.setdefault(level, "unclassified"))

        taxstring = ";".join(taxonomy)
        genestring = ";".join(s[0:1])

        out.write("%s\t%s\n" % (genestring,taxstring))
    out.close()
124
shell: "perl ~/src/orochi/scripts/hmm.majority.vote.pl -i {input.taxonomy} -o {output.essential}"
138
139
script:
    "../../mmgenome_load_data.R"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import json

def get_location(location):
    ## https://github.com/ohmeta/oral-assembly/blob/master/notebook/antismash.ipynb
    loc_ = location.split(":")
    return int(loc_[0].lstrip("[")), int(loc_[1].rstrip("]"))

json_dict = json.load(open(snakemake.input[0]))

outfile = open(snakemake.output[0],"w")

for record in json_dict['records']:
    for feat in record['features']:
            if feat['type'] == "region":
                start,stop = get_location(feat['location'])
                #print(record['seq']['data'][start:stop])
                #print(feat['qualifiers']['product'])
                #print(record['id'])
                outfile.write(">%s_%s_%s_%s | %s\n%s\n" % (record['id'], feat['qualifiers']['product'][0], start, stop, feat['qualifiers']['product'][0], record['seq']['data'][start:stop]))
                break
outfile.close()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
with open('scratch/assembly/megahit/minimus2/all.merged.contigs.fasta') as f:
    lines = f.readlines()

rango = list(range(0, int(round(len(lines)/2)), 2))
headers = list()
seqs = list()
for i in rango:
    headers.append(lines[i])
    seqs.append(lines[i+1])

filtered_headers = list()
for i in range(len(headers)):
    prueba = headers[i].split(" ")
    filtered_headers.append(prueba[0])

setOfElems = set()
filtered_seqs = list()
filtered_complete_headers = list()
repetidos_descartados = list()
for elem in range(len(filtered_headers)):
    if filtered_headers[elem] in setOfElems:
        repetidos_descartados.append(filtered_headers[elem])
    else:
        setOfElems.add(filtered_headers[elem]) 
        filtered_seqs.append(seqs[elem])
        filtered_complete_headers.append(headers[elem])

with open('scratch/assembly/megahit/minimus2/filtered_assembly.fasta', "a") as new_file:
    for i in range(len(setOfElems)):
        new_file.write(filtered_complete_headers[i] + '\n')
        new_file.write(filtered_seqs[i] + '\n')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import os
with open('minimus2_location.txt', 'r') as f:
    lineas = f.readlines()
    path_to_read = lineas[0]

with open('minimus2', 'w') as new:
    pass

with open('minimus2', 'a') as new:
    with open(path_to_read[0:-1], 'r') as f:
        lineas = f.readlines()
        i = 0
        while i < 45:
            new.write(lineas[i])
            i = i+1
        correct_line = lineas[43]
    correct_path = correct_line.split("=")
    path = correct_path[1]
    path = path[0:-1]
    new.write('DELTAFILTER = ' + path + '/delta-filter' + '\n')
    new.write('SHOWCOORDS = ' + path + '/show-coords' + '\n')

with open('minimus2', 'a') as new:
    with open(path_to_read[0:-1], 'r') as f:
        lineas = f.readlines()
        i = 47
        while i > 46 and i < 57:
            new.write(lineas[i])
            i = i+1
    new.write('20: $(NUCMER) --maxmatch -c $(OVERLAP) $(REFSEQ) $(QRYSEQ) -p $(PREFIX)' + '\n')

with open('minimus2', 'a') as new:
    with open(path_to_read[0:-1], 'r') as f:
        lineas = f.readlines()
        i = 58
        while i > 57 and i < 85:
            new.write(lineas[i])
            i = i+1

os.chmod("minimus2", 0o777)
os.replace("minimus2", path_to_read[0:-1])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import os
from Bio import SeqIO

fileslist = os.listdir("results/binning/metabat/bins/")

with open("results/binning/metabat/bins/metabat.bins.txt", "a") as outfile:
    outfile.write("scaffold" + "\t" + "bin")
    for i in fileslist:
        bin_seqs = SeqIO.parse(open(os.path.join("bins/", i)),'fasta')
        for fasta in bin_seqs:
            name, sequence = fasta.id, str(fasta.seq)
            outfile.write('\n' + name + "\t" + i.replace('.', '')[:-2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
trimmer = " ".join(snakemake.params.trimmer)

shell("trimmomatic PE {snakemake.params.extra} "
      "{snakemake.input.r1} {snakemake.input.r2} "
      "{snakemake.output.r1} {snakemake.output.r1_unpaired} "
      "{snakemake.output.r2} {snakemake.output.r2_unpaired} "
      "{trimmer} "
      "{log}")
ShowHide 45 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/nioo-knaw/orochi
Name: orochi
Version: 0.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...