Contamination Detection in the Olive Genome: A Snakemake Workflow on AWS

public public 1yr ago 0 bookmarks

This workflow includes analyses to find contamination in the domesticated olive genome. The results are presented in this preprint .

This workflow was run on a amazon aws instance (ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20171026.1 (ami-1c1d217c)). To run this workflow, you must first install and configure miniconda (or anaconda if you prefer).

Install miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Run through the prompts.

Configure miniconda and create an environment

conda config --add channels r
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n snakemake python==3.6 snakemake=4.3.0 pysam=0.13 pandas=0.20.3 numpy=1.13.3 sourmash=2.0.0a2
conda activate snakemake

Then, run the workflow:

snakemake --use-conda -k

Currently, the all rule dynamic("outputs/sylvestris/blast/tab/{contig_names_sylv}.tab") in the Snakefile is commented out. This means that the pipeline will only output Oe6 relevant files. To output sylvestris files, uncomment the rule, and comment the Oe6 rule.

Code Snippets

10
11
12
13
    shell:'''
    wget -O {output.gz} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz
    gunzip -c {output.gz} > {output.uncmp}
	'''
21
22
23
24
25
26
shell:'''
wget -O {output.sangz} ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/678/115/GCA_001678115.1_ASM167811v1/GCA_001678115.1_ASM167811v1_genomic.fna.gz 
gunzip -c {output.sangz} > {output.sanuncmp}
wget -O {output.ex150gz} ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/721/785/GCA_000721785.1_Aureobasidium_pullulans_var._pullulans_EXF-150_assembly_version_1.0/GCA_000721785.1_Aureobasidium_pullulans_var._pullulans_EXF-150_assembly_version_1.0_genomic.fna.gz
gunzip -c {output.ex150gz} > {output.ex150uncmp}
'''
34
35
36
shell:'''
nucmer --mum {input.Oe6} {input.san} -p outputs/aur-pul-nucmer/Oe6-APvarSan
'''
42
43
44
shell:'''
delta-filter -l 500 -q {input} > {output} 
'''
50
51
52
shell:'''
show-coords -c -l -L 500 -r -T {input} > {output}
'''
60
61
62
shell:'''
nucmer --mum {input.Oe6} {input.ex150} -p outputs/aur-pul-nucmer/Oe6-APvarEx
'''
68
69
70
shell:'''
delta-filter -l 500 -q {input} > {output}
'''
76
77
78
shell:'''
show-coords -c -l -L 500 -r -T {input} > {output}
'''
13
14
15
16
    shell:'''
    	wget -O {output.gz} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz 
    	gunzip -c {output.gz} > {output.uncmp}
	'''
22
23
24
25
    shell:'''
		wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea%3E1kb_scaffolds.gz
		gunzip -c {output.gz} > {output.uncmp}
    '''
31
32
33
shell:'''
	samtools faidx {input}
'''
39
40
41
shell:'''
	samtools faidx {input}
'''
53
54
55
56
57
58
59
60
61
62
63
run:
    fasta = pysam.Fastafile(filename = input.genome)
    f=open(input.contigs,'r')
    for line in f.readlines():
        # strip white space
        line = line.strip()
        # use pysam to grab the line of interest
        fasta_of_interest = fasta.fetch(line)
        with open(f"outputs/Oe6/suspicious_contigs/{line}.fa", "w") as text_file:
            # write content of fasta file with appropriate header and sequence
            text_file.write(f">{line}\n{fasta_of_interest}")    
71
72
73
74
75
76
77
78
79
80
81
run:
    fasta = pysam.Fastafile(filename = input.genome)
    f=open(input.contigs,'r')
    for line in f.readlines():
        # strip white space
        line = line.strip()
        # use pysam to grab the line of interest
        fasta_of_interest = fasta.fetch(line)
        with open(f"outputs/sylvestris/suspicious_contigs/{line}.fa", "w") as text_file:
            # write content of fasta file with appropriate header and sequence
            text_file.write(f">{line}\n{fasta_of_interest}")    
87
88
89
90
91
shell:'''
    cd inputs/blast_db
	wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz'
	cat nt.*.tar.gz | tar -zxvi -f - -C .
'''
 99
100
101
shell:'''
	blastn -query {input.contig} -db inputs/blast_db/nt -outfmt 11 -out {output}
'''
109
110
111
shell:'''
	blastn -query {input.contig} -db inputs/blast_db/nt -outfmt 11 -out {output}
'''
117
118
119
shell:'''
	blast_formatter -archive {input} -outfmt 6 -out {output}
'''
125
126
127
shell:'''
	blast_formatter -archive {input} -outfmt 6 -out {output}
'''
 8
 9
10
11
    shell:'''
	wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea_all_scaffolds.fa.gz 
	gunzip -c {output.gz} > {output.uncmp}
	'''
17
18
19
20
21
    shell:'''
    wget -O {output.tgz} http://busco.ezlab.org/datasets/embryophyta_odb9.tar.gz
	mkdir -p {output.db}
    tar -xf {output.tgz} --strip-components=1 -C {output.db}
    '''
28
29
30
31
    shell:'''
	run_busco -i {input.genome} -o wild_olive_busco -l {input.busco_db} -m geno
        mv run_wild_olive_busco {output}
    '''
 8
 9
10
11
    shell:'''
    wget -O {output.gz} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz 
    gunzip -c {output.gz} > {output.uncmp}
	'''
17
18
19
20
    shell:'''
	wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea_chromosome+unchromosome.gz 
	gunzip -c {output.gz} > {output.uncmp}
	'''
28
29
30
shell:'''
nucmer --mum {input.sylv} {input.Oe6} -p outputs/olive_genomes_nucmer/sylvester_santander_nucmer
'''
36
37
38
shell:'''
delta-filter -l 500 -q {input} > {output}
'''
44
45
46
shell:'''
show-coords -c -l -L 500 -r -T {input} > {output}
'''
6
7
8
    shell:'''
    wget -O {output} http://denovo.cnag.cat/genomes/olive/download/Oe6/OE6A.pep.fa 
	'''
14
15
16
17
    shell:'''
	wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea.gene.pep.final.chr_and_chrUn_noTE.gz 
	gunzip -c {output.gz} > {output.uncmp}
	'''
32
33
34
35
    shell:'''
	orthofinder -f inputs/peptides -og
	cp -a inputs/peptides/Results*/. {output}
	'''
 8
 9
10
shell:'''
	wget -O {output} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz
'''
14
15
16
    shell:'''
		wget -O {output} http://olivegenome.org/genome_datasets/Olea_europaea%3E1kb_scaffolds.gz
	'''
22
23
24
25
shell:'''
# compute tetranucleotide frequency of scaffolds
	sourmash compute -k 31 --scaled 10000 -o {output} {input}
'''
31
32
33
34
shell:'''
# compute tetranucleotide frequency of scaffolds
	sourmash compute -k 31 --scaled 10000 -o {output} {input}
'''
38
39
40
shell:'''
	wget -O inputs/genbank_lca/genbank-k31.lca.json.gz https://osf.io/zskb9/download
'''
48
49
50
shell:'''
	sourmash lca classify --db {input.genbank} --query {input.sig} -o {output}
'''
58
59
60
shell:'''
sourmash lca classify --db {input.genbank} --query {input.sig} -o {output}
'''
ShowHide 32 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/taylorreiter/olive_genome
Name: olive_genome
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: BSD 3-Clause "New" or "Revised" License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...