PhyloFunDB pipeline for specific gene database construction and update.

public public 1yr ago 0 bookmarks

PhyloFunDB pipeline

Pipeline for specific gene database construction and update.

Introduction

The pipeline is based on the following workflow:

Db_pipeline

Before starting, you can download a framebot database (protein sequences) for your specific gene. However, for some genes, there is no fungene database available, so it is also possible to skip the Framebot step.

Download reference database from fungene, for framebot: http://fungene.cme.msu.edu/

In fungene, set the parameters that are suitable for your gene, such as:

min size aa = 600

hmm coverage = 99

To remove duplicated sequences based on protein sequence you can use the following command, with the software seqkit:

conda create -n seqkit
source activate seqkit
conda install seqkit
seqkit rmdup {gene}.fungene.fasta -s -o {gene}.fungene.clean.fasta

Getting started

1. Logon to the place where you will analysis your data, e.g. server

2. Create a local copy of the pipeline in a project folder

git clone https://github.com/nioo-knaw/PhyloFunDB.git

3. Enter the pipeline folder with:

cd PhyloFunDB

4. The configuration of the pipeline needs to be set in the file config.yaml . Adjust the settings:

gene: gene name
full_name: "protein full name"
minlength: minimum sequence length
cutoff_otu: cut off for OTU clustering (generally found in the literature)
cutoff_dm: cut off for distance matrix (in general, 0.25 is good enough)
framebot_db: false if there is no framebot reference database, otherwise, true
update: false, as you want to create a new dabatase
mindate: only when you want to update the database
maxdate: only when you want to update the database
path_to_tree: "only when you want to update the database"
path_to_seqs: "only when you want to update the database"
path_to_db: "only when you want to update the database"
path_to_tax: "only when you want to update the database"

5. Check if the config file is correct and which steps will be run

snakemake -n

6. Run the pipeline. -j specifies the number of threads. Conda is the package manager. Optionally do this in a tmux session.

snakemake -j 8 --use-conda

Updating the old pipeline

Some time after you built your specific gene pipeline, it is possible to update it with the newest sequences uploaded to the NCBI database, setting a date range for downloading new sequences and adding the new OTUs to the reference tree. You just need to adjust the options in the config.yaml file.

The update pipeline is based on the following workflow:

Update_pipeline

The most recent sequences will be downloaded, within a date range, processed and added to the initial reference tree.

Getting started

1. Logon to the place where you will analysis your data, e.g. server

2. Create a local copy of the pipeline in a project folder, or enter the folder created previously, in case you already have built a dabatase.

git clone https://github.com/nioo-knaw/PhyloFunDB.git

3. Enter the pipeline folder with:

cd PhyloFunDB

4. Adjust the settings in the file config.yaml .:

gene: gene name
full_name: "protein full name"
minlength: minimum sequence length
cutoff_otu: cut off for OTU clustering (generally found in the literature)
cutoff_dm: cut off for distance matrix (in general, 0.25 is good enough)
framebot_db: false if there is no framebot reference database, otherwise, true
update: true, very important
mindate: date after the sequences in the initial database were donwloaded (yyyy/mm/dd)
maxdate: current day (yyyy/mm/dd)
path_to_tree: "path_to_the_reference_tree_of_the_database"
path_to_seqs: "path_to_the_sequences_used_to_build_the_reference_tree_of_the_database" - the new sequences need to be aligned to the sequences in the tree
path_to_db: "path_to_the_fasta_file_of_the_full_database"
path_to_tax: "path_to_the_taxonomy_file_of_the_full_database"

5. Check if the config file is correct and which steps will be run

snakemake -n -s Snakefile.update

6. Run the pipeline. -j specifies the number of threads. Conda is the package manager. Optionally do this in a tmux session.

snakemake -j 8 -s Snakefile.update --use-conda

Refining sequence taxonomy

After getting your database files, it is still necessary to check the taxonomy/clustering of the sequences in the phylogenetic tree and improve the unassigned/unclassified ones

It is also possible to download metadata from all sequences using the Entrez Direct from NCBI and add that information to the taxonomy string

conda create -n entrez
source activate entrez
conda install -c bioconda entrez-direct

Example nirK gene:

esearch -db nucleotide -query "nirK[gene]" | efetch -format gpc | xtract -insd source organism mol_type strain country isolation_source | sort | uniq >metadata_nirk.txt

After having your tree ready and metadata downloaded (optional):

1. Check if there are cultivated representatives in the OTU groups - not always the OTU representative sequence is a cultivated/known organism, the program is not able to distinguish that - check in the file "interm/{gene}.aligned.good.filter.unique.pick.good.filter.an.{cutoff_otu}.rep.names"

2. Look at the tree – check if there are defined clades in the literature

3. It is possible to match the sequences with their full string taxonomy donwloaded for NCBI using the using VLOOKUP function (Excel)- it makes the checking and formatting of the taxonomy file easier.

4. After checking and refining/correcting the taxonomies of the OTUs (until genus level, work on species/strains in the full taxonomy list), it is necessary to expand the taxonomy to all the sequences in the OTU group (remember that not all the sequences in the db are in the tree, only the OTU representatives) - use the expand_taxonomy.R script

5. In the full taxonomy file, check whether the cultivated representatives have their correct taxonomy (Excel). Add the species and environmental origin to the last level of the taxonomy

6. Some formatting parameters that have to be observed:

  • Names and number of sequences in the .fasta and .taxonomy file must be equal

  • Formatting will depend on the software to be used – remove all spaces, avoid different characters

    • for mothur, strings should end with “;”

    • for qiime2, strings should end without “;” - also, you have to remove the gaps from the sequences.

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
library("tidyr")

#get fasta sequence names
#get taxonomy file
#isolate accession number from fasta seqnames
#add column with accession numbers to fasta seqname files
#isolate accession number from taxonomy file
#merge taxonomy and fasta seqnames files by accession number column
#final file with only sequence names and associated taxonomy

gene.names<-read.table(file=snakemake@input[["fasta"]])
gene.tax<-read.table(file=snakemake@input[["tax"]], sep="\t")
gene.names.columns<-separate(data = gene.names, col = V1, into = c("accession", "rest"), sep = "\\_")
gene.names$accession<-gene.names.columns$accession
gene.tax.columns<-separate(data = gene.tax, col = V1, into = c("accession", "rest"), sep = "\\.")
gene.taxonomy<-merge(gene.names,gene.tax.columns, by.x = "accession", by.y = "accession", all.x = FALSE)
gene.taxonomy.final<-gene.taxonomy[,c(2,4)]

write.table(gene.taxonomy.final, file=snakemake@output[["final_tax"]], sep="\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
shell:"""
       if [[ {config[update]} == True ]]; then
           esearch -db nucleotide -query "{params.gene}[gene]" -mindate {params.mindate} -maxdate {params.maxdate} | \
           efetch -format gpc | \
           xtract -pattern INSDFeature -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDInterval_accession -element INSDInterval_from -element INSDInterval_to | \
           sort -u -k1,1 | \
           uniq | \
           awk 'NF<4' | \
           xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' > {output}
       else
           esearch -db nucleotide -query "{params.gene}[gene]" | \
           efetch -format gpc | \
           xtract -pattern INSDFeature -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDInterval_accession -element INSDInterval_from -element INSDInterval_to | \
           sort -u -k1,1 | \
           uniq | \
           awk 'NF<4' | \
           xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' > {output}
       fi
       """
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
shell:"""
       if [[ {config[update]} == True ]]; then
           esearch -db nucleotide -query "{params.gene}[gene]" -mindate {params.mindate} -maxdate {params.maxdate} | \
           efetch -format gpc | \
           xtract -pattern INSDSeq -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDSeq_accession-version -element INSDSeq_taxonomy |\
           sort -u -k1,1 |\
           uniq > {output}
      else
           esearch -db nucleotide -query "{params.gene}[gene]" | \
           efetch -format gpc | \
           xtract -pattern INSDSeq -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDSeq_accession-version -element INSDSeq_taxonomy |\
           sort -u -k1,1 |\
           uniq > {output}
      fi
     """
76
77
78
79
80
81
shell:
    """
    sed -e 's/[.]/-/' -e 's/ /-/g' -e 's/_//g' {input} | \
    stdbuf -o0 cut -d "-" -f 1,4,5| \
    sed -e 's/-/_/g' -e 's/[.]//g' -e 's/[,]//g' > {output}
    """
88
89
90
91
92
shell:
    """
    set +o pipefail; grep UNVERIFIED {input} | \
    stdbuf -o0 cut -c 2- > {output}
    """
102
103
104
105
shell:
    '''
    mothur "#remove.seqs(accnos={input.accnos}, fasta={input.fasta})"
    '''
117
118
119
120
shell:
    '''
    mothur "#trim.seqs(fasta={input}, minlength={params.minlength}, maxambig=0, processors={threads})"
    '''
133
134
shell:
    "FrameBot framebot -o {params} -N {input.db_framebot} {input.fasta}"
SnakeMake From line 133 of master/Snakefile
145
146
shell:
    "mafft --thread {threads} --auto {input} >{output}"
156
157
158
159
shell:
    '''
    mothur "#screen.seqs(fasta={input}, optimize=start-end, criteria=96, processors={threads})"
    '''
169
170
171
172
shell:
    '''
    mothur "#filter.seqs(fasta={input}, vertical=T, trump=., processors={threads})"
    '''
181
182
183
184
shell:
    '''
    mothur "#unique.seqs(fasta={input})"
    '''            
194
195
196
197
shell:
    '''
    mothur "#chimera.vsearch(fasta={input.fasta}, name={input.name})"
    '''
209
210
211
212
shell:
    '''
    mothur "#remove.seqs(accnos={input.accnos}, fasta={input.fasta}, name={input.name})"
    '''
223
224
225
shell:'''
        mothur "#screen.seqs(fasta={input.fasta}, name={input.name}, optimize=start-end, criteria=96, processors={threads})"
        '''
235
236
237
238
shell:
    '''
    mothur "#filter.seqs(fasta={input}, vertical=T, trump=., processors={threads})"
    '''
248
249
250
251
shell:
    '''
    mothur "#dist.seqs(fasta={input}, cutoff={config[cutoff_dm]}, processors={threads})"
    '''
270
271
272
273
shell:
    '''
    mothur "#cluster(column={input.column}, name={input.name}, method=average, cutoff={config[cutoff_dm]})"
    '''
286
287
288
289
shell:
    '''
    mothur "#get.oturep(column={input.column}, fasta={input.fasta}, name={input.name}, list={input.list}, cutoff={config[cutoff_otu]})"
    '''
299
300
301
302
shell:
    '''
    mothur "#deunique.seqs(fasta={input.fasta}, name={input.name}, outputdir=./results)"
    '''
309
310
311
312
shell:
    """
    sed -e 's/_//g' -e 's/ //g' -e 's/$/;/' {input} > {output}
    """
SnakeMake From line 309 of master/Snakefile
319
320
321
322
shell:
    """
    grep ">" {input} | stdbuf -o0 cut -c 2- > {output}
    """
SnakeMake From line 319 of master/Snakefile
332
333
script:
    "renaming.R"
SnakeMake From line 332 of master/Snakefile
346
347
shell:
   "iqtree -s {input}  -m MFP -alrt 1000 -bb 1000 -nt {threads} -pre {params}"
356
357
358
359
shell:
    """
    cat {input.fasta_new} {input.fasta} > {output}
    """
SnakeMake From line 356 of master/Snakefile
369
370
371
372
shell:
    """
    mafft --thread {threads} --auto {input} >{output}
    """
385
386
387
388
shell:
    """
    raxmlHPC -f v -s {input.fasta} -t {input.tree} -m GTRCAT -H -n {params}
    """
396
397
398
399
shell:
    """
    cat {input.fasta_new} {input.fasta_db} > {output}
    """
SnakeMake From line 396 of master/Snakefile
ShowHide 23 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/nioo-knaw/PhyloFunDB
Name: phylofundb
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...