PhyloFunDB pipeline for specific gene database construction and update.

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

PhyloFunDB pipeline

Pipeline for specific gene database construction and update.

Introduction

The pipeline is based on the following workflow:

Db_pipeline

Before starting, you can download a framebot database (protein sequences) for your specific gene. However, for some genes, there is no fungene database available, so it is also possible to skip the Framebot step.

Download reference database from fungene, for framebot: http://fungene.cme.msu.edu/

In fungene, set the parameters that are suitable for your gene, such as:

min size aa = 600

hmm coverage = 99

To remove duplicated sequences based on protein sequence you can use the following command, with the software seqkit:

conda create -n seqkit
source activate seqkit
conda install seqkit
seqkit rmdup {gene}.fungene.fasta -s -o {gene}.fungene.clean.fasta

Getting started

1. Logon to the place where you will analysis your data, e.g. server

2. Create a local copy of the pipeline in a project folder

git clone https://github.com/nioo-knaw/PhyloFunDB.git

3. Enter the pipeline folder with:

cd PhyloFunDB

4. The configuration of the pipeline needs to be set in the file config.yaml . Adjust the settings:

gene: gene name
full_name: "protein full name"
minlength: minimum sequence length
cutoff_otu: cut off for OTU clustering (generally found in the literature)
cutoff_dm: cut off for distance matrix (in general, 0.25 is good enough)
framebot_db: false if there is no framebot reference database, otherwise, true
update: false, as you want to create a new dabatase
mindate: only when you want to update the database
maxdate: only when you want to update the database
path_to_tree: "only when you want to update the database"
path_to_seqs: "only when you want to update the database"
path_to_db: "only when you want to update the database"
path_to_tax: "only when you want to update the database"

5. Check if the config file is correct and which steps will be run

snakemake -n

6. Run the pipeline. -j specifies the number of threads. Conda is the package manager. Optionally do this in a tmux session.

snakemake -j 8 --use-conda

Updating the old pipeline

Some time after you built your specific gene pipeline, it is possible to update it with the newest sequences uploaded to the NCBI database, setting a date range for downloading new sequences and adding the new OTUs to the reference tree. You just need to adjust the options in the config.yaml file.

The update pipeline is based on the following workflow:

Update_pipeline

The most recent sequences will be downloaded, within a date range, processed and added to the initial reference tree.

Getting started

1. Logon to the place where you will analysis your data, e.g. server

2. Create a local copy of the pipeline in a project folder, or enter the folder created previously, in case you already have built a dabatase.

git clone https://github.com/nioo-knaw/PhyloFunDB.git

3. Enter the pipeline folder with:

cd PhyloFunDB

4. Adjust the settings in the file config.yaml .:

gene: gene name
full_name: "protein full name"
minlength: minimum sequence length
cutoff_otu: cut off for OTU clustering (generally found in the literature)
cutoff_dm: cut off for distance matrix (in general, 0.25 is good enough)
framebot_db: false if there is no framebot reference database, otherwise, true
update: true, very important
mindate: date after the sequences in the initial database were donwloaded (yyyy/mm/dd)
maxdate: current day (yyyy/mm/dd)
path_to_tree: "path_to_the_reference_tree_of_the_database"
path_to_seqs: "path_to_the_sequences_used_to_build_the_reference_tree_of_the_database" - the new sequences need to be aligned to the sequences in the tree
path_to_db: "path_to_the_fasta_file_of_the_full_database"
path_to_tax: "path_to_the_taxonomy_file_of_the_full_database"

5. Check if the config file is correct and which steps will be run

snakemake -n -s Snakefile.update

6. Run the pipeline. -j specifies the number of threads. Conda is the package manager. Optionally do this in a tmux session.

snakemake -j 8 -s Snakefile.update --use-conda

Refining sequence taxonomy

After getting your database files, it is still necessary to check the taxonomy/clustering of the sequences in the phylogenetic tree and improve the unassigned/unclassified ones

It is also possible to download metadata from all sequences using the Entrez Direct from NCBI and add that information to the taxonomy string

conda create -n entrez
source activate entrez
conda install -c bioconda entrez-direct

Example nirK gene:

esearch -db nucleotide -query "nirK[gene]" | efetch -format gpc | xtract -insd source organism mol_type strain country isolation_source | sort | uniq >metadata_nirk.txt

After having your tree ready and metadata downloaded (optional):

1. Check if there are cultivated representatives in the OTU groups - not always the OTU representative sequence is a cultivated/known organism, the program is not able to distinguish that - check in the file "interm/{gene}.aligned.good.filter.unique.pick.good.filter.an.{cutoff_otu}.rep.names"

2. Look at the tree – check if there are defined clades in the literature

3. It is possible to match the sequences with their full string taxonomy donwloaded for NCBI using the using VLOOKUP function (Excel)- it makes the checking and formatting of the taxonomy file easier.

4. After checking and refining/correcting the taxonomies of the OTUs (until genus level, work on species/strains in the full taxonomy list), it is necessary to expand the taxonomy to all the sequences in the OTU group (remember that not all the sequences in the db are in the tree, only the OTU representatives) - use the expand_taxonomy.R script

5. In the full taxonomy file, check whether the cultivated representatives have their correct taxonomy (Excel). Add the species and environmental origin to the last level of the taxonomy

6. Some formatting parameters that have to be observed:

Names and number of sequences in the .fasta and .taxonomy file must be equal
Formatting will depend on the software to be used – remove all spaces, avoid different characters
- for mothur, strings should end with “;”
- for qiime2, strings should end without “;” - also, you have to remove the gaps from the sequences.

Code Snippets

library("tidyr")

#get fasta sequence names
#get taxonomy file
#isolate accession number from fasta seqnames
#add column with accession numbers to fasta seqname files
#isolate accession number from taxonomy file
#merge taxonomy and fasta seqnames files by accession number column
#final file with only sequence names and associated taxonomy

gene.names<-read.table(file=snakemake@input[["fasta"]])
gene.tax<-read.table(file=snakemake@input[["tax"]], sep="\t")
gene.names.columns<-separate(data = gene.names, col = V1, into = c("accession", "rest"), sep = "\\_")
gene.names$accession<-gene.names.columns$accession
gene.tax.columns<-separate(data = gene.tax, col = V1, into = c("accession", "rest"), sep = "\\.")
gene.taxonomy<-merge(gene.names,gene.tax.columns, by.x = "accession", by.y = "accession", all.x = FALSE)
gene.taxonomy.final<-gene.taxonomy[,c(2,4)]

write.table(gene.taxonomy.final, file=snakemake@output[["final_tax"]], sep="\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

R tidyr From line 2 of master/renaming.R

shell:"""
       if [[ {config[update]} == True ]]; then
           esearch -db nucleotide -query "{params.gene}[gene]" -mindate {params.mindate} -maxdate {params.maxdate} | \
           efetch -format gpc | \
           xtract -pattern INSDFeature -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDInterval_accession -element INSDInterval_from -element INSDInterval_to | \
           sort -u -k1,1 | \
           uniq | \
           awk 'NF<4' | \
           xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' > {output}
       else
           esearch -db nucleotide -query "{params.gene}[gene]" | \
           efetch -format gpc | \
           xtract -pattern INSDFeature -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDInterval_accession -element INSDInterval_from -element INSDInterval_to | \
           sort -u -k1,1 | \
           uniq | \
           awk 'NF<4' | \
           xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' > {output}
       fi
       """

SnakeMake From line 24 of master/Snakefile

shell:"""
       if [[ {config[update]} == True ]]; then
           esearch -db nucleotide -query "{params.gene}[gene]" -mindate {params.mindate} -maxdate {params.maxdate} | \
           efetch -format gpc | \
           xtract -pattern INSDSeq -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDSeq_accession-version -element INSDSeq_taxonomy |\
           sort -u -k1,1 |\
           uniq > {output}
      else
           esearch -db nucleotide -query "{params.gene}[gene]" | \
           efetch -format gpc | \
           xtract -pattern INSDSeq -if INSDFeature_key -equals CDS -and INSDQualifier_value -equals {params.gene} -or INSDQualifier_value -contains '{params.full_name}' -element INSDSeq_accession-version -element INSDSeq_taxonomy |\
           sort -u -k1,1 |\
           uniq > {output}
      fi
     """

SnakeMake From line 55 of master/Snakefile

shell:
    """
    sed -e 's/[.]/-/' -e 's/ /-/g' -e 's/_//g' {input} | \
    stdbuf -o0 cut -d "-" -f 1,4,5| \
    sed -e 's/-/_/g' -e 's/[.]//g' -e 's/[,]//g' > {output}
    """

SnakeMake From line 76 of master/Snakefile

shell:
    """
    set +o pipefail; grep UNVERIFIED {input} | \
    stdbuf -o0 cut -c 2- > {output}
    """

SnakeMake From line 88 of master/Snakefile

shell:
    '''
    mothur "#remove.seqs(accnos={input.accnos}, fasta={input.fasta})"
    '''

SnakeMake mothur From line 102 of master/Snakefile

shell:
    '''
    mothur "#trim.seqs(fasta={input}, minlength={params.minlength}, maxambig=0, processors={threads})"
    '''

SnakeMake mothur From line 117 of master/Snakefile

shell:
    "FrameBot framebot -o {params} -N {input.db_framebot} {input.fasta}"

SnakeMake From line 133 of master/Snakefile

shell:
    "mafft --thread {threads} --auto {input} >{output}"

SnakeMake MAFFT API (EBI) From line 145 of master/Snakefile

shell:
    '''
    mothur "#screen.seqs(fasta={input}, optimize=start-end, criteria=96, processors={threads})"
    '''

SnakeMake mothur From line 156 of master/Snakefile

shell:
    '''
    mothur "#filter.seqs(fasta={input}, vertical=T, trump=., processors={threads})"
    '''

SnakeMake mothur From line 169 of master/Snakefile

shell:
    '''
    mothur "#unique.seqs(fasta={input})"
    '''            

SnakeMake mothur From line 181 of master/Snakefile

shell:
    '''
    mothur "#chimera.vsearch(fasta={input.fasta}, name={input.name})"
    '''

SnakeMake mothur From line 194 of master/Snakefile

shell:
    '''
    mothur "#remove.seqs(accnos={input.accnos}, fasta={input.fasta}, name={input.name})"
    '''

SnakeMake mothur From line 209 of master/Snakefile

shell:'''
        mothur "#screen.seqs(fasta={input.fasta}, name={input.name}, optimize=start-end, criteria=96, processors={threads})"
        '''

SnakeMake mothur From line 223 of master/Snakefile

shell:
    '''
    mothur "#filter.seqs(fasta={input}, vertical=T, trump=., processors={threads})"
    '''

SnakeMake mothur From line 235 of master/Snakefile

shell:
    '''
    mothur "#dist.seqs(fasta={input}, cutoff={config[cutoff_dm]}, processors={threads})"
    '''

SnakeMake mothur From line 248 of master/Snakefile

shell:
    '''
    mothur "#cluster(column={input.column}, name={input.name}, method=average, cutoff={config[cutoff_dm]})"
    '''

SnakeMake mothur From line 270 of master/Snakefile

shell:
    '''
    mothur "#get.oturep(column={input.column}, fasta={input.fasta}, name={input.name}, list={input.list}, cutoff={config[cutoff_otu]})"
    '''

SnakeMake mothur From line 286 of master/Snakefile

shell:
    '''
    mothur "#deunique.seqs(fasta={input.fasta}, name={input.name}, outputdir=./results)"
    '''

SnakeMake mothur From line 299 of master/Snakefile

shell:
    """
    sed -e 's/_//g' -e 's/ //g' -e 's/$/;/' {input} > {output}
    """

SnakeMake From line 309 of master/Snakefile

shell:
    """
    grep ">" {input} | stdbuf -o0 cut -c 2- > {output}
    """

SnakeMake From line 319 of master/Snakefile

script:
    "renaming.R"

SnakeMake From line 332 of master/Snakefile

shell:
   "iqtree -s {input}  -m MFP -alrt 1000 -bb 1000 -nt {threads} -pre {params}"

SnakeMake IQ-TREE From line 346 of master/Snakefile

shell:
    """
    cat {input.fasta_new} {input.fasta} > {output}
    """

SnakeMake From line 356 of master/Snakefile

shell:
    """
    mafft --thread {threads} --auto {input} >{output}
    """

SnakeMake MAFFT API (EBI) From line 369 of master/Snakefile

shell:
    """
    raxmlHPC -f v -s {input.fasta} -t {input.tree} -m GTRCAT -H -n {params}
    """

SnakeMake raxml-ng From line 385 of master/Snakefile

shell:
    """
    cat {input.fasta_new} {input.fasta_db} > {output}
    """

SnakeMake From line 396 of master/Snakefile

ShowHide 23 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/nioo-knaw/PhyloFunDB

Name: phylofundb

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: GNU General Public License v3.0

Keywords:

raxml-ng IQ-TREE MAFFT API (EBI) mothur Snakemake tidyr Data acquisition

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free