GEN-ERA toolbox suite of Nextflow-Singularity workflows

public 1yr ago Version: Version 1 0 bookmarks

View Workflow

pips

The GEN-ERA toolbox is a suite of Nextflow-Singularity workflows designed for comparative genomics of bacteria and small eukaryotes. Without any installation, it allows researchers to download, assemble and bin (meta)genomes (from short or long reads). Orthologous inference and maximum likelihood phylogenomic analyses (bootstrap and jackknife) can be inferred with this suite. Constrained (by a ribosomal phylogenomic) SSU rRNA phylogeny can also be inferred. Average nucleotide identity, GTDB identification and metabolic modelling are also included in the toolbox.

BCCM GEN-ERA tools repository

Please visit the wiki for tutorials and access to the tools: https://github.com/Lcornet/GENERA/wiki

NEWS

Update in Phylogeny.nf, pass on RAxMLV8. The fast method is no longer used for jackknife, pass on ML bestree.

Information about the GEN-ERA project

Please visit
https://bccm.belspo.be/content/bccm-collections-genomic-era

GEN-ERA project final report

Pierre Becker, Luc Cornet, Elizabet D’hooge, Ilse Cleenwerck, Oren Tzfadia, Leen Rigouts, Wim Mulders, Heide-Marie Daniel, Annick Wilmotte, Denis Baurain. BCCM collections in the genomic era. Final Report.
Brussels: Belgian Science PolicyOffice2022–40p. (BRAIN-be2.0-(Belgian Research Action through Interdisciplinary Networks))
https://www.belspo.be/belspo/brain2-be/projects/FinalReports/BCCMGENERA_FinRep.pdf

Publications

ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies.
Léonard, R. R., Leleu, M., Vlierberghe, M. V., Cornet, L., Kerff, F., and Baurain, D. (2021).
PeerJ 9, e11348. doi:10.7717/peerj.11348.
https://peerj.com/articles/11348/
The taxonomy of the Trichophyton rubrum complex: a phylogenomic approach.
Cornet, L., D’hooge, E., Magain, N., Stubbe, D., Packeu, A., Baurain, D., and Becker P. (2021).
Microbial Genomics 7, 000707. doi:10.1099/mgen.0.000707.
https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000707
ORPER: A Workflow for Constrained SSU rRNA Phylogenies.
Cornet, L., Ahn, A.-C., Wilmotte, A., and Baurain, D. (2021).
Genes 12, 1741. doi:10.3390/genes12111741.
https://www.mdpi.com/2073-4425/12/11/1741/html
AMAW: automated gene annotation for non-model eukaryotic genomes.
Meunier, L., Baurain, D., Cornet, L. (2021)
https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1
Phylogenomic analyses of Snodgrassella isolates from honeybees and bumblebees reveals taxonomic and functional diversity.
Cornet, L., Cleenwerck, I., Praet, J., Leonard, R., Vereecken, N.J., Michez, D., Smagghe, G., Baurain, D., Vandamme, P. (2021)
https://doi.org/10.1128/msystems.01500-21
Contamination detection in genomic data: more is not enough.
Cornet, L & Baurain, D (2022)
Genome Biology. 2022;23:60.
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02619-9
The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics
Cornet, L., Durieu, B., Baert, F., D’hooge, E., Colignon, D., Meunier, L., Lupo, V., Cleenwerck I., Daniel, HM., Rigouts, L., Sirjacobs, D., Declerck, D., Vandamme, P., Wilmotte, A., Baurain, D., Becker P (2022).
https://www.biorxiv.org/content/10.1101/2022.10.20.513017v1
CRitical Assessment of genomic COntamination detection at several Taxonomic ranks (CRACOT)
Cornet, L., Lupo, V., Declerck, S., Baurain, D. (2022).
https://www.biorxiv.org/content/10.1101/2022.11.14.516442v1

Copyright and License

Code Snippets

"""
setup-taxdir.pl --taxdir=$workingdir
echo $workingdir > taxdump_path.txt
"""

NextFlow From line 146 of Nextflow/Genome-downloader.nf

 	        """
            echo $workingdir > taxdump_path.txt
		    """           

NextFlow From line 156 of Nextflow/Genome-downloader.nf

		"""
        echo $taxdump > taxdump_path.txt
		"""		

NextFlow From line 168 of Nextflow/Genome-downloader.nf

"""
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt -O refseq_sum.txt
$companion refseq_sum.txt --mode=sum
grep -v "#" refseq_sum-filt.txt | cut -f1 > GCF.list
fetch-tax.pl GCF.list  --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species
grep -v "#" refseq_sum-filt.txt | cut -f20 > ftp.list
grep -v "#" refseq_sum-filt.txt | cut -f20 | cut -f10 -d"/" > names.list
for f in `cat ftp.list `; do echo "/"; done > slash.list
for f in `cat ftp.list `; do echo "_genomic.fna.gz"; done > end1.list
for f in `cat ftp.list `; do echo "wget "; done > get.list
for f in `cat ftp.list `; do echo " -O "; done > out.list
for f in `cat ftp.list `; do echo ".fna.gz"; done > end2.list
cut -f1,2 -d"_" names.list > id.list
paste get.list ftp.list slash.list names.list end1.list out.list id.list end2.list > ftp.sh
sed -i -e 's/\t//g' ftp.sh
echo "RefSeq metadata" >> Genome-downloader.log
"""

NextFlow From line 193 of Nextflow/Genome-downloader.nf

"""
echo "Add Refseq Genomes NOT activated" > GCF.tax
echo "Add Refseq Genomes NOT activated" > ftp.sh
echo "Add Refseq metadata NOT activated" >> Genome-downloader.log
"""

NextFlow From line 213 of Nextflow/Genome-downloader.nf

"""
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt -O genbank_sum.txt
$companion genbank_sum.txt --mode=sum
grep -v "#" genbank_sum-filt.txt | cut -f1 > GCA.list
fetch-tax.pl GCA.list  --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species
grep -v "#" genbank_sum-filt.txt | cut -f20 > ftp.list
grep -v "#" genbank_sum-filt.txt | cut -f20 | cut -f10 -d"/" > names.list
for f in `cat ftp.list `; do echo "/"; done > slash.list
for f in `cat ftp.list `; do echo "_genomic.fna.gz"; done > end1.list
for f in `cat ftp.list `; do echo "wget "; done > get.list
for f in `cat ftp.list `; do echo " -O "; done > out.list
for f in `cat ftp.list `; do echo ".fna.gz"; done > end2.list
cut -f1,2 -d"_" names.list > id.list
paste get.list ftp.list slash.list names.list end1.list out.list id.list end2.list > GCA-ftp.sh
sed -i -e 's/\t//g' GCA-ftp.sh
 echo "Add GenBank metadata activated" >> Genome-downloader.log
"""

NextFlow From line 242 of Nextflow/Genome-downloader.nf

"""
echo "Add GenBank Genomes NOT activated" > GCA.tax
echo "Add GenBank Genomes NOT activated" > GCA-ftp.sh
echo "Add GenBank metadata NOT activated" >> Genome-downloader.log
"""

NextFlow From line 262 of Nextflow/Genome-downloader.nf

"""
#Produce list of GCF IDs with reference group and taxa levels
$companion GCF.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCF.refgroup.uniq`; do grep \$f ftp.sh; done > reduce-ftp.sh
bash reduce-ftp.sh
gunzip *.gz
find *.fna | cut -f1,2 -d"." > fna.list
for f in `cat fna.list`; do inst-abbr-ids.pl \$f*.fna --id-regex=:DEF --id-prefix=\$f; done
echo "Add RefSeq Genomes, abbr mode" >> Genome-downloader.log
"""

NextFlow From line 296 of Nextflow/Genome-downloader.nf

"""
#Produce list of GCF IDs with reference group and taxa levels
$companion GCF.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCF.refgroup.uniq`; do grep \$f ftp.sh; done > reduce-ftp.sh
bash reduce-ftp.sh
gunzip *.gz
echo "Add RefSeq Genomes, non abbr mode" >> Genome-downloader.log
"""

NextFlow From line 308 of Nextflow/Genome-downloader.nf

"""
echo "Add RefSeq Genomes NOT activated" > FALSER-abbr.fna
echo "Add RefSeq Genomes NOT activated" > reduce-ftp.sh
echo "GCF_FALSE" > GCF.refgroup.uniq
echo "Add RefSeq Genomes NOT activated" >> Genome-downloader.log
"""

NextFlow From line 320 of Nextflow/Genome-downloader.nf

"""
#Produce list of GCA IDs with reference group and taxa levels
$companion GCA.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCA.refgroup.uniq`; do grep \$f GCA-ftp.sh; done > GCA-reduce-ftp.sh
bash GCA-reduce-ftp.sh
gunzip *.gz
find *.fna | cut -f1,2 -d"." > fna.list
for f in `cat fna.list`; do inst-abbr-ids.pl \$f*.fna --id-regex=:DEF --id-prefix=\$f; done
#for fix and proceed , false genbank files
echo "Add GenBank Genomes activated" > FALSE-abbr.fna
echo "Add GenBank Genomes activated" > FALSE-GCA-reduce-ftp.sh
echo "Add GenBank Genomes activated, abbr mode" >> Genome-downloader.log
"""

NextFlow From line 354 of Nextflow/Genome-downloader.nf

"""
#Produce list of GCA IDs with reference group and taxa levels
$companion GCA.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCA.refgroup.uniq`; do grep \$f GCA-ftp.sh; done > GCA-reduce-ftp.sh
bash GCA-reduce-ftp.sh
gunzip *.gz
#for fix and proceed , false genbank files
echo "Add GenBank Genomes activated" > FALSE-abbr.fna
echo "Add GenBank Genomes activated" > FALSE-GCA-reduce-ftp.sh
echo "Add GenBank Genomes activated, non abbr mode" >> Genome-downloader.log
"""

NextFlow From line 370 of Nextflow/Genome-downloader.nf

"""
echo "Add GenBank Genomes NOT activated" > FALSE-abbr.fna
echo "Add GenBank Genomes NOT activated" > GCA-reduce-ftp.sh
echo "Add GenBank Genomes NOT activated" >> Genome-downloader.log
"""

NextFlow From line 385 of Nextflow/Genome-downloader.nf

"""
#Delete false Genbak files
rm -rf FALSE*
mkdir GEN
mv *.fna GEN/
dRep dereplicate DREP -g GEN/*.fna -p $cpu
mkdir DEREPLICATED/
mv DREP/dereplicated_genomes/*.fna DEREPLICATED/
echo "Drep Dereplication activated" >> Genome-downloader.log
"""

NextFlow dRep From line 417 of Nextflow/Genome-downloader.nf

"""
#Delete false Genbak files
rm -rf FALSE*
mkdir GEN
mv *.fna GEN/
dRep dereplicate DREP -g GEN/*.fna -p $cpu --ignoreGenomeQuality
mkdir DEREPLICATED/
mv DREP/dereplicated_genomes/*.fna DEREPLICATED/
echo "Drep Dereplication activated" >> Genome-downloader.log
"""

NextFlow dRep From line 429 of Nextflow/Genome-downloader.nf

"""
#Delete false Genbak files
rm -rf FALSE*
mkdir DEREPLICATED/
mv *.fna DEREPLICATED/
echo "Drep Dereplication not activated" >> Genome-downloader.log
"""

NextFlow From line 443 of Nextflow/Genome-downloader.nf

"""
#log part
echo test > Genome-downloader.log
#Merge ftp
cat ftp.sh GCA-ftp.sh > merge-ftp.sh
grep -v 'activated' merge-ftp.sh > temp; mv temp merge-ftp.sh
#Collecte GCA/F number
cp DEREPLICATED/*.fna .
find *.fna > fna.list
sed -i -e 's/-abbr.fna//g' fna.list
sed -i -e 's/.fna//g' fna.list
rm -rf *.fna
#Get ftp part
for f in `cat fna.list`; do grep \$f merge-ftp.sh; done > prot-ftp.sh
sed -i -e 's/_genomic.fna.gz/_protein.faa.gz/g' prot-ftp.sh
sed -i -e 's/fna.gz/faa.gz/g' prot-ftp.sh
bash prot-ftp.sh
find *.gz -type f -empty -print -delete
gunzip *.gz
mkdir PROT
mv *.faa PROT/
"""

NextFlow From line 475 of Nextflow/Genome-downloader.nf

"""
#log part
echo "Prot download NOT activated" > Genome-downloader.log
mkdir PROT
echo "Prot download NOT activated" > PROT/info.faa
"""

NextFlow From line 500 of Nextflow/Genome-downloader.nf

"""
mv DEREPLICATED/*.fna .
#log part
echo "Genome-downloader started at `date`" > Genome-downloader.log
echo "Genomes dowloaded: " >> Genome-downloader.log
find *.fna | wc -l >> Genome-downloader.log
echo "Genome-downloader, version: " >> Genome-downloader.log
echo "3.0.0 " >> Genome-downloader.log
#copy part
#find *.fna | cut -f1 -d'-' > GC.list
find *.fna  > GC.list
mkdir GENOMES/
mv *.fna GENOMES/
sed -i -e 's/.fna//g' GC.list
fetch-tax.pl GC.list  --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species
mv GC.tax Genomes.taxomonomy
mkdir PROTEINS
mv PROT/*.faa PROTEINS/
"""