GEN-ERA toolbox suite of Nextflow-Singularity workflows

public public 1yr ago Version: Version 1 0 bookmarks

pips

The GEN-ERA toolbox is a suite of Nextflow-Singularity workflows designed for comparative genomics of bacteria and small eukaryotes. Without any installation, it allows researchers to download, assemble and bin (meta)genomes (from short or long reads). Orthologous inference and maximum likelihood phylogenomic analyses (bootstrap and jackknife) can be inferred with this suite. Constrained (by a ribosomal phylogenomic) SSU rRNA phylogeny can also be inferred. Average nucleotide identity, GTDB identification and metabolic modelling are also included in the toolbox.

BCCM GEN-ERA tools repository

Please visit the wiki for tutorials and access to the tools: https://github.com/Lcornet/GENERA/wiki

NEWS

Update in Phylogeny.nf, pass on RAxMLV8. The fast method is no longer used for jackknife, pass on ML bestree.

Information about the GEN-ERA project

Please visit
https://bccm.belspo.be/content/bccm-collections-genomic-era

GEN-ERA project final report

Pierre Becker, Luc Cornet, Elizabet D’hooge, Ilse Cleenwerck, Oren Tzfadia, Leen Rigouts, Wim Mulders, Heide-Marie Daniel, Annick Wilmotte, Denis Baurain. BCCM collections in the genomic era. Final Report.
Brussels: Belgian Science PolicyOffice2022–40p. (BRAIN-be2.0-(Belgian Research Action through Interdisciplinary Networks))
https://www.belspo.be/belspo/brain2-be/projects/FinalReports/BCCMGENERA_FinRep.pdf

Publications

  1. ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies.
    Léonard, R. R., Leleu, M., Vlierberghe, M. V., Cornet, L., Kerff, F., and Baurain, D. (2021).
    PeerJ 9, e11348. doi:10.7717/peerj.11348.
    https://peerj.com/articles/11348/

  2. The taxonomy of the Trichophyton rubrum complex: a phylogenomic approach.
    Cornet, L., D’hooge, E., Magain, N., Stubbe, D., Packeu, A., Baurain, D., and Becker P. (2021).
    Microbial Genomics 7, 000707. doi:10.1099/mgen.0.000707.
    https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000707

  3. ORPER: A Workflow for Constrained SSU rRNA Phylogenies.
    Cornet, L., Ahn, A.-C., Wilmotte, A., and Baurain, D. (2021).
    Genes 12, 1741. doi:10.3390/genes12111741.
    https://www.mdpi.com/2073-4425/12/11/1741/html

  4. AMAW: automated gene annotation for non-model eukaryotic genomes.
    Meunier, L., Baurain, D., Cornet, L. (2021)
    https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1

  5. Phylogenomic analyses of Snodgrassella isolates from honeybees and bumblebees reveals taxonomic and functional diversity.
    Cornet, L., Cleenwerck, I., Praet, J., Leonard, R., Vereecken, N.J., Michez, D., Smagghe, G., Baurain, D., Vandamme, P. (2021)
    https://doi.org/10.1128/msystems.01500-21

  6. Contamination detection in genomic data: more is not enough.
    Cornet, L & Baurain, D (2022)
    Genome Biology. 2022;23:60.
    https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02619-9

  7. The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics
    Cornet, L., Durieu, B., Baert, F., D’hooge, E., Colignon, D., Meunier, L., Lupo, V., Cleenwerck I., Daniel, HM., Rigouts, L., Sirjacobs, D., Declerck, D., Vandamme, P., Wilmotte, A., Baurain, D., Becker P (2022).
    https://www.biorxiv.org/content/10.1101/2022.10.20.513017v1

  8. CRitical Assessment of genomic COntamination detection at several Taxonomic ranks (CRACOT)
    Cornet, L., Lupo, V., Declerck, S., Baurain, D. (2022).
    https://www.biorxiv.org/content/10.1101/2022.11.14.516442v1

Copyright and License

This softwares is copyright (c) 2017-2021 by University of Liege / Sciensano / BCCM collection by Luc CORNET This is free softwares; you can redistribute it and/or modify.

BCCM

Code Snippets

146
147
148
149
"""
setup-taxdir.pl --taxdir=$workingdir
echo $workingdir > taxdump_path.txt
"""
156
157
158
 	        """
            echo $workingdir > taxdump_path.txt
		    """           
168
169
170
		"""
        echo $taxdump > taxdump_path.txt
		"""		
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
"""
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt -O refseq_sum.txt
$companion refseq_sum.txt --mode=sum
grep -v "#" refseq_sum-filt.txt | cut -f1 > GCF.list
fetch-tax.pl GCF.list  --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species
grep -v "#" refseq_sum-filt.txt | cut -f20 > ftp.list
grep -v "#" refseq_sum-filt.txt | cut -f20 | cut -f10 -d"/" > names.list
for f in `cat ftp.list `; do echo "/"; done > slash.list
for f in `cat ftp.list `; do echo "_genomic.fna.gz"; done > end1.list
for f in `cat ftp.list `; do echo "wget "; done > get.list
for f in `cat ftp.list `; do echo " -O "; done > out.list
for f in `cat ftp.list `; do echo ".fna.gz"; done > end2.list
cut -f1,2 -d"_" names.list > id.list
paste get.list ftp.list slash.list names.list end1.list out.list id.list end2.list > ftp.sh
sed -i -e 's/\t//g' ftp.sh
echo "RefSeq metadata" >> Genome-downloader.log
"""
213
214
215
216
217
"""
echo "Add Refseq Genomes NOT activated" > GCF.tax
echo "Add Refseq Genomes NOT activated" > ftp.sh
echo "Add Refseq metadata NOT activated" >> Genome-downloader.log
"""
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
"""
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt -O genbank_sum.txt
$companion genbank_sum.txt --mode=sum
grep -v "#" genbank_sum-filt.txt | cut -f1 > GCA.list
fetch-tax.pl GCA.list  --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species
grep -v "#" genbank_sum-filt.txt | cut -f20 > ftp.list
grep -v "#" genbank_sum-filt.txt | cut -f20 | cut -f10 -d"/" > names.list
for f in `cat ftp.list `; do echo "/"; done > slash.list
for f in `cat ftp.list `; do echo "_genomic.fna.gz"; done > end1.list
for f in `cat ftp.list `; do echo "wget "; done > get.list
for f in `cat ftp.list `; do echo " -O "; done > out.list
for f in `cat ftp.list `; do echo ".fna.gz"; done > end2.list
cut -f1,2 -d"_" names.list > id.list
paste get.list ftp.list slash.list names.list end1.list out.list id.list end2.list > GCA-ftp.sh
sed -i -e 's/\t//g' GCA-ftp.sh
 echo "Add GenBank metadata activated" >> Genome-downloader.log
"""
262
263
264
265
266
"""
echo "Add GenBank Genomes NOT activated" > GCA.tax
echo "Add GenBank Genomes NOT activated" > GCA-ftp.sh
echo "Add GenBank metadata NOT activated" >> Genome-downloader.log
"""
296
297
298
299
300
301
302
303
304
305
"""
#Produce list of GCF IDs with reference group and taxa levels
$companion GCF.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCF.refgroup.uniq`; do grep \$f ftp.sh; done > reduce-ftp.sh
bash reduce-ftp.sh
gunzip *.gz
find *.fna | cut -f1,2 -d"." > fna.list
for f in `cat fna.list`; do inst-abbr-ids.pl \$f*.fna --id-regex=:DEF --id-prefix=\$f; done
echo "Add RefSeq Genomes, abbr mode" >> Genome-downloader.log
"""
308
309
310
311
312
313
314
315
"""
#Produce list of GCF IDs with reference group and taxa levels
$companion GCF.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCF.refgroup.uniq`; do grep \$f ftp.sh; done > reduce-ftp.sh
bash reduce-ftp.sh
gunzip *.gz
echo "Add RefSeq Genomes, non abbr mode" >> Genome-downloader.log
"""
320
321
322
323
324
325
"""
echo "Add RefSeq Genomes NOT activated" > FALSER-abbr.fna
echo "Add RefSeq Genomes NOT activated" > reduce-ftp.sh
echo "GCF_FALSE" > GCF.refgroup.uniq
echo "Add RefSeq Genomes NOT activated" >> Genome-downloader.log
"""
354
355
356
357
358
359
360
361
362
363
364
365
366
"""
#Produce list of GCA IDs with reference group and taxa levels
$companion GCA.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCA.refgroup.uniq`; do grep \$f GCA-ftp.sh; done > GCA-reduce-ftp.sh
bash GCA-reduce-ftp.sh
gunzip *.gz
find *.fna | cut -f1,2 -d"." > fna.list
for f in `cat fna.list`; do inst-abbr-ids.pl \$f*.fna --id-regex=:DEF --id-prefix=\$f; done
#for fix and proceed , false genbank files
echo "Add GenBank Genomes activated" > FALSE-abbr.fna
echo "Add GenBank Genomes activated" > FALSE-GCA-reduce-ftp.sh
echo "Add GenBank Genomes activated, abbr mode" >> Genome-downloader.log
"""
370
371
372
373
374
375
376
377
378
379
380
"""
#Produce list of GCA IDs with reference group and taxa levels
$companion GCA.tax --mode=fetch --taxa=$taxa --refgroup=$group
for f in `cat GCA.refgroup.uniq`; do grep \$f GCA-ftp.sh; done > GCA-reduce-ftp.sh
bash GCA-reduce-ftp.sh
gunzip *.gz
#for fix and proceed , false genbank files
echo "Add GenBank Genomes activated" > FALSE-abbr.fna
echo "Add GenBank Genomes activated" > FALSE-GCA-reduce-ftp.sh
echo "Add GenBank Genomes activated, non abbr mode" >> Genome-downloader.log
"""
385
386
387
388
389
"""
echo "Add GenBank Genomes NOT activated" > FALSE-abbr.fna
echo "Add GenBank Genomes NOT activated" > GCA-reduce-ftp.sh
echo "Add GenBank Genomes NOT activated" >> Genome-downloader.log
"""
417
418
419
420
421
422
423
424
425
426
"""
#Delete false Genbak files
rm -rf FALSE*
mkdir GEN
mv *.fna GEN/
dRep dereplicate DREP -g GEN/*.fna -p $cpu
mkdir DEREPLICATED/
mv DREP/dereplicated_genomes/*.fna DEREPLICATED/
echo "Drep Dereplication activated" >> Genome-downloader.log
"""
429
430
431
432
433
434
435
436
437
438
"""
#Delete false Genbak files
rm -rf FALSE*
mkdir GEN
mv *.fna GEN/
dRep dereplicate DREP -g GEN/*.fna -p $cpu --ignoreGenomeQuality
mkdir DEREPLICATED/
mv DREP/dereplicated_genomes/*.fna DEREPLICATED/
echo "Drep Dereplication activated" >> Genome-downloader.log
"""
443
444
445
446
447
448
449
"""
#Delete false Genbak files
rm -rf FALSE*
mkdir DEREPLICATED/
mv *.fna DEREPLICATED/
echo "Drep Dereplication not activated" >> Genome-downloader.log
"""
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
"""
#log part
echo test > Genome-downloader.log
#Merge ftp
cat ftp.sh GCA-ftp.sh > merge-ftp.sh
grep -v 'activated' merge-ftp.sh > temp; mv temp merge-ftp.sh
#Collecte GCA/F number
cp DEREPLICATED/*.fna .
find *.fna > fna.list
sed -i -e 's/-abbr.fna//g' fna.list
sed -i -e 's/.fna//g' fna.list
rm -rf *.fna
#Get ftp part
for f in `cat fna.list`; do grep \$f merge-ftp.sh; done > prot-ftp.sh
sed -i -e 's/_genomic.fna.gz/_protein.faa.gz/g' prot-ftp.sh
sed -i -e 's/fna.gz/faa.gz/g' prot-ftp.sh
bash prot-ftp.sh
find *.gz -type f -empty -print -delete
gunzip *.gz
mkdir PROT
mv *.faa PROT/
"""
500
501
502
503
504
505
"""
#log part
echo "Prot download NOT activated" > Genome-downloader.log
mkdir PROT
echo "Prot download NOT activated" > PROT/info.faa
"""
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
"""
mv DEREPLICATED/*.fna .
#log part
echo "Genome-downloader started at `date`" > Genome-downloader.log
echo "Genomes dowloaded: " >> Genome-downloader.log
find *.fna | wc -l >> Genome-downloader.log
echo "Genome-downloader, version: " >> Genome-downloader.log
echo "3.0.0 " >> Genome-downloader.log
#copy part
#find *.fna | cut -f1 -d'-' > GC.list
find *.fna  > GC.list
mkdir GENOMES/
mv *.fna GENOMES/
sed -i -e 's/.fna//g' GC.list
fetch-tax.pl GC.list  --taxdir=\$(<taxdump_path.txt) --item-type=taxid --levels=phylum class order family genus species
mv GC.tax Genomes.taxomonomy
mkdir PROTEINS
mv PROT/*.faa PROTEINS/
"""
ShowHide 18 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Lcornet/GENERA
Name: gen-era-toolbox
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU Affero General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...