MGnify genomes analysis pipeline

public public 1yr ago Version: Version 1 0 bookmarks

MGnify genomes analysis pipeline

MGnify A pipeline to perform taxonomic and functional annotation and to generate a catalogue from a set of isolate and/or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:

Gurbich TA, Almeida A, Beracochea M, Burdett T, Burgin J, Cochrane G, Raj S, Richardson L, Rogers AB, Sakharova E, Salazar GA and Finn RD. (2023) MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues. J Mol Biol . doi: https://doi.org/10.1016/j.jmb.2023.168016

Detailed information about existing MGnify catalogues: https://docs.mgnify.org/src/docs/genome-viewer.html

Code Snippets

27
28
29
30
31
32
33
34
35
36
"""
amrfinder --plus \
-n ${fna} \
-p ${faa} \
-g ${gff} \
-d ${params.amrfinder_plus_db} \
-a prokka \
--output ${cluster}_amrfinderplus.tsv \
--threads ${task.cpus}
"""
51
52
53
54
55
56
57
58
"""
annotate_gff.py \
-g ${gff} \
-i ${ips_annotations_tsv} \
-e ${eggnog_annotations_tsv} \
-r ${ncrna_tsv} \
${crisprcas_flag} ${sanntis_flag} ${amrfinder_flag}
"""
61
62
63
"""
touch ${gff.simpleName}_annotated.gff
"""
13
14
15
16
17
"""
bracken-build -d ${kraken_db} \
-t ${task.cpus} \
-l ${read_length}
"""
16
17
18
19
20
21
"""
checkm lineage_wf -t ${task.cpus} -x fa --tab_table ${assemblies_folder} checkm_output

# to csv #
checkm2csv.py -i checkm_output > checkm_quality.csv
"""
24
25
26
"""
touch checkm_quality.csv
"""
45
46
47
48
49
50
51
52
53
54
"""
classify_folders.py -g ${genomes_folder} --text-file ${text_file}

# Clean any empty directories #
find many_genomes -type d -empty -print -delete
find one_genome -type d -empty -print -delete

mv many_genomes pg
mv one_genome sg
"""
29
30
31
32
33
"""
get_core_genes.py \
-i ${panaroo_gen_preabs} \
-o ${cluster_name}.core_genes.txt
"""
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
"""
CRISPRCasFinder.pl -i $fasta \
-so /opt/CRISPRCasFinder/sel392v2.so \
-def G \
-drpt /opt/CRISPRCasFinder/supplementary_files/repeatDirection.tsv \
-outdir crisprcasfinder_results

echo "Running post-processing"

process_crispr_results.py \
--tsv-report crisprcasfinder_results/TSV/Crisprs_REPORT.tsv \
--gffs crisprcasfinder_results/GFF/*gff \
--tsv-output crisprcasfinder_results/${fasta.baseName}_crisprcasfinder.tsv \
--gff-output crisprcasfinder_results/${fasta.baseName}_crisprcasfinder.gff \
--gff-output-hq crisprcasfinder_results/${fasta.baseName}_crisprcasfinder_hq.gff \
--fasta $fasta
"""
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
"""
cmscan \
--cpu ${task.cpus} \
--tblout overlapped_${fasta.baseName} \
--hmmonly \
--clanin ${rfam_ncrna_models}/Rfam.clanin \
--fmt 2 \
--cut_ga \
--noali \
-o /dev/null \
${rfam_ncrna_models}/Rfam.cm \
${fasta}

# De-overlap #
grep -v " = " overlapped_${fasta.baseName} > ${fasta.baseName}.ncrna.deoverlap.tbl
"""
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
"""
shopt -s extglob

RESULTS_FOLDER=results_folder
FASTA=${fasta}
CM_DB=${cm_models}

BASENAME=\$(basename "\${FASTA}")
FILENAME="\${BASENAME%.*}"

mkdir "\${RESULTS_FOLDER}"

echo "[ Detecting rRNAs ] "

for CM_FILE in "\${CM_DB}"/*.cm; do
    MODEL=\$(basename "\${CM_FILE}")
    echo "Running cmsearch for \${MODEL}..."
    cmsearch -Z 1000 \
        --hmmonly \
        --cut_ga --cpu ${task.cpus} \
        --noali \
        --tblout "\${RESULTS_FOLDER}/\${FILENAME}_\${MODEL}.tblout" \
        "\${CM_FILE}" "\${FASTA}" 1> "\${RESULTS_FOLDER}/\${FILENAME}_\${MODEL}.out"
done

echo "Concatenating results..."
cat "\${RESULTS_FOLDER}/\${FILENAME}"_*.tblout > "\${RESULTS_FOLDER}/\${FILENAME}.tblout"

echo "Removing overlaps..."
cmsearch-deoverlap.pl \
--maxkeep \
--clanin "\${CM_DB}/ribo.claninfo" \
"\${RESULTS_FOLDER}/\${FILENAME}.tblout"

mv "\${FILENAME}.tblout.deoverlapped" "\${RESULTS_FOLDER}/\${FILENAME}.tblout.deoverlapped"

echo "Parsing final results..."
parse_rRNA-bacteria.py -i \
"\${RESULTS_FOLDER}/\${FILENAME}.tblout.deoverlapped" 1> "\${RESULTS_FOLDER}/\${FILENAME}_rRNAs.out"

rRNA2seq.py -d \
"\${RESULTS_FOLDER}/\${FILENAME}.tblout.deoverlapped" \
-i "\${FASTA}" 1> "\${RESULTS_FOLDER}/\${FILENAME}_rRNAs.fasta"

echo "[ Detecting tRNAs ]"
tRNAscan-SE -B -Q \
-m "\${RESULTS_FOLDER}/\${FILENAME}_stats.out" \
-o "\${RESULTS_FOLDER}/\${FILENAME}_trna.out" "\${FASTA}"

parse_tRNA.py -i "\${RESULTS_FOLDER}/\${FILENAME}_stats.out" 1> "\${RESULTS_FOLDER}/\${FILENAME}_tRNA_20aa.out"

echo "Completed"
"""
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
"""
dRep dereplicate -g ${genomes_directory}/*.fa \
-p ${task.cpus} \
-pa 0.9 \
-sa 0.95 \
-nc 0.30 \
-cm larger \
-comp 50 \
-con 5 \
-extraW ${extra_weights_table} \
--genomeInfo ${checkm_csv} \
drep_output

tar -czf drep_data_tables.tar.gz drep_output/data_tables
"""
51
52
53
54
55
56
"""
mkdir -p drep_output/data_tables
touch drep_output/data_tables/Cdb.csv
touch drep_output/data_tables/Mdb.csv
touch drep_output/data_tables/Sdb.csv
"""
NextFlow From line 51 of modules/drep.nf
25
26
27
28
29
30
31
32
33
34
35
36
"""
emapper.py -i ${fasta} \
--database ${eggnog_db} \
--dmnd_db ${eggnog_diamond_db} \
--data_dir ${eggnog_data_dir} \
-m diamond \
--no_file_comments \
--cpu ${task.cpus} \
--no_annot \
--dbmem \
-o ${fasta.baseName}
"""
38
39
40
41
42
43
44
45
46
"""
emapper.py \
--data_dir ${eggnog_data_dir} \
--no_file_comments \
--cpu ${task.cpus} \
--annotate_hits_table ${annotation_hit_table} \
--dbmem \
-o ${annotation_hit_table.baseName}
"""
54
55
56
57
58
59
60
61
62
63
64
"""
touch eggnog-output.emapper.seed_orthologs

echo "#query	seed_ortholog	evalue	score	eggNOG_OGs	max_annot_lvl	COG_category	Description	Preferred_name	GOs	EC	KEGG_ko	KEGG_Pathway	KEGG_Module	KEGG_Reaction	KEGG_rclass	BRITE	KEGG_TC	CAZy	BiGG_Reaction	PFAMs" > eggnog-output.emapper.seed_orthologs

echo "MGYG000000012_00001	948106.AWZT01000053_gene1589	1.1e-63	199.0	COG5654@1|root,COG5654@2|Bacteria,1N6P3@1224|Proteobacteria,2VSGY@28216|Betaproteobacteria,1KFU4@119060|Burkholderiaceae	28216|Betaproteobacteria	S	RES	-	-	-	-	-	-	-	-	-	-	-	-	RES" >> eggnog-output.emapper.seed_orthologs

echo "MGYG000000001_00001	948106.AWZT01000053_gene1589	1.1e-63	199.0	COG5654@1|root,COG5654@2|Bacteria,1N6P3@1224|Proteobacteria,2VSGY@28216|Betaproteobacteria,1KFU4@119060|Burkholderiaceae	28216|Betaproteobacteria	S	RES	-	-	-	-	-	-	-	-	-	-	-	-	RES" >> eggnog-output.emapper.seed_orthologs

echo "MGYG000000020_00001	948106.AWZT01000053_gene1589	1.1e-63	199.0	COG5654@1|root,COG5654@2|Bacteria,1N6P3@1224|Proteobacteria,2VSGY@28216|Betaproteobacteria,1KFU4@119060|Burkholderiaceae	28216|Betaproteobacteria	S	RES	-	-	-	-	-	-	-	-	-	-	-	-	RES" >> eggnog-output.emapper.seed_orthologs
"""
66
67
68
69
70
71
72
73
74
75
"""
touch eggnog-output.emapper.annotations
echo "#query	seed_ortholog	evalue	score	eggNOG_OGs	max_annot_lvl	COG_category	Description	Preferred_name	GOs	EC	KEGG_ko	KEGG_Pathway	KEGG_Module	KEGG_Reaction	KEGG_rclass	BRITE	KEGG_TC	CAZy	BiGG_Reaction	PFAMs" > eggnog-output.emapper.annotations

echo "MGYG000000012_00001	59538.XP_005971304.1	7.97e-152	431.0	COG0101@1|root,KOG4393@2759|Eukaryota,39RAQ@33154|Opisthokonta,3BK4Y@33208|Metazoa,3D27W@33213|Bilateria,48A93@7711|Chordata,494G6@7742|Vertebrata,3J2WS@40674|Mammalia 33208|Metazoa	J	synthase-like 1 -	GO:0001522	-	-	-	-	-	-	-	-	-	-	DSPc,Laminin_G_3,PseudoU_synth_1" >> eggnog-output.emapper.annotations

echo "MGYG000000001_00001	 59538.XP_005971304.1	7.97e-152	431.0	COG0101@1|root,KOG4393@2759|Eukaryota,39RAQ@33154|Opisthokonta,3BK4Y@33208|Metazoa,3D27W@33213|Bilateria,48A93@7711|Chordata,494G6@7742|Vertebrata,3J2WS@40674|Mammalia 33208|Metazoa	J	synthase-like 1 -	GO:0001522	-	-	-	-	-	-	-	-	-	-	DSPc,Laminin_G_3,PseudoU_synth_1" >> eggnog-output.emapper.annotations

echo "MGYG000000020_00001	59538.XP_005971304.1	7.97e-152	431.0	COG0101@1|root,KOG4393@2759|Eukaryota,39RAQ@33154|Opisthokonta,3BK4Y@33208|Metazoa,3D27W@33213|Bilateria,48A93@7711|Chordata,494G6@7742|Vertebrata,3J2WS@40674|Mammalia 33208|Metazoa	J	synthase-like 1 -	GO:0001522	-	-	-	-	-	-	-	-	-	-	DSPc,Laminin_G_3,PseudoU_synth_1" >> eggnog-output.emapper.annotations
"""
30
31
32
"""
filter_qs50.py -i ${genomes} -c ${checkm_csv} --filter
"""
31
32
33
34
35
36
37
"""
functional_annotations_summary.py \
-f ${cluster_rep_faa} \
-i ${ips_annotation_tsvs} \
-e ${eggnog_annotation_tsvs} \
-k ${kegg_classes}
"""
40
41
42
43
44
45
46
"""
touch ${cluster_rep_faa.baseName}_annotation_coverage.tsv
touch ${cluster_rep_faa.baseName}_kegg_classes.tsv
touch ${cluster_rep_faa.baseName}_kegg_modules.tsv
touch ${cluster_rep_faa.baseName}_cazy_summary.tsv
touch ${cluster_rep_faa.baseName}_cog_summary.tsv
"""
15
16
17
18
19
20
21
22
23
24
25
26
"""
cut -f1 ${mmseqs_100_cluster_tsv} | sort -u > rep_list.txt

mkdir gene_catalogue

cp ${mmseqs_100_cluster_tsv} gene_catalogue/clusters.tsv

# Make the catalogue #
seqtk subseq \
${cluster_reps_ffn} \
rep_list.txt > gene_catalogue/gene_catalogue-100.ffn
"""
33
34
35
36
37
"""
generate_extra_weight_table.py \
-d ${genomes_folder} \
-o extra_weight_table.txt ${args}
"""
40
41
42
"""
touch extra_weight_table.txt
"""
35
36
37
38
39
40
41
42
43
44
"""
generate_summary_json.py \
--annot-cov ${coverage_summary} \
--gff ${annotated_gff} \
--metadata ${metadata} \
--biome "${biome}" \
--species-faa ${cluster_rep_faa} \
--species-name ${cluster} ${args} \
--output-file ${cluster}.json
"""
47
48
49
"""
touch ${cluster}.json
"""
37
38
39
40
41
42
43
44
45
46
47
48
"""
GTDBTK_DATA_PATH=/opt/gtdbtk_refdata \
gtdbtk classify_wf \
--cpus ${task.cpus} \
--pplacer_cpus ${task.cpus} \
--genome_dir genomes_dir \
--extension fna \
--skip_ani_screen \
--out_dir gtdbtk_results

tar -czf gtdbtk_results.tar.gz gtdbtk_results
"""
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
"""
mkdir gtdbtk_results

mkdir -p gtdbtk_results/classify
touch gtdbtk_results/classify/gtdbtk.bac120.summary.tsv
touch gtdbtk_results/classify/gtdbtk.ar53.summary.tsv

echo "user_genome	classification	fastani_reference	fastani_reference_radius	fastani_taxonomy	fastani_ani	fastani_af	closest_placement_reference	closest_placement_radius	closest_placement_taxonomy	closest_placement_ani	closest_placement_af	pplacer_taxonomy	classification_method	note	other_related_references(genome_id,species_name,radius,ANI,AF)	msa_percent	translation_table	red_value	warnings" > gtdbtk_results/classify/gtdbtk.bac120.summary.tsv

for file in $drep_folder/*
do
    GENOME=\$(basename \$file .fna)
    echo "\$GENOME	d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__Rothia mucilaginosa_B	GCF_001548235.1	95	d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__Rothia mucilaginosa_B	95.51	0.96	GCF_000175615.1	95	d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__Rothia mucilaginosa	94.5	0.94	d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Micrococcaceae;g__Rothia;s__	ANI	topological placement and ANI have incongruent species assignments	GCF_000269965.1, s__Bifidobacterium infantis, 95.0, 94.8, 0.77	97.9	11	N/A	N/A" >> gtdbtk_results/classify/gtdbtk.bac120.summary.tsv
done
"""
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
"""
gunc run -t ${task.cpus} \
-i ${fasta} \
-r ${gunc_db}

### gunc contaminated genomes ###
awk '{if(\$8 > 0.45 && \$9 > 0.05 && \$12 > 0.5) print\$1}' GUNC.*.maxCSS_level.tsv | grep -v "pass.GUNC" >gunc_contaminated.txt

# gunc_contaminated.txt could be empty - that means genome is OK
# gunc_contaminated.txt could have this genome inside - that means gunc filtered this genome

### check completeness ###

# remove header
tail -n +2 "${genomes_checkm}" > genomes.csv

### get notcompleted genomes ###
cat genomes.csv | tr ',' '\t' | awk '{if(\$2 < 90)print\$1}' > notcompleted.txt

grep -f gunc_contaminated.txt notcompleted.txt > bad.txt || true
# if bad.txt is not empty - that means genome didnt pass completeness and gunc filters

### final decision ###

if [ -s bad.txt ]; then
    touch ${fasta.baseName}_gunc_empty.txt
else
    touch ${fasta.baseName}_gunc_complete.txt
fi
"""
33
34
35
"""
samtools faidx ${fasta}
"""
38
39
40
"""
touch ${fasta.simpleName}.fai
"""
20
21
22
23
24
25
26
27
28
29
"""
interproscan.sh \
-cpu ${task.cpus} \
-dp \
--goterms \
-pa \
-f TSV \
--input ${faa_fasta} \
-o ${faa_fasta.baseName}.IPS.tsv
"""
25
26
27
28
29
30
31
32
33
34
"""
gunzip -c ${msa_fasta_gz} > ${output_prefix}_alignment.faa

iqtree -T 8 \
-s ${output_prefix}_alignment.faa \
--prefix iqtree.${output_prefix}

cp iqtree.${output_prefix}.treefile ${output_prefix}_iqtree.nwk
cp ${msa_fasta_gz} ${output_prefix}_alignment.faa.gz
"""
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"""
# Prepare the GTDB inputs #
cat ${gtdbtk_concatenated} | grep -v \"user_genome\" | cut -f1-2 > kraken_taxonomy_temp.tsv

while read line; do
    NAME=\$(echo \$line | cut -d ' ' -f1 | cut -d '.' -f1)
    echo \$line | sed "s/__\\;/__\$NAME\\;/g" | sed "s/s__\$/s__\$NAME/g"
done < kraken_taxonomy_temp.tsv > kraken_taxonomy.tsv

sed -i "s/ /\t/" kraken_taxonomy.tsv

gtdbToTaxonomy.pl \
--infile kraken_taxonomy.tsv \
--sequence-dir reps_fa/ \
--output-dir kraken_intermediate

mkdir ${kraken_db_name}

cp -r kraken_intermediate/taxonomy ${kraken_db_name}
"""
50
51
52
53
54
"""
kraken2-build \
--add-to-library ${cluster_fna_tax_annotated} \
--db ${kraken_db_path}
"""
71
72
73
74
75
"""
kraken2-build --build \
--db ${kraken_db_path} \
--threads ${task.cpus}
"""
94
95
96
97
98
"""
cat ${kraken_db}/library/added/*.fna > ${kraken_db}/library/library.fna

cp "${kraken_db}"/taxonomy/prelim_map.txt ${kraken_db}/library
"""
27
28
29
30
31
"""
mash2nwk1.R -m ${mash}

mv trees/mashtree.nwk ${mash.baseName}.nwk
"""
14
15
16
"""
mash sketch -o all_genomes.msh ${genomes_fasta.join( ' ' )}
"""
18
19
20
21
22
23
24
"""
merge_ncbi_ena.py --ncbi ${ncbi_genomes} \
--ncbi-csv ${ncbi_genomes_checkm} \
--ena ${ena_genomes} \
--ena-csv ${ena_genomes_checkm} \
--outname merged_genomes
"""
27
28
29
30
"""
mkdir merged_genomes
touch merged_genomes.csv
"""
35
36
37
38
39
40
41
42
43
44
45
46
47
48
"""
create_metadata_table.py \
--genomes-dir genomes_dir \
--extra-weight-table ${extra_weights_tsv} \
--checkm-results ${check_results_tsv} \
--rna-results rRNA_outs \
--naming-table ${name_mapping_tsv} \
--clusters-table ${clusters_tsv} \
--taxonomy ${gtdb_summary_tsv} \
--ftp-name ${ftp_name} \
--ftp-version ${ftp_version} \
--geo ${geo_metadata} ${args} \
--outfile genomes-all_metadata.tsv
"""
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
"""
timestamp() {
    date +"%H:%M:%S"
}
echo "\$(timestamp) [mmseqs script] Creating MMseqs database"

mmseqs createdb ${faa_file} mmseqs.db

echo "\$(timestamp) [mmseqs script] Clustering MMseqs with linclust with option -c ${id_threshold}"

mmseqs linclust \
mmseqs.db \
mmseqs_cluster.db \
mmseqs-tmp --min-seq-id ${id_threshold} \
--threads ${task.cpus} \
-c ${cov_threshold} \
--cov-mode 1 \
--cluster-mode 2 \
--kmer-per-seq 80

echo "\$(timestamp) [mmseqs script] Parsing output to create FASTA file of all sequences"

mmseqs createseqfiledb mmseqs.db \
mmseqs_cluster.db \
mmseqs_cluster_seq \
--threads ${task.cpus}

mmseqs result2flat mmseqs.db \
mmseqs.db \
mmseqs_cluster_seq \
mmseqs_cluster.fa

echo "\$(timestamp) [mmseqs script] Parsing output to create TSV file with cluster membership"

mmseqs createtsv mmseqs.db \
mmseqs.db \
mmseqs_cluster.db \
protein_catalogue-${threshold_rounded}.tsv \
--threads ${task.cpus}

echo "\$(timestamp) [mmseqs script] Parsing output to create FASTA file of representative sequences"

mmseqs result2repseq \
mmseqs.db \
mmseqs_cluster.db \
mmseqs_cluster_rep \
--threads ${task.cpus}

mmseqs result2flat \
mmseqs.db \
mmseqs.db \
mmseqs_cluster_rep \
protein_catalogue-${threshold_rounded}.faa \
--use-fasta-header

# Create a tarball with all the mmseq files
tar -cv mmseqs* | gzip > mmseq_${threshold_rounded}_outdir.tar.gz

tar -cv protein_catalogue-${threshold_rounded}.faa \
protein_catalogue-${threshold_rounded}.tsv | gzip > protein_catalogue-${threshold_rounded}.tar.gz
"""
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
"""
panaroo \
-t ${task.cpus} \
-i ${gff_files.join( ' ' )} \
-o ${cluster_name}_panaroo \
--clean-mode strict \
--merge_paralogs \
--core_threshold 0.90 \
--threshold 0.90 \
--family_threshold 0.5 \
--no_clean_edges

mv ${cluster_name}_panaroo/pan_genome_reference.fa ${cluster_name}_panaroo/${cluster_name}.pan-genome.fna

tar -czf ${cluster_name}_panaroo.tar.gz ${cluster_name}_panaroo
"""
36
37
38
39
40
41
42
43
44
"""
per_genome_annotations.py \
--ips ${ips_annotations_tsv} \
--eggnog ${eggnog_annotations_tsv} \
--rep-list ${species_reps_csv} \
--mmseqs-tsv ${mmseq_tsv} \
-c ${task.cpus} \
-o output_folder
"""
16
17
18
"""
phylo_tree_generator.py --table ${gtdb_taxonomy_tsv} --out phylo_tree.json
"""
60
61
62
63
64
65
66
67
68
69
70
"""
cat ${fasta} | tr '-' ' ' > ${fasta.baseName}_cleaned.fasta

prokka ${fasta.baseName}_cleaned.fasta \
--cpus ${task.cpus} \
--kingdom 'Bacteria' \
--outdir ${fasta.baseName}_prokka \
--prefix ${fasta.baseName} \
--force \
--locustag ${fasta.baseName}
"""
37
38
39
40
41
42
43
44
45
"""
rename_fasta.py -d ${genomes} \
-p ${genomes_prefix} \
-i ${start_number} \
--max ${max_number} \
-t name_mapping.tsv \
-o renamed_genomes \
--csv ${check_csv}
"""
48
49
50
51
52
"""
mkdir renamed_genomes
touch name_mapping.tsv
touch renamed_${check_csv.baseName}_checkm.txt
"""
31
32
33
34
35
36
37
"""
gunzip -c ${interproscan_tsv} > interproscan.tsv
sanntis \
--ip-file interproscan.tsv \
--outfile ${cluster_name}_sanntis.gff \
${prokka_gbk}
"""
39
40
41
42
43
44
"""
sanntis \
--ip-file ${interproscan_tsv} \
--outfile ${cluster_name}_sanntis.gff \
${prokka_gbk}
"""
32
33
34
"""
split_drep.py --cdb ${cdb_csv} --mdb ${mdb_csv} --sdb ${sdb_csv} -o split_output
"""
31
32
33
34
35
36
37
38
39
40
41
42
"""
mv ${interproscan_annotations} protein_catalogue-90_InterProScan.tsv
mv ${eggnog_annotations} protein_catalogue-90_eggNOG.tsv

gunzip -c ${mmseq_90_tarball} > protein_catalogue-90.tar

rm ${mmseq_90_tarball}

tar -uf protein_catalogue-90.tar protein_catalogue-90_InterProScan.tsv protein_catalogue-90_eggNOG.tsv

gzip protein_catalogue-90.tar
"""
25
26
27
28
29
30
31
32
33
"""
rm -f gunc_failed.txt || true
touch gunc_failed.txt
for GUNC_FAILED in failed_gunc/*; do
    name=\$(basename \$GUNC_FAILED)
    genome_name="\${name%"_gunc_empty.txt"}"
    echo \$genome_name >> gunc_failed.txt
done
"""
ShowHide 37 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/EBI-Metagenomics/genomes-pipeline.git
Name: mgnify-genomes-analysis-pipeline
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...