Phylogenomics tutorial based on BUSCO genes

public public 1yr ago 0 bookmarks

Phylogenomics tutorial based on BUSCO genes

Disclaimer To follow the demo and make the most of it, it helps if you have some basic skills with running software tools and manipulating files using the Unix shell command line. It assumes you have Docker installed on your computer (tested with Docker version 18.09.7, build 2d0083d; on Ubuntu 18.04).

Introduction

We will be reconstructing the phylogenetic relationships of some (iconic) vertebrates based on previously published whole genome data. The list of species we will be including in the analyses, and the URL for the data download can be found in this table .

All software used in the demo is deposited as Docker images on Dockerhub and all data is freely and publicly available.

The workflow we will demonstrate is as follows:

  • Download genomes from Genbank

  • Identifying complete BUSCO genes in each of the genomes

  • pre-filtering of orthology/BUSCO groups

  • For each BUSCO group:

    • build alignment

    • trim alignment

    • identify model of protein evolution

    • infer phylogenetic tree (ML)

  • construct supermatrix from individual gene alignments

  • infer phylogenomic tree with paritions corresponding to the original gene alignments using ML

  • map internode certainty (IC) onto the phylogenomic tree

Let's begin

Before you get going I suggest you download this repository, so have all scripts that you'll need. Ideally you'd do it through git .

(user@host)-$ git clone https://github.com/chrishah/phylogenomics_intro_vertebrata.git

Then move into the newly cloned directory, and get ready.

(user@host)-$ cd phylogenomics_intro_vertebrata

1.) Download data from Genbank

What's the first species of vertebrate that pops into your head? Latimeria chalumnae perhaps? Let's see if someone has already attempted to sequence its genome. NCBI Genbank is usually a good place to start. Surf to the webpage and have a look. And indeed we are lucky .

Let's get it downloaded. Note that the (user@host)-$ part of the code below just mimics a command line prompt. This will look differently on each computer. The command you actually need to exectue is the part after that, so only, e.g. mkdir assemblies :

#First make a directory and enter it
(user@host)-$ mkdir assemblies
(user@host)-$ mkdir assemblies/Latimeria_chalumnae
(user@host)-$ cd assemblies/Latimeria_chalumnae
#use the wget program to download the genome
(user@host)-$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/225/785/GCF_000225785.1_LatCha1/GCF_000225785.1_LatCha1_genomic.fna.gz
#decompress for future use
(user@host)-$ gunzip GCF_000225785.1_LatCha1_genomic.fna.gz
#leave the directory for now
(user@host)-$ cd ../..

We have compiled a list of published genomes that we will be including in our analyses here - the actual text file ships with the repository here: data/samples.csv . You don't have to download them all now, but do another few just as practice. You can do one by one or use your scripting skills (for loop) to get them all in one go.

To keep things clean I'd suggest to download each into a separate directory that should be named according to the binomial (connected with underscores, rather than spaces, see the first column of the file data/samples.csv ), following the same logic as in the example for L. chalumnae above.

2.) Run BUSCO on each assembly

In order to identify a defined set of genes in all of our genomes we could use BUSCO, i.e. run it on each of the downloaded genomes.

Attention

Since these genomes are relatively large BUSCO takes quite a while to run so this step has been already done for you.

A reduced representation of the BUSCO results for each species ships with the repository in the directory results/orthology/busco/busco_runs .

Take a few minutes to explore the reports.

3.) Prefiltering of BUSCO groups

Now, assuming that we ran BUSCO across a number of genomes, we're going to select us a bunch of BUSCO genes to be included in our phylogenomic analyses. Let's get and overview.

We have a script to produce a matrix of presence/absence of BUSCO genes across multiple species. Let's try it out. In this tutorial we'll be using Docker containers through Singularity.

(user@host)-$ singularity exec docker://reslp/biopython_plus:1.77 \
 bin/extract_busco_table.py \
 --hmm results/orthology/busco/busco_set/vertebrata_odb10/hmms \
 --busco_results results/orthology/busco/busco_runs/ \
 -o busco_table.tsv

The resulting file busco_table.tsv can be found in your current directory.

ATTENTION

When calling singularity as above it will download the corresponding container from the cloud. This is very convenient, but might in some instances take a bit of time. If you are doing this exercise as part of a course you might be provided with local copies of the images to save some time. Please wait here to get instructions.

The following command would download the image and safe it to a local *.sif file.

(user@host)-$ singularity pull docker://reslp/biopython_plus:1.77
(user@host)-$ ls -hrlt

Note that the previous command singularity pull .. downloaded a file called biopython_plus_1.77.sif - this is a physical representation of the image. Note also the naming scheme. Singularity names the file by combining the image name with the version (the part after the ':') via a '_', so 'biopython_plus:1.77' becomes 'biopython_plus_1.77.sif'.

To use the *.sif file instead of always querying the cloud the command before could be adjusted to:

(user@host)-$ singularity exec biopython_plus_1.77.sif \
 bin/extract_busco_table.py \
 --hmm results/orthology/busco/busco_set/vertebrata_odb10/hmms \
 --busco_results results/orthology/busco/busco_runs/ \
 -o busco_table.tsv

ATTENTION

If you're doing this as part of a course, all images may have been downloaded for you already.

Please check for example: ls ~/Share/Singularity_images/ or ask the instructors. Then, in all subsequent singularity calls please use the local images, rather than querying the cloud, so instead of:

singularity exec docker://reslp/biopython_plus:1.77 ..

adjust to

`singularity exec ~/Share/Singularity_images/biopython_plus_1.77.sif ..

Moving on..

Next, we'd want for example to identify all genes that are present in at least 20 of our 25 taxa and concatenate the sequences from each species into a single fasta file. We made a script for that - see below. Please make sure to follow the instructions also with respect to creating new directories, when this is suggested! .

(user@host)-$ mkdir -p by_gene/raw
(user@host)-$ singularity exec docker://reslp/biopython_plus:1.77 \
 bin/create_sequence_files.py \
 --busco_table busco_table.tsv \
 --busco_results results/orthology/busco/busco_runs \
 --cutoff 0.5 \
 --outdir by_gene/raw \
 --minsp 20 \
 --type aa \
 --gene_statistics gene_stats.txt \
 --genome_statistics genome_statistics.txt 

A bunch of files have been created in your current directory ( gene_stats.txt ) and also in the directory by_gene/raw (per gene fasta files).

4.) For each BUSCO group

For each of the BUSCOs that passed we want to:

  • do multiple sequence alignment

  • filter the alignment, i.e. remove ambiguous/problematic positions

  • build a phylogenetic tree

Let's go over a possible solution step by step for gene: 409625at7742 .

Perform multiple sequence alignment with clustalo .

#alignment with clustalo
(user@host)-$ mkdir by_gene/aligned
(user@host)-$ singularity exec docker://reslp/clustalo:1.2.4 \
 clustalo \
 -i by_gene/raw/409625at7742_all.fas \
 -o by_gene/aligned/409625at7742.clustalo.fasta \
 --threads=2

We can then look at the alignment result ( by_gene/aligned/409625at7742.clustalo.fasta ). There is a number of programs available to do that, e.g. MEGA, Jalview, Aliview, or you can do it online. A link to the upload client for the NCBI Multiple Sequence Alignment Viewer is here (I suggest to open in new tab). Download the alignment ( by_gene/aligned/409625at7742.clustalo.fasta ) to your local computer, upload the file to the online tool, press 'Close' button, and have a look.

What do you think? It's actually quite messy..

Let's move on to score and filter the alignment, using TrimAl .

#alignment trimming with trimal
(user@host)-$ mkdir by_gene/trimmed
(user@host)-$ singularity exec docker://reslp/trimal:1.4.1 \
 trimal \
 -in by_gene/aligned/409625at7742.clustalo.fasta \
 -out by_gene/trimmed/409625at7742.clustalo.trimal.fasta \
 -gappyout

Try open the upload dialog for the Alignment viewer in a new tab and upload the new file ( by_gene/trimmed/409625at7742.clustalo.trimal.fasta ). What do you think? The algorithm has removed quite a bit at the ends of the original alignment, reducing it to only ~100 positions, but these look mostly ok, at first glance.

Now, let's infer a ML tree with IQtree .

#ML inference with IQTree
(user@host)-$ mkdir -p by_gene/phylogeny/409625at7742
(user@host)-$ singularity exec docker://reslp/iqtree:2.0.7 \
 iqtree \
 -s by_gene/trimmed/409625at7742.clustalo.trimal.fasta \
 --prefix by_gene/phylogeny/409625at7742/409625at7742 \
 -m MFP --seqtype AA -T 2 -bb 1000

The best scoring Maximum Likelihood tree can be found in the file: by_gene/phylogeny/409625at7742/409625at7742.treefile .

The tree is in the Newick tree format. There is a bunch of programs that allow you to view and manipulate trees in this format. You can only do it online, for example through iTOL , embl's online tree viewer. There is others, e.g. ETE3 , icytree , or trex . You can try it out, but first let's have a quick look at the terminal.

(user@host)-$ cat by_gene/phylogeny/409625at7742/409625at7742.treefile

Go to the iTOL , select 'Upload a tree' and copy/paste the textual representation of our tree into the 'Tree text' field.

What do you think? Does it look right? Remember that this is based on one gene, and different genes may tell different stories depending on their history, which may be affected by many factors.

Well done!

5.) Run the process for multiple genes

Now, let's say we want to go over these steps for multiple genes, say these:

  • 359032at7742

  • 413149at7742

  • 409719at7742

  • 406935at7742

For loop would do the job, right? See the below code. Do you manage to add the tree inference step in, too? It's not in there yet.

#!/bin/bash
for gene in $(echo "359032at7742 413149at7742 409719at7742 406935at7742")
do
 echo -e "\n$(date)\t$gene"
 echo -e "$(date)\taligning"
 singularity exec docker://reslp/clustalo:1.2.4 clustalo -i by_gene/raw/${gene}_all.fas -o by_gene/aligned/${gene}.clustalo.fasta --threads=2
 echo -e "$(date)\ttrimming"
 singularity exec docker://reslp/trimal:1.4.1 trimal -in by_gene/aligned/${gene}.clustalo.fasta -out by_gene/trimmed/${gene}.clustalo.trimal.fasta -gappyout
 echo -e "$(date)\tDone"
done

Prepare your code in a script bygene.sh .

A possible solution for the script (including the tree inference) can be found here: backup/bygene.sh , or if you'd been asked to use local singularity images check out the solution here: backup/bygene_local.sh .

Run your script, e.g. like so:

(user@host)-$ chmod a+x bygene.sh #make sure it's executable
(user@host)-$ ./bygene.sh

If something went wrong and you want to continue anyway you can get the data you'd have produced in the previous step by copying it from our backup directory.

(user@host)-$ rsync -avpuzP backup/by_gene/* by_gene

Well Done! Now you should have five trees - one for each of the genes. Just to doublecheck:

(user@host)-$ ls -1 by_gene/phylogeny/*/*treefile
by_gene/phylogeny/359032at7742/359032at7742.treefile
by_gene/phylogeny/406935at7742/406935at7742.treefile
by_gene/phylogeny/409625at7742/409625at7742.treefile
by_gene/phylogeny/409719at7742/409719at7742.treefile
by_gene/phylogeny/413149at7742/413149at7742.treefile

Now, then, let's infer a ML tree using a supermatrix of all 5 genes that we have processed so far. Conveniently, you'll just need to point IQtree onto a directory that contains multiple alignments and it will do the rest. In our case we use the trimmed alignments. Be aware, though, that in order for IQtree to be able to match up the right sequences in the supermatrix you're going to have to use the same names in all individual alignment files.

(user@host)-$ singularity exec docker://reslp/iqtree:2.0.7 \
 iqtree \
 -s by_gene/trimmed/ \
 --prefix five_genes \
 -m MFP --seqtype AA -T 2 -bb 1000 

This will run for about 10 Minutes. You can check out the result five_genes.treefile , once it's done.

(user@host)-$ cat five_genes.treefile #or try backup/five_genes.treefile instead if you had trouble

Now, we can also try to build a speciestree from the 5 individual gene trees using ASTRAL .

TASK

First bring the individual gene trees together into one file. Let's call the file trees.txt .

Then run ASTRAL.

(user@host)-$ singularity exec docker://reslp/astral:5.7.1 \
 java -jar /ASTRAL-5.7.1/Astral/astral.5.7.1.jar \
 -i trees.txt -o species_tree.astral.tre 

Have a look at the result.

(user@host)-$ cat species_tree.astral.tre #or try backup/species_tree.astral.tre instead if you had trouble

Instead of looking at the plain text representation you can also explore the trees e.g. via iTOL .

Congratulations, you've just built your first phylogenomic tree(s)!!!

5.) Automate the workflow with Snakemake

A very neat way of handling this kind of thing is Snakemake .

The very minimum you'll need to create Snakemake workflow is a so called Snakefile. The repository ships with files called Snakemake_intro/Snakefile_local and Snakemake_intro/Snakefile_cloud . This file contains the instructions for running a basic workflow with Snakemake. Let's have a look.

(user@host)-$ less Snakemake_intro/Snakefile_local #exit less with 'q' - the version to use local images
(user@host)-$ less Snakemake_intro/Snakefile_cloud #exit less with 'q' - the version to use images from the cloud

In the Snakefile you'll see 'rules' (that's what individual steps in the analyses are called in the Snakemake world). Some of which should look familiar, because we just ran them manually, and then again within a simple for loop. Filenames etc. are replaced with variables but other than that..

Spend some time to explore the file.

Snakemake should be installed on your system - check back with your instructors if you're doing this as part of a course. An easy way to get it set up is through conda. If you haven't set it up yet, we provide some instructions here .

Assuming you've set up a conda environment called snakemake (this will usually be the case if you do this as part of a course), in order to run Snakemake you first need to enter this environment.

(user@host)-$ conda activate snakemake
(snakemake) (user@host)-$ snakemake -h

Now, let's try to do a Snakemake 'dry-run', providing a specific target file and see what happens. You'll first need to put a file called Snakefile in place - this is the default expectation of Snakemake.

(user@host)-$ cp Snakemake_intro/Snakefile_local Snakefile #or cp Snakemake_intro/Snakefile_cloud Snakefile if you prefer to query the cloud
(user@host)-$ snakemake -n -rp \
 auto/trimmed/193525at7742.clustalo.trimal.fasta

Now, you could extend the analyses to further genes.

(user@host)-$ snakemake -n -rp \
 auto/trimmed/193525at7742.clustalo.trimal.fasta \
 auto/trimmed/406935at7742.clustalo.trimal.fasta

Actually, running would happen if you remove the -n flag. Note that I've added another flag ( --use-singularity ) which tells snakemake to use containers for certain rules if so indicated in the Snakefile .

(user@host)-$ snakemake -rp --use-singularity \
 auto/trimmed/193525at7742.clustalo.trimal.fasta \
 auto/trimmed/406935at7742.clustalo.trimal.fasta

Now try the following:

  • Check what happens if you run the above command once again.

  • remove the file auto/trimmed/406935at7742.clustalo.trimal.fasta and rerun the command. Neat, no?

See if you can get it run also for gene id 378120at7742 .

TASK

Try to extend the Snakefile to also include:

  • per gene phylogenetic inference (see above)

  • supermatrix tree inference using the same 5 genes as before

Have fun playing around with this for a while ;-)

Solutions can be found in these files:

  • backup/Snakefile_with_ml_local

  • backup/Snakefile_with_ml

A Snakefile that would do the full analyses using all genes that are present in the directory by_gene/raw/ can be found here: backup/Snakefile_with_ml_from_dir .

Try it out:

(user@host)-$ snakemake -n -rp --use-singularity \
 -s backup/Snakefile_with_ml_from_dir

Create yourself a dag to see what the current status of the workflow is.

(user@host)-$ snakemake -n --dag -s backup/Snakefile_with_ml_from_dir | dot -Tpdf > dag.with_ml_from_dir.pdf

The result will look something like this .

6.) Full automation (OPTIONAL)

The current repository is actually a snapshot of phylociraptor . This is a pipeline for automating the entire process of phylogenomic analyses from BUSCO genes (for now).

In the base directory of this repository you could resume an analysis as shown below. If there is time we'll talk about the setup a little bit.

The main things you need are:

  • config file data/config.vertebrata_minimal.yaml

  • sample file data/vertebrata_minimal.csv

A few steps were already run for you - see the file data/preparation.md

#get table
./phylociraptor orthology -t serial=2 --config-file data/config.vertebrata_minimal.yaml
#filter-orthology
./phylociraptor filter-orthology -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose
#align
./phylociraptor align -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose
#filter align
./phylociraptor filter-align -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose
#modeltest
./phylociraptor modeltest -t serial=2 --config-file data/config.vertebrata_minimal.yaml
#ml tree
./phylociraptor mltree -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose
#speciestree
./phylociraptor speciestree -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose
#figure
./phylociraptor report --config-file data/config.vertebrata_minimal.yaml 
./phylociraptor report --figure --config-file data/config.vertebrata_minimal.yaml

Code Snippets

  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import os
import sys
import pandas as pd
from Bio import SeqIO
import argparse
import tarfile
from io import StringIO
from io import TextIOWrapper

if sys.version_info[0] < 3:
    raise Exception("Must be using Python 3")

pars = argparse.ArgumentParser(prog="create_sequence_files.py", description = """This script will create fasta files for all the buscos from all species with >% of buscos present""", epilog = """written by Philipp Resl""")
pars.add_argument('--busco_table', dest="busco_table", required=True, help="Path to BUSCO table.")
pars.add_argument('--busco_results', dest="busco_results", required=True, help="Results directory containing all BUSCO runs.")
pars.add_argument('--cutoff', dest="cutoff", default=0, required=True, help="Percent cutoff %% for BUSCO presence. Species below this threshold will be excluded.")
pars.add_argument('--outdir', dest="outdir", required=True, help="Path to output directory.")
pars.add_argument('--minsp', dest="minsp", required=True, help="Minimum number of species which have to be present to keep the sequences.")
pars.add_argument('--type' , dest="type", required=True, help="Type of sequences (aa or nu).")
pars.add_argument('--genome_statistics' , dest="genome_stats", required=True, help="Path to genome statistics output file.")
pars.add_argument('--gene_statistics' , dest="gene_stats", required=True, help="Path to gene statistics output file.")
pars.add_argument('--batchID' , dest="batchid", default=1, type=int, help="Batch ID (start for subsampling)")
pars.add_argument('--nbatches', dest="nbatches", default=1, type=int, help="Total number of batches (step size for subsampling)")

args=pars.parse_args()

extension=""
if args.type == "nu":
	extension = ".fna"
else:
	extension = ".faa" 


busco_overview = pd.read_csv(args.busco_table, sep="\t")
genomes = os.listdir(args.busco_results)
#print(busco_overview)
print("Settings:")
print("cutoff: ", args.cutoff)
print("minsp: ", args.minsp)
print("type: ", args.type)
print("outdir: ", args.outdir)
print("batchID: %i / %i" %(args.batchid, args.nbatches))

species_list = busco_overview.species.tolist()
print("Original number of species:", len(species_list))
#print(species_list)
#first remove species with too low busco coverage
busco_overview = busco_overview.set_index("species")

genome_file = open(args.genome_stats, "w")
for sp in species_list:
	if busco_overview.loc[sp, "percent_complete"] < float(args.cutoff):
		out = sp + "\tFAILED" + "\t" + str(busco_overview.loc[sp, "percent_complete"]) + "\t" + str(float(args.cutoff))
		print(out, file=genome_file)
		busco_overview = busco_overview.drop([sp])
	else:
		out = sp + "\tOK" + "\t" + str(busco_overview.loc[sp, "percent_complete"]) + "\t" + str(float(args.cutoff)) 
		print(out, file=genome_file)
species_list =  list(busco_overview.index)
print("Species remaining after applying cutoff:", len(species_list))
genome_file.close()

#now loop through each busco and extract sequence for each species
buscos = list(busco_overview.columns.values)
buscos.remove("percent_complete")

target=int(args.batchid)
gene_file = open(args.gene_stats, "w").close()
for i in range(len(buscos)):
	j = i+1
#	print("i: %i; j: %i; target: %i" %(i,j, target))
	if j != target:
#		print("Don't do anything (i: %i)" %i)
		continue
	gene_file = open(args.gene_stats, "a")
	target+=args.nbatches
	busco = buscos[i]
	print("Processing: " + busco)
	numseqs = 0
	outstring = ""
	for species in species_list:
		tar_file_content = open(args.busco_results + "/" + species + "/run_busco/single_copy_busco_sequences.txt", "r")
		for line in tar_file_content:
			line = line.strip()
			if busco+extension in line:
				path_to_busco_file = line.split(" ")[-1]	
				tf = tarfile.open(args.busco_results + "/" + species + "/run_busco/single_copy_busco_sequences.tar", "r")
				#print(path_to_busco_file)
				#print(tf.extractfile(path_to_busco_file))
				tar_file_content = TextIOWrapper(tf.extractfile(path_to_busco_file))
				if tar_file_content:
					with TextIOWrapper(tf.extractfile(path_to_busco_file)) as handle:
						for seq_record in SeqIO.parse(handle, "fasta"):
							name = ">" +species+"\n"
							#outfile.write(name)
							#outfile.write(str(seq_record.seq)+"\n")
							outstring = outstring + name
							outstring = outstring + str(seq_record.seq) + "\n"
				else:
					print(busco, "not found for", species)
					continue
	if outstring.count(">") >= int(args.minsp):	# only keep sequences if total number is larger than specified cutoff above.		
		print(busco + "\t" + "OK" + "\t" + str(outstring.count(">")) +"\t" + str(int(args.minsp)), file=gene_file)
		outfile = open(args.outdir+"/"+busco+"_all.fas", "w")
		outfile.write(outstring)
		outfile.close()
	else:
		print(busco + "\t" + "FAILED" + "\t" + str(outstring.count(">")) +"\t" + str(int(args.minsp)), file=gene_file)
	gene_file.close()
#old working code when busco sequences are not tared.
"""
	for species in species_list:
		for genome in genomes: # this loop is to get the correct directory name, it is very unelegant
			#print(args.busco_results+"/"+genome+"/run_busco/single_copy_busco_sequences/"+busco+extension)
			if species == genome:
				try:
					seqfile = open(args.busco_results + "/" + genome + "/run_busco/single_copy_busco_sequences/" + busco + extension, "r")
					for seq_record in SeqIO.parse(seqfile, "fasta"):
						name = ">" +species+"\n"
						#outfile.write(name)
						#outfile.write(str(seq_record.seq)+"\n")
						outstring = outstring + name
						outstring = outstring + str(seq_record.seq) + "\n"
					seqfile.close()
				except: # skip missing buscos
					continue
	if outstring.count(">") >= int(args.minsp):	# only keep sequences if total number is larger than specified cutoff above.		
		print(busco + "\t" + "OK" + "\t" + str(outstring.count(">")) +"\t" + str(int(args.minsp)), file=gene_file)
		outfile = open(args.outdir+"/"+busco+"_all.fas", "w")
		outfile.write(outstring)
		outfile.close()
	else:
		print(busco + "\t" + "FAILED" + "\t" + str(outstring.count(">")) +"\t" + str(int(args.minsp)), file=gene_file)
gene_file.close()
"""
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import os
import sys
import argparse


if sys.version_info[0] < 3:
    raise Exception("Must be using Python 3")

pars = argparse.ArgumentParser(prog="extract_busco_table.py", description = """This script will get all busco3 hmms and look in the busco results all specified genomes for the presence of the files""", epilog = """written by Philipp Resl""")
pars.add_argument('--hmm', dest="hmm_dir", required=True, help="Directory of the BUSCO hmms.")
pars.add_argument('--busco_results', dest="busco_results", required=True, help="Results directory containing all BUSCO runs.")
pars.add_argument('-o', dest="out", required=True, help="BUSCO table output file.")
args=pars.parse_args()

hmms = os.listdir(args.hmm_dir)
hmms = [hmm.strip(".hmm") for hmm in hmms]

genomes = os.listdir(args.busco_results)

outfile = open(args.out, "w")
header = "species\t"
header += "\t".join(hmms)
header += "\tpercent_complete" 
print(header, file= outfile)
for species in genomes:
	ones = 0
	zeros = 0
	outstring = species
	print("Extracting HMMs for", species, file=sys.stderr)
	try:
		busco_listing_file = open(args.busco_results + species + "/run_busco/single_copy_busco_sequences.txt", "r")
		buscos = []
		for line in busco_listing_file:
			line = line.strip()
			line = line.split(" ")[-1]
			line = line.split("/")[-1]
			if ".faa" in line: # only take aa files, this should be enough
				buscos.append(line)
		buscos = [busco.strip(".faa") for busco in buscos]
		busco_listing_file.close()
		for hmm in hmms:
			if hmm in buscos:
				outstring += "\t"
				outstring += "1"
				ones +=1
			else:
				outstring += "\t"
				outstring += "0"
				zeros +=1
		percent = ones / (ones+zeros)
		outstring += "\t"
		outstring += str(percent)
		print(outstring, file=outfile)
	except:
		out = species + " not found. Skipped.\n"
		print(out, file=sys.stderr)
		continue
outfile.close()
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import os
import sys
import argparse
from Bio import SeqIO

if sys.version_info[0] < 3:
    raise Exception("Must be using Python 3")

pars = argparse.ArgumentParser(prog="filter_alignments.py", description = """This script will remove alignments with duplicated sequences.""", epilog = """written by Philipp Resl""")
pars.add_argument('--alignments', dest="align", required=True, help="alignment files")
pars.add_argument('--outdir', dest="outdir", required=True, help="output directory")
pars.add_argument('--per_sample', dest="perseq", action='store_true', help="if set, only samples with duplicated sequences will be removed. Else the whole alignment")
pars.add_argument('--min-parsimony', nargs="?", dest="minpars", const=0, help="The minimum number of parsimony informative sites and alignment must have to be kept.")
pars.add_argument('--statistics-file', nargs="?", dest="statfile", const="", help="Statistics file which contains alignment information(incl. parsimony informative sites; this file needs to be created with concat)")
pars.add_argument('--minsp', dest="minsp", action="store", default=3, type=int, help="Number of taxa to be present in alignment at least (default: 3)")
args=pars.parse_args()

algn_list = args.align.split(" ")

stats_dict = {}
if args.statfile:
	with open(args.statfile, "r") as stats:
		stats.readline()
		for stat in stats:
			stats_dict[stat.split("\t")[0]] = int(stat.split("\t")[5])	
for al in algn_list:
	if os.stat(al).st_size == 0:
		print(os.path.basename(al), "\tEMPTY")
		continue
	if stats_dict and stats_dict[os.path.basename(al)] < int(args.minpars):
		print(os.path.basename(al), "\tNOT_INFORMATIVE")
		continue

	seqfile = open(al, "r")
	sequences = []
	for seq_record in SeqIO.parse(seqfile, "fasta"):
		sequences.append(seq_record)
	names_list = [seq.id for seq in sequences]
	if len(names_list) < args.minsp:
		print(os.path.basename(al), "\tTOO_FEW_SEQUENCES")
		continue
	if len(names_list) == len(set(names_list)):
		print("Create symlink: ", args.outdir+"/"+al.split("/")[-1], file=sys.stderr)
		os.symlink(al, args.outdir+"/"+al.split("/")[-1])
		print(os.path.basename(al), "\tPASS")
	else:
		print("Warning: File %s contains duplicated sequence IDs!" % al, file=sys.stderr)
		if (args.perseq == True):	
			print("Duplicated sequences will be removed and copy of file will be made.", file=sys.stderr)
			dups = [name for name in names_list if names_list.count(name)>1]
			print("Warning: Will remove %d sequences" % len(dups), file=sys.stderr)
			dups = set(dups)
			sequences = [sequence for sequence in sequences if sequence.id not in dups]
			if len(sequences) < args.minsp:
				print("Warning: %s file will be discarded (too few sequences after duplicate removal)" % al, file=sys.stderr)
				print(os.path.basename(al), "\tTOO_FEW_SEQUENCES")
				continue
			outfile = open(args.outdir+"/"+al.split("/")[-1], "w")
			for sequence in sequences:
				print(">"+sequence.id, file=outfile)
				print(sequence.seq, file=outfile)					
			outfile.close()
			print(os.path.basename(al), "\tREMOVE_DUPLICATED_SEQUENCES")
		else:
			print("Warning: %s file will be discarded" % al, file=sys.stderr)
			print(os.path.basename(al), "\tREMOVED")	
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
import sys, os, time
import pandas as pd
from Bio import Entrez
import subprocess
from subprocess import PIPE
import argparse
from urllib.request import HTTPError
import urllib.request, urllib.error, urllib.parse
import urllib

pars = argparse.ArgumentParser(prog="genome_download.py", description = """This script will download complete genomes from NCBI genbank""", epilog = """written by Philipp Resl""")
pars.add_argument('--genomes', dest="genomes", help="List of genomes to be downloaded sperated by commas (,). Requried.")
pars.add_argument('--outdir', dest="outdir",default="", help="output directory. Default: current directory")
pars.add_argument('--entrez_email', dest="email", required=True, help="Email to be used for communication with the NCBI Entrez databases. Required.")
pars.add_argument('--api_key', dest="api_key", help="API key to be used for communication with the NCBI Entrez databases. Optional; not yet implemented.")
pars.add_argument('--overview-only', dest="over", action='store_true', help="Download only the NCBI overview, then stop")
pars.add_argument('--batch', dest="batch", default="", help="Batch number.")
args=pars.parse_args()

def now():
	return time.strftime("%Y-%m-%d %H:%M") + " -"


if not args.outdir.endswith("/"):
	args.outdir = args.outdir + "/"
wd = os.getcwd() + "/"
#args.outdir = wd + args.outdir
print(now(), "Using ouput directory:", args.outdir)

if not os.path.isfile(args.outdir+"assembly_summary_genbank.txt"):
	print(now(), "Downloading genome overview...")	
	try:
		outstream = subprocess.run(["wget","-q", "ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt", "-P", args.outdir], stdout = PIPE, stderr = PIPE)	
		if outstream.returncode == 0:
			print(now(), "Download finished successfully.")
			output = ""
			with open(args.outdir+"assembly_summary_genbank.txt","r") as handle: # this needs to be done because this file contains two comment head lines.
				handle.readline() # read first line so that pointer is at second line
				for line in handle:
					if line.startswith("#"):
						line = line[2:]
					output += line
			outfile = open(args.outdir+"assembly_summary_genbank.txt","w")
			outfile.write(output) 
			outfile.close()
		else:
			print(now(), "Something went wrong during download")
			sys.exit(1)
	except:
		print(now(), "An error occurred during download")
		print(outstream.stderr.decode())
		print(outstream.stdout.decode())
		sys.exit(1)
else:
	print(now(), "NCBI genome database file is already present. Will not download")


if args.over == True:
	print(now(), "Only genome overview was downloaded, will stop now.")
	sys.exit(0)
# parse command line arguments
genomes = args.genomes
genomes = genomes.split(",")
Entrez.email = args.email

# read NCBI genome database csv
print(now(), "Reading NCBI genome database file")
data = pd.read_csv(args.outdir + "assembly_summary_genbank.txt", sep="\t", error_bad_lines=False, low_memory=False, na_values="")
data["ftp_path"] = data["ftp_path"].astype(str)


#def check_taxonomy(sp, taxid):
#	print(now(), "Will test if", sp, "has taxid", taxid)
#	entrez_handle = Entrez.esummary(db="taxonomy", id=str(taxid))
#	#entrez_handle = Entrez.esearch(db="Taxonomy", term=sp)
#	record = Entrez.read(entrez_handle)
#	entrez_handle.close()
#	print(record)
#	if record[0]["ScientificName"] == sp:
#		print(now(),  taxid, "seems to be correct for", sp)
#		return True
#	else:
#		print(now(),  taxid, "seems to be wrong for", sp)
#		print(now(), "You can check manually if", sp, "=",record[0]["ScientificName"], ".")
#		return False	

def get_taxid(sp):
	tryagain=True
	while tryagain:
		try:
			entrez_handle = Entrez.esearch(db="Taxonomy", term=sp)
			record = Entrez.read(entrez_handle)
			entrez_handle.close()
			if len(record["IdList"]) == 0:
				return
			if record["IdList"][0]:
				return str(record["IdList"][0])
		except HTTPError as e:
			if e.code == 429:
				print(now(), "There are currently too many requests to the NCBI database, will try again in 30 seconds.")
				time.sleep(30)
			else:
				print(now(), "A HTTP Error occurred. Will stop trying.")
				tryagain = False
		except: # other errors
			tryagain = False
	return		

def check_already_downloaded(file):
	if os.path.isfile(file):
		return True
	else:
		return False

def download(download_data, genome):
	# added for debugging purpose, there seems to be an index error with this genome
	if download_data.empty:
		print(now(), "The dataframe is empty. Nothing will be downloaded. Please check manually.")
		return "failed"
	url = download_data.iloc[0]["ftp_path"]
	genome = genome.replace(" ", "_")
	filename = url.split("/")[-1]
	download_url_genome = url + "/" + filename + "_genomic.fna.gz"
	print(now(), "Will try to download genome from:", download_url_genome)
	if check_already_downloaded(args.outdir + genome + "_genomic_genbank.fna.gz"):
		print(now(), "A genome has already been downloaded. Will not download again.")
		return "success"
	else:
		tryagain = True
		while tryagain:
			try:
				#outstream = subprocess.run(["wget", download_url_genome, "-P", args.outdir], stderr=subprocess.STDOUT, stdout=subprocess.PIPE, universal_newlines=True)
				urllib.request.urlretrieve(download_url_genome, args.outdir+"/"+genome+"_genomic_genbank.fna.gz")
				print(now(), "Download finished successfully.")
				print(now(), "Writing meta information file.")
				download_data.to_csv(args.outdir + genome + "_db_genbank.tsv", sep="\t")
				tryagain = False #No need to continue, all is done.
				#if outstream.returncode == 0:
				#	print(now(), "Download finished successfully.")
				#	outstream = subprocess.run(["mv", args.outdir + filename + "_genomic.fna.gz", args.outdir + genome + "_genomic_genbank.fna.gz"])
				#	print(now(), "Writing meta information file.")
				#	download_data.to_csv(args.outdir + genome +"_db_genbank.tsv", sep="\t")
				#	tryagain = False # No need to try again, everything has been downloaded.
				#else:
				#	print(now(), "Something went wrong during download")
				#	print(outstream.stdout)
				#	return "failed"
			except HTTPError as e:
				if e.code == 429:
					print(now(), "There are currently too many requests to the NCBI database. Will try again in 30 seconds.")
					time.sleep(30)
				else:
					print(now(), "A HTTP error occurred. Will stop. Maybe the file you are trying to download does not exist? (eg. an older accession)")
					return "failed"
			except Exception as e:
				print(now(), "An error occurred during download.")
				print(now(), "{}".format(e))
				return "failed"
	print(now(), "done")
	return "success"

overview = {}
for genome in genomes:
	print("\n----------------------------------", genome.split("=")[0], "----------------------------------")
	if check_already_downloaded(args.outdir+genome+"_genomic_genbank.fna.gz"):
		print(now(), "A genome has already been downloaded. Will skip")
		overview[genome] = "success"
		continue

	#if specific accession is provided via 'web=accession'
	if "=" in genome:
		accession = genome.split("=")[1]
		print(now(), "A specific accession '"+accession+"' has been provided")
		if data.assembly_accession.str.fullmatch(accession).any():
			species_data = data[data.assembly_accession == accession]
			#print(species_data.values.tolist())
		else:
			print(now(), "The accession wasn't found in the current list of genomes. Will check if there is an archived version.")
			#print(genome.split("=")[1].split(".")[0])
			accession_basename=genome.split("=")[1].split(".")[0]
			if data.assembly_accession.str.startswith(accession_basename).any():
				print(now(), "It looks like this is indeed an older version, will check if it is still available.")
				bool_series = data.assembly_accession.str.startswith(accession_basename, na = False)
				species_data = data[bool_series]
				ftp_path = "/".join(species_data.iloc[0]["ftp_path"].split("/")[0:-1]) 	
				response = urllib.request.urlopen(ftp_path)
				content = response.read().decode('UTF-8')
				found = False
				folder=""
				for line in content.split('"'):
					if line.startswith(genome.split("=")[1]):
						print(now(), "Found folder for accession", line)
						folder = line			
						found = True
						break
				if found == False:
					print(now(), "Could not resolve specified accession version. Please check manually, maybe it does not exist any more?")
					overview[genome.split("=")[0]] = "failed"
					continue
				else:	
					ftp_path += "/"+folder.strip("/")
					species_data["ftp_path"] = ftp_path
					species_data["assembly_accession"] = genome.split("=")[1] 
					# the information below has to be removed because it may be different for older accessions
					# there is currently no way to get this directly from NCBI without creating more HTTP requests
					species_data["version_status"] = "replaced"
					species_data["seq_rel_date"] = "na"
					species_data["refseq_category"] = "na"
					species_data["wgs_master"] = "na"
					species_data["asm_name"] = "na"
					#print(species_data.values.tolist())
			else:
				print(now(), "Could not resolve specified accession version. Please check manually for typos.")
				overview[genome.split("=")[0]] = "failed"
				continue
		overview[genome.split("=")[0].replace("_", " ")] = download(species_data, genome.split("=")[0])
		continue

	genome = genome.replace("_", " ")
	taxid = get_taxid(genome)
	if taxid:
		print(now(), "taxid for",genome,"is",taxid)
	else:
		print(now(), "Unable to find taxid for", genome, ". Was the name misspelled?")
		overview[genome] = "failed"
		continue
	print(now(),"Checking number of available genomes...")
	#species_data = data[data["organism_name"] == genome]
	species_data = data[data.organism_name.str.contains(genome)]
	dates = list(species_data["seq_rel_date"])
	dates.sort()
	print(now(), "Found", species_data.shape[0], "assemblies for", genome)
	if (species_data.shape[0] == 0):
		continue
	if (species_data.shape[0] == 1):
		if str(species_data.iloc[0]["taxid"]) == taxid:
			print(now(), "taxids of species and available genome entry match. Will proceed to genome download...")
			overview[genome] = download(species_data, genome)
			continue
		else:
			print(now(), "The genome availabe for", genome, "with taxid:",species_data.iloc[0]["taxid"],"seems to have a different taxid as the species specified:", taxid, ". Nothing will be downloaded, please check manually and adjust the name (eg. the specific strain could be missing.")
			overview[genome] = "failed"
			continue
	else:
		print(now(), "Will check if all assemblies have the same taxid...")
		if len(set(list(species_data["taxid"]))) == 1:
			print(now(), "All genomes have the same the same taxid.", list(species_data["taxid"])[0])
			if not str(list(species_data["taxid"])[0]) == taxid:
				print(now(), "The genome availabe for", genome, "with taxid:",species_data.iloc[0]["taxid"],"seems to have a different taxid as the species specified:", taxid, ". Nothing will be downloaded, please check manually and adjust the name. This can happen if the genome is deposited under an infraspecific name such as strains, subspecies etc.")
				overview[genome] = "failed"
				#print(now(), "taxid:", list(species_data["taxid"])[0], "and species name", genome, "do not match. Will skip this species")
				continue	
			else:
				if "reference genome" in list(species_data["refseq_category"]):
					print(now(), "Found reference genome for", genome)
					overview[genome] = download(species_data[species_data.refseq_category == "reference genome"], genome)
					continue
				elif "representative genome" in list(species_data["refseq_category"]):
					print(now(), "Found representative genome for", genome)
					overview[genome] = download(species_data[species_data.refseq_category == "representative genome"], genome)
					continue
				else:
					print(now(), "There is no reference or representative genome available for %s, will download latest genome..." % genome)
					dates = list(species_data["seq_rel_date"])
					latest_date = dates[-1]
					print(now(), "Latest genome of",genome,"was released:", latest_date, ". Will try to download this genome...")
					overview[genome] = download(species_data[species_data.seq_rel_date == "latest_date"], genome)	
					continue
		else:
			print(now(), "Potential genomes have multiple different taxids. Will try to resolve this...")
			taxids = [str(item) for item in set(list(species_data["taxid"]))]
			if taxid not in taxids:
				print(now(), "tax id of species is not in the list of taxon ids with potential genomes. Will not download anything. Please resolve manually.")
				overview[genome] = "failed"
				continue
			taxid_species_data = species_data[species_data["taxid"] == int(taxid)]
			if (taxid_species_data.shape[0] == 1):
				print(now(), "A single genome with the correct taxid remaining.Will try to download this genome now...")
				overview[genome] = download(taxid_species_data[taxid_species_data.taxid == int(taxid)], genome)
				continue
			else:
				print(now(), taxid_species_data.shape[0], "genomes with correct taxid remaining. Checking for reference genome ...")
				if "reference genome" in list(taxid_species_data["refseq_category"]):
					print(now(), "Found reference genome for", genome)
					overview[genome] = download(taxid_species_data[taxid_species_data.refseq_category == "reference genome"], genome)
					continue
				elif "representative genome" in list(taxid_species_data["refseq_category"]):
					print(now(), "Found representative genome for", genome)
					overview[genome] = download(taxid_species_data[taxid_species_data.refseq_category == "representative genome"], genome)
					continue
				else:
					print(now(), "There is no reference or representative genome available for, will download latest genome...", genome)
					dates = list(taxid_species_data["seq_rel_date"])
					latest_date = dates[-1]
					print(now(), "Latest genome of",genome,"was released:", latest_date, ". Will try to download this genome...")
					overview[genome] = download(taxid_species_data[taxid_species_data.seq_rel_date == latest_date], genome)
					continue
	# All possible cases should be covered by the code above, in case something goes wrong nonetheless, treat genome as failed:
	print(now(), "An unspecified problem occured for species", genome, ". Nothing was downloaded")
	overview[genome] = "failed"

print(overview)
print(now(), "Writing overview statistics files")
statsfile = open(args.outdir + "download_overview_"+str(args.batch)+".txt", "w")
successfile = open(args.outdir + "successfully_downloaded_"+str(args.batch)+".txt", "w")
failedfile = open(args.outdir + "not_downloaded_"+str(args.batch)+".txt", "w")
for sp in overview.keys():
	print(sp.replace(" ","_"), ",", overview[sp], sep="", file=statsfile)
	if overview[sp] == "success":
		print(sp.replace(" ","_"), file=successfile)
	if overview[sp] == "failed":
		print(sp.replace(" ","_"), file=failedfile)
statsfile.close()
successfile.close()
failedfile.close()
45
46
47
48
shell:
	"""
	clustalo -i {input.sequence_file} --threads={threads} $(if [[ "{params}" != "None" ]]; then echo {params}; fi) 1> {output.alignment} 2> {log}
	"""
66
67
68
69
shell:
	"""
	mafft --thread {threads} $(if [[ "{params}" != "None" ]]; then echo {params}; fi) {input.sequence_file} 1> {output.alignment} 2> {log}
	"""
86
87
88
89
shell:
	"""
	muscle -super5 {input.sequence_file} -threads {threads} $(if [[ "{params}" != "None" ]]; then echo {params}; fi) -output {output.alignment} 2> {log}
	"""
96
97
98
99
shell:
	"""
	touch {output.checkpoint}
	"""
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
shell:
	"""
	# here the ids for the alignments need to be filtered as well first. maybe this can be changed in the concat.py script, so that an id file is not needed anymore.
	concat.py -i $(ls -1 {params.wd}/results/alignments/full/{wildcards.aligner}/* | sed -n '{wildcards.batch}~{params.nbatches}p' | tr '\\n' ' ') -t <(ls -1 {params.wd}/results/orthology/busco/busco_runs/) --runmode concat -o results/statistics/ --biopython --statistics --seqtype {params.datatype} --noseq
	mv results/statistics/statistics.txt {output.statistics_alignment}
	# make this output tab delimited so it is easier to parse
	ovstats="{params.alignment_method}"
	if [[ "{wildcards.aligner}" == "mafft" ]]; then 
		ovstats="${{ovstats}}\t{params.mafft_alignment_params}"
	fi
	if [[ "{wildcards.aligner}" == "clustalo" ]]; then 
		ovstats="${{ovstats}}\t{params.clustalo_alignment_params}"
	fi
	ovstats="${{ovstats}}\t{params.pars_sites}"
	echo -e $ovstats > {output.overview_statistics}
	"""
140
141
142
143
144
shell:
	"""
	touch {output}
	echo "$(date) - Pipeline part 2 (align) done." >> results/statistics/runlog.txt
	"""
SnakeMake From line 140 of rules/align.smk
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
shell:
	"""
	echo "$(date) - concatenate {wildcards.aligner}-{wildcards.alitrim}: Will use bootstrap cutoff {wildcards.bootstrap} before creating concatenated alignment" >> results/statistics/runlog.txt
	mkdir -p results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/algn
	for gene in $(echo "{params.genes}") 
	do
		cp {params.wd}/results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}/"$gene"_aligned_trimmed.fas results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/algn
	done
	# Now run concat:
	concat.py -d results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/algn --runmode concat -o results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim} --biopython --statistics

	# Now convert alignment to phylip:
	python -c "from Bio import AlignIO; alignment = AlignIO.read(open('{output.alignment}'), 'fasta'); print(' '+str(len(alignment))+' '+str(alignment.get_alignment_length())); print(''.join([str(seq.id)+'   '+str(seq.seq)+'\\n' for seq in alignment]));" > {output.phylip_alignment}
	# and stockholm (pfam) format:
	python -c "from Bio import AlignIO; alignment = AlignIO.read(open('{output.alignment}'), 'fasta'); print(alignment.format('stockholm'));" > {output.stockholm_alignment}	
	"""
42
43
44
45
shell:
	"""
	trimal {params.trimmer} -in {input.alignment} -out {output.trimmed_alignment}
	"""
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
shell:
	"""
	if [[ -d results/alignments/trimmed/{wildcards.aligner}-aliscore/{params.busco} ]]; then rm -rf results/alignments/trimmed/{wildcards.aligner}-aliscore/{params.busco}; fi
	mkdir -p results/alignments/trimmed/{wildcards.aligner}-aliscore/{params.busco}
	cd results/alignments/trimmed/{wildcards.aligner}-aliscore/{params.busco}
	ln -s -f  {params.wd}/{input.alignment} {params.busco}_aligned.fas 
	Aliscore.pl {params.trimmer} -i {params.busco}_aligned.fas &> aliscore_{params.busco}.log || true

	if [[ -f {params.busco}_aligned.fas_List_random.txt ]]; then
		echo "$(date) - The aliscore output file does not exist. Check results for BUSCO: {params.busco}" >> {params.wd}/results/statistics/runlog.txt
		if [[ $(cat {params.busco}_aligned.fas_List_random.txt | head -n 1 | grep '[0-9]' -c) != 0 ]]; then
			echo "$(date) - The aliscore output appears to be empty. Check results for BUSCO: {params.busco}" >> {params.wd}/results/statistics/runlog.txt
			ALICUT.pl -s &> alicut_{params.busco}.log
		fi
	fi

	# this check is included because alicut very rarely does not  produce an output file.
	# in this case an empty file will be touched. This is necessary so the rule does not fail
	# The empty file will later be exluded again in the next rule.
	if [[ ! -f {params.wd}/results/alignments/trimmed/{wildcards.aligner}-aliscore/{params.busco}/ALICUT_{params.busco}_aligned.fas ]]; then
		echo "$(date) - The ALICUT output appears to be empty. Will touch an empty file so the pipeline will continue. Check results for BUSCO: {params.busco}" >> {params.wd}/results/statistics/runlog.txt
		touch {params.wd}/{output.trimmed_alignment}
	else
		if [[ "$(cat ALICUT_{params.busco}_aligned.fas | grep -v ">" | sed 's/-//g' | grep "^$" | wc -l)" -gt 0 ]]; then
			echo "$(date) - Alignment of BUSCO: {params.busco} contains empty sequence after aliscore/alicut. This sequence will be removed." >> {params.wd}/results/statistics/runlog.txt
			cp ALICUT_{params.busco}_aligned.fas ALICUT_{params.busco}_aligned.fas_tmp
			cat ALICUT_{params.busco}_aligned.fas_tmp | perl -ne 'chomp; $h=$_; $s=<>; chomp($s); $check=$s; $check=~s/-//g; if (length($check) > 0){{print "$h\n$s\n"}}' > ALICUT_{params.busco}_aligned.fas
		fi
		mv ALICUT_{params.busco}_aligned.fas {params.wd}/{output.trimmed_alignment}
	fi
	"""
105
106
107
108
109
110
shell:
	"""
	# here the ids for the alignments need to be filtered as well first. maybe this can be changed in the concat.py script, so that an id file is not needed anymore.
	concat.py -i $(ls -1 {params.wd}/results/alignments/trimmed/{wildcards.aligner}-{wildcards.alitrim}/*.fas | sed -n '{wildcards.batch}~{params.nbatches}p' | tr '\\n' ' ') -t <(ls -1 {params.wd}/results/orthology/busco/busco_runs) --runmode concat -o results/statistics/ --biopython --statistics --seqtype {params.datatype} --noseq
	mv results/statistics/statistics.txt {output.statistics_trimmed}
	"""
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
	shell:
		"""
		if [[ -d results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim} ]]; then
			rm -rf results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}
		fi
		mkdir -p results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}

		# concatenate the statistics files from the individual batches (for some reason snakemake complained if I did it all in one step, so this looks a bit ugly now, but it runs)
		cat results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/stats_trimmed_{wildcards.alitrim}_{wildcards.aligner}-* > results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/statistics_trimmed_temp-{wildcards.alitrim}_{wildcards.aligner}.txt
		head -n 1 results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/statistics_trimmed_temp-{wildcards.alitrim}_{wildcards.aligner}.txt > results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/statistics_trimmed_{wildcards.alitrim}_{wildcards.aligner}.txt
		cat results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/statistics_trimmed_temp-{wildcards.alitrim}_{wildcards.aligner}.txt | grep -v "^alignment" >> {output.trim_info}
		rm results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/statistics_trimmed_temp-{wildcards.alitrim}_{wildcards.aligner}.txt

		for file in results/alignments/trimmed/{wildcards.aligner}-{wildcards.alitrim}/*.fas;
		do
                        if [[ -s {params.wd}/$file ]]; then
				if [[ "$(cat {params.wd}/$file | grep ">" -c)" -lt {params.minsp} ]]; then
					echo -e "$file\tTOO_FEW_SEQUENCES" 2>&1 | tee -a {output.filter_info}
                                	echo "$(date) - File $file contains less than {params.minsp} sequences after trimming with {params.trimming_method}. This file will not be used for tree reconstruction." >> {params.wd}/results/statistics/runlog.txt
                                        	continue
                        	fi
				if [[ $(grep "$(basename $file)" {output.trim_info} | cut -f 6) -ge {params.min_pars_sites} ]]
				then
					echo -e "$file\tPASS" 2>&1 | tee -a {output.filter_info}
					ln -s {params.wd}/$file {params.wd}/results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}/
				else
					echo -e "$file\tNOT_INFORMATIVE" 2>&1 | tee -a {output.filter_info}
				fi

#                                python bin/filter_alignments.py --alignments {params.wd}/$file --outdir "{params.wd}/results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}" --statistics-file results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/statistics_trimmed_{wildcards.alitrim}_{wildcards.aligner}.txt --min-parsimony {params.min_pars_sites} --minsp {params.minsp} >> {output.filter_info}
                        else #do nothing if file is empty (happens rarely when ALICUT fails)
                                continue
                        fi
		done
		# remove unnecessary statistics file
#		rm results/statistics/trim-{wildcards.aligner}-{wildcards.alitrim}/statistics_trimmed_{wildcards.alitrim}_{wildcards.aligner}.txt

		echo "$(date) - Number of alignments ({wildcards.aligner}): $(ls results/alignments/full/{wildcards.aligner}/*.fas | wc -l)" >> results/statistics/runlog.txt
		echo "$(date) - Number of trimmed alignments ({wildcards.aligner} - {wildcards.alitrim}): $(ls results/alignments/trimmed/{wildcards.aligner}-{wildcards.alitrim}/*.fas | wc -l)" >> results/statistics/runlog.txt
		echo "$(date) - Number of alignments ({wildcards.aligner} - {wildcards.alitrim}) after filtering: $(ls results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}/*.fas | wc -l)" >> results/statistics/runlog.txt
		"""
182
183
184
185
186
187
shell:
	"""
	# here the ids for the alignments need to be filtered as well first. maybe this can be changed in the concat.py script, so that an id file is not needed anymore.
	concat.py -i $(ls -1 {params.wd}/results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}/*.fas | sed -n '{wildcards.batch}~{params.nbatches}p' | tr '\\n' ' ') -t <(ls -1 {params.wd}/results/orthology/busco/busco_runs) --runmode concat -o results/statistics/ --biopython --statistics --seqtype {params.datatype} --noseq
	mv results/statistics/statistics.txt {output.statistics_filtered}
	"""
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
shell:
	"""
	echo "combo\t$(head -n 1 {input.filtered[0]} | cut -f 1,4-)" > {output.stats}
	for f in {input.filtered}; do combo=$(echo $f | cut -d "/" -f 3 | sed 's/filter-//'); tail -n +2 $f | sed "s/^/$combo\t/"; done | cut -f 1,2,4- >> {output.stats}
	# gather aligner and trimming combinations and corresponding settings for statistics:	
	combs="{params.algn_trim}"
	echo "aligner\ttrimmer\ttrimming_params\tdupseq\tcutoff\tminsp\tseq_type\tmin_parsimony_sites" > results/statistics/trim-filter-overview.txt
	for combination in ${{combs}}; do
		if [[ $combination == *"trimal"* ]]; then
			out=$combination"__{params.trimal_params}"
		elif [[ $combination == *"aliscore"* ]]; then
			out=$combination"__{params.aliscore_params}"
		else
			out=$combination"__no parameters"
		fi
		out=$out"__{params.dupseq}__{params.cutoff}__{params.minsp}__{params.seq_type}__{params.min_parsimony_sites}"
		echo $out | sed 's/__/\t/g' >> results/statistics/trim-filter-overview.txt
	done  
	echo "$(date) - Pipeline part filter_align done." >> results/statistics/runlog.txt
	touch {output.checkpoint}
	"""	
41
42
43
44
45
46
47
48
shell:
	""" 
	if [[ ! -d {output.sequence_dir} ]]; then mkdir -p {output.sequence_dir}; fi
	# remove files in case there are already some:
	rm -f {output.sequence_dir}/*
	python bin/create_sequence_files.py --type {params.seqtype} --busco_table {input.table} --busco_results {params.busco_dir} --cutoff {params.cutoff} --outdir {output.sequence_dir} --minsp {params.minsp} --genome_statistics {output.genome_statistics} --gene_statistics {output.gene_statistics} --batchID {wildcards.batch} --nbatches {params.nbatches} &> {log}
	touch {output.checkpoint}   
	"""
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
	"""
	if [[ -d results/orthology/busco/busco_sequences_deduplicated ]]; then
		rm -rf results/orthology/busco/busco_sequences_deduplicated
	fi
	mkdir results/orthology/busco/busco_sequences_deduplicated

	rm -f {output.statistics}

	if [[ {params.dupseq} == "persample" ]];
	then
		echo "$(date) - BUSCO files will be filtered on a per-sample basis. This could lower the number of species in the final tree." >> results/statistics/runlog.txt
	else
		echo "$(date) - BUSCO files will be filtered on a per BUSCO gene basis. This could lower the number of genes used to calculate the final tree." >> results/statistics/runlog.txt
	fi

	for file in results/orthology/busco/busco_sequences/*/*.fas;
		do
		if [[ {params.dupseq} == "persample" ]];
		then
			# per sequence filtering
			python bin/filter_alignments.py --alignments {params.wd}/$file --outdir "{params.wd}/results/orthology/busco/busco_sequences_deduplicated" --per_sample --minsp {params.minsp} >> {output.statistics}
		else
			# whole alignment filtering
			python bin/filter_alignments.py --alignments {params.wd}/$file --outdir "{params.wd}/results/orthology/busco/busco_sequences_deduplicated" >> {output.statistics}
		fi
	done
	#gather runtime statistics currently not activated. This needs some more work.
	#for file in $(ls results/statistics/benchmarks/busco/run_busco_*); do printf $file"\t"; sed '2q;d' results/statistics/benchmarks/busco/$file; done > results/statistics/benchmark_all_busco_runs.bench	

	# get statistics files for report. This is still a bit hacky and could be solved better, especially for the genomes file:
	cat results/statistics/orthology_filtering/orthology_filtering_gene_* > results/statistics/orthology_filtering_gene_statistics.txt
	files=$(ls results/statistics/orthology_filtering/orthology_filtering_genomes_*)
	cat $(echo $files | awk '{{print $1}}') > results/statistics/orthology_filtering_genomes_statistics.txt		

	echo "$(date) - Number of BUSCO sequence files: $(ls results/orthology/busco/busco_sequences/*/*.fas | wc -l)" >> results/statistics/runlog.txt
	echo "$(date) - Number of deduplicated BUSCO sequence files: $(ls results/orthology/busco/busco_sequences_deduplicated/*.fas | wc -l)" >> results/statistics/runlog.txt
	touch {output.checkpoint}
	"""
108
109
110
111
112
shell:
	"""
	echo "$(date) - Pipeline part filter-orthology done." >> results/statistics/runlog.txt
	touch {output}
	"""
87
88
89
90
91
92
shell:
	"""
	if [[ ! -d "results/modeltest/{wildcards.aligner}-{wildcards.alitrim}/{wildcards.busco}" ]]; then mkdir -p results/modeltest/{wildcards.aligner}-{wildcards.alitrim}/{wildcards.busco}; fi
	iqtree -s {input.alignment} -msub nuclear --prefix {params.wd}/results/modeltest/{wildcards.aligner}-{wildcards.alitrim}/{params.busco}/{params.busco} -nt {threads} -m MFP $(if [[ "{params.bb}" != "None" ]]; then echo "-bb {params.bb}"; fi) $(if [[ "{params.seed}" != "None" ]]; then echo "-seed {params.seed}"; fi) {params.additional_params}
	touch {output.checkpoint}
	"""
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
	shell:
		"""
		# echo "name\tmodel" > {output.best_models}
		for gene in {params.genes}
		do
			model=$(cat results/modeltest/{wildcards.aligner}-{wildcards.alitrim}/$gene/$gene.log | grep "Best-fit model:" | tail -n 1 | awk -F ":" '{{print $2}}' | awk -F " " '{{print $1}}')
			# This should not happen
			if [[ $(echo $model | wc -l) -eq 0 ]]
			then
				#this should not happen - just a failsafe
				echo "ERROR: No model detected for gene: $gene" >&2
			else
				printf "$gene\t" >> {output.best_models}
				echo $model >> {output.best_models}
			fi

			#now calculate mean bootsrap for each tree
			bootstrapvalues=$(grep -E '\)[0-9]+:' -o results/modeltest/{wildcards.aligner}-{wildcards.alitrim}/$gene/$gene.treefile | sed 's/)//' | sed 's/://' | tr '\n' '+' | sed 's/+$//')
#			bootstrapsum=$(echo "$bootstrapvalues" | bc)
			bootstrapsum=$(echo "$bootstrapvalues" | perl -ne 'chomp; @bs=split(/\+/); $sum=0; for (@bs){{$sum=$sum+$_}}; print "$sum"')

			totalbootstraps=$(echo "$bootstrapvalues" | sed 's/[0-9]//g' | wc -c)
#			meanbootstrap=$(echo "($bootstrapsum)/$totalbootstraps" | bc)
			meanbootstrap=$(( bootstrapsum / totalbootstraps ))
			echo -e "$gene\t{wildcards.aligner}\t{wildcards.alitrim}\t$bootstrapsum\t$totalbootstraps\t$meanbootstrap" | tee -a {output.genetree_filter_stats}
		done
		touch {output.checkpoint}

		"""
SnakeMake From line 106 of rules/model.smk
142
143
144
145
146
shell:
	"""
	touch {output}
	echo "$(date) - Pipeline part modeltest (model) done." >> results/statistics/runlog.txt
	"""
SnakeMake From line 142 of rules/model.smk
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
shell:
	"""
	mkdir -p log
	dir=results/busco/{params.species}
	# prepare stripped down version auf augustus config path.
	# this is introduced to lower the number of files.
	mkdir augustus
	cp -R {params.augustus_config_in_container}/cgp augustus
	cp {params.augustus_config_in_container}/config.ini augustus
	cp -R {params.augustus_config_in_container}/extrinsic augustus
	cp -R {params.augustus_config_in_container}/model augustus
	cp -R {params.augustus_config_in_container}/profile augustus
	mkdir augustus/species

	if [ -d {params.augustus_config_in_container}/species/{params.sp} ]
	then
		cp -R {params.augustus_config_in_container}/species/{params.sp} augustus/species
	fi		

	export AUGUSTUS_CONFIG_PATH=$(pwd)/augustus
	#echo $AUGUSTUS_CONFIG_PATH

	# handle gzipped and other assemblies differently
	if echo $(file $(readlink -f "{input.assembly}")) | grep compressed ;
	then
		fullname=$(basename "{input.assembly}")
		filename="${{fullname%.*}}"
		gunzip -c $(readlink -f "{input.assembly}") > "$filename"
	else
		filename="{input.assembly}"
	fi
	echo "Assembly used for BUSCO is "$filename 2>&1 | tee {log}
	run_busco -i $filename -f --out busco -c {threads} -sp {params.sp} --lineage_path {input.busco_set} -m {params.mode} {params.additional_params} 2>&1 | tee -a {log}

	# do some cleanup to save space
	echo -e "\\n[$(date)]\\tCleaning up after BUSCO to save space" 2>&1 | tee -a {log}
	basedir=$(pwd)
	cd run_busco
	mkdir software_outputs
	mv *_output software_outputs
	$basedir/bin/tar_folder.sh $basedir/{output.output} software_outputs 2>&1 | tee -a $basedir/{log}
	cd ..
	tar -pcf {output.single_copy_buscos} -C run_busco/ single_copy_busco_sequences
	tar -tvf {output.single_copy_buscos} > {output.single_copy_buscos_tarlist} 2>&1 | tee -a {log}
	#move output files:
	mv run_busco/full_table_busco.tsv {output.full_table}
	mv run_busco/short_summary_busco.txt {output.short_summary}
	mv run_busco/missing_busco_list_busco.tsv {output.missing_busco_list}
	#mv run_busco/single_copy_busco_sequences {output.single_copy_buscos}

	buscos=$(tail -n +6 results/orthology/busco/busco_runs/{params.species}/run_busco/full_table_busco.tsv | cut -f 2 | sort | uniq -c | tr '\\n' ' ' | sed 's/ $/\\n/g')
	name="{params.species}"
	echo "$(date) $name $buscos" >> results/statistics/runlog.txt

	touch {output.checkpoint}
	"""
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
shell:
	"""
	mkdir -p log
	dir=results/busco/{params.species}
	# prepare stripped down version auf augustus config path.
	# this is introduced to lower the number of files.
	mkdir augustus
	cp -R {params.augustus_config_in_container}/cgp augustus
	cp -R {params.augustus_config_in_container}/extrinsic augustus
	cp -R {params.augustus_config_in_container}/model augustus
	cp -R {params.augustus_config_in_container}/profile augustus
	mkdir augustus/species
	cp -R {params.augustus_config_in_container}/species/generic augustus/species/

	if [ -d {params.augustus_config_in_container}/species/{params.sp} ]
	then
		cp -R {params.augustus_config_in_container}/species/{params.sp} augustus/species
	fi		

	export AUGUSTUS_CONFIG_PATH=$(pwd)/augustus
	#export AUGUSTUS_SCRIPTS_PATH=/usr/local/bin/ #might be necessary in ezlabgva/busco:v5.2.1_cv1 image
	#export AUGUSTUS_BIN_PATH=/usr/local/bin/
	#echo $AUGUSTUS_CONFIG_PATH

	# handle gzipped and other assemblies differently
	if [[ "{input.assembly}" =~ \.gz$ ]]
	then
		fullname=$(basename "{input.assembly}")
		filename="${{fullname%.*}}"
		gunzip -c $(readlink -f "{input.assembly}") > "$filename"
	else
		filename="{input.assembly}"
	fi
	echo "Assembly used for BUSCO is "$filename 2>&1 | tee {log}
	busco -i $filename -f --out {params.species} -c {threads} $(if [[ "{params.sp}" != "None" ]]; then echo "--augustus --augustus_species {params.sp}"; fi) --lineage_dataset $(pwd)/{input.busco_set} -m {params.mode} {params.additional_params} 2>&1 | tee -a {log}
	# do some cleanup to save space
	echo -e "\\n[$(date)]\\tCleaning up after BUSCO to save space" 2>&1 | tee -a {log}
	basedir=$(pwd)
	cd {params.species}/run_{params.set}
	mkdir software_outputs
	mv *_output software_outputs
	$basedir/bin/tar_folder.sh $basedir/{output.output} software_outputs 2>&1 | tee -a $basedir/{log}
	cd ..
	$basedir/bin/tar_folder.sh $basedir/{output.logs} logs 2>&1 | tee -a $basedir/{log}
	cd ..
	tar -pcf {output.single_copy_buscos} -C {params.species}/run_{params.set}/busco_sequences single_copy_busco_sequences 
	tar -tvf {output.single_copy_buscos} > {output.single_copy_buscos_tarlist} 2>&1 | tee -a $basedir/{log}

	#move output files:
	mv {params.species}/run_{params.set}/full_table.tsv {output.full_table}
	mv {params.species}/run_{params.set}/short_summary.txt {output.short_summary}
	mv {params.species}/run_{params.set}/missing_busco_list.tsv {output.missing_busco_list}

	buscos=$(tail -n +6 {output.full_table} | cut -f 2 | sort | uniq -c | tr '\\n' ' ' | sed 's/ $/\\n/g')
	name="{params.species}"
	echo "$(date) $name $buscos" >> results/statistics/runlog.txt

	#touch checkpoint
	touch {output.checkpoint}
	"""
255
256
257
258
shell:
	"""
	touch {output}
	"""
273
274
275
276
277
278
279
280
281
shell:
	"""
	python bin/extract_busco_table.py --hmm {input.busco_set}/hmms --busco_results {params.busco_dir} -o {output.busco_table}
	echo "species\tcomplete\tsingle_copy\tduplicated\tfragmented\tmissing\ttotal" > results/statistics/busco_summary.txt
	for file in $(ls results/orthology/busco/busco_runs/*/run_busco/short_summary_busco.txt);
	do  
		name=$(echo $file | sed 's#results/busco/##' | sed 's#/run_busco/short_summary_busco.txt##'); 
		printf $name; cat $file | grep -P '\t\d' | awk -F "\t" '{{printf "\t"$2}}' | awk '{{print}}'; done >> results/statistics/busco_summary.txt
	"""
291
292
293
294
295
shell:
	"""
	touch {output}
	echo "$(date) - Pipeline part 1 (orthology) done." >> results/statistics/runlog.txt
	"""
20
21
22
23
shell:
	"""
	quicktree -in a {input} > {output}
	"""
30
31
32
33
34
shell:
	"""
	echo "$(date) - Pipeline part ntree (njtree) done." >> results/statistics/runlog.txt
	touch {output}
	"""
16
17
18
19
20
21
22
shell:
	"""
	# First extract information on buscos:
	echo "species\tcomplete\tsingle_copy\tduplicated\tfragmented\tmissing\ttotal" > results/report/busco_summary.txt
	for file in $(ls results/busco/*/run_busco/short_summary_busco.txt);do  name=$(echo $file | sed 's#results/busco/##' | sed 's#/run_busco/short_summary_busco.txt##'); printf $name; cat $file | grep -P '\t\d' | awk -F "\t" '{{printf "\t"$2}}' | awk '{{print}}'; done >> results/report/busco_summary.txt
	Rscript bin/report.R {params.busco_set} {params.width} {params.height} {input.busco_table} results/report/busco_summary.txt {output.busco_table} {output.busco_summary}
	"""
81
82
83
84
shell:
	"""
	python bin/genome_download.py --entrez_email {params.email} --outdir {params.wd}/results/downloaded_genomes/ --overview-only &> {log}
	"""
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
shell:
	"""
	if [[ ! -f results/statistics/runlog.txt ]]; then if [[ ! -d results/statistics ]]; then mkdir -p results/statistics; fi; touch results/statistics/runlog.txt; fi
	if [[ "{params.species}" != "" ]]; then
		echo "$(date) - Setup: Will download species now - batch: {params.batch}" >> results/statistics/runlog.txt
		python bin/genome_download.py --entrez_email {params.email} --outdir {params.wd}/results/downloaded_genomes/ --genomes {params.species} --batch {params.batch} 2>&1 | tee {log}
	else
		# need to touch these files, since they are usually produced by the python script.
		touch {output.success}
		touch {output.download_overview}
		touch {output.failed}
		echo "$(date) - Setup: No species to download." >> results/statistics/runlog.txt
	fi
	if [[ ! -d results/checkpoints/genome_download ]]; then mkdir -p results/checkpoints/genome_download; fi # sometimes this folder is not created, better do it to be safe.
	touch {output.checkpoint}
	"""
SnakeMake From line 105 of rules/setup.smk
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
shell:
	"""
	if [[ -z "{input}" ]]; then
		echo "All genomes local. Will touch empty output files."
		echo "name\tid\tassembly_accession\tbioproject\tbiosample\twgs_master\trefseq_category\ttaxid\tspecies_taxid\torganism_name\tinfraspecific_name\tisolate\tversion_status\tassembly_level\trelease_type\tgenome_rep\tseq_rel_date\tasm_name\tsubmitter\tgbrs_paired_asm\tpaired_asm_comp\tftp_path\texcluded_from_refseq\trelation_to_type_material" > {output.statistics}
		touch {output.success}
		touch {output.failed}
		touch {output.overview}
	else
		cat {params.success} > {output.success}
		cat {params.failed} > {output.failed}	
		cat {params.overview} > {output.overview}
		echo "name\tid\tassembly_accession\tbioproject\tbiosample\twgs_master\trefseq_category\ttaxid\tspecies_taxid\torganism_name\tinfraspecific_name\tisolate\tversion_status\tassembly_level\trelease_type\tgenome_rep\tseq_rel_date\tasm_name\tsubmitter\tgbrs_paired_asm\tpaired_asm_comp\tftp_path\texcluded_from_refseq\trelation_to_type_material" > {output.statistics}
		for file in $(ls results/downloaded_genomes/*_db_genbank.tsv); 
			do
				echo "---- Adding info ----"
				echo "file: "$file
				species=$(echo $file | awk -F '_db_' '{{print $1}}' | awk -F '/' '{{print $(NF)}}')

				echo "species: "$species
				if [[ -f "results/downloaded_genomes/"$species"_genomic_genbank.fna.gz" ]]; then 
					output=$(sed -n 2p $file)
					echo $species"\t""$output" >> {output.statistics}
				fi
			done
	fi
	"""
SnakeMake From line 138 of rules/setup.smk
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
shell:
	"""
	mkdir -p results/assemblies
	for spe in $(grep ",success" {input.overview}); do
		sp=$(echo $spe | awk -F',' '{{print $1}}')
		if [[ -f {params.wd}/results/assemblies/"$sp".fasta.gz ]]; then
			continue
		else
			link="{params.wd}/results/downloaded_genomes/"$sp"_genomic_genbank.fna.gz"
			echo $link
			if [[ ! -f "$link" ]]; then
				echo "$sp" >> {output.statistics} 
				continue
			else
				ln -s $link {params.wd}/results/assemblies/"$sp".fasta.gz
			fi
		fi
	done	
	for spe in {params.local_species}; do
		sparr=(${{spe//,/ }})
		if [[ -L {params.wd}/results/assemblies/"${{sparr[0]}}".fasta ]]; then
			echo "${{sparr[0]}}" >> {output.statistics_local}
			continue
		else
			echo "${{sparr[0]}}" >> {output.statistics_local}
			ln -s {params.wd}/"${{sparr[1]}}" {params.wd}/results/assemblies/"${{sparr[0]}}".fasta
		fi
	done
	if [[ ! -f {output.statistics} ]]; then touch {output.statistics}; fi
	if [[ ! -f {output.statistics_local} ]]; then touch {output.statistics_local}; fi
	touch {output.checkpoint}
	"""
SnakeMake From line 180 of rules/setup.smk
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
shell:
	"""
	echo -e "[$(date)]\\tBUSCO set specified: {params.set}" 2>&1 | tee {log}
	if [ -d {output.busco_set} ]; then rm -rf {output.busco_set}; fi
	mkdir {output.busco_set}

	if [ "{params.busco_version}" == "3.0.2" ]
	then
		base_url="https://busco.ezlab.org/v3/datasets"
		echo -e "[$(date)]\\tDownloading .." 2>&1 | tee -a {log}
		wget -q -c $base_url/{params.set}.tar.gz -O - --no-check-certificate | tar -xz --strip-components 1 -C {output.busco_set}/
	elif [ "{params.busco_version}" == "5.2.1" ]
	then
		base_url="https://busco-data.ezlab.org/v5/data/lineages"
		current=$(curl -s $base_url/ | grep "{params.set}" | cut -d ">" -f 2 | sed 's/<.*//')
		echo -e "[$(date)]\\tCurrent version is: $current" 2>&1 | tee -a {log}
		echo -e "[$(date)]\\tDownloading .." 2>&1 | tee -a {log}
		wget -q -c $base_url/$current -O - --no-check-certificate | tar -xz --strip-components 1 -C {output.busco_set}/
	else
		echo -e "\\n######\\nPlease specify a valid BUSCO version in your config file - supported are '3.0.2' and '5.0.2'\\n######" 2>&1 | tee -a {log}
		exit 1
	fi

	echo -ne "[$(date)]\\tDone!\\n" 2>&1 | tee -a {log}
	touch {output.checkpoint}
	"""
264
265
266
267
268
269
270
shell:
	"""
	touch {output}
	mkdir -p results/statistics
	touch "results/statistics/runlog.txt"
	echo "$(date) - phylociraptor setup done." >> results/statistics/runlog.txt
	"""
SnakeMake From line 264 of rules/setup.smk
279
280
281
282
shell:
	"""
	touch {output}
	"""
SnakeMake From line 279 of rules/setup.smk
64
65
66
67
68
69
70
shell:
	"""
	touch {output}
	mkdir -p results/statistics
	touch "results/statistics/runlog.txt"
	echo "$(date) - Pipeline setup done." >> results/statistics/runlog.txt
	"""
79
80
81
82
shell:
	"""
	touch {output}
	"""
93
94
95
96
97
shell:
	"""
	touch {output}
	echo "$(date) - Pipeline part 1 (orthology) done." >> results/statistics/runlog.txt
	"""
105
106
107
108
109
shell:
	"""
	echo "$(date) - Pipeline part filter-orthology done." >> results/statistics/runlog.txt
	touch {output}
	"""
SnakeMake From line 105 of rules/Snakefile
116
117
118
119
120
shell:
	"""
	touch {output}	
	echo "$(date) - Pipeline part 2 (align) done." >> results/statistics/runlog.txt
	"""
SnakeMake From line 116 of rules/Snakefile
128
129
130
131
132
shell:
	"""
	echo "$(date) - Pipeline part filter_align done." >> results/statistics/runlog.txt
	touch {output}
	"""
SnakeMake From line 128 of rules/Snakefile
140
141
142
143
144
shell:
	"""
	touch {output}
	echo "$(date) - Pipeline part modeltest (model) done." >> results/statistics/runlog.txt
	"""
SnakeMake From line 140 of rules/Snakefile
152
153
154
155
156
shell:
	"""
	touch {output}
	echo "$(date) - Pipeline part 3 (tree) done." >> results/statistics/runlog.txt
	"""
SnakeMake From line 152 of rules/Snakefile
163
164
165
166
167
shell:
	"""
	echo "$(date) - Speciestree reconstruction done." >> results/statistics/runlog.txt
	touch {output}
	"""
SnakeMake From line 163 of rules/Snakefile
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
shell:
	"""
	rm -rf {output.trees}

	if [[ "{params.include}" == "None" ]]
	then
		echo "$(date) - phylociraptor will use gene tree filtering based on average bootstrap support value of {wildcards.bootstrap}." >> {params.wd}/results/statistics/runlog.txt
		for gene in $(echo "{params.genes}")
		do
				cat results/modeltest/{wildcards.aligner}-{wildcards.alitrim}/$gene/*.treefile >> {output.trees}
		done
	else
		cat $(cat {params.include}) > {output.trees}
	fi
	touch {output.checkpoint}
	"""
115
116
117
118
119
120
121
shell:
	"""
	java -jar /ASTRAL-5.7.1/Astral/astral.5.7.1.jar -i {input.trees} -o {output.species_tree} $(if [[ "{params.seed}" != "None" ]]; then echo "-s {params.seed}"; fi)
	statistics_string="astral\t{wildcards.aligner}\t{wildcards.alitrim}\t{wildcards.bootstrap}\t$(cat {input.trees} | wc -l)\t$(cat {output.species_tree})"
	echo -e $statistics_string > {params.wd}/{output.statistics}	
	touch {output.checkpoint}
	"""
128
129
130
131
132
133
shell:
	"""

	echo "$(date) - Speciestree reconstruction done." >> results/statistics/runlog.txt
	touch {output}
	"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
shell:
	"""
	if [[ -f {params.wd}/{params.models} && {params.wd}/checkpoints/modeltest.done ]]; then
		echo "$(date) - 'phylociraptor model' finished successfully before. Will run raxml with best models." >> {params.wd}/results/statistics/runlog.txt
		awk 'FNR==NR{{a[$1"_aligned_trimmed.fas"]=$2;next}}{{print $0"\\t"a[$1]}}' {params.models} results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/statistics.txt | awk -F"\\t" 'NR>1{{split($1,b,"_"); print $9", " b[1]"="$2"-"$3}}' > results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/partitions_unformated.txt
	else
		echo "$(date) - 'phylociraptor model' was NOT run before. Will run raxml with GTR or PROTGTR depending on input data type." >> {params.wd}/results/statistics/runlog.txt
		if [[ {params.datatype} == "aa" ]]; then
			awk '{{print $0"\\tPROTGTR"}}' {input} | awk -F"\\t" 'NR>1{{split($1,b,"_"); print $9", " b[1]"="$2"-"$3}}' > results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/partitions_unformated.txt
		else
			awk '{{print $0"\\tGTR"}}' {input} | awk -F"\\t" 'NR>1{{split($1,b,"_"); print $9", " b[1]"="$2"-"$3}}' > results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/partitions_unformated.txt
		fi
	fi
	# correct some model names to make them raxml compatible:
	# it is not quite clear which models are compatible. During more extensive tests additional problems should show up
	cat results/phylogeny-{wildcards.bootstrap}/concatenate/{wildcards.aligner}-{wildcards.alitrim}/partitions_unformated.txt | sed 's/JTTDCMut/JTT-DCMut/' > {output.partitions}
	touch {output.partitions}
	"""
SnakeMake From line 24 of rules/tree.smk
62
63
64
65
66
67
68
69
70
71
shell:
	"""
	cp {input.alignment} {output.alignment}
	cp {input.partitions} {output.partitions}
	cd results/phylogeny-{wildcards.bootstrap}/raxml/{wildcards.aligner}-{wildcards.alitrim}
	raxml-ng --msa {params.wd}/{output.alignment} $(if [[ "{params.seed}" != "None" ]]; then echo "--seed {params.seed}"; fi) --prefix raxmlng -threads {threads} --bs-trees {params.bs} --model {params.wd}/{output.partitions} --all {params.additional_params}
	statistics_string="raxmlng\t{wildcards.aligner}\t{wildcards.alitrim}\t{params.bs}\t{wildcards.bootstrap}\t$(cat {params.wd}/{output.partitions} | wc -l)\t$(cat raxmlng.raxml.bestTree)"
	echo -e $statistics_string > {params.wd}/{output.statistics}
	touch {params.wd}/{output.checkpoint}
	"""
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
shell:
	"""
	mkdir {output.algn}
	echo "$(date) - prepare_iqtree {wildcards.aligner}-{wildcards.alitrim}: Will use bootstrap cutoff ({wildcards.bootstrap}) before creating concatenated alignment" >> {params.wd}/results/statistics/runlog.txt
	for gene in $(echo "{params.genes}")
	do
		cp {params.wd}/results/alignments/filtered/{wildcards.aligner}-{wildcards.alitrim}/"$gene"_aligned_trimmed.fas {output.algn}/
	done

	echo "Will create NEXUS partition file with model information now." >> {params.wd}/results/statistics/runlog.txt
	echo "#nexus" > {output.nexus}
	echo "begin sets;" >> {output.nexus}
	i=1
	charpart=""
	for gene in $(echo "{params.genes}")
	do
		cat {params.wd}/{params.models} | grep $gene | awk -v i=$i '{{print "charset part"i" = algn/"$1"_aligned_trimmed.fas:*;"}}' >> {output.nexus}
		charpart=$charpart$(cat {params.wd}/{params.models} | grep $gene | awk -v i=$i '{{printf($2":part"i", ")}}' | sed 's/\\(.*\\), /\\1, /')
		i=$((++i))
	done
	echo "charpartition mine = "$charpart";" >> {output.nexus}
	echo "end;" >> {output.nexus} concat.nex
	echo "$(date) - nexus file for iqtree written." >> {params.wd}/results/statistics/runlog.txt
	"""
131
132
133
134
135
136
137
138
shell:
	"""
	cd results/phylogeny-{wildcards.bootstrap}/iqtree/{wildcards.aligner}-{wildcards.alitrim}/
	iqtree -p concat.nex --prefix concat -bb {params.bb} -nt {threads} $(if [[ "{params.seed}" != "None" ]]; then echo "-seed {params.seed}"; fi) {params.additional_params} 
	statistics_string="iqtree\t{wildcards.aligner}\t{wildcards.alitrim}\t{params.bb}\t{wildcards.bootstrap}\t$(ls algn | wc -l)\t$(cat concat.contree)"
	echo -e $statistics_string > {params.wd}/{output.statistics}	
	touch {params.wd}/{output.checkpoint}
	"""
201
202
203
204
205
shell:
	"""
	touch {output}
	echo "$(date) - Pipeline part 3 (tree) done." >> results/statistics/runlog.txt
	"""
SnakeMake From line 201 of rules/tree.smk
ShowHide 47 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/chrishah/phylogenomics_intro_vertebrata
Name: phylogenomics_intro_vertebrata
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...