gvclass: Taxonomic Classification of Giant Viruses and MAGs

public public 1yr ago 0 bookmarks
version 0.9.4 March 7, 2023

Giant viruses are abundant and diverse and frequently found in environmental microbiomes. gvclass assigns taxonomy to putative giant virus contigs or metagenome assembled genomes ( GVMAGs ). It uses a conservative approach based on the consensus of single protein trees built from up to 9 giant virus orthologous groups ( GVOGs ). These 9 GVOGs are conserved across different lineages in the viral phylum Nucleocytoviricota.

Running gvclass

Requirements

  • Conda environment wih snakemake, check here: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

  • Input is a directory that contains single contigs or MAGs as nucleic acid (fna) or proteins (faa)

  • File extensions .fna or .faa

  • Recommended length for single contigs is 50kb, but at least 20kb

  • No special characters (".", ";", ":") in filebase name, "_" or "-" are okay

  • No whitespace in sequence headers

  • Recommended sequence header format if faa provided: |

  • Input will be checked and reformatted if necessary

IMG/VR

  • Upload you metagenome assembled genome or single contig to IMG/VR using the gvclass feature

Snakemake workflow

  • Clone the repository
git clone https://github.com/NeLLi-team/gvclass
  • Test gvclass using the provided giant virus assemblies
snakemake -j 24 --use-conda --config querydir="example"
  • If this completes successfully, run it using your own directory of query genomes
snakemake -j <number of processes> --use-conda --config querydir="<path to query dir>"

Docker / Shifter

  • Will be provided soon

Interpretation of the results

  • The classification result is summarized in a tab separated file <query name>.summary.tab

Taxonomy assignments

  • Taxonomy assignments are provided on different taxonomic levels.

  • To yield an assignments all nearest neighbors in GVOG phylogenetic tree have to be in agreement.

  • Depending on the number of identified unique GVOGs an assignment is provided if at least 1 GVOG (stringency "gte1"), 2 GVOGs (stringency "gte2") or 3 GVOGs (stringency "gte3") were found.

  • Less stringent is the "majority" approach, here more than 50% of identified markers have to yield the same taxonomy to enable and assignment.

  • If taxonomy assignments are not in agreement at a low taxonomy level (e.g. species, genus, or family) then the next higher taxonomy level will be evaluated, up to the domain level.

  • March 2023: Added experimental feature xgboost classifier based on kmers. Provides assignment of provided sequences (fna) to cellular domains, phages or NCLDV together with order level assignment (if NCLDV).

Contamination

  • Giant virus genomes typically have less than 10 out of a set of 56 universal cellular housekeeping genes (UNI56). Higher UNI56 counts indicate cellular contamination, or giant virus sequences that are located on host contigs.

    • UNI56u (unique counts), UNI56t(total counts), UNI56df (duplication factor) are provided and can be used for further quality filtering
  • Giant virus genomes typically have a duplication factor of GVOG7 (as subset of GVOG9) of below 3. Higher GVOG7 duplication factors indicate the presence mixed viral populations.

    • GVOG7u (unique counts), GVOG7t(total counts), GVOG7df (duplication factor) are provided and can be used for further quality filtering
  • Giant viruses may break any of these rules, thus, gvclass does not perform automatic quality filtering based on marker gene counts.

Benchmarking

  • Will be provided soon

Citation

  • Will be provided soon

References

  1. Schulz F, Roux S, Paez-Espino D, Jungbluth S, Walsh DA, Denef VJ, McMahon KD, Konstantinidis KT, Eloe-Fadrosh EA, Kyrpides NC, Woyke T. Giant virus diversity and host interactions through global metagenomics. Nature. 2020 Feb;578(7795):432-6.

  2. Aylward FO, Moniruzzaman M, Ha AD, Koonin EV. A phylogenomic framework for charting the diversity and evolution of giant viruses. PLoS biology. 2021 Oct 27;19(10):e3001430.

Acknowledgements

GVClass was developed by the New Lineages of Life Group at the DOE Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import click
from Bio import SeqIO
import pandas as pd
import os


@click.command()
@click.option('--input', '-i', type=click.Path(exists=True), help='Input FASTA file')
@click.option('--output', '-o', type=click.Path(), help='Output FASTA file')
@click.option('--stats', '-s', type=click.Path(), help='Genome stats outfile')
def main(input, output, stats):
    # read in the input file
    records = list(SeqIO.parse(input, "fasta"))

    # modify the sequence IDs and descriptions
    protcount = 0
    for record in records:
        protcount += 1
        if "|" in record.id:
            record.id = record.id.split("|")[-1]
        record.id = f"{os.path.splitext(os.path.basename(input))[0]}|{record.id}"
        record.description = ""

    # write the reformatted sequences to the output file
    SeqIO.write(records, output, "fasta")

    # calculate statistics and write them to the stats file
    stats_dict = {
        'query': os.path.splitext(os.path.basename(input))[0],
        'contigs': 'no_fna',
        'LENbp': 'no_fna',
        'GCperc': 'no_fna',
        'genecount': protcount,
        'CODINGperc': 'no_fna',
        'ttable': 'no_fna'
    }
    pd.DataFrame.from_dict(stats_dict, orient='index').T.to_csv(stats, sep="\t", index=False)

if __name__ == '__main__':
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import click
import os
import sys
import shutil
import glob
from subprocess import call
from Bio import SeqIO
import pandas as pd


def calc_stats(fnain, faain):
    """
    Calculate coding density, assembly size, GC%
    """
    cumulative_gc = 0
    cumulative_len = 0
    cumulative_len_aa = 0
    gene_count = 0
    contigs = 0

    for seq_record in SeqIO.parse(fnain, "fasta"):
        cumulative_gc += seq_record.seq.upper().count('G') + seq_record.seq.upper().count('C')
        cumulative_len += len(seq_record.seq)
        contigs += 1

    for seq_record in SeqIO.parse(faain, "fasta"):
        cumulative_len_aa += len(seq_record.seq)
        gene_count += 1

    return [contigs, cumulative_len, round(cumulative_gc / cumulative_len * 100, 2), gene_count, round(cumulative_len_aa * 3 / cumulative_len * 100, 2)]


def check_filename(fnafile):
    """
    Input has to be .fna with no additional . in filename
    """
    if len(fnafile.split(".")) > 2:
        print("fnafile name contains additional '.'")
        sys.exit()
    if fnafile.split(".")[1] != "fna":
        print("fnafile ending needs to be .fna")
        sys.exit()


def rename_header(bestcode, finalfaa, gffout):
    """
    Rename sequence header to ><filename>|<proteinid>
    """
    records = []
    infilebase = os.path.splitext(os.path.basename(bestcode))[0]

    for seq_record in SeqIO.parse(bestcode, "fasta"):
        headerbase = seq_record.description.split("|")[0]
        if headerbase != infilebase:
            seq_record.id = f"{infilebase}|{str(seq_record.id).split()[0]}"
            seq_record.description = ""
            records.append(seq_record)
        elif headerbase == infilebase:
            seq_record.id = str(seq_record.id).split()[0]
            seq_record.description = ""
            records.append(seq_record)

    SeqIO.write(records, finalfaa, "fasta")
    # keep only best gff
    shutil.copy(gffout + "_code" + bestcode.split("_code")[-1], gffout)


def run_genecalling_codes(fnafile, code, gffout):
    """
    Genecalling with prodigal and defined tt
    """
    faaout = gffout.replace(".gff", ".faa")
    if code > 0:
        run_command = f"prodigal -n -q -f gff -i {fnafile} -a {faaout}_code{code} -o {gffout}_code{code} -g {code}"
        call(run_command, shell=True)
    else:  # output for snakemake with -p meta
        run_command = f"prodigal -n -q -f gff -p meta -i {fnafile} -a {faaout}_codemeta -o {gffout}"
        call(run_command, shell=True)
        shutil.copy(gffout, gffout + "_codemeta")


@click.command()
@click.option('--fnafile', '-f', type=click.Path(exists=True), help='Input FNA file')
@click.option('--gffout', '-g', type=click.Path(), help='Output GFF file')
@click.option('--genecalling_statsout', '-gs', '--genecalling-statsout', type=click.Path(), help='Genecalling statistics file')
@click.option('--summary_statsout', '-ss', type=click.Path(), help='Genome stats outfile')
@click.option('--finalfaa', '-fa', type=click.Path(), help='Output FASTA file')
def main(fnafile, gffout, genecalling_statsout, summary_statsout, finalfaa):
    # check input
    check_filename(fnafile)
    # gene calling
    # create output for snakemake
    codes = [0, 1, 4, 6, 11] # pass these from snakemake config instead
    for code in codes:
        run_genecalling_codes(fnafile, code, gffout)
    stats_dict = {}
    faaoutdir = os.path.dirname(gffout)
    for faain in glob.glob(os.path.join(faaoutdir, "*.faa_code*")):
        genome_id = os.path.basename(faain)
        stats_dict[genome_id] = calc_stats(fnafile, faain)

    # gene calling stats to select ttable that yields highest coding density
    stats_df = pd.DataFrame.from_dict(stats_dict, columns=["contigs", "LENbp", "GCperc", "genecount", "CODINGperc"], orient="index")
    stats_df = stats_df.sort_values("CODINGperc", ascending=False)
    stats_df["ttable"] = stats_df.index.map(lambda x: x.split("_")[-1])
    stats_df.insert(0, "query", stats_df.index.map(lambda x: os.path.splitext(x)[0]))
    bestcode = os.path.join(faaoutdir, stats_df.index[0])  # genome id with code appended that had highest coding density
    stats_df.to_csv(genecalling_statsout, sep="\t", index=None)
    stats_df.head(1).to_csv(summary_statsout, sep="\t", index=None)

    # rename headers and cleanup
    rename_header(bestcode, finalfaa, gffout)
    for tempout in glob.glob(f"{gffout}_code*"):
        os.remove(tempout)
        faaout = tempout.replace(".gff", ".faa")
        if os.path.isfile(faaout):
            os.remove(faaout)


if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import click
import pandas as pd
import pickle
from Bio import SeqIO
from Bio.Seq import Seq
from sklearn.preprocessing import LabelEncoder


def get_label_mapping(mapping_file):
    """Returns a dictionary mapping encoded labels to their original values"""
    with open(mapping_file) as f:
        return {int(line.split('\t')[0]): line.strip().split('\t')[1] for line in f}


def get_best_kmers(kmers_file):
    """Returns a list of the best kmers"""
    with open(kmers_file) as f:
        return [line.strip() for line in f.readlines()]


def count_kmers(kmers, fna):
    """Counts the occurrences of kmers in a given FASTA file"""
    kmer_counts_genome_dict = {x:0 for x in kmers}
    total_seqs = sum(1 for _ in SeqIO.parse(fna, "fasta"))
    with click.progressbar(SeqIO.parse(fna, "fasta"), label='Counting kmers', length=total_seqs) as records:
        for seq_record in records:
            for kmer in kmers:
                kmer_counts_genome_dict[kmer] += Seq(seq_record.seq).count_overlap(kmer)
    return kmer_counts_genome_dict


def predict(model, kmers, labels, queryseq):
    # Load XGBoost model
    xgbmodel = pickle.load(open(model, 'rb'))

    # Load label mapping
    inv_mapping = get_label_mapping(labels)

    # Load and preprocess query sequence counts
    kmers_list = get_best_kmers(kmers)
    kmer_count_all_dict = {queryseq.split("/")[-1].split(".f")[0]: count_kmers(kmers_list, queryseq)}
    df_qcounts  =  pd.DataFrame.from_dict(kmer_count_all_dict).T.fillna(0)
    df_qcounts = df_qcounts.div(df_qcounts.sum(axis=1), axis=0)
    df_qcounts = df_qcounts.reset_index().drop(['index'], axis=1)

    # Ensure correct order of features
    cols_when_model_builds = xgbmodel.get_booster().feature_names
    df_qcounts = df_qcounts[cols_when_model_builds]

    # Make prediction and print result
    y_pred = xgbmodel.predict(df_qcounts.head(1))
    prediction = inv_mapping[y_pred[0]]
    prob_val = xgbmodel.predict_proba(df_qcounts.head(1))
    print (prob_val)
    proba_dict = {inv_mapping[x]:[prob_val[0,x]] for x,y in inv_mapping.items()}
    return prediction, proba_dict


@click.command()
@click.option('--queryseq', '-q', type=click.Path(exists=True), required=True, help='Path to query sequence')
@click.option('--model', '-m', type=click.Path(exists=True), required=True, help='Path to prebuilt model')
@click.option('--kmers', '-k', type=click.Path(exists=True), required=True, help='Path to kmer file')
@click.option('--labels', '-l', type=click.Path(exists=True), required=True, help='Path to labels')
@click.option('--out', '-o', type=click.Path(exists=False), required=True, help='Path to prediction output')
def main(queryseq, model, kmers, labels, out):
    # get domain level prediction
    prediction_domain, proba_dict_domain = predict(model, kmers, labels, queryseq)
    # if NCDLV then also get order level prediction
    if prediction_domain == "NCLDV":
        model_order = model.replace("domain", "order")
        kmers_order = kmers.replace("domain", "order")
        labels_order = labels.replace("domain", "order")
        prediction_order, proba_dict_order = predict(model_order, kmers_order, labels_order, queryseq)
    else:
        prediction_order = ""
        proba_dict_order = {}
    with open(out, "w") as outfile:
        outfile.write(f"{queryseq.split('/')[-1].split('.f')[0]}\t{prediction_order}___{prediction_domain}")
    proba_dict = {**proba_dict_domain, **proba_dict_order}
    df_proba = pd.DataFrame(proba_dict)
    df_proba["query"] = queryseq.split('/')[-1].split('.f')[0]
    df_proba = df_proba.set_index("query")
    #print (df_proba)
    #df_proba.to_csv("test.tab", sep="\t", index=None)
    df_proba.to_csv(f"{out}.proba", sep="\t", index=True)

if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import click
import subprocess
import sys
import pandas as pd


def run_cmd(cmd):
    """Runs a command in the shell and prints standard output and error"""
    sp = subprocess.Popen(cmd, shell=True)
    std_out, std_err = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


def get_models(modelsin):
    """
    Generate list of all model names
    Some models might have no hits in hmmout, but they should be displayed in count matrix
    """
    models = []
    with open(modelsin, "r") as f:
        models = [line.split()[-1].rstrip() for line in f if line.startswith("NAME")]
    return models


def get_markercompleteness(models, hmmout, query):
    """Returns a dictionary of marker count for each model"""
    count_dict = {x: 0 for x in models}
    seen = []
    with open(hmmout, "r") as f:
        lines = [line.rstrip() for line in f if not line.startswith("#")]
        for line in lines:
            if line.split()[0] not in seen:
                count_dict[line.split()[3]] += 1
                seen.append(line.split()[0])
    count_dict = {query: count_dict}
    return count_dict


@click.command()
@click.option('--queryfaa', '-q', type=click.Path(exists=True), help='Input query FASTA file')
@click.option('--modelscombined', '-m', type=click.Path(exists=True), help='Combined HMM model file')
@click.option('--hmmout', '-h', type=click.Path(), help='Output file for HMM hits')
@click.option('--target', '-t', type=str, help='Target database (UNI56 or UNIREF90)')
@click.option('--countout', '-c', type=click.Path(), help='Output file for marker gene counts')
def main(queryfaa, modelscombined, hmmout, target, countout):
    ### hmmsearch ###
    cutoff = "-E 1e-10"
    if target == "UNI56":
        cutoff = "--cut_ga"
    hmmsearch = ["hmmsearch" +
                 " --noali" +
                 " " + cutoff + " " +
                 " --domtblout " + hmmout +
                 " --cpu 4 " + modelscombined +
                 " " + queryfaa]
    run_cmd(hmmsearch)

    ### get counts ###
    query = queryfaa.split("/")[-1].split(".")[0]
    models = get_models(modelscombined)
    count_dict = get_markercompleteness(models, hmmout, query)
    pd.DataFrame.from_dict(count_dict).T.to_csv(countout, sep="\t")


if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import os
from Bio import SeqIO
import shutil
from collections import defaultdict
import click

@click.command()
@click.option('--hmmout', '-h', required=True, help='Path to HMM output file.')
@click.option('--queryfaa', '-q', required=True, help='Path to query protein FASTA file.')
@click.option('--queryhitsfaa', '-o', required=True, help='Path to output file containing hits.')
def main(hmmout, queryfaa, queryhitsfaa):
    outmodel = os.path.splitext(os.path.basename(queryhitsfaa))[0]
    hits_model_dict = get_qhits(hmmout, outmodel)
    export_hitsfaa(hits_model_dict, queryfaa, queryhitsfaa)

def export_hitsfaa(hits_model_dict, queryfaa, queryhitsfaa):
    query_ids = set()
    with click.progressbar(hits_model_dict.values(), label='Exporting hits to file') as bar1:
        for hits in bar1:
            query_ids |= set(hits)
    with open(queryhitsfaa, "w") as outfile:
        with click.progressbar(SeqIO.parse(queryfaa, "fasta"), label='Writing hits to file') as bar2:
            for seq_record in bar2:
                if seq_record.description in query_ids:
                    SeqIO.write(seq_record, outfile, "fasta")

def get_qhits(hmmout, outmodel):
    hits_model_dict = defaultdict(set)
    with open(hmmout, "r") as infile:
        with click.progressbar(infile, label='Parsing HMM output') as bar:
            for line in bar:
                if not line.startswith("#") and outmodel in line:
                    hits_model_dict[line.split()[3]].add(line.split()[0])
    return hits_model_dict

if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import click
import subprocess
from Bio import SeqIO
from collections import defaultdict
import os


def run_cmd(cmd):
    sp = subprocess.Popen(cmd,
        #stderr=subprocess.PIPE,
        #stdout=subprocess.PIPE,
        shell=True)
    std_out, std_err = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


def run_blastp(queryfaa, refdb, blastpout):
    # no evalue set, what is the default?
    dblastp = ["diamond blastp --threads 8 "
               "--quiet --outfmt 6 "
               "--more-sensitive --query-cover 30 "
               "--subject-cover 30 "
               "-e 1e-10 "
               "-d {} -q {} -o {}".format(refdb, queryfaa, blastpout)]
    run_cmd(dblastp)


def parse_blastp(blastpout):
    queryids_hits_dict = defaultdict(list)
    seen = []
    # there could be multiple queries
    try:
        num_queries = len(set([line.split()[0] for line in open(blastpout)]))
        with open(blastpout, "r") as infile:
            for line in infile:
                queryname = line.split()[0]
                subjectname = line.split()[1]
                pident = float(line.split()[2])
                # Yield equal number of subject ids per query
                if subjectname not in seen and len(queryids_hits_dict[queryname]) <= 100/num_queries and pident!=100:
                    queryids_hits_dict[queryname].append(subjectname)
                    seen.append(subjectname)
    except:
        pass
    # single list with all 100 top subject ids
    besthits = [x for v in queryids_hits_dict.values() for x in v]
    return besthits


def reduce_ref(queryfaa, reffaa, reduced_reffaa_merged, refdb, blastpout):
    run_blastp(queryfaa, refdb, blastpout)
    besthits = parse_blastp(blastpout)
    seenids = []
    outseqs = []
    for seq_record in SeqIO.parse(queryfaa, "fasta"):
        if seq_record.id not in seenids:
            outseqs.append(seq_record)
            seenids.append(seq_record.id)
    for seq_record in SeqIO.parse(reffaa, "fasta"):
        if seq_record.description in besthits and seq_record.id not in seenids:
            outseqs.append(seq_record)
            seenids.append(seq_record.id)
    SeqIO.write(outseqs, reduced_reffaa_merged, "fasta")


@click.command()
@click.option('--queryfaa', '-q', required=True, help='Query sequences from hmmsearch for a single marker')
@click.option('--reffaa', '-r', required=True, help='Ref seqs for a single marker')
@click.option('--refdb', '-d', required=True, help='Diamond formatted db for a single marker')
@click.option('--blastpout', '-b', required=True, help='Output file name for diamond blastp')
@click.option('--reduced_reffaa_merged', '-o', required=True, help='Output file name for reduced reference sequences')
def main(queryfaa, reffaa, refdb, blastpout, reduced_reffaa_merged):
    if os.path.getsize(queryfaa) > 0:
        reduce_ref(queryfaa, reffaa, reduced_reffaa_merged, refdb, blastpout)
    else: 
        print (f"{queryfaa} is empty, no hits, omitting blast")

if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import subprocess
import click


def run_cmd(cmd):
    sp = subprocess.Popen(cmd,
        stderr=subprocess.PIPE,
        stdout=subprocess.PIPE,
        shell=True)
    (std_out, std_err) = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


@click.command()
@click.option("--queryfaamerged", "-q", required=True, type=click.Path(exists=True), help="Input fasta file containing merged fasta sequences")
@click.option("--alnout", "-a", required=True, type=click.Path(), help="Output fasta file for aligned sequences")
@click.option("--trimout", "-t", required=True, type=click.Path(), help="Output fasta file for trimmed sequences")
def main(queryfaamerged, alnout, trimout):
    mafftaln = f"mafft --quiet --thread 4 {queryfaamerged} > {alnout}"
    trimaln = f"trimal -gt 0.1 -in {alnout} -out {trimout}"
    for cmd in [mafftaln, trimaln]:
        run_cmd(cmd)

if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import click
import subprocess


def run_cmd(cmd):
    sp = subprocess.Popen(cmd,
        stderr=subprocess.PIPE,
        stdout=subprocess.PIPE,
        shell=True)
    (std_out, std_err) = sp.communicate()
    print('std_err: ', std_err)
    print('std_out: ', std_out)


@click.command()
@click.option("--aln", "-a", type=click.Path(exists=True), help='Alignment file in FASTA format')
@click.option("--treeout", "-t",  type=click.Path(), help='Output file for the tree')
def main(aln, treeout):
    runtree = f"fasttree -lg < {aln} > {treeout}"
    run_cmd(runtree)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import glob
from collections import Counter
from Bio import AlignIO
import click

def get_nn(aln, query):
    hd_dict_paralogs = {}
    for qseq in AlignIO.read(aln, "fasta"):
        if qseq.id.split("|")[0] == query:
            hd_dict = {}
            for rseq in AlignIO.read(aln, "fasta"):
                if rseq.id.split("|")[0] != query:
                    hd = hamming_distance(qseq.seq, rseq.seq)
                    if hd > 30:
                        hd_dict[rseq.id] = hd
            nn = max(hd_dict, key=hd_dict.get)
            hd_dict_paralogs[nn + ";;;" + qseq.id] = hd_dict[nn]
    return hd_dict_paralogs


def flattenkeys(dict1, labels_dict, taxlevel):
    if taxlevel == "order":
        level = 3
    elif taxlevel == "family":
        level = 2
    elif taxlevel == "genus":
        level = 1
    dictlist = [dict1[model] for model in dict1]
    flattened = list({k: v for d in dictlist for k, v in d.items()}.keys())
    return [labels_dict[tax.split("|")[0]][level] for tax in flattened]


def hamming_distance(qseq, rseq):
    """
    get hamming distance between two aligned sequences
    """
    # return id%
    hd = (sum(aa1 != aa2 for aa1, aa2 in zip(qseq, rseq)))
    return hd

@click.command()
@click.option("--query", "-q", required=True)
@click.option("--aln_outpath", "-a", required=True)
@click.option("--outname", "-o", required=True)
@click.option("--ncldv_labels", "-l", required=True)
def main(query, aln_outpath, outname, ncldv_labels):
    hd_dict_all = {}
    hd_dict_all_best = {}
    for aln in glob.glob(aln_outpath + "*.mafft01"):
        modelname = aln.split("/")[-1].split(".")[0]
        try:
            hd_dict_all[modelname] = get_nn(aln, query)
            hd_dict_all_best[modelname] = {max(hd_dict_all[modelname], key=hd_dict_all[modelname].get) : hd_dict_all[modelname][max(hd_dict_all[modelname], key=hd_dict_all[modelname].get)]}
        except:
            pass

    labels_dict = {}
    with open(ncldv_labels, "r") as infile:
        for line in infile:
            taxid, lineage = line.strip().split("\t")
            labels_dict[taxid] = lineage.split("|")

    final_aln_dict = Counter(flattenkeys(hd_dict_all_best, labels_dict, "order"))
    aln_result = "|".join([f"{x}:{y}" for x, y in final_aln_dict.items()])

    with open(outname, "w") as outfile:
        if len(aln_result) > 0:
            outfile.write(f"{query}\t{aln_result}\taln\n")
        else:
            outfile.write(f"{query}\tno_hits\taln\n")
        for model, value in hd_dict_all.items():
            value = dict(sorted(value.items(), key=lambda item: item[1], reverse=True))
            outfile.write("\n".join([f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}" for k, v in value.items()]) + "\n")

if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import glob
import click
from ete3 import Tree
from collections import defaultdict
from collections import Counter
import pandas as pd


def flattenkeys_to_list(dict1, labels_dict, taxlevel):
    if taxlevel=="order":
        level = 3
    elif taxlevel=="family":
        level = 2
    elif taxlevel=="genus":
        level = 1
    dictlist = [  dict1[model] for model in dict1.keys() ]
    flattened = list({k: v for d in dictlist for k, v in d.items()}.keys())
    return [labels_dict[tax.split("|")[0]][level] for tax in flattened]


def unpack_family(node, leaves, query, seen):
    """
    get all leaves under a parental node
    """
    children = node.get_children()
    for child in children:
        childname = str(child).replace("--","").replace("/-", "").replace("\-", "").replace("\n","")
        if child.is_leaf() and childname.split("|")[0] != query:
                seen.append(childname)
                leaves.append(childname)
        elif not child.is_leaf() and childname != query and childname not in seen:
            seen.append(childname)
            unpack_family(child, leaves, query, seen) 
        elif child.is_leaf() and childname == query:
            seen.append(childname)
        else:
            pass
    if len(leaves)==0:
        parentnode = node.up
        unpack_family(parentnode, leaves, query, seen) 
    return leaves


@click.command()
@click.option('-q', '--query', required=True, help='Query to search in the tree.')
@click.option('-t', '--tree_outpath', required=True, help='Path to the tree directory.')
@click.option('-o', '--outname', required=True, help='Output file name.')
@click.option('-l', '--ncldv_labels', required=True, help='Path to the NCBI taxonomy labels file.')
def main(query, tree_outpath, outname, ncldv_labels):

    def get_closestrelative(query, leaves, node, tree):
        distance = 10
        closestrelative = "nd"
        for child in leaves:
            if child != query:
                newdist = t.get_distance(child, query, topology_only = False)
                if newdist < distance:
                    distance = newdist
                    closestrelative = child
                else:
                    pass
        return closestrelative, distance


    def get_neighbor(t, modelname, query):
        tree_dict_paralogs = {} # include nearest neighbor for all paralogs
        for node in t.traverse():
            if node.name.split("|")[0] == query:
                queryintree = node.name
                print (queryintree)
                # move up one node in the tree
                # collect all terminal leaves under the parent node
                leaves = []
                seen = []
                leaves = unpack_family(node, leaves, query, seen)
                print (leaves)
                closestrelative, distance = get_closestrelative(queryintree, leaves, node, t)
                print (closestrelative)
                tree_dict_paralogs[closestrelative + ";;;" + queryintree] = distance
        #closestrelativep = min(tree_dict_paralogs, key=tree_dict_paralogs.get) # get only ref that is most similar from all paralogs = shorted distance
        #return {closestrelativep : tree_dict_paralogs[closestrelativep]} # return only ref that is most similar
        return tree_dict_paralogs


    labels_dict = {}
    with open(ncldv_labels, "r") as infile:
        for line in infile:
            taxid = line.split("\t")[0]
            lineage = line.split("\t")[1].replace("\n","").split("|")
            labels_dict[taxid] = lineage

    # all distances including paralogs
    tree_dict_all = {}
    # dict with only single paralog that had shortest distance to reference
    tree_dict_all_best = {}
    for ftree in glob.glob(tree_outpath + "/*.FTWAG"):
        try:
            t = Tree(ftree)
            modelname = ftree.split("/")[-1].split(".")[0]
            print (f"{modelname}")
            tree_dict_all[modelname] = get_neighbor(t, modelname, query)
            print (f"{tree_dict_all}")
            tree_dict_all_best[modelname] = {min(tree_dict_all[modelname], key=tree_dict_all[modelname].get) : tree_dict_all[modelname][min(tree_dict_all[modelname], key=tree_dict_all[modelname].get)]}
        except:
            print (f"Something wrong with {ftree}")
            pass

    # this is the summary result, show number of best hits per tax string
    # cutoff for order: tree dist 2, aln 30 idperc --> needs further evaluation

    print (f"{tree_dict_all_best}@@@@@@@@{query}")
    final_tree_dict = Counter(flattenkeys_to_list(tree_dict_all_best, labels_dict, "order"))
    tree_result = "|".join([":".join([str(x) for x in pair]) for pair in list(final_tree_dict.items())])

    with open(outname, "w") as outfile:
        if len(tree_result) > 0:
            outfile.write(f"{query}\t{tree_result}\ttree\n")
        else:
            outfile.write(f"{query}\tno_hits\ttree\n")
        for model, value in tree_dict_all.items():
            # print lowest distance first
            value = dict(sorted(value.items(), key=lambda item: item[1], reverse=False))
            #outfile.write("\n".join(f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}" for k,v in value.items()) + "\n")
            outfile.write("\n".join([
                f"{model}\t{k.split(';;;')[1]}\t{k.split(';;;')[0]}\t{'|'.join(labels_dict[k.split('|')[0]])}\t{v}"
                for k, v in value.items()]) + "\n")


if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
import click
import pandas as pd
import glob
from collections import Counter


def parse_result(tabout):
    results = []
    with open(tabout) as infile:
        for line in infile:
            line = line.strip().split()
            if line[0].startswith("GVOG"):
                if line[0] == "GVOGm0461" and float(line[-1]) < 1.8:
                    results.append(line)
                elif line[0] != "GVOGm0461" and float(line[-1]) < 2.2:
                    results.append(line)
                else:
                    print(f"{line[0]} tree distance to neighbor above threshold, hit removed")
    return results

def process_gvog9(gvog9_count):
    try:
        gvog9_models = ["GVOGm0003","GVOGm0013","GVOGm0022","GVOGm0023","GVOGm0054","GVOGm0172","GVOGm0461","GVOGm0760","GVOGm0890"]
        gvog7_models = ["GVOGm0013","GVOGm0023","GVOGm0054","GVOGm0172","GVOGm0461","GVOGm0760","GVOGm0890"]
        gvogsout_df = pd.read_csv(gvog9_count, sep="\t", index_col=0)
        gvogsout_df['GVOG7u'] = (gvogsout_df[gvog7_models] > 0).sum(axis=1)
        gvogsout_df['GVOG7t'] = (gvogsout_df[gvog7_models]).sum(axis=1)
        gvogsout_df['GVOG9u'] = (gvogsout_df[gvog9_models] > 0).sum(axis=1)
        gvogsout_df['GVOG9t'] = (gvogsout_df[gvog9_models]).sum(axis=1)
        gvogsout_df['GVOG7df'] = gvogsout_df['GVOG7t'] / gvogsout_df['GVOG7u'] if gvogsout_df['GVOG7t'].any() else 0
        return list(gvogsout_df.GVOG9u)[0], list(gvogsout_df.GVOG9t)[0], list(gvogsout_df.GVOG7u)[0], list(gvogsout_df.GVOG7t)[0], list(gvogsout_df.GVOG7df)[0], list(gvogsout_df.GVOGm0003)[0]
    except:
        return 0,0,0,0,0,0


def process_uni56(uni56_count):
    try:
        UNI56out_df = pd.read_csv(uni56_count, sep="\t", index_col=0)
        UNI56out_models = list(UNI56out_df.columns)
        UNI56out_df['UNI56u'] = (UNI56out_df[UNI56out_models] > 0).sum(axis=1)
        UNI56out_df['UNI56t'] = (UNI56out_df[UNI56out_models]).sum(axis=1)
        UNI56out_df['UNI56df'] = UNI56out_df['UNI56t'] / UNI56out_df['UNI56u'] if UNI56out_df['UNI56t'].any() else 0
        return list(UNI56out_df.UNI56u)[0], list(UNI56out_df.UNI56t)[0], list(UNI56out_df.UNI56df)[0]
    except:
        return 0, 0, 0


def most_frequent(taxstrings, taxlevel):
    freq = Counter(taxstrings)
    if len(freq) == 1:
        if len(taxstrings) >= 2:
            return taxstrings[0]
        else:
            return "_"
    most_common = freq.most_common(1)[0]
    if most_common[1] > len(taxstrings) // 2:
        return most_common[0]
    elif taxlevel == "domain":
        return "-".join(sorted(list(set(taxstrings))))
    else:
        return "_"

def tax_annotation(row, level_other, level_ncldv):
    if row["subject"].split("__")[0] in ["EUK", "ARC", "BAC", "PHAGE"]:
        return  row["subject"].split("__")[0] + "__" + row["taxannot"].split("|")[level_other]
    else:
        return row["taxannot"].split("|")[level_ncldv]

def tax_domains(row):
    if row["subject"].split("__")[0] in ["EUK", "ARC", "BAC", "PHAGE"]:
        return  row["subject"].split("__")[0]
    elif row["taxannot"].split("|")[5] == "Nucleocytoviricota":
        return "NCLDV"
    else:
        return "CONFLICT"

def get_final_tax(df, query, stringency_s):
    df_q = df[df.queryname == query]
    tax_levels = ["species", "genus", "family", "order", "class", "phylum", "domain"]
    tax_lists = [list(df_q[tax]) for tax in tax_levels]
    final_tax = [query]
    if stringency_s == "majority":
        for level, lst in zip(tax_levels, tax_lists):
            final_tax.append(f"{level[0]}_{most_frequent(lst, level)}")
    else:
        stringency = int(stringency_s.replace("gte", ""))
        for level, lst in zip(tax_levels, tax_lists):
            if len(set(lst)) == 1:
                final_tax.append(f"{level[0]}_{lst[0]}")
            elif len(lst) >= stringency:
                final_tax.append(f"{level[0]}__")
            else:
                final_tax.extend(["missing_markers"] * (7 - len(final_tax)))
                break
        else:
            final_tax.extend(["missing_markers"] * (7 - len(final_tax)))
        if len(set(tax_lists[-1])) == 1:
            final_tax[-1] = f"{tax_levels[-1][0]}_{tax_lists[-1][0]}"
        else:
            for lst, domain in zip(tax_lists[:-1], df_q["domain"]):
                if domain == "EUK" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-EUK"
                    break
                elif domain == "BAC" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-BAC"
                    break
                elif domain == "ARC" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-ARC"
                    break
                elif domain == "ARC" and "BAC" in lst:
                    final_tax[-1] = "d_ARC-BAC"
                    break
                elif domain == "ARC" and "EUK" in lst:
                    final_tax[-1] = "d_ARC-EUK"
                    break
                elif domain == "BAC" and "EUK" in lst:
                    final_tax[-1] = "d_BAC-EUK"
                    break
                elif domain == "PHAGE" and "NCLDV" in lst:
                    final_tax[-1] = "d_NCLDV-PHAGE"
                    break
            else:
                final_tax[-1] = "d__"
    final_tax.append(stringency_s)
    final_tax.append(df_q["distance"].mean())
    return final_tax


def summarize(df):
    try:
        # Split the query column and extract the queryname
        df["queryname"] = df["query"].str.split("|", expand=True)[0]
        df["species"] =  df.apply(lambda row: tax_annotation(row, 6, -1), axis=1)
        df["genus"] =  df.apply(lambda row: tax_annotation(row, 5, 1), axis=1)
        df["family"] =  df.apply(lambda row: tax_annotation(row, 4, 2), axis=1)
        df["order"] =  df.apply(lambda row: tax_annotation(row, 3, 3), axis=1)
        df["class"] =  df.apply(lambda row: tax_annotation(row, 2, 4), axis=1)
        df["phylum"] =  df.apply(lambda row: tax_annotation(row, 1, 5), axis=1)
        df["domain"] =  df.apply(tax_domains, axis=1)
        df = df[["queryname","species", "genus","family","order", "class", "phylum", "domain", "GVOG", "distance"]]
        # Get the top hit
        df_tophit = df.drop_duplicates(subset=['GVOG'])
        # Get results for different stringencies
        stringencies = ["gte1", "gte2", "gte3", "majority"]
        allresults = [get_final_tax(df_tophit, list(df.queryname)[0], stringency) for stringency in stringencies]
        # Create results dataframe
        df_results = pd.DataFrame(allresults, columns=["query", "species", "genus", "family", "order", "class", "phylum", "domain", "stringency", "avgdist"])
        return df_results
    except:
        pass


@click.command()
@click.option('-n', '--nn_tree', required=True, help='Path to the nearest neighbor tree file.')
@click.option('-g', '--gvog9_count', required=True, help='Path to the GVOG9 count file.')
@click.option('-u', '--uni56_count', required=True, help='Path to the UNI56 count file.')
@click.option('-q', '--querystats', required=True, help='Path to the query stats file.')
@click.option('-c', '--xgb_out', required=True, help='Path to the classifier output file.')
@click.option('-s', '--summary_out', required=True, help='Path to the output summary file.')
def main(nn_tree, gvog9_count, uni56_count, querystats, xgb_out, summary_out):
    try:
        query = summary_out.split("/")[-1].split(".")[0]
        treehits = []
        treehits.extend(parse_result(nn_tree))
        querystats_df = pd.read_csv(querystats, sep="\t")
        with open(xgb_out) as f:
            prediction = [line.split('\t')[1].strip() for line in f if line.strip()]
        if len(treehits) > 0:
            df_tree = pd.DataFrame(treehits, columns=["GVOG", "query", "subject", "taxannot", "distance"])
            df_tree["distance"] = df_tree["distance"].astype(float)
            df_results_tree = summarize(df_tree)
            GVOG9u, GVOG9t, GVOG7u, GVOG7t, GVOG7df, MCP = process_gvog9(gvog9_count)
            UNI56u, UNI56t, UNI56df = process_uni56(uni56_count)
            df_results_tree["GVOG9u"] = GVOG9u
            df_results_tree["GVOG9t"] = GVOG9t
            df_results_tree["GVOG7u"] = GVOG7u
            df_results_tree["GVOG7t"] = GVOG7t
            df_results_tree["GVOG7df"] = GVOG7df
            df_results_tree["MCP"] = MCP
            df_results_tree["UNI56u"] = UNI56u
            df_results_tree["UNI56t"] = UNI56t
            df_results_tree["UNI56df"] = UNI56df
            if len(prediction)>0:
                df_results_tree["xgb"] = prediction[0]
            else:
                df_results_tree["xgb"] = "no_fna"
            df_results_tree = pd.merge(df_results_tree, querystats_df, on='query')
            df_results_tree.to_csv(summary_out, sep="\t", index=False)
        elif len(prediction)>0:
            allresults = [[query, "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", prediction[0], "no_hits", "missing_markers", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"]]
            cols = ["query", "species", "genus", "family", "order", "class", "phylum", "domain", "xgb", "stringency", "avgdist", "GVOG9u", "GVOG9t", "GVOG7u", "GVOG7t", "GVOG7df", "MCP", "UNI56u", "UNI56t", "UNI56df"]
            df_results_tree = pd.DataFrame(allresults, columns=cols)
            df_results_tree = pd.merge(df_results_tree, querystats_df, on='query')
            df_results_tree.to_csv(summary_out, sep="\t", index=False)
        else:
            allresults = [[query, "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "missing_markers", "no_fna", "no_hits", "missing_markers", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"]]
            cols = ["query", "species", "genus", "family", "order", "class", "phylum", "domain", "xgb", "stringency", "avgdist", "GVOG9u", "GVOG9t", "GVOG7u", "GVOG7t", "GVOG7df", "MCP", "UNI56u", "UNI56t", "UNI56df"]
            df_results_tree = pd.DataFrame(allresults, columns=cols)
            df_results_tree = pd.merge(df_results_tree, querystats_df, on='query')
            df_results_tree.to_csv(summary_out, sep="\t", index=False)
    except:
        print(f"Empty output for {summary_out.split('/')[-1].split('.')[0]}")
        with open(summary_out, "w") as outfile:
            pass

if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import click
import os.path
import glob
import pandas as pd


@click.command()
@click.option('-r', '--resultsdirparent', type=click.Path(exists=True), help='Parent directory of the summary files.')
@click.option('-o', '--combinedout', type=click.Path(), help='Output file name.')
def main(resultsdirparent, combinedout):
    """
    Retain only annotation for d_NCLDV at highest available stringency.
    """
    results = []
    for sumout in glob.glob(resultsdirparent + "*/*.summary.tab"):
        try:
            df = pd.read_csv(sumout, index_col=None, header=0, sep="\t")
            results.append(df)
        except:
            print (sumout + " is empty")
    df_results_f = pd.concat(results, axis=0, ignore_index=True)
    df_results_f = df_results_f.fillna(0)
    # also output d_NCLDV-EUK, d_NCLDV-BAC, d_NCLDV-ARC, d_NCLDV-PHAGE
    df_results_f_gt3 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte3')]
    df_results_f_gt2 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte2')]
    df_results_f_gt1 = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'gte1')]
    df_results_f_majority = df_results_f[(df_results_f['domain'].str.startswith('d_NCLDV')) & (df_results_f['stringency'] == 'majority')]
    df_results_f_xgb = df_results_f[(df_results_f['xgb'].str.endswith('NCLDV'))]
    df_results_f_combined = pd.concat([df_results_f_gt3, df_results_f_gt2, df_results_f_gt1, df_results_f_majority, df_results_f_xgb])
    df_results_f_combined = df_results_f_combined.drop_duplicates(subset=['query'])
    df_results_f_combined.to_csv(combinedout, sep='\t', index=None)


if __name__ == '__main__':
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import click
import tarfile
import os.path
import shutil

@click.command()
@click.option('-r', '--resultsdir', type=click.Path(exists=True), help='Directory to be compressed.')
@click.option('-o', '--tarout', type=click.Path(), help='Output tarfile name.')
def main(resultsdir, tarout):
    """
    Compress a directory to a tar.gz file and delete the original directory.
    """
    with tarfile.open(tarout, "w:gz") as tar:
        tar.add(resultsdir, arcname=os.path.basename(resultsdir))
        try:
            shutil.rmtree(resultsdir)
        except OSError as e:
            print(f"Error: {resultsdir} : {e.strerror}")

if __name__ == '__main__':
    main()
60
61
62
63
shell:
  """
  (python workflow/scripts/00reformat.py -i {input} -o {output[0]} -s {output[1]}) &> {log}
  """
83
84
85
86
87
shell:
  """
  python workflow/scripts/01genecalling_single.py -f {input} -g {output.gffout} -gs {output.genecalling_statsout} -ss {output.best_statsout} -fa {output.faafinalout} &> {log}
  cp {input} {output.fnafinalout}
  """
106
107
108
109
110
111
112
113
114
115
116
117
  shell:
    """
    if [ -f {params.infilebase}.fna ]; then
    python workflow/scripts/01predict.py -q {params.infilebase}.fna -m {input.xgbD} -k {input.featuresD} -l {input.mappingsD} -o {output} &> {log};
    else touch {output}; fi
    """


"""
Step 1 
Identify markers, extract, align
"""
133
134
135
136
shell:
  """
  (python workflow/scripts/02hmmsearch.py -q {input.queryfaa} -m {input.models} -h {output.GVOG9out} -t GVOG9 -c {output.hitcounts}) &> {log}
  """
154
155
156
157
shell:
  """
  (python workflow/scripts/02hmmsearch.py -q {input.queryfaa} -m {input.models} -h {output.UNI56out} -t UNI56 -c {output.hitcounts}) &> {log}
  """
171
172
173
174
175
shell:
  """
  python workflow/scripts/03extract_qhits.py -h {input.hmmout} -q {input.queryfaa} -o {output.queryhitsfaa}
  touch {output.queryhitsfaa}
  """
194
195
196
197
198
shell:
  """
  (python workflow/scripts/04blastp_reduce_merge.py -q {input.queryhitsfaa} -r {input.reffaa} -d {input.refdb} -b {output.blastpout} -o {output.mergedfaa}) >& {log}
  touch {output.blastpout} {output.mergedfaa}
  """
214
215
216
217
218
219
220
221
222
223
224
  shell:
    """
    (python workflow/scripts/05align_trim.py -q {input.mergedfaa} -a {output.aln} -t {output.trimmedaln}) >& {log}
    touch {output.aln} {output.trimmedaln}
    """


"""
Step 2 - Tree based
Build protein trees, get nearest neighbor in trees
"""
238
239
240
241
242
shell:
  """
  (python workflow/scripts/06build_tree.py -a {input.trimmedaln} -t {output.tree}) >& {log}
  touch {output.tree}
  """
262
263
264
265
266
shell: 
  """
  (python workflow/scripts/07get_nn_aln.py -q {params.queryname} -a {params.fdir}/queryrefs_aligned/ -o {output.aln_out} -l {labels}
  python workflow/scripts/08get_nn_tree.py -q {params.queryname} -t {params.fdir}/queryrefs_fasttree/ -o {output.tree_out} -l {labels}) &> {log}
  """
285
286
287
288
289
shell:
  """
  (python workflow/scripts/09summarize.py -n {input.nn_tree} -g {input.gvog9_count} -u {input.uni56_count} -q {input.querystats} -s {output} -c {input.xgbout}
  touch {output}) &> {log}
  """
304
305
306
307
308
shell:
  """
  python workflow/scripts/10combinedout.py -r {params.fdirp} -o {output.combined}
  touch {output.combined}
  """
323
324
325
326
327
shell:
  """
  #find {params.fdir} -type f -empty -print -delete
  python workflow/scripts/11cleanup.py -r {params.fdir} -o {output}
  """
ShowHide 18 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NeLLi-team/gvclass
Name: gvclass
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...