Podacris Genome Annotation (PGA)

public 1yr ago 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Podacris Genome Annotation (PGA)

TODO: Link manuscript when published

Genomic signatures of selection underpinning extremely rapid adaptation in island lizards

Authors: Maria Novosolov, Anthony Herrel, Anamaria Stambuk, Jesper Stenderup, James S. Santangelo, Iva Sabolić, Kristian E. Hanghøj, Thorfinn S. Korneliussen, Debora Y. C. Brandt, Rasmus Heller, Rasmus Nielsen, Morten E. Allentoft

Description

This repository contains the code used for annotation scaffold-level assembly of the Italian Wall Lizard ( Podacris siculus ). The pipeline starts with input of the genome FASTA file. It outputs the final functional annotation in GFF3 format

In addition to the genome FASTA, the following dependencies/resources need to be obtained prior to running the pipeline.

The pipeline uses GeneMark, which is licensed softwared. The license key can be obtained from here .
1. Include the .gm_key in the workflow/ directory containing the main Snakefile used for running the pipeline
A copy of the RepBase repeat library (I used RepBaseRepeatMaskerEdition-20181026.tar.gz ) from here . A license is required to use this database so it could not be included. Once obtained, it needs to be included in resources/ directory

A copy of the InterProScan data needs to be obtained and included in the resources/ directory. The following commands can be run to download and setup the database:

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/alt/interproscan-data-5.61-93.0.tar.gz
tar -xvzfp interproscan-data-5.61-93.0.tar.gz
chmod -R 777 interproscan-data-5.61-93.0

Using the pipeline

This pipeline requires Conda and Singularity :

A minimal installation of Conda (i.e., Miniconda ) can be installed by following the instructions for your platform here

Installation of Singularity requires Admin privileges, but using Singularity to run pre-created containers does not. Installation instructions can be found here . All Singularity containers used in this pipeline are avalaible in this public repository , though they will be automatically pulled and executed by the pipeline. Assuming Conda is installed, the this repository's Conda environment can be replicated by running the following command:

conda env create -f environment.yaml -n prg

This will create a Conda environment named prg containing a minimal set of dependencies required to run the pipeline.

After activating the environment ( conda activate prg ), the pipeline can be executed from the workflow directory by running a command that looks something like:

snakemake --use-conda --use-singularity --singularity-args "--bind <path> --bind <path/to/interproscan/data>:/opt/interproscan-5.61-93.0/data" --configfile ../config/<configfile> --notemp -j <cores>

for local execution. Here, <path> is the path on the cluster from which files will be read/written (e.g., /scratch ), <path/to/interproscan/data> is the full path to the interproscan data in resources/ , <configfile> is one of the configfiles in the config directory that needs to be modified to match the paths on your system, and <cores> is the number of cores available for executing parallel processes.

Code Snippets

shell:
    """
    ( tar -xzf {input} -C {params.untar_dir} &&
     cp -r /opt/RepeatMasker/Libraries/* {output.libdir} &&
     ln -sf {output.dfam} {output.rm} &&
     addRepBase.pl --libdir {output.libdir} ) &> {log}
    """

SnakeMake From line 21 of rules/annotation.smk

shell:
    """
    BuildDatabase -name {params.out} {input} &> {log}
    """

SnakeMake From line 41 of rules/annotation.smk

shell:
    """
    RepeatModeler -database {params.db_base} \
        -threads {threads} \
        -LTRStruct &> {log}
    """

SnakeMake RepeatModeler From line 61 of rules/annotation.smk

shell:
    """
    ( famdb.py -i {input.rm_db} families \
        --format fasta_name \
        --ancestors --descendants --include-class-in-name \
        'tetrapoda' > {params.rm_db_fasta} &&

    cat {params.rm_db_fasta} {input.tr_db} > {output} ) 2> {log}
    """

SnakeMake From line 81 of rules/annotation.smk

shell:
    """
    ( RepeatMasker -e ncbi \
        -pa {threads} \
        -xsmall \
        -gc {params.gc} \
        -lib {input.lib} \
        -dir {params.outdir} \
        -gff {input.fasta} &&

        mv {params.outdir}/*.fasta.masked {output.fasta}
        mv {params.outdir}/*.cat.gz {output.cat}
        mv {params.outdir}/*.out {output.out}
        mv {params.outdir}/*.gff {output.gff}
        mv {params.outdir}/*.tbl {output.stats} ) &> {log}
    """

SnakeMake RepeatMasker From line 110 of rules/annotation.smk

shell:
    """
    prefetch {wildcards.acc} -O {params.outdir} 2> {log}
    """

SnakeMake From line 141 of rules/annotation.smk

shell:
    """
    cd {params.outdir}
    fasterq-dump --split-3 \
        -e {threads} \
        --skip-technical \
        {wildcards.acc} 2> {log}
    """

SnakeMake From line 160 of rules/annotation.smk

shell:
    """
    gzip {input} 2> {log} 
    """

SnakeMake From line 179 of rules/annotation.smk

shell:
    """
    mkdir {output}
    STAR --runMode genomeGenerate \
        --genomeDir {output} \
        --genomeFastaFiles {input.masked_genome} \
        --runThreadN {threads} &> {log}
    """

SnakeMake STAR From line 195 of rules/annotation.smk

shell:
    """
    STAR --readFilesIn {input.R1} {input.R2} \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --twopassMode Basic \
        --genomeDir {input.star_build} \
        --runThreadN {threads} \
        --readFilesCommand zcat \
        --alignIntronMax 10000 \
        --outFileNamePrefix {params.out} &> {log} 
    """

SnakeMake STAR From line 220 of rules/annotation.smk

shell:
    """
    ( samtools merge -@ {threads} -r -o {output.bam} {input} &&\
        samtools index {output.bam} ) 2> {log}
    """

SnakeMake SAMtools From line 245 of rules/annotation.smk

shell:
    """
    ( wget --no-check-certificate https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Vertebrata.fa.gz -P {params.outdir} && 
    gunzip {params.outdir}/Vertebrata.fa.gz ) 2> {log}
    """

SnakeMake From line 264 of rules/annotation.smk

shell:
    """
    curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/{params.ref_seq}/download?include_annotation_type=PROT_FASTA&filename={wildcards.psp}.zip" -H "Accept: application/zip" &&
    unzip {wildcards.psp}.zip -d {params.outdir} &&
    cp  {params.outdir}/ncbi_dataset/data/{params.ref_seq}/protein.faa {output}
    rm -rf {params.outdir}/ncbi_dataset
    """

SnakeMake From line 280 of rules/annotation.smk

shell:
    """
    curl --output {output} '{params.url}' 2> {log}
    """

SnakeMake From line 297 of rules/annotation.smk

shell:
    """
    cat {input} > {output}
    """

SnakeMake From line 312 of rules/annotation.smk

shell:
    """
    braker.pl --genome {input.masked_genome} \
        --prot_seq {input.proteins} \
        --softmasking \
        --useexisting \
        --threads {threads} \
        --workingdir {params.outputdir} \
        --species "Podacris siculus prot" 2> {log}
    """ 

SnakeMake From line 332 of rules/annotation.smk

shell:
    """
    braker.pl --genome {input.masked_genome} \
        --bam {input.bam} \
        --softmasking \
        --useexisting \
        --threads {threads} \
        --workingdir {params.outputdir} \
        --species "Podacris siculus rna" 2> {log} 
    """

SnakeMake From line 358 of rules/annotation.smk

shell:
    """
    tsebra.py -g {input.prot_aug} \
        -k {input.rna_aug} \
        -c {params.config} \
        -e {input.hints_rna},{input.hints_prot} \
        -o {output} 2> {log} 
    """ 

SnakeMake From line 386 of rules/annotation.smk

shell:
    """
    rename_gtf.py --gtf {input} \
            --prefix Psic \
            --translation_tab {output.tab} \
            --out {output.gtf} 2> {log} 
    """

SnakeMake From line 406 of rules/annotation.smk

run:
    features = ['exon', 'intron', 'gene', 'transcript']
    with open(input[0], 'r') as fin:
        with open(output[0], 'w') as fout:
            lines = fin.readlines()
            for line in lines:
                sline = line.split('\t')
                if sline[2] in features:
                    pass
                else:
                    fout.write(line)

SnakeMake From line 426 of rules/annotation.smk

shell:
    """
    agat_convert_sp_gxf2gxf.pl --gtf {input} \
        --output {output} &> {log}
    """

SnakeMake From line 448 of rules/annotation.smk

shell:
    """
    sed 's/SEQ/0/g' {input} > {output}
    """

SnakeMake From line 462 of rules/annotation.smk

shell:
    """
    grep -v 'Psic_g38722' {input} > {output}
    """

SnakeMake From line 475 of rules/annotation.smk

shell:
    """
    gt gff3 -sort -tidy -retainids {input} > {output} 2> {log}
    """

SnakeMake From line 490 of rules/annotation.smk

run:
    with open(input[0], 'r') as fin:
        with open(output[0], 'w') as fout:
            lines = fin.readlines()
            for line in lines:
                if not line.startswith('#'):
                    sline = line.split('\t')
                    feature = sline[2]
                    if feature == 'gene':
                        #Remove transcript_id attribute from gene features
                        sline[8] = re.sub(r'(;transcript_id.*$)', '', sline[8])
                    elif feature == 'mRNA':
                        #Make sure transcript_id annotations for isoforms are correct (i.e., .t2, .t3, etc.)
                        #Use ID attribute since transcript_id attributes are incorrectly incremented and will be replaced
                        id_pattern = r"(?<=ID=)(.*)(?=;Parent)"
                        ID = re.search(id_pattern, sline[8]).group(1)
                        if ID.endswith('.t1'):
                            #First isoforms are fine
                            pass
                        else:
                            #Alternative isoforms need transcript_id replaced with ID
                            sline[8] = re.sub(r'(?<=;transcript_id=)(.*$)', ID, sline[8])
                    fout.write('\t'.join(sline))
                else:
                    fout.write(line) 

SnakeMake From line 503 of rules/annotation.smk

shell:
    """
    gffread -E -y {output} -g {input.ref} {input.gff} 2> {log}
    """

SnakeMake gffread From line 541 of rules/annotation.smk

shell:
    """
    interproscan.sh -i {input.prot} \
        -b {params.out_base} \
        -f xml,gff3 \
        -goterms \
        --pathways \
        --seqtype p \
        --cpu {threads} \
        --verbose &> {log} 
    """ 

SnakeMake InterProScan (EBI) From line 565 of rules/annotation.smk

shell:
    """
    funannotate setup --database {output} -b tetrapoda --force 2> {log}
    """

SnakeMake funannotate From line 585 of rules/annotation.smk

shell:
    """
    mkdir -p {output}
    download_eggnog_data.py -y --data_dir {output} 
    """

SnakeMake From line 597 of rules/annotation.smk

shell:
    """
    emapper.py -i {input.prot} \
        --data_dir {input.db} \
        --cpu {threads} \
        --itype proteins \
        -o {params.out} \
        --override &> {log}
    """

SnakeMake eggNOG-mapper v2 From line 617 of rules/annotation.smk

shell:
    """
    mkdir -p {params.outdir}
    funannotate annotate \
        --gff {input.gff} \
        --fasta {input.ref} \
        --species "Podacris siculus" \
        --out {params.outdir} \
        --iprscan {input.iprs} \
        --eggnog {input.enm} \
        --force \
        --database {input.db} \
        --busco_db tetrapoda \
        --cpus {threads} 2> {log}
    """

SnakeMake funannotate From line 650 of rules/annotation.smk

shell:
    """
    wget {params.url} --no-check-certificate -P {params.outdir}
    """

SnakeMake From line 676 of rules/annotation.smk

script:
    "../scripts/python/fixEC_incrementCDS_addLocusTags.py"

SnakeMake From line 694 of rules/annotation.smk

shell:
    """
    gff3_sort -g {input} -r -og {output} 2> {log}
    """

SnakeMake From line 707 of rules/annotation.smk

shell:
    """
    gffread -E -y {output} -g {input.ref} {input.gff} 2> {log}
    """

SnakeMake gffread From line 723 of rules/annotation.smk

shell:
    """
    touch {output}
    """

SnakeMake From line 733 of rules/annotation.smk

shell:
    """
    quast.py -o {output} \
            --threads {threads} \
            --split-scaffolds \
            --large \
            --k-mer-stats \
            {input.fasta} &> {log}
    """

SnakeMake QUAST From line 14 of rules/qc.smk

shell:
    """
    busco -m genome \
            -i {input.fasta} \
            -o {params.out_name} \
            --out_path {params.out_path} \
            --lineage tetrapoda_odb10 \
            --force \
            --cpu {threads} &> {log}
    """

SnakeMake From line 38 of rules/qc.smk

shell:
    """
    agat_sp_functional_statistics.pl --gff {input.gff} \
            -gs {input.fasta} \
            --output {output} 2> {log}
    """

SnakeMake From line 60 of rules/qc.smk

shell:
    """
    busco -m protein \
            -i {input} \
            -o {params.out_name} \
            --out_path {params.out_path} \
            --lineage tetrapoda_odb10 \
            --force \
            --cpu {threads} &> {log}
    """

SnakeMake From line 81 of rules/qc.smk

shell:
    """
    touch {output}
    """

SnakeMake From line 101 of rules/qc.smk

shell:
    """
    ln -sf {input} {output.fasta} &&
    samtools faidx {output.fasta}
    """

SnakeMake SAMtools From line 11 of rules/ref.smk

shell:
    """
    cat {input.fai} | awk '$2 > 100000 {{print $1}}' > {output.scaffs}&&
    samtools faidx {input.fasta} -r {output.scaffs} > {output.fasta}
    """

SnakeMake SAMtools From line 29 of rules/ref.smk

import re
import gffutils

# Creat dictionary mapping products to EC numbers
EC_num_to_products = {}
with open(snakemake.input['ec'][0], 'r') as fin:
    lines = fin.readlines()
    for l in lines:
        if l.startswith('ID'):
            EC_number = l.split('   ')[1].strip()
        if l.startswith('DE'):
            prod = l.split('   ')[1].strip()
            if not EC_number in EC_num_to_products.keys():
                EC_num_to_products[EC_number] = prod
            else:
                EC_num_to_products[EC_number] += EC_num_to_products[EC_number] + prod
        else:
            pass

locus_tag_prefix = snakemake.params['locus_tag']
with open(snakemake.output[0], 'w') as fout:
    fout.write("##gff-version 3\n")
    gene_num = 1
    hyp_wEC_count = 0
    reassigned = 0
    for feat in gffutils.DataIterator(snakemake.input['gff']):
        if feat.featuretype == 'gene':
            # Set CDS counter
            cds_count = 1

            # Add locus tag to genes
            feat.attributes["locus_tag"] = f'{locus_tag_prefix}_' + str(int(gene_num)).zfill(5)
            gene_num += 1
        elif feat.featuretype == 'mRNA':
            if 'product' in feat.attributes.keys() and 'EC_number' in feat.attributes.keys():
                # Rename EC_number
                feat.attributes['ec_number'] = [feat.attributes['EC_number'][0]]
                feat.attributes.pop('EC_number')

                # Reassign product if EC Number to 4 digits, else delete EC Number and keep hypothetical
                EC_number = feat.attributes['ec_number'][0]
                prod = feat.attributes['product'][0]
                if prod == 'hypothetical protein':
                    EC_split = EC_number.split('.')
                    if len(EC_split) < 4:
                        hyp_wEC_count += 1
                        feat.attributes.pop('ec_number')
                    elif len(EC_split) == 4:
                        reassigned += 1
                        new_product = EC_num_to_products[EC_number]
                        # Handle cases where EC numbers have been reassigned
                        if 'Transferred' in new_product:
                            pattern = r'(?<=:\s)(\d\.\d+\.\d+\.\d+(?=[\.|,]))'
                            new_ec = re.search(pattern, new_product).group(1)
                            new_product = EC_num_to_products[new_ec]

                        new_product = new_product.replace('.', '')
                        feat.attributes['product'] = [new_product]

                        # Add note to GFF about origin of Product
                        feat.attributes['note'] = ["Funnanotate product changed from hypothetical protein based on EC_number"]
                else:
                    EC_split = EC_number.split('.')
                    if len(EC_split) < 4:
                        new_ec = '.'.join(EC_split + ['-'] * (4 - len(EC_split)))
                        feat.attributes['ec_number'] = [new_ec]
        elif feat.featuretype == 'CDS':
            # Assign increment CDS to ensure unique ID
            feat.attributes['ID'][0] = re.sub('cds', f'cds{cds_count}', feat.attributes['ID'][0])
            cds_count += 1
        else:
            pass
        fout.write(str(feat) + '\n')
print(f"{hyp_wEC_count} products were Hypothetical and contained EC Numbers")
print(f"{reassigned} hypothetical proteins have been reassigned based on their EC Numbers")