Podacris Genome Annotation (PGA)

public public 1yr ago 0 bookmarks

Podacris Genome Annotation (PGA)

TODO: Link manuscript when published

Genomic signatures of selection underpinning extremely rapid adaptation in island lizards

Authors: Maria Novosolov, Anthony Herrel, Anamaria Stambuk, Jesper Stenderup, James S. Santangelo, Iva Sabolić, Kristian E. Hanghøj, Thorfinn S. Korneliussen, Debora Y. C. Brandt, Rasmus Heller, Rasmus Nielsen, Morten E. Allentoft

Description

This repository contains the code used for annotation scaffold-level assembly of the Italian Wall Lizard ( Podacris siculus ). The pipeline starts with input of the genome FASTA file. It outputs the final functional annotation in GFF3 format

In addition to the genome FASTA, the following dependencies/resources need to be obtained prior to running the pipeline.

  1. The pipeline uses GeneMark, which is licensed softwared. The license key can be obtained from here .

    1. Include the .gm_key in the workflow/ directory containing the main Snakefile used for running the pipeline
  2. A copy of the RepBase repeat library (I used RepBaseRepeatMaskerEdition-20181026.tar.gz ) from here . A license is required to use this database so it could not be included. Once obtained, it needs to be included in resources/ directory

  3. A copy of the InterProScan data needs to be obtained and included in the resources/ directory. The following commands can be run to download and setup the database:

    wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/alt/interproscan-data-5.61-93.0.tar.gz
    tar -xvzfp interproscan-data-5.61-93.0.tar.gz
    chmod -R 777 interproscan-data-5.61-93.0
    

Using the pipeline

This pipeline requires Conda and Singularity :

A minimal installation of Conda (i.e., Miniconda ) can be installed by following the instructions for your platform here

Installation of Singularity requires Admin privileges, but using Singularity to run pre-created containers does not. Installation instructions can be found here . All Singularity containers used in this pipeline are avalaible in this public repository , though they will be automatically pulled and executed by the pipeline. Assuming Conda is installed, the this repository's Conda environment can be replicated by running the following command:

conda env create -f environment.yaml -n prg

This will create a Conda environment named prg containing a minimal set of dependencies required to run the pipeline.

After activating the environment ( conda activate prg ), the pipeline can be executed from the workflow directory by running a command that looks something like:

snakemake --use-conda --use-singularity --singularity-args "--bind <path> --bind <path/to/interproscan/data>:/opt/interproscan-5.61-93.0/data" --configfile ../config/<configfile> --notemp -j <cores>

for local execution. Here, <path> is the path on the cluster from which files will be read/written (e.g., /scratch ), <path/to/interproscan/data> is the full path to the interproscan data in resources/ , <configfile> is one of the configfiles in the config directory that needs to be modified to match the paths on your system, and <cores> is the number of cores available for executing parallel processes.

Code Snippets

21
22
23
24
25
26
27
shell:
    """
    ( tar -xzf {input} -C {params.untar_dir} &&
     cp -r /opt/RepeatMasker/Libraries/* {output.libdir} &&
     ln -sf {output.dfam} {output.rm} &&
     addRepBase.pl --libdir {output.libdir} ) &> {log}
    """
41
42
43
44
shell:
    """
    BuildDatabase -name {params.out} {input} &> {log}
    """
61
62
63
64
65
66
shell:
    """
    RepeatModeler -database {params.db_base} \
        -threads {threads} \
        -LTRStruct &> {log}
    """
81
82
83
84
85
86
87
88
89
shell:
    """
    ( famdb.py -i {input.rm_db} families \
        --format fasta_name \
        --ancestors --descendants --include-class-in-name \
        'tetrapoda' > {params.rm_db_fasta} &&

    cat {params.rm_db_fasta} {input.tr_db} > {output} ) 2> {log}
    """
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
shell:
    """
    ( RepeatMasker -e ncbi \
        -pa {threads} \
        -xsmall \
        -gc {params.gc} \
        -lib {input.lib} \
        -dir {params.outdir} \
        -gff {input.fasta} &&

        mv {params.outdir}/*.fasta.masked {output.fasta}
        mv {params.outdir}/*.cat.gz {output.cat}
        mv {params.outdir}/*.out {output.out}
        mv {params.outdir}/*.gff {output.gff}
        mv {params.outdir}/*.tbl {output.stats} ) &> {log}
    """
141
142
143
144
shell:
    """
    prefetch {wildcards.acc} -O {params.outdir} 2> {log}
    """
160
161
162
163
164
165
166
167
shell:
    """
    cd {params.outdir}
    fasterq-dump --split-3 \
        -e {threads} \
        --skip-technical \
        {wildcards.acc} 2> {log}
    """
179
180
181
182
shell:
    """
    gzip {input} 2> {log} 
    """
195
196
197
198
199
200
201
202
shell:
    """
    mkdir {output}
    STAR --runMode genomeGenerate \
        --genomeDir {output} \
        --genomeFastaFiles {input.masked_genome} \
        --runThreadN {threads} &> {log}
    """
220
221
222
223
224
225
226
227
228
229
230
231
shell:
    """
    STAR --readFilesIn {input.R1} {input.R2} \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --twopassMode Basic \
        --genomeDir {input.star_build} \
        --runThreadN {threads} \
        --readFilesCommand zcat \
        --alignIntronMax 10000 \
        --outFileNamePrefix {params.out} &> {log} 
    """
245
246
247
248
249
shell:
    """
    ( samtools merge -@ {threads} -r -o {output.bam} {input} &&\
        samtools index {output.bam} ) 2> {log}
    """
264
265
266
267
268
shell:
    """
    ( wget --no-check-certificate https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Vertebrata.fa.gz -P {params.outdir} && 
    gunzip {params.outdir}/Vertebrata.fa.gz ) 2> {log}
    """
280
281
282
283
284
285
286
shell:
    """
    curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/{params.ref_seq}/download?include_annotation_type=PROT_FASTA&filename={wildcards.psp}.zip" -H "Accept: application/zip" &&
    unzip {wildcards.psp}.zip -d {params.outdir} &&
    cp  {params.outdir}/ncbi_dataset/data/{params.ref_seq}/protein.faa {output}
    rm -rf {params.outdir}/ncbi_dataset
    """
297
298
299
300
shell:
    """
    curl --output {output} '{params.url}' 2> {log}
    """
312
313
314
315
shell:
    """
    cat {input} > {output}
    """
332
333
334
335
336
337
338
339
340
341
shell:
    """
    braker.pl --genome {input.masked_genome} \
        --prot_seq {input.proteins} \
        --softmasking \
        --useexisting \
        --threads {threads} \
        --workingdir {params.outputdir} \
        --species "Podacris siculus prot" 2> {log}
    """ 
358
359
360
361
362
363
364
365
366
367
shell:
    """
    braker.pl --genome {input.masked_genome} \
        --bam {input.bam} \
        --softmasking \
        --useexisting \
        --threads {threads} \
        --workingdir {params.outputdir} \
        --species "Podacris siculus rna" 2> {log} 
    """
386
387
388
389
390
391
392
393
shell:
    """
    tsebra.py -g {input.prot_aug} \
        -k {input.rna_aug} \
        -c {params.config} \
        -e {input.hints_rna},{input.hints_prot} \
        -o {output} 2> {log} 
    """ 
406
407
408
409
410
411
412
shell:
    """
    rename_gtf.py --gtf {input} \
            --prefix Psic \
            --translation_tab {output.tab} \
            --out {output.gtf} 2> {log} 
    """
426
427
428
429
430
431
432
433
434
435
436
run:
    features = ['exon', 'intron', 'gene', 'transcript']
    with open(input[0], 'r') as fin:
        with open(output[0], 'w') as fout:
            lines = fin.readlines()
            for line in lines:
                sline = line.split('\t')
                if sline[2] in features:
                    pass
                else:
                    fout.write(line)
448
449
450
451
452
shell:
    """
    agat_convert_sp_gxf2gxf.pl --gtf {input} \
        --output {output} &> {log}
    """
462
463
464
465
shell:
    """
    sed 's/SEQ/0/g' {input} > {output}
    """
475
476
477
478
shell:
    """
    grep -v 'Psic_g38722' {input} > {output}
    """
490
491
492
493
shell:
    """
    gt gff3 -sort -tidy -retainids {input} > {output} 2> {log}
    """
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
run:
    with open(input[0], 'r') as fin:
        with open(output[0], 'w') as fout:
            lines = fin.readlines()
            for line in lines:
                if not line.startswith('#'):
                    sline = line.split('\t')
                    feature = sline[2]
                    if feature == 'gene':
                        #Remove transcript_id attribute from gene features
                        sline[8] = re.sub(r'(;transcript_id.*$)', '', sline[8])
                    elif feature == 'mRNA':
                        #Make sure transcript_id annotations for isoforms are correct (i.e., .t2, .t3, etc.)
                        #Use ID attribute since transcript_id attributes are incorrectly incremented and will be replaced
                        id_pattern = r"(?<=ID=)(.*)(?=;Parent)"
                        ID = re.search(id_pattern, sline[8]).group(1)
                        if ID.endswith('.t1'):
                            #First isoforms are fine
                            pass
                        else:
                            #Alternative isoforms need transcript_id replaced with ID
                            sline[8] = re.sub(r'(?<=;transcript_id=)(.*$)', ID, sline[8])
                    fout.write('\t'.join(sline))
                else:
                    fout.write(line) 
541
542
543
544
shell:
    """
    gffread -E -y {output} -g {input.ref} {input.gff} 2> {log}
    """
565
566
567
568
569
570
571
572
573
574
575
shell:
    """
    interproscan.sh -i {input.prot} \
        -b {params.out_base} \
        -f xml,gff3 \
        -goterms \
        --pathways \
        --seqtype p \
        --cpu {threads} \
        --verbose &> {log} 
    """ 
585
586
587
588
shell:
    """
    funannotate setup --database {output} -b tetrapoda --force 2> {log}
    """
597
598
599
600
601
shell:
    """
    mkdir -p {output}
    download_eggnog_data.py -y --data_dir {output} 
    """
617
618
619
620
621
622
623
624
625
shell:
    """
    emapper.py -i {input.prot} \
        --data_dir {input.db} \
        --cpu {threads} \
        --itype proteins \
        -o {params.out} \
        --override &> {log}
    """
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
shell:
    """
    mkdir -p {params.outdir}
    funannotate annotate \
        --gff {input.gff} \
        --fasta {input.ref} \
        --species "Podacris siculus" \
        --out {params.outdir} \
        --iprscan {input.iprs} \
        --eggnog {input.enm} \
        --force \
        --database {input.db} \
        --busco_db tetrapoda \
        --cpus {threads} 2> {log}
    """
676
677
678
679
shell:
    """
    wget {params.url} --no-check-certificate -P {params.outdir}
    """
694
695
script:
    "../scripts/python/fixEC_incrementCDS_addLocusTags.py"
707
708
709
710
shell:
    """
    gff3_sort -g {input} -r -og {output} 2> {log}
    """
723
724
725
726
shell:
    """
    gffread -E -y {output} -g {input.ref} {input.gff} 2> {log}
    """
733
734
735
736
shell:
    """
    touch {output}
    """
14
15
16
17
18
19
20
21
22
shell:
    """
    quast.py -o {output} \
            --threads {threads} \
            --split-scaffolds \
            --large \
            --k-mer-stats \
            {input.fasta} &> {log}
    """
38
39
40
41
42
43
44
45
46
47
shell:
    """
    busco -m genome \
            -i {input.fasta} \
            -o {params.out_name} \
            --out_path {params.out_path} \
            --lineage tetrapoda_odb10 \
            --force \
            --cpu {threads} &> {log}
    """
SnakeMake From line 38 of rules/qc.smk
60
61
62
63
64
65
shell:
    """
    agat_sp_functional_statistics.pl --gff {input.gff} \
            -gs {input.fasta} \
            --output {output} 2> {log}
    """
SnakeMake From line 60 of rules/qc.smk
81
82
83
84
85
86
87
88
89
90
shell:
    """
    busco -m protein \
            -i {input} \
            -o {params.out_name} \
            --out_path {params.out_path} \
            --lineage tetrapoda_odb10 \
            --force \
            --cpu {threads} &> {log}
    """
SnakeMake From line 81 of rules/qc.smk
101
102
103
104
shell:
    """
    touch {output}
    """
SnakeMake From line 101 of rules/qc.smk
11
12
13
14
15
shell:
    """
    ln -sf {input} {output.fasta} &&
    samtools faidx {output.fasta}
    """
29
30
31
32
33
shell:
    """
    cat {input.fai} | awk '$2 > 100000 {{print $1}}' > {output.scaffs}&&
    samtools faidx {input.fasta} -r {output.scaffs} > {output.fasta}
    """
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import re
import gffutils

# Creat dictionary mapping products to EC numbers
EC_num_to_products = {}
with open(snakemake.input['ec'][0], 'r') as fin:
    lines = fin.readlines()
    for l in lines:
        if l.startswith('ID'):
            EC_number = l.split('   ')[1].strip()
        if l.startswith('DE'):
            prod = l.split('   ')[1].strip()
            if not EC_number in EC_num_to_products.keys():
                EC_num_to_products[EC_number] = prod
            else:
                EC_num_to_products[EC_number] += EC_num_to_products[EC_number] + prod
        else:
            pass

locus_tag_prefix = snakemake.params['locus_tag']
with open(snakemake.output[0], 'w') as fout:
    fout.write("##gff-version 3\n")
    gene_num = 1
    hyp_wEC_count = 0
    reassigned = 0
    for feat in gffutils.DataIterator(snakemake.input['gff']):
        if feat.featuretype == 'gene':
            # Set CDS counter
            cds_count = 1

            # Add locus tag to genes
            feat.attributes["locus_tag"] = f'{locus_tag_prefix}_' + str(int(gene_num)).zfill(5)
            gene_num += 1
        elif feat.featuretype == 'mRNA':
            if 'product' in feat.attributes.keys() and 'EC_number' in feat.attributes.keys():
                # Rename EC_number
                feat.attributes['ec_number'] = [feat.attributes['EC_number'][0]]
                feat.attributes.pop('EC_number')

                # Reassign product if EC Number to 4 digits, else delete EC Number and keep hypothetical
                EC_number = feat.attributes['ec_number'][0]
                prod = feat.attributes['product'][0]
                if prod == 'hypothetical protein':
                    EC_split = EC_number.split('.')
                    if len(EC_split) < 4:
                        hyp_wEC_count += 1
                        feat.attributes.pop('ec_number')
                    elif len(EC_split) == 4:
                        reassigned += 1
                        new_product = EC_num_to_products[EC_number]
                        # Handle cases where EC numbers have been reassigned
                        if 'Transferred' in new_product:
                            pattern = r'(?<=:\s)(\d\.\d+\.\d+\.\d+(?=[\.|,]))'
                            new_ec = re.search(pattern, new_product).group(1)
                            new_product = EC_num_to_products[new_ec]

                        new_product = new_product.replace('.', '')
                        feat.attributes['product'] = [new_product]

                        # Add note to GFF about origin of Product
                        feat.attributes['note'] = ["Funnanotate product changed from hypothetical protein based on EC_number"]
                else:
                    EC_split = EC_number.split('.')
                    if len(EC_split) < 4:
                        new_ec = '.'.join(EC_split + ['-'] * (4 - len(EC_split)))
                        feat.attributes['ec_number'] = [new_ec]
        elif feat.featuretype == 'CDS':
            # Assign increment CDS to ensure unique ID
            feat.attributes['ID'][0] = re.sub('cds', f'cds{cds_count}', feat.attributes['ID'][0])
            cds_count += 1
        else:
            pass
        fout.write(str(feat) + '\n')
print(f"{hyp_wEC_count} products were Hypothetical and contained EC Numbers")
print(f"{reassigned} hypothetical proteins have been reassigned based on their EC Numbers")
ShowHide 35 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/James-S-Santangelo/pga
Name: pga
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...