Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Podacris Genome Annotation (PGA)
TODO: Link manuscript when published
Genomic signatures of selection underpinning extremely rapid adaptation in island lizards
Authors: Maria Novosolov, Anthony Herrel, Anamaria Stambuk, Jesper Stenderup, James S. Santangelo, Iva Sabolić, Kristian E. Hanghøj, Thorfinn S. Korneliussen, Debora Y. C. Brandt, Rasmus Heller, Rasmus Nielsen, Morten E. Allentoft
Description
This repository contains the code used for annotation scaffold-level assembly of the Italian Wall Lizard ( Podacris siculus ). The pipeline starts with input of the genome FASTA file. It outputs the final functional annotation in GFF3 format
In addition to the genome FASTA, the following dependencies/resources need to be obtained prior to running the pipeline.
-
The pipeline uses GeneMark, which is licensed softwared. The license key can be obtained from here .
-
Include the
.gm_key
in the workflow/ directory containing the main Snakefile used for running the pipeline
-
Include the
-
A copy of the RepBase repeat library (I used
RepBaseRepeatMaskerEdition-20181026.tar.gz
) from here . A license is required to use this database so it could not be included. Once obtained, it needs to be included in resources/ directory -
A copy of the
InterProScan
data needs to be obtained and included in the resources/ directory. The following commands can be run to download and setup the database:wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/alt/interproscan-data-5.61-93.0.tar.gz tar -xvzfp interproscan-data-5.61-93.0.tar.gz chmod -R 777 interproscan-data-5.61-93.0
Using the pipeline
This pipeline requires
Conda
and
Singularity
:
A minimal installation of
Conda
(i.e.,
Miniconda
) can be installed by following the instructions for your platform
here
Installation of
Singularity
requires Admin privileges, but using
Singularity
to run pre-created containers does not.
Installation instructions can be found
here
.
All
Singularity
containers used in this pipeline are avalaible in
this public repository
,
though they will be automatically pulled and executed by the pipeline.
Assuming
Conda
is installed, the this repository's Conda environment can be replicated by running the following command:
conda env create -f environment.yaml -n prg
This will create a
Conda
environment named prg containing a minimal set of dependencies required to run the pipeline.
After activating the environment (
conda activate prg
), the pipeline can be executed from the workflow directory by running a command that looks something like:
snakemake --use-conda --use-singularity --singularity-args "--bind <path> --bind <path/to/interproscan/data>:/opt/interproscan-5.61-93.0/data" --configfile ../config/<configfile> --notemp -j <cores>
for local execution. Here,
<path>
is the path on the cluster from which files will be read/written (e.g.,
/scratch
),
<path/to/interproscan/data>
is the full path to the interproscan data in
resources/
,
<configfile>
is one of the configfiles in the config directory that needs to be modified to match the paths on your system, and
<cores>
is the number of cores available for executing parallel processes.
Code Snippets
21 22 23 24 25 26 27 | shell: """ ( tar -xzf {input} -C {params.untar_dir} && cp -r /opt/RepeatMasker/Libraries/* {output.libdir} && ln -sf {output.dfam} {output.rm} && addRepBase.pl --libdir {output.libdir} ) &> {log} """ |
41 42 43 44 | shell: """ BuildDatabase -name {params.out} {input} &> {log} """ |
61 62 63 64 65 66 | shell: """ RepeatModeler -database {params.db_base} \ -threads {threads} \ -LTRStruct &> {log} """ |
81 82 83 84 85 86 87 88 89 | shell: """ ( famdb.py -i {input.rm_db} families \ --format fasta_name \ --ancestors --descendants --include-class-in-name \ 'tetrapoda' > {params.rm_db_fasta} && cat {params.rm_db_fasta} {input.tr_db} > {output} ) 2> {log} """ |
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | shell: """ ( RepeatMasker -e ncbi \ -pa {threads} \ -xsmall \ -gc {params.gc} \ -lib {input.lib} \ -dir {params.outdir} \ -gff {input.fasta} && mv {params.outdir}/*.fasta.masked {output.fasta} mv {params.outdir}/*.cat.gz {output.cat} mv {params.outdir}/*.out {output.out} mv {params.outdir}/*.gff {output.gff} mv {params.outdir}/*.tbl {output.stats} ) &> {log} """ |
141 142 143 144 | shell: """ prefetch {wildcards.acc} -O {params.outdir} 2> {log} """ |
160 161 162 163 164 165 166 167 | shell: """ cd {params.outdir} fasterq-dump --split-3 \ -e {threads} \ --skip-technical \ {wildcards.acc} 2> {log} """ |
179 180 181 182 | shell: """ gzip {input} 2> {log} """ |
195 196 197 198 199 200 201 202 | shell: """ mkdir {output} STAR --runMode genomeGenerate \ --genomeDir {output} \ --genomeFastaFiles {input.masked_genome} \ --runThreadN {threads} &> {log} """ |
220 221 222 223 224 225 226 227 228 229 230 231 | shell: """ STAR --readFilesIn {input.R1} {input.R2} \ --outSAMtype BAM SortedByCoordinate \ --outSAMstrandField intronMotif \ --twopassMode Basic \ --genomeDir {input.star_build} \ --runThreadN {threads} \ --readFilesCommand zcat \ --alignIntronMax 10000 \ --outFileNamePrefix {params.out} &> {log} """ |
245 246 247 248 249 | shell: """ ( samtools merge -@ {threads} -r -o {output.bam} {input} &&\ samtools index {output.bam} ) 2> {log} """ |
264 265 266 267 268 | shell: """ ( wget --no-check-certificate https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Vertebrata.fa.gz -P {params.outdir} && gunzip {params.outdir}/Vertebrata.fa.gz ) 2> {log} """ |
280 281 282 283 284 285 286 | shell: """ curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/{params.ref_seq}/download?include_annotation_type=PROT_FASTA&filename={wildcards.psp}.zip" -H "Accept: application/zip" && unzip {wildcards.psp}.zip -d {params.outdir} && cp {params.outdir}/ncbi_dataset/data/{params.ref_seq}/protein.faa {output} rm -rf {params.outdir}/ncbi_dataset """ |
297 298 299 300 | shell: """ curl --output {output} '{params.url}' 2> {log} """ |
312 313 314 315 | shell: """ cat {input} > {output} """ |
332 333 334 335 336 337 338 339 340 341 | shell: """ braker.pl --genome {input.masked_genome} \ --prot_seq {input.proteins} \ --softmasking \ --useexisting \ --threads {threads} \ --workingdir {params.outputdir} \ --species "Podacris siculus prot" 2> {log} """ |
358 359 360 361 362 363 364 365 366 367 | shell: """ braker.pl --genome {input.masked_genome} \ --bam {input.bam} \ --softmasking \ --useexisting \ --threads {threads} \ --workingdir {params.outputdir} \ --species "Podacris siculus rna" 2> {log} """ |
386 387 388 389 390 391 392 393 | shell: """ tsebra.py -g {input.prot_aug} \ -k {input.rna_aug} \ -c {params.config} \ -e {input.hints_rna},{input.hints_prot} \ -o {output} 2> {log} """ |
406 407 408 409 410 411 412 | shell: """ rename_gtf.py --gtf {input} \ --prefix Psic \ --translation_tab {output.tab} \ --out {output.gtf} 2> {log} """ |
426 427 428 429 430 431 432 433 434 435 436 | run: features = ['exon', 'intron', 'gene', 'transcript'] with open(input[0], 'r') as fin: with open(output[0], 'w') as fout: lines = fin.readlines() for line in lines: sline = line.split('\t') if sline[2] in features: pass else: fout.write(line) |
448 449 450 451 452 | shell: """ agat_convert_sp_gxf2gxf.pl --gtf {input} \ --output {output} &> {log} """ |
462 463 464 465 | shell: """ sed 's/SEQ/0/g' {input} > {output} """ |
475 476 477 478 | shell: """ grep -v 'Psic_g38722' {input} > {output} """ |
490 491 492 493 | shell: """ gt gff3 -sort -tidy -retainids {input} > {output} 2> {log} """ |
503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 | run: with open(input[0], 'r') as fin: with open(output[0], 'w') as fout: lines = fin.readlines() for line in lines: if not line.startswith('#'): sline = line.split('\t') feature = sline[2] if feature == 'gene': #Remove transcript_id attribute from gene features sline[8] = re.sub(r'(;transcript_id.*$)', '', sline[8]) elif feature == 'mRNA': #Make sure transcript_id annotations for isoforms are correct (i.e., .t2, .t3, etc.) #Use ID attribute since transcript_id attributes are incorrectly incremented and will be replaced id_pattern = r"(?<=ID=)(.*)(?=;Parent)" ID = re.search(id_pattern, sline[8]).group(1) if ID.endswith('.t1'): #First isoforms are fine pass else: #Alternative isoforms need transcript_id replaced with ID sline[8] = re.sub(r'(?<=;transcript_id=)(.*$)', ID, sline[8]) fout.write('\t'.join(sline)) else: fout.write(line) |
541 542 543 544 | shell: """ gffread -E -y {output} -g {input.ref} {input.gff} 2> {log} """ |
565 566 567 568 569 570 571 572 573 574 575 | shell: """ interproscan.sh -i {input.prot} \ -b {params.out_base} \ -f xml,gff3 \ -goterms \ --pathways \ --seqtype p \ --cpu {threads} \ --verbose &> {log} """ |
585 586 587 588 | shell: """ funannotate setup --database {output} -b tetrapoda --force 2> {log} """ |
597 598 599 600 601 | shell: """ mkdir -p {output} download_eggnog_data.py -y --data_dir {output} """ |
617 618 619 620 621 622 623 624 625 | shell: """ emapper.py -i {input.prot} \ --data_dir {input.db} \ --cpu {threads} \ --itype proteins \ -o {params.out} \ --override &> {log} """ |
650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 | shell: """ mkdir -p {params.outdir} funannotate annotate \ --gff {input.gff} \ --fasta {input.ref} \ --species "Podacris siculus" \ --out {params.outdir} \ --iprscan {input.iprs} \ --eggnog {input.enm} \ --force \ --database {input.db} \ --busco_db tetrapoda \ --cpus {threads} 2> {log} """ |
676 677 678 679 | shell: """ wget {params.url} --no-check-certificate -P {params.outdir} """ |
694 695 | script: "../scripts/python/fixEC_incrementCDS_addLocusTags.py" |
707 708 709 710 | shell: """ gff3_sort -g {input} -r -og {output} 2> {log} """ |
723 724 725 726 | shell: """ gffread -E -y {output} -g {input.ref} {input.gff} 2> {log} """ |
733 734 735 736 | shell: """ touch {output} """ |
14 15 16 17 18 19 20 21 22 | shell: """ quast.py -o {output} \ --threads {threads} \ --split-scaffolds \ --large \ --k-mer-stats \ {input.fasta} &> {log} """ |
38 39 40 41 42 43 44 45 46 47 | shell: """ busco -m genome \ -i {input.fasta} \ -o {params.out_name} \ --out_path {params.out_path} \ --lineage tetrapoda_odb10 \ --force \ --cpu {threads} &> {log} """ |
60 61 62 63 64 65 | shell: """ agat_sp_functional_statistics.pl --gff {input.gff} \ -gs {input.fasta} \ --output {output} 2> {log} """ |
81 82 83 84 85 86 87 88 89 90 | shell: """ busco -m protein \ -i {input} \ -o {params.out_name} \ --out_path {params.out_path} \ --lineage tetrapoda_odb10 \ --force \ --cpu {threads} &> {log} """ |
101 102 103 104 | shell: """ touch {output} """ |
11 12 13 14 15 | shell: """ ln -sf {input} {output.fasta} && samtools faidx {output.fasta} """ |
29 30 31 32 33 | shell: """ cat {input.fai} | awk '$2 > 100000 {{print $1}}' > {output.scaffs}&& samtools faidx {input.fasta} -r {output.scaffs} > {output.fasta} """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | import re import gffutils # Creat dictionary mapping products to EC numbers EC_num_to_products = {} with open(snakemake.input['ec'][0], 'r') as fin: lines = fin.readlines() for l in lines: if l.startswith('ID'): EC_number = l.split(' ')[1].strip() if l.startswith('DE'): prod = l.split(' ')[1].strip() if not EC_number in EC_num_to_products.keys(): EC_num_to_products[EC_number] = prod else: EC_num_to_products[EC_number] += EC_num_to_products[EC_number] + prod else: pass locus_tag_prefix = snakemake.params['locus_tag'] with open(snakemake.output[0], 'w') as fout: fout.write("##gff-version 3\n") gene_num = 1 hyp_wEC_count = 0 reassigned = 0 for feat in gffutils.DataIterator(snakemake.input['gff']): if feat.featuretype == 'gene': # Set CDS counter cds_count = 1 # Add locus tag to genes feat.attributes["locus_tag"] = f'{locus_tag_prefix}_' + str(int(gene_num)).zfill(5) gene_num += 1 elif feat.featuretype == 'mRNA': if 'product' in feat.attributes.keys() and 'EC_number' in feat.attributes.keys(): # Rename EC_number feat.attributes['ec_number'] = [feat.attributes['EC_number'][0]] feat.attributes.pop('EC_number') # Reassign product if EC Number to 4 digits, else delete EC Number and keep hypothetical EC_number = feat.attributes['ec_number'][0] prod = feat.attributes['product'][0] if prod == 'hypothetical protein': EC_split = EC_number.split('.') if len(EC_split) < 4: hyp_wEC_count += 1 feat.attributes.pop('ec_number') elif len(EC_split) == 4: reassigned += 1 new_product = EC_num_to_products[EC_number] # Handle cases where EC numbers have been reassigned if 'Transferred' in new_product: pattern = r'(?<=:\s)(\d\.\d+\.\d+\.\d+(?=[\.|,]))' new_ec = re.search(pattern, new_product).group(1) new_product = EC_num_to_products[new_ec] new_product = new_product.replace('.', '') feat.attributes['product'] = [new_product] # Add note to GFF about origin of Product feat.attributes['note'] = ["Funnanotate product changed from hypothetical protein based on EC_number"] else: EC_split = EC_number.split('.') if len(EC_split) < 4: new_ec = '.'.join(EC_split + ['-'] * (4 - len(EC_split))) feat.attributes['ec_number'] = [new_ec] elif feat.featuretype == 'CDS': # Assign increment CDS to ensure unique ID feat.attributes['ID'][0] = re.sub('cds', f'cds{cds_count}', feat.attributes['ID'][0]) cds_count += 1 else: pass fout.write(str(feat) + '\n') print(f"{hyp_wEC_count} products were Hypothetical and contained EC Numbers") print(f"{reassigned} hypothetical proteins have been reassigned based on their EC Numbers") |
Support
- Future updates