AugusMake: Snakemake-Based Gene Annotation Pipeline with Augustus

public public 1yr ago 0 bookmarks

AugusMake - Augustus-based gene prediction with Snakemake

Guided by a genome file, a transcriptome is de novo assembled with Trinity. It does so using RNA-Seq data provided by the user or automatically downloads paired-end (PE) reads following an accession number from NCBI'S SRA database. Trimming of PE reads is done with Trimmomatic and their quality is assessed pre- and post-trimming with FastQC. Alternatively, the user can also provide an already-assembled transcriptome. The completeness of the transcriptome is checked using BUSCO. Gene predictions can be done ab initio using just a genome file, i.e. no transcriptome is required, and/or with extrinsic hints using transcriptomic data. Both can be done with pre-trained parameters from Augustus or the user can choose to train the program for a new species. The results from training should be used with caution and carefully checked by the user after the analyzes are completed.

The rules for Augustus were developed after the protocol from Hoff & Stanke (2018) , the developers from Augustus.

System requirements

Local machine

I recommend running the workflow on a HPC system, as the analyses are resource and time-consuming.

  • If you don't have it yet, it is necessary to have conda or miniconda in your machine. Follow there instructions.

    • After you are all set with conda, I highly ( highly! ) recommend installing a much much faster package manager to replace conda, mamba

      • First, activate your conda base:

      conda activate base

      • Then, type:

      conda install -n base -c conda-forge mamba

  • Likewise, follow this tutorial to install Git if you don't have it.

HPC system

Follow the instructions from your cluster administrator regarding loading of modules, such as loading a root distribution from Conda. For example, modules might be used to set up environmental variables, which have to first be loaded within the jobscripts.

e.g.: module load anaconda3/2022.05

Normally, the user doesn't have sudo rights to install anything to the root of the cluster. For example, you might want to work with a more updated distribution of conda as your "new", local base, and (ideally) install and use mamba to replace conda as a package manager. To do son, one needs to first load the anaconda module and then create a new environment, here called localconda

  1. module load anaconda3/2022.05

  2. conda create -n localconda -c conda-forge conda=22.9.0

  3. conda install -n localconda -c conda-forge mamba

  4. conda activate localconda

If you run conda env list you'll probably see something like this: /home/myusername/.conda/envs/localconda/

Installation

  1. Clone this repository

git clone https://github.com/juliawiggeshoff/AugusMake.git

  1. Activate your conda base

conda activate base

  • If you are working on a cluster or have your own "local", isolated environment you want to activate instead, use its name to activate it

conda activate localconda

  1. Install AugusMake into an isolated software environment by navigating to the directory where this repo is and run:

conda env create --file environment.yaml

If you followed what was recommended in the System requirements , run this instead:

mamba env create --file environment.yaml

The environment from AugusMake is created

  1. Always activate the environment before running the workflow

On a local machine:

conda activate AugusMake

If you are on a cluster and/or created the environment "within" another environment, you want to run this first:

conda env list

You will probably see something like this in your environment:

home/myusername/.conda/envs/localconda/envs/AugusMake

From no own, you have to give this full path when activating the environment prior to running the workflow

conda activate /home/myusername/.conda/envs/localconda/envs/AugusMake

Input data requirements

  1. Configuration file config/config.yaml

  2. Species Information table config/species_table.tsv

Detailed information on the required input files, including all three augustus processing alternatives found in config/README.md

Run AugusMake

Remember to always activate the environment first

conda activate AugusMake

or

conda activate /home/myusername/.conda/envs/localconda/envs/AugusMake

Before running the workflow locally or in a HPC system, make sure to do a "dry run" ( --dry-run or -n ) to see how many jobs will be run and if any errors are being flagged. Use --printshellcmds or -p for a full description of the rules or --quiet or -q to just print a summary of the jobs:

snakemake --use-conda --cores 51 -q -n

Building DAG of jobs...
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
aa2nonred 2 20 20
all 1 1 1
all_busco_flag 1 1 1
all_fasterq_dump_done 1 1 1
augustus_ab_initio 1 1 1
augustus_hints 4 1 1
augustus_training 2 1 1
 ...
prefetch 1 1 1
raw_fastqc 2 2 2
samtools 2 5 5
trimmed_fastqc 8 2 2
trimmomatic 2 15 15
total 69 1 20

You can use the information from the total numbers of jobs to "guide" the value for --jobs , while keeping in mind not to "allow" too many jobs at once. So, a value of 25 to 30 jobs might be the safest choice. See cluster execution .

Local machine execution

Not recommended if you don't have a lot of storage and CPUs available (and time to wait...). Nevertheless, you can simply run like this:

nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --cores 31 --max-threads 25 --rerun-incomplete > nohup_AugusMake_$(date +"%F_%H_%M_%S").out &

Modify number of cores accordingly.

Cluster execution

Tailored to work with a SGE job scheduler. Modify as needed. See Snakemake documentation for help.

Before the first execution of workflow, you need to create the environments, otherwise, Snakemake fails:

snakemake --cores 8 --use-conda --conda-create-envs-only

To run the workflow:

nohup snakemake --keep-going --use-conda --verbose --printshellcmds --reason --nolock --jobs 15 --cores 51 --local-cores 15 --max-threads 25 --rerun-incomplete --cluster "qsub -terse -V -b y -j y -o snakejob_logs/ -cwd -pe smp {threads} -q small.q,medium.q,large.q -M [email protected] -m be" > nohup_AugusMake_$(date +"%F_%H_%M_%S").out &

Remember to:

  1. Create snakejob_logs in the working directory

  2. Modify [email protected]

  3. Change values for --jobs, --cores, --local-cores, and --max-threads accordingly

  4. Change -q queue name

Make sure you set a low value for --local-cores to not take up too many resources from your host node. Likewise, it is not recommended to let a lot of --jobs to run in parallel, for similar reasons. Lastly, I like to set a --max-threads limit to ensure no rule "hogs" too many threads. This is often necessary because the number of thread-use per rule is set as a percentage of the total resources provided by the user. e.g.: If you can and want to use 110 threads, a rule like genome_guided_trinity will "take" 40% of these, 44 threads, and that is often an overkill. So, by limiting the maximum number of threads to 25, other jobs can run simultaneously, while making sure Trinity still has a "decent" number of threads to use.

If you named your configuration file differently, e.g. NEW_NAME, include the option --configfile config/NEW_NAME.yaml

Main results

report.zip

A report.zip file is automatically generated upon the successfully finished workflow. ALWAYS include --no-lock when running snakemake, otherwise, it is not automatically created and it returns an error. If that happens, the user can just manually generate the report with snakemake --report report.zip afterward, but the automatic generation is obviously preferred.

Charts to visually represent the BUSCO results are output in the report.zip. Each species has its own, individual chart. One chart combining all species is also available. This is done to compare the results between the assemblies. Similarly, the FastQC report files are also included for each species pre and post-trimming.

To be included: Augustus results

Code Snippets

134
135
136
shell:
    "(prefetch {wildcards.accession} --output-directory {params.outdir_sra} && "
    "echo \"{output} DONE\") &> {log}"
152
153
154
155
156
shell:
    "(fasterq-dump {input} --outdir {params.outdir} --threads {threads} && "
    "echo \"fasterq-dump\" finished extracting {input} && "
    "pigz -p {threads} {params.outdir}/{wildcards.accession}*.fastq && "
    "echo \"pigz finished compressing {output.fwd} {output.rev}\") &> {log}"
166
167
shell:
    "echo {input.fwd} > {output} && echo {input.rev} > {output}"
175
176
shell:
    "echo {input} > {output.all_SRA_downloads}"
214
215
216
shell:
    "(zcat {input.fwd} | fastqc stdin:{wildcards.species}_fwd -o {params.fastqc_pretrim_outdir} -t {threads} && "
    "zcat {input.rev} | fastqc stdin:{wildcards.species}_rev -o {params.fastqc_pretrim_outdir} -t {threads}) &> {log}"
241
242
243
shell:
    "trimmomatic PE -threads {threads} -phred33 {input.fwd} {input.rev} {output.fwd_p} {output.fwd_u} "
    "{output.rev_p} {output.rev_u} {params.clip}{params.adapter}{params.trim_params} {params.info_len} &> {log}"
265
266
267
shell:
   "(zcat {input.trimmed} | fastqc stdin:{wildcards.species}_trimmed_{wildcards.suffix} "
   "-o {params.fastqc_posttrim_outdir} -t {threads}) &> {log}"
282
283
shell:
    "gmap_build --dir {params.out_dir} {input.genome} --genomedb {params.prefix} &> {log}"
304
305
306
307
shell:
    "(gsnap --gunzip --dir {params.out_dir} --db {params.prefix} "
    "{input.fwd_p} {input.rev_p} --nthreads {threads} --format sam --novelsplicing=1 "
    "--nofails --output-file {output.sam} --npaths=20) &> {log}"
321
322
323
324
325
shell:
    "(samtools view -bo {output.bam} {input.sam} -@ {threads} && "
    "echo Converted sam file to bam && "
    "samtools sort -o {output.coordsortbam} {output.bam} -@ {threads} && "
    "echo Converted read alignments to a coordinate-sorted bam file) &> {log}"
341
342
343
shell:
    "(Trinity --genome_guided_bam {input.coordsortbam} --genome_guided_max_intron 10000 "
    "--max_memory 50G --CPU {threads} --output {params.out_dir}) &> {log}"
411
412
413
shell:
    "mkdir -p {params.flag_dir} && "
    "cp {input} {params.all_species_dir}"
421
422
shell:
    "ls {input} > {output}"
521
522
523
524
525
shell:
    "mkdir -p {params.outdir}; "
    "sed \"s/<__DATABASE__>/{params.path_db}/\" {params.template_config} | "
    "sed \"s/<__MIN_PERCENT_ALIGNED__>/0.8/\" | "
    "sed \"s/<__MIN_AVG_PER_ID__>/0.9/\" > {output.path_species_config}; "
563
564
565
566
567
568
569
570
571
572
573
shell:
    "(grep \"complete\" {input.cds} | perl -pe 's/>(\S+).*/$1/' > {output.complete_orfs} && "
    "echo \"Created a list of complete ORFs: {output.complete_orfs}\" && "
    "grep -F -f {output.complete_orfs} {input.gff3} | "
    "grep -P \"(\tCDS\t|\texon\t)\" | perl -pe 's/cds\.//; s/\.exon\d+//;' > {output.training_set_complete_gff3} && "
    "echo \"Used the list with complete ORFs to search for them in the pasa filtered transcriptome mapping file, "
    "keeping only the exon and CDS entries; renamed exon entries to have the same identifier as CDS entries, "
    "generating training gene candidates: {output.training_set_complete_gff3}\" && "
    "cat {output.training_set_complete_gff3} | perl -pe 's/\t\S*(asmbl_ \d+).*/\t$1/' | "
    "sort -n -k 4 | sort -s -k 9 | sort -s -k 1,1 > {output.bonafide_gtf} && "
    "echo \"Sorted the training gene candidates: {output.bonafide_gtf}\") &> {log}"
587
588
589
590
shell:
    "(computeFlankingRegion.pl {input.bonafide_gtf} && "
    "grep \"flanking_DNA\" {log} | xargs echo > {output.flanking_length} && "
    "echo \"Flanking region length for training gene structures listed in the GTF file: {output.flanking_length}\") &> {log}"
604
605
606
607
608
shell:
    "(length=$(cut -d \" \" -f5 {input.flanking_length}) && "
    "gff2gbSmallDNA.pl {input.bonafide_gtf} {input.genome} \"$length\" {output.bonafide_gbff} && "
    "echo \"Converted genome file and gtf file into GenBank flatfile format using the flanking "
    "length \"$length\": {output.bonafide_gbff}\") &> {log}"
628
629
630
631
632
633
634
635
636
637
638
shell:
    "(echo \"Peptide sequences for the final candidate ORFs generated by TransDecoder within the PASApipeline: {input.pep}\" && "
    "cat {input.pep} | grep -Pzo \">.*complete.*\\n[^\\n]*\\n(?:[^>].*\\n)*\" | sed -E 's/^>([^[:space:]]+).*/>\\1/' | "
    "sed '/^$/d' > {output.mod_headers_pep}; "
    "echo \"Filtered the peptide file to keep sequences corresponding to complete ORFs, "
    "while also simplifying the sequence headers: {output.mod_headers_pep}\" && "
    "aa2nonred.pl {output.mod_headers_pep} {output.nonred_pep} --cores {threads} && "
    "grep \">\" {output.nonred_pep} | sed \"s/^>//g\" > {output.nonred_pep_headers} && "
    "echo \"Generated a fasta file containing protein sequences that are less than 80% redundant "
    "with any other sequence in the set: {output.nonred_pep}\" && "
    "echo \"Stored information on the headers from the non-redudant sequences: {output.nonred_pep_headers}\") &> {log}"
653
654
655
656
657
658
659
660
661
662
663
664
665
666
shell:
    "(cat {input.bonafide_gbff} | perl -ne 'if($_ =~ m/LOCUS\s+(\S+)\s/) {{$txLocus = $1;}} "
    "elsif($_ =~ m/\/gene=\\\"(\S+)\\\"/) {{$txInGb3{{$1}} = $txLocus}} if(eof()){{foreach(keys %txInGb3) "
    "{{print \"$_\\t$txInGb3{{$_}}\\n\";}}}}' > {output.gene_loci_tsv} && "
    "echo \"Generated a tab-separated file with information on gene names and their corresponding loci names: {output.gene_loci_tsv}\" && "
    "grep -f {input.nonred_pep_headers} {output.gene_loci_tsv} | cut -f2 > {output.nonred_loci_lst} && "
    "echo \"List of loci without redudant sequences on the amino acid level: {output.nonred_loci_lst}\" && "
    "filterGenesIn.pl {output.nonred_loci_lst} {input.bonafide_gbff} > {output.filtered_bonafide_gbff} && "
    "loci_orig=$(grep -c \"LOCUS\" {input.bonafide_gbff}) && "
    "loci_filt=$(grep -c \"LOCUS\" {output.filtered_bonafide_gbff}) && "
    "rmvd_loci=$(($loci_orig - $loci_filt)) && "
    "echo \"Removed \"$rmvd_loci\" genes from the original gbff file due to sequence redundancy on the amino acid level "
    "and generated a filtered gbff file: {output.filtered_bonafide_gbff}\" && "
    "echo \"Now ready to start training Augustus using \"$loci_filt\" genes from the species {wildcards.species}\") &> {log}"
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
shell:
    "(if ! [ -f {params.augustus_config} ]; then "
    "new_species.pl --species={wildcards.species}; "
    "echo \"Created subdirectory for the new species {wildcards.species} with template parameter files: {params.augustus_config_dir}\"; "
    "fi && " 
    "etraining --species={wildcards.species} {input.filtered_bonafide_gbff} &&"
    "err_count=$(grep -c \"Variable stopCodonExcludedFromCDS set right\" {log}) && "
    "loci_filt=$(grep -c \"LOCUS\" {input.filtered_bonafide_gbff}) && "
    "cutoff=$(echo \"\"$loci_filt\"*0.50\" | bc -l | xargs printf \"%.0f\") && "
    "if [ \"$err_count\" -gt \"$cutoff\" ]; then "
    "echo \"More than half of the genes ($err_count) returned an error during etraining\"; "
    "echo \"Modifying value stopCodonExcludedFromCDS to true from {params.augustus_config}\"; "
    "sed -i \"s/stopCodonExcludedFromCDS false/stopCodonExcludedFromCDS true/\" {params.augustus_config}; "
    "else "
    "echo \"More than half of the genes are adequate to train augustus, with a total of $err_count displaying an error\"; "
    "echo \"Keeping value stopCodonExcludedFromCDS as false in {params.augustus_config}\"; "
    "fi && "
    "echo \"Running etraining again\" && "
    "etraining --species={wildcards.species} {input.filtered_bonafide_gbff} 2>&1 | grep \"in sequence\" | "
    "perl -pe 's/.*n sequence (\S+):.*/$1/' | sort -u > {output.error_genes_lst} && "
    "echo \"List with genes displaying errors, which will be removed from gbff file: {output.error_genes_lst}\" && "
    "filterGenes.pl {output.error_genes_lst} {input.filtered_bonafide_gbff} > {output.noerror_genes_gbff} && "
777
778
779
780
shell:
    "(augustus --species={params.sp_name} {input.genome} --gff3=on --outfile={output.predicted_genes_gff3} --errfile={output.error_out} && "
    "echo \"Genes predicted ab initio: {output.predicted_genes_gff3}\" && "
    "echo \"Error messages: {output.error_out}\") &> {log}"
820
821
822
823
824
825
826
shell:
    "(blat -noHead -minIdentity=92 {input.genome} {input.transcriptome} {output.psl} && "
    "echo \"cDNA aligned against genome with BLAT: {output.psl}\" && "
    "pslCDnaFilter -minId=0.9 -localNearBest=0.005 -ignoreNs -bestOverlap {output.psl} {output.filt_psl} && "
    "echo \"Alignments filtered to obtain those that are potentially most useful for gene prediction: {output.filt_psl}\" && "
    "cat {output.filt_psl} | sort -n -k 16,16 | sort -s -k 14,14 > {output.sort_filt_psl} && "
    "echo \"Sorted filtered alignments according to start position and target sequence name: {output.sort_filt_psl}\") &> {log}"
837
838
839
shell:
    "(blat2hints.pl --in={input.sort_filt_psl} --out={output.hints} --minintronlen=35 --trunkSS && "
    "echo \"Convert sort-filtered blat alignments in psl format to hints: {output.hints}\" ) &> {log}"
878
879
880
881
882
883
884
shell:
    "(cp {params.augustus_orig_config} {output.augustus_new_config} && "
    "echo \"Augustus configuration file: {output.augustus_new_config}\" && "
    "augustus --species={params.sp_name} --extrinsicCfgFile={output.augustus_new_config} "
    "--hintsfile={input.hints} --allow_hinted_splicesites=atac "
    "--softmasking=off {input.genome} > {output.predicted_with_hints} && "
    "echo \"Genes predicted with extrinsic hints: {output.predicted_with_hints}\") &> {log}"
ShowHide 20 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/juliawiggeshoff/AugusMake
Name: augusmake
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...