Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
-
Last modified: mån aug 28, 2023 01:12
-
Sign: JN
Description
The workflow (Fig. \ref{workflow}) will try to extract known genomic regions
(based on multiple- sequence alignments and HMMs; the
Reference
) from a
genome file (the
Genome
). The approach taken is essentially a search with
HMM's against a genome, with an extra step where an initial similarity search
is used to reduce the input data down to matching HMM's and genomic regions.
Both the known genomic regions (multiple nucleotide-sequence alignments in
fasta format), and the genome files (fasta format, one or several scaffolds)
must be provided by the user. If several genomes are provided, the workflow can
also collect each genomic region extracted from each genome (the
Fasta seq
files), and produce unaligned "gene" files. These can, for example, be the
input for further processing with a multiple-sequence alignment software.
{width=60%}
Installation and testing
Alt. 1: Manual installation
-
Install prerequisites (see section Software prerequisites for details). On a Debian-based GNU/Linux system (tested on Ubuntu Linux 20.04), this can be done using
$ sudo apt install build-essential git hmmer ncbi-blast+ pigz plast snakemake $ git clone https://github.com/nylander/split-fasta-seq.git $ cd split-fasta-seq/src/ && make && sudo make install
-
Clone birdscanner2 from GitHub:
$ git clone https://github.com/nylander/birdscanner2.git
-
Optional: Download example data (636 MB) and test the installation
$ cd birdscanner2 $ wget -O data.tgz "https://owncloud.nrm.se/index.php/s/d6EuUJSHlQwCN6X/download" $ tar xfz data.tgz && rm data.tgz $ snakemake -j -p --dry-run $ snakemake -j -p
Input
Place properly named genome files (fasta format, gzip compressed,
.gz
) in
data/genomes/
and properly named reference files (aligned fasta files,
.fas
) in
data/references/
. See section
Data
for details on the data
format.
Example set up:
data/
├── genomes
│ ├── Apa.gz
│ └── Bpa.gz
└── references ├── 1.fas └── 2.fas
Output
Output are written to the folder
results/
.
Example output:
results/
├── genes
│ ├── 1.fas
│ └── 2.fas
├── genomes
│ ├── Apa
│ │ ├── Apa.1.fas
│ │ └── Apa.2.fas
│ └── Bpa
│ ├── Bpa.1.fas
│ └── Bpa.2.fas
└── hmmer ├── Apa.hmmer.out.gz └── Bpa.hmmer.out.gz
Run
$ cd birdscanner2
$ snakemake -j -p
Data
Note: The pipeline is very picky about the format of both file names and file content. Safest option is to make sure they are OK before trying to run the analyses.
Indata
1. Genomes
Add compressed (gzip) genome files (contig files in fasta format, nucleotide
data) to the folder
data/genomes/
. Files need to be named
<name>.gz
. The
<name>
should contain no periods, and will be used in the output as part of
the fasta header for the extracted sequences. Examples:
apa_genome.gz
,
bpa.gz
(but not, e.g.,
apa.genome.fas.gz
,
bpa.tar.gz
, etc).
2. Reference alignments
Add reference sequence alignments (nucleotides, fasta format, file suffix
.fas
) in the folder
data/references/
. Each alignment file would represent
one genomic region ("gene").
The name of the alignment file will be used in downstream analyses, so they
should have names that are easy to parse (do not use spaces or special
characters, not even hyphens (
-
) in the file names). Examples:
myo.fas
,
odc.fas
,
988.fas
,
999.fas
, etc.
The fasta headers are also used in downstream analyses and should also be easy
to parse. Examples,
>Passe
,
>Ploceu
,
>Prunell
. Use underscores (
_
)
instead of hyphens (
-
). Fasta headers needs to be unique (i.e., no duplicates
in each individual alignment), but the number of sequences doesn't need to be
the same in all files.
2.2 Jarvis bird data
We also provide filtered versions of the "Jarvis data" (
Jarvis
et al
.
2015
). If you wish to use any of
these data sets, it is recommend to download and uncompress the data
(
references.tgz
) directly inside the
birdscanner2/data/
folder. Please see
the file
resources/Jarvis_et_al_2015/README.md
for full description.
Outdata
Individual
gene
files (fasta format) for each
genome
are written to the
folder
results/genome/<genome>/
, with the default name
<genome>.<gene>.fas
.
The fasta header contains the genome name:
><genome>
and some statistics from
the HMMer search.
The output files from the hmmer-search step is also saved as compressed text
files, one for each genome, as
results/hmmer/<genome>.hmmer.out.gz
.
The gene files are also concatenated and written to the folder
results/genes/
, one file for each gene (
results/genes/<gene1>.fas
,
results/genes/<gene2>.fas
, etc). The fasta headers contains the genome names:
><genome>
, and these files can be input to a software for doing
multiple-sequence alignments.
Concatenate output from separate runs
If different runs have been made
with the same references data
, then the
separate runs can be combined using the helper script
bs2-gather-genes.pl
. For example, if
genome
Apa.gz
and
Bpa.gz
have been run against the same set of references
at different occasions, the individual files in
results/genomes/Apa
and
results/genomes/Bpa
can be concatenated to fasta files ready for
multiple-sequence alignment:
$ bs2-gather-genes.pl --outdir=out /path/to/results/genomes/Apa /path/to/results/genomes/Bpa
The concatenated files are in folder
out/
. Note that not all genomes may have
the same number of output files in
results/genomes/
, hence the number of
sequences in the concatenated files may not be the same.
Software prerequisites
The workflow is tested on GNU/Linux (Ubuntu 20.04), and uses standard Linux
(bash) tools in addition to the main workflow manager
snakemake
. A list of
tools (and tested version) are given below.
See also section
Running birdscanner2 on UPPMAX
.
-
bash (5.0.17)
-
awk (5.0.1)
-
cat (8.30)
-
find (4.7.0)
-
grep (3.4)
-
sort (8.30)
-
-
python (3.8.2)
-
snakemake (5.10.0)
-
pigz (2.4)
-
makeblastdb (2.9.0+)
-
blastdbcmd (2.9.0+)
-
hmmbuild (3.3)
-
hmmpress (3.3)
-
nhmmer (3.3)
-
perl (5.30.0)
-
plast (2.3.2)
-
splitfast (Tue 14 Jan 2020)
-
bs2-fas-to-sto.pl (1.0)
-
bs2-parse-nhmmer.pl (1.0)
-
bs2-gather-genes.pl (1.0)
Softwares 13-15 are provided. Software requirements 1-10 can also be taken care of by the conda system by running the pipeline with commands
$ snakemake -j -p --use-conda
Note: This requires conda, and is currently mostly untested. Furthermore, softwares 11, and 12 still needs to be installed separately (currently not in any conda channels).
Running birdscanner2 on UPPMAX
1. Install software
On UPPMAX, most software are available as modules. However, the
plast
and
splitfast
programs need to be installed manually. For example (tested on
cluster
"Rackham"
, and
assuming
$HOME/bin
is in your
PATH
):
1.1. plast
# Alt. 1: Copy binary
$ wget https://github.com/PLAST-software/plast-library/releases/download/v2.3.2/plastbinary_linux_v2.3.2.tar.gz
$ tar xvzf plastbinary_linux_v2.3.2.tar.gz
$ cp plastbinary_linux_v2.3.2/build/bin/plast $HOME/bin/plast
# Alt. 2: Compile
$ module load cmake doxygen
$ git clone https://github.com/PLAST-software/plast-library.git
$ cd plast-library
$ git checkout stable
$ sed -i '98,99{s/^/#/}' CMakeLists.txt # if no cppunit, disable unittest
$ mkdir build
$ cd build
$ cmake -Wno-dev ..
$ make
$ cp bin/PlastCmd $HOME/bin/plast
1.2. splitfast
$ git clone https://github.com/nylander/split-fasta-seq.git
$ cd split-fasta-seq/src
$ make
$ cp splitfast $HOME/bin/
2. Clone birdscanner2
$ git clone https://github.com/nylander/birdscanner2.git
3. Add your uppmax account number
Manually edit the file
config/cluster.yaml
, and add
your
SNIC compute project account
number
(e.g., "snic2020-12-34"). For example:
$ sed -i 's/#SNICACCOUNT#/snic2020-12-34/' config/cluster.yaml
4. Add genome and reference data
See Section Indata
5. Run the initial data conversion
Note, this step is an ad-hoc step currently used to avoid submission of too
many snakemake rules to slurm. The expected output is a folder,
birdscanner2/run/tmp
, with fasta and Stockholm files. Please make sure to
replace the "snic1234-56-78" below with your SNIC account number.
$ sbatch -A snic1234-56-78 -t 60 workflow/scripts/bs2-convert.slurm.sh
6. Run the rest of the workflow
Replace the "snic1234-56-78" below with your SNIC account number.
$ sbatch -A snic1234-56-78 workflow/scripts/bs2-run.slurm.sh
-
6.1. Launch a screen session
$ screen -S birdscanner2
-
6.2. Load modules
$ module load bioinfo-tools \ hmmer/3.2.1-intel \ blast/2.9.0+ \ snakemake/5.10.0 \ pigz/2.4
-
6.3. Start snakemake (TODO: advice on number of jobs)
$ snakemake --profile slurm -j 50
-
6.4. Detach the screen session (Ctrl-A + Ctrl-D)
Re-attaching to the screen session is done with
screen -R birdscanner2
.
When the workflow is finished, remember to exit the screen session!
Jobs on the cluster can be monitored with command
jobinfo
. Or, perhaps better:
$ squeue --user=$USER -M rackham --format="%.8i %.50j %.8u %.8T %.10M %.10l %.9P %.6D %.16R"
Run time
A runtime example (output from
snakemake --report
after succesful run) is
given below. The input was two genomes (nseqs=2393, avg.len=463145, and
nseqs=2401, avg.len=461424, respectively), and reference data consisted of
7,979 multiple sequence alignments (avg.len=1491). The analysis was run on one
multi-core computer.
{width=40%}
The bottle-necks in the workflow are the similarity searches using
plast
, and
most noticeable, the search with
nhmmer
. These will take a long time. In
Figure \ref{runtime} we can see that the plast-search (
005_run_plast
) took
about 55 min per genome, and that the nhmmer-search (
O13_run_hmmer
) took around
20h per genome. Hence, the whole process of analysing the two genomes took
almost two days.
{width=40%}
Note: if one scrutinizes the creation dates of the output files for individual genomes, we see that processing of genomes are essentially serial (Fig. \ref{creationdate}). Work on parallelize this process on a single machine is currently in progress.
License and Copyright
Copyright (c) 2020-2023 Johan Nylander
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Code Snippets
39 40 | shell: "awk -v a=$(basename {input} .fas) '/>/{{$0=\">\"a\"__\"++i}}1' {input} > {output}" |
53 54 | shell: "perl workflow/scripts/bs2-fas-to-sto.pl {input} > {output}" |
69 70 | shell: "hmmbuild {params.hmmbuildtype} {output} {input}" |
81 82 | shell: "find run/tmp -type f -name \*.fasta -exec cat {{}} \+ > {output}" |
99 100 | shell: "splitfast -m {params.length} <(pigz -p {threads} -d -c {input}) > {output}" |
116 117 | shell: "makeblastdb -in {input} -dbtype {params.dbtype} {params.parse_seqids}" |
136 137 | shell: "plast -p {params.plastprog} -i {input.query} -d run/plast/{wildcards.genome}.split.fas -o {output} -a {threads} -max-hit-per-query {params.maxhitperquery} {params.bargraph}" |
155 156 157 158 159 160 161 162 | shell: """ awk '$4>{params.minlen}' {input} | \ sort -t$'\t' -k1,1 -k12rg | \ awk -F $'\t' '!x[$1]++' | \ awk -F $'\t' '{{print $2}}' | \ sort -u > {output} """ |
177 178 179 180 181 182 183 184 | shell: """ awk '$4>{params.minlen}' {input} | \ sort -t$'\t' -k1g -k12rg | \ awk -F $'\t' '!x[$1]++' | \ awk -F $'__' '{{print $1 ".hmm"}}' | \ sort -u > {output} """ |
201 202 | shell: "blastdbcmd -db {input.db} -dbtype {params.dbtype} -entry_batch {input.idfile} -outfmt {params.outfmt} -out {output}" |
214 215 216 217 | shell: """ while IFS= read -r line; do cat run/tmp/$line >> {output}; done < {input.refids} """ |
230 231 | shell: "hmmpress {input}" |
251 252 | shell: "{params.hmmerprog} {params.outfmt} --cpu {threads} --tblout {output} {input.hmm} {input.query} > /dev/null" |
270 271 | shell: "perl workflow/scripts/bs2-parse-nhmmer.pl -f {params.fastaheader} -p {params.prefix} -i {input.hmmer} -g {input.fas} -d {output} {params.stats}" |
285 286 | shell: "perl workflow/scripts/bs2-gather-genes.pl --outdir={output.dir} $(find results/genomes -type d)" |
301 302 | shell: "pigz -p {threads} -c {input} > {output}" |
309 310 | shell: "rm -rf run/" |
317 318 | shell: "rm -rf run/ results/" |
Support
- Future updates