Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Author
- Oksana Kotovskaya
Supervisors
-
Yury Barbitoff (Bioinformatics Institute)
-
Mikhail Skoblov (Research Center of Medical Genetics)
Introduction
Genetic variants leading to loss of function are not found in all genes. If a gene is found under selection pressure, protein truncation variants (PTV) are much less common in them (Cassa C., 2017). Most often, such genes have important functions, and a such catastrophic change in the protein leads to various diseases or death (Samocha K., 2014). In this work, we are interested in the case when the division of genes into conservative (that is, under selection) and non-conservative (that is, free from selection) becomes less unambiguous, namely, cases when non-conservative genes are found in relatively conservative genes. This work is devoted to implementation of algorithm to the search for such sequences.
Goal : to estimate the evolutionary conservativeness of individual regions within single ORF.
The key task to achieve this goal was to implement an algorithm based on the hidden Markov model (HMM), which allows to determine the conservativeness of individual regions of the protein-coding sequence (CDS).
Data
In this work, the GRCh37/hg19 assembly of the human genome was used (Frankish A. 2021). Sequence
.fa
and annotation
.gff3
were obtained from this
link
.
Data on existing human PTVs was provided by Yury Barbitoff (Skitchenko R.K., 2020) and was taken from the Genome Aggregation Database (GnomAD) v2.1.1 (Karczewski K.J. et al, 2020).
Also mutation rates in the trinucleotide context were taken from the work of Karczewski K.J.
et al
(2020).
Workflow
First, we needed to obtain data on the coding DNA sequence (CDS) for the gene of interest, namely the nucleotide sequence itself, the CDS coordinates, as well as information about the observed PTV for this gene. The data obtaining scheme is presented below.
Stage 1. Sequence extraction
We started by getting the CDS coordinates from
.gff3
file. To do this, we used a combination of
grep
,
agrep
and
awk
(presented in the file
get_cds.sh
). In order to calculate the mutation rate in the trinucleotide context (described below), we also needed nucleotides before and after each region of coding DNA sequence (CDS).
After that, we indexed the file .fa using
samtools faidx
(Danecek P., 2021):
! samtools faidx ./data/GRCh37.p13.genome.fa
The indexed file was intersected with the previously obtained coordinates to get sequence with
bedtools getfasta
(Quinlan A.R., 2010):
! bedtools getfasta -fi ./data/GRCh37.p13.genome.fa -bed {gene_name}.gff -name > {gene}.fa
In this way, we obtained the DNA sequences necessary to calculate the expected probability of PTV.
Stage 2. Counting of PTV mutation rate at the codon in a trinucleotide context
The next step is to calculate the mutation rate in the trinucleotide context for each codon. According to the work of Samocha K., 2014, the best context for determining the variability of a single nucleotide is the inclusion of both 5’ and 3’ flanking nucleotides.
We used mutation frequencies for each possible variant (G, T, C for A), specified in Supplement Materials to the work of Karczewski K.J., 2020.
Thus, for each nucleotide, we considered three possible substitutions in a trinucleotide context. At the same time, we consider this base as part of the codon and determine what such a variant leads to: synonymous, missense or nonsense mutation (
frequencies_count
function in file
freqcounter.py
).
Stage 3. Computation of PTV mutation rate per window
Next, we divided the gene into regions of fixed length, scanning window (
frequencies_per_window
function in file
freqcounter.py
). We did not fix
window size
it for all genes, because the length of genes varies very much, and in the case of too small values of the window size, it becomes problematic to resolve long non-conservative regions (if there are any). For each window, the sum of the expected frequencies was calculated (example for gene ARFGEF1 presented below).
The resulting distribution of mutation rates corresponds to the expected: the mutation level for missense mutations is higher than the mutation level for synonymous variants, while loss-of-function (LoF) mutations are lower than all. For further analysis, we needed LoF mutation rates $(U)$.
Stage 4. Counting observed PTV
Next, we calculated the sum of the observed variants (allele count, $n$) per each window, as well as the average value of allele number ($N$) using function
observed_ptv
in
ploffinder.py
.
.tsv
with high-confidence pLOF variants were provided by Yury Barbitoff (file
variants_coords_ac_an_name.tsv
).
For regions in which no protein truncation variants were detected, we took the $N$ value averaged over the entire gene (since we considered only genes containing at least one PTV).
Stage 5. Hidden Markov model
Based on the data obtained, we built a hidden Markov model (presented below).
For the sake of simplicity, we have built a model in which only two states are possible:
conservative
(Cons) and
non-conservative
(Not) (in the future, the number of states can be increased).
As observations, we choose the allele count per window $n$ (since it is finite, we can consider this quantity discrete).
To assess conservativeness, we used an estimation of the selective effect against heterozygous PTVs ($s_{het}$) that takes into account for each region the observed value of protein truncation variants (PTV) $n$, the allele numbers $N$ and the expected mutation rate.
Stage 5.1. Search of HMM parameters.
As mentioned above, we have chosen $s_{het}$ as an estimate of conservativeness. Similarly to the work of Cassa C. 2017, we assumed the observed distribution of PTV counts across $i$-th region:
$$P(n_i | \alpha, \beta; \nu_i) = \int P(n_i|s_{het};\nu_i) P(s_{het};\alpha, \beta)ds_{het}, $$
where $\nu_i = N_iU_i$ - expected genic PTV counts.
Further, $P(n_i|s_{het};\nu_i) = Pois(n_i, \lambda_i),\ \mbox{where}\ \lambda_i=\nu_i / s_{het}$.
$P(s_{het};\alpha, \beta) = IG(s_{het};\alpha, \beta)$ is inverse Gaussian distribution with mean $\alpha$ and shape $\beta$ parameters (calculated for the gene as a whole).
Thus,
$$P(n_i | \alpha, \beta; \nu_i) = \int Pois(n_i, \nu_i / s_{het}) IG(s_{het};\alpha, \beta) ds_{het} $$
We have chosen as the emission probabilities ($e_k(n_i)$ for $n_i$ in $k$-th state) the normalized probabilities obtained by taking a certain integral (
emission_probabilities
function in
emissionprobgetter.py
file):
$$ e_k(n_i) = \frac{P(n_i | \alpha, \beta; \nu_i; a, b)}{P(n_i | \alpha, \beta; \nu_i)} = \frac{\int\limits_{a}^{b} Pois(n_i, \nu_i / s_{het}) IG(s_{het};\alpha, \beta) ds_{het}}{P(n_i | \alpha, \beta; \nu_i)},$$
where $a, b = [0, 0.01]$ for $k=Not$ and $a, b = [0.01, 1]$ for $k=Cons$. The choice of such values is also due to the results obtained in the work Cassa C. 2017.
Since we had no assumptions about the transition probabilities, we used the Baum–Welch algorithm to find transition probabilities corresponding to the maximum likelihood of the model (function
baum_welch_algo
in file
hmmalg.py
).
Stage 5.2. Decoding sequence
The decoding of the path of states was carried out using the Viterbi algorithm (function
viterbi_algo
in file
hmmalg.py
), it consists in finding a path that also meets the maximum likelihood of the model.
Results and future plans
An algorithm based on the hidden Markov model for the estimation of conservativeness of individual regions of the protein-coding sequence was invented and implemented (
conservativeness_estimator
function in file
getconservativeness.py
gathers all the previous ones and gives the result). Also, a complete pipeline for finding conservative regions for one gene was implemented with
Snakemake
v6.10.0 (Mölder F. 2021). To use it, it's needed to download and index the genome, as described above, and, specifying the desired gene name, window size and initial approximations of the transition probabilities at top of
Snakefile
, run
Snakemake
.
The algorithm was tested on set of genes: mostly conservative, mostly non-conservative, presumably possessing regions of conservative, and it solves the first two cases and not so much sensitive for the third.
There is a graph below showing the dependence of the allele counts per window for the regions of the gene ARFGEF1. In the case of a large number of alleles, the algorithm with the specified parameters copes with finding the most conservative region.
In the future, it is planned to establish whether it is possible to fix transition probabilities based on the theory of population evolution. Also change the approach to calculating the allele number for regions where no PTV is observed (the current estimate is rather rough, as well as $a$, $b$ for emission probabilities).
Requirements for the pipeline :
-
agrep tool;
-
Python v3.9.10 (the requirements for computing packages are specified in
requirements.txt
); -
samtools v1.14;
-
bedtools v2.30.0;
-
Snakemake v6.10.0.
Literature
-
Cassa, C., Weghorn, D., Balick, D. et al . Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat Genet 49, 806–810 (2017).
-
Danecek, P., Bonfield, J.K., Liddle, J., Li, H. et al . Twelve years of SAMtools and BCFtools. GigaScience , 10(2), (2021).
-
Davis, R.I. and Lovell, B.C. Comparing and evaluating HMM ensemble training algorithms using train and test and condition number criteria. Pattern Anal. Appl. 6, 4, pp. 327–336, (2003).
-
Frankish, A., Diekhans, M., Jungreis, I., et al . GENCODE 2021. Nucleic Acids Research . 49(D1):D916-D923 (2021).
-
Karczewski, K.J., Francioli, L.C., Tiao, G. et al . The mutational constraint spectrum quantified from variation in 141,456 humans. Nature , 581, pp. 434–443 (2020).
-
Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J.. Sustainable data analysis with Snakemake. F1000Res 10, 33. (2021)
-
Quinlan, A.R., Hall, I.M., BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics , 26(6), pp. 841–842 (2010).
-
Samocha, K., Robinson, E., Sanders, S. et al . A framework for the interpretation of de novo mutation in human disease. Nat Genet 46, 944–950 (2014).
-
Skitchenko, R.K., Kornienko, J.S., Maksiutenko, E.M., Glotov, A.S., Predeus, A.V., Barbitoff, Y.A. Harnessing population-specific protein truncating variants to improve the annotation of loss-of-function alleles. bioRxiv 2020.08.17.254904, (2020).
Code Snippets
3 4 5 6 | zcat ./data/gencode.v19.annotation.gff3.gz | grep -w "$1" |\ agrep -w "CCDS;CDS" | grep -v "transcript_type=retained_intron" |\ grep -P "appris_principal|appris_candidate_longest" |\ awk -v FS="\t" -v OFS="\t" '{split($9, a, ";"); print $1, $2, a[1]":"a[7]":"$7, $4-1, $5+1, $6, $7, $8, $9}' |
15 16 17 18 19 20 21 22 23 24 25 26 27 | run: link_to_genefile = f'{input.genefile}' link_to_variants_file = f'{input.variants_file}' gene_name, coords_list, u_list, n_list, mean_allele_number_list, path = conservativeness_estimator(genefilename=link_to_genefile, variantsfilename=link_to_variants_file, window_size=window_size,transitions=transitions) with open(f'{output}','w') as out_f: writer = csv.writer(out_f,delimiter='\t') writer.writerow(['Gene', 'Start', 'Stop', 'sumFreqLoF', 'sumAC', 'meanAN', 'State']) for i in range(len(coords_list)): writer.writerow([gene, coords_list[i][0], coords_list[i][1], u_list[i], n_list[i], mean_allele_number_list[i], path[i]]) |
32 | shell: 'grep -w "{gene}" {input} > {output}' |
37 | shell: "bedtools getfasta -fi {input[0]} -bed {input[1]} -name > {output}" |
41 | shell: "sh get_cds.sh {gene} > {output}" |
Support
- Future updates
Related Workflows





