Final semester project at the Bioinformatics institute.

public public 1yr ago 0 bookmarks

[logo-bi-18-7| width=10px] logo-bi-18-3

Author

  • Oksana Kotovskaya

Supervisors

  • Yury Barbitoff (Bioinformatics Institute)

  • Mikhail Skoblov (Research Center of Medical Genetics)

Introduction

Genetic variants leading to loss of function are not found in all genes. If a gene is found under selection pressure, protein truncation variants (PTV) are much less common in them (Cassa C., 2017). Most often, such genes have important functions, and a such catastrophic change in the protein leads to various diseases or death (Samocha K., 2014). In this work, we are interested in the case when the division of genes into conservative (that is, under selection) and non-conservative (that is, free from selection) becomes less unambiguous, namely, cases when non-conservative genes are found in relatively conservative genes. This work is devoted to implementation of algorithm to the search for such sequences.

Goal : to estimate the evolutionary conservativeness of individual regions within single ORF.

The key task to achieve this goal was to implement an algorithm based on the hidden Markov model (HMM), which allows to determine the conservativeness of individual regions of the protein-coding sequence (CDS).

Data

In this work, the GRCh37/hg19 assembly of the human genome was used (Frankish A. 2021). Sequence .fa and annotation .gff3 were obtained from this link .
Data on existing human PTVs was provided by Yury Barbitoff (Skitchenko R.K., 2020) and was taken from the Genome Aggregation Database (GnomAD) v2.1.1 (Karczewski K.J. et al, 2020).
Also mutation rates in the trinucleotide context were taken from the work of Karczewski K.J. et al (2020).

Workflow

First, we needed to obtain data on the coding DNA sequence (CDS) for the gene of interest, namely the nucleotide sequence itself, the CDS coordinates, as well as information about the observed PTV for this gene. The data obtaining scheme is presented below.

data extraction_dark_2 data extraction_light_2

Stage 1. Sequence extraction

We started by getting the CDS coordinates from .gff3 file. To do this, we used a combination of grep , agrep and awk (presented in the file get_cds.sh ). In order to calculate the mutation rate in the trinucleotide context (described below), we also needed nucleotides before and after each region of coding DNA sequence (CDS).
After that, we indexed the file .fa using samtools faidx (Danecek P., 2021):

! samtools faidx ./data/GRCh37.p13.genome.fa 

The indexed file was intersected with the previously obtained coordinates to get sequence with bedtools getfasta (Quinlan A.R., 2010):

! bedtools getfasta -fi ./data/GRCh37.p13.genome.fa -bed {gene_name}.gff -name > {gene}.fa

In this way, we obtained the DNA sequences necessary to calculate the expected probability of PTV.

Stage 2. Counting of PTV mutation rate at the codon in a trinucleotide context

The next step is to calculate the mutation rate in the trinucleotide context for each codon. According to the work of Samocha K., 2014, the best context for determining the variability of a single nucleotide is the inclusion of both 5’ and 3’ flanking nucleotides. We used mutation frequencies for each possible variant (G, T, C for A), specified in Supplement Materials to the work of Karczewski K.J., 2020.
Thus, for each nucleotide, we considered three possible substitutions in a trinucleotide context. At the same time, we consider this base as part of the codon and determine what such a variant leads to: synonymous, missense or nonsense mutation ( frequencies_count function in file freqcounter.py ).

Stage 3. Computation of PTV mutation rate per window

Next, we divided the gene into regions of fixed length, scanning window ( frequencies_per_window function in file freqcounter.py ). We did not fix window size it for all genes, because the length of genes varies very much, and in the case of too small values of the window size, it becomes problematic to resolve long non-conservative regions (if there are any). For each window, the sum of the expected frequencies was calculated (example for gene ARFGEF1 presented below).

test_rasterization_dark test_rasterization_light

The resulting distribution of mutation rates corresponds to the expected: the mutation level for missense mutations is higher than the mutation level for synonymous variants, while loss-of-function (LoF) mutations are lower than all. For further analysis, we needed LoF mutation rates $(U)$.

Stage 4. Counting observed PTV

Next, we calculated the sum of the observed variants (allele count, $n$) per each window, as well as the average value of allele number ($N$) using function observed_ptv in ploffinder.py . .tsv with high-confidence pLOF variants were provided by Yury Barbitoff (file variants_coords_ac_an_name.tsv ).
For regions in which no protein truncation variants were detected, we took the $N$ value averaged over the entire gene (since we considered only genes containing at least one PTV).

Stage 5. Hidden Markov model

Based on the data obtained, we built a hidden Markov model (presented below).
For the sake of simplicity, we have built a model in which only two states are possible: conservative (Cons) and non-conservative (Not) (in the future, the number of states can be increased).

As observations, we choose the allele count per window $n$ (since it is finite, we can consider this quantity discrete).

To assess conservativeness, we used an estimation of the selective effect against heterozygous PTVs ($s_{het}$) that takes into account for each region the observed value of protein truncation variants (PTV) $n$, the allele numbers $N$ and the expected mutation rate.

Evolutionary conservativeness of ORF regions_dark Evolutionary conservativeness of ORF regions_light

Stage 5.1. Search of HMM parameters.

As mentioned above, we have chosen $s_{het}$ as an estimate of conservativeness. Similarly to the work of Cassa C. 2017, we assumed the observed distribution of PTV counts across $i$-th region:

$$P(n_i | \alpha, \beta; \nu_i) = \int P(n_i|s_{het};\nu_i) P(s_{het};\alpha, \beta)ds_{het}, $$
where $\nu_i = N_iU_i$ - expected genic PTV counts.

Further, $P(n_i|s_{het};\nu_i) = Pois(n_i, \lambda_i),\ \mbox{where}\ \lambda_i=\nu_i / s_{het}$.
$P(s_{het};\alpha, \beta) = IG(s_{het};\alpha, \beta)$ is inverse Gaussian distribution with mean $\alpha$ and shape $\beta$ parameters (calculated for the gene as a whole). Thus,

$$P(n_i | \alpha, \beta; \nu_i) = \int Pois(n_i, \nu_i / s_{het}) IG(s_{het};\alpha, \beta) ds_{het} $$

We have chosen as the emission probabilities ($e_k(n_i)$ for $n_i$ in $k$-th state) the normalized probabilities obtained by taking a certain integral ( emission_probabilities function in emissionprobgetter.py file):

$$ e_k(n_i) = \frac{P(n_i | \alpha, \beta; \nu_i; a, b)}{P(n_i | \alpha, \beta; \nu_i)} = \frac{\int\limits_{a}^{b} Pois(n_i, \nu_i / s_{het}) IG(s_{het};\alpha, \beta) ds_{het}}{P(n_i | \alpha, \beta; \nu_i)},$$

where $a, b = [0, 0.01]$ for $k=Not$ and $a, b = [0.01, 1]$ for $k=Cons$. The choice of such values is also due to the results obtained in the work Cassa C. 2017.

Since we had no assumptions about the transition probabilities, we used the Baum–Welch algorithm to find transition probabilities corresponding to the maximum likelihood of the model (function baum_welch_algo in file hmmalg.py ).

Stage 5.2. Decoding sequence

The decoding of the path of states was carried out using the Viterbi algorithm (function viterbi_algo in file hmmalg.py ), it consists in finding a path that also meets the maximum likelihood of the model.

Results and future plans

An algorithm based on the hidden Markov model for the estimation of conservativeness of individual regions of the protein-coding sequence was invented and implemented ( conservativeness_estimator function in file getconservativeness.py gathers all the previous ones and gives the result). Also, a complete pipeline for finding conservative regions for one gene was implemented with Snakemake v6.10.0 (Mölder F. 2021). To use it, it's needed to download and index the genome, as described above, and, specifying the desired gene name, window size and initial approximations of the transition probabilities at top of Snakefile , run Snakemake .

The algorithm was tested on set of genes: mostly conservative, mostly non-conservative, presumably possessing regions of conservative, and it solves the first two cases and not so much sensitive for the third.

There is a graph below showing the dependence of the allele counts per window for the regions of the gene ARFGEF1. In the case of a large number of alleles, the algorithm with the specified parameters copes with finding the most conservative region.

ARFGEF1r_tst2_dark ARFGEF1r_tst2_light

In the future, it is planned to establish whether it is possible to fix transition probabilities based on the theory of population evolution. Also change the approach to calculating the allele number for regions where no PTV is observed (the current estimate is rather rough, as well as $a$, $b$ for emission probabilities).

Requirements for the pipeline :

  • agrep tool;

  • Python v3.9.10 (the requirements for computing packages are specified in requirements.txt );

  • samtools v1.14;

  • bedtools v2.30.0;

  • Snakemake v6.10.0.

Literature

  • Cassa, C., Weghorn, D., Balick, D. et al . Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat Genet 49, 806–810 (2017).

  • Danecek, P., Bonfield, J.K., Liddle, J., Li, H. et al . Twelve years of SAMtools and BCFtools. GigaScience , 10(2), (2021).

  • Davis, R.I. and Lovell, B.C. Comparing and evaluating HMM ensemble training algorithms using train and test and condition number criteria. Pattern Anal. Appl. 6, 4, pp. 327–336, (2003).

  • Frankish, A., Diekhans, M., Jungreis, I., et al . GENCODE 2021. Nucleic Acids Research . 49(D1):D916-D923 (2021).

  • Karczewski, K.J., Francioli, L.C., Tiao, G. et al . The mutational constraint spectrum quantified from variation in 141,456 humans. Nature , 581, pp. 434–443 (2020).

  • Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J.. Sustainable data analysis with Snakemake. F1000Res 10, 33. (2021)

  • Quinlan, A.R., Hall, I.M., BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics , 26(6), pp. 841–842 (2010).

  • Samocha, K., Robinson, E., Sanders, S. et al . A framework for the interpretation of de novo mutation in human disease. Nat Genet 46, 944–950 (2014).

  • Skitchenko, R.K., Kornienko, J.S., Maksiutenko, E.M., Glotov, A.S., Predeus, A.V., Barbitoff, Y.A. Harnessing population-specific protein truncating variants to improve the annotation of loss-of-function alleles. bioRxiv 2020.08.17.254904, (2020).

Code Snippets

3
4
5
6
zcat ./data/gencode.v19.annotation.gff3.gz | grep -w "$1" |\
 agrep -w "CCDS;CDS" |  grep -v "transcript_type=retained_intron" |\
  grep -P "appris_principal|appris_candidate_longest" |\
   awk -v FS="\t" -v OFS="\t" '{split($9, a, ";"); print $1, $2, a[1]":"a[7]":"$7, $4-1, $5+1, $6, $7, $8, $9}'
Shell From line 3 of main/get_cds.sh
15
16
17
18
19
20
21
22
23
24
25
26
27
run:
    link_to_genefile = f'{input.genefile}'
    link_to_variants_file = f'{input.variants_file}'
    gene_name, coords_list, u_list, n_list, mean_allele_number_list, path = conservativeness_estimator(genefilename=link_to_genefile,
                                              variantsfilename=link_to_variants_file,
                                               window_size=window_size,transitions=transitions)
    with open(f'{output}','w') as out_f:
        writer = csv.writer(out_f,delimiter='\t')
        writer.writerow(['Gene', 'Start', 'Stop', 'sumFreqLoF', 'sumAC',
                         'meanAN', 'State'])
        for i in range(len(coords_list)):
            writer.writerow([gene, coords_list[i][0], coords_list[i][1], u_list[i], n_list[i],
                             mean_allele_number_list[i], path[i]])
SnakeMake From line 15 of main/Snakefile
32
shell: 'grep -w "{gene}" {input} > {output}'
SnakeMake From line 32 of main/Snakefile
37
shell: "bedtools getfasta -fi {input[0]} -bed {input[1]} -name > {output}"
41
shell: "sh get_cds.sh {gene} > {output}"
SnakeMake From line 41 of main/Snakefile
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ombystoma-young/evolutionary-constraint-within-orf
Name: evolutionary-constraint-within-orf
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...