Accounting for Epistasis in PRSs Through the Coalescent

P. Fournier & F. Larribe (STATQAM — UQAM)

New Statistical Methods in Genetic Studies

June 2^nd, 2022

PRS & Epistasis

Assumptions

Weight estimation is based on linear models

Assumes (among other things):

Additivity
Linearity

Epistasis?

Gene-Gene interaction
Major difficulty in the analysis of GWAS data6
Epistasis-aware models are possible; however, naïve ones are intractable ($\mathcal O(2^m)$ terms)

That being said,

Some phenotypes are simple
Some forms of epistasis might be reflected in additive effects7

Epistasis-aware models

Interaction learning8^,9
Machine learning

For machine learning:

Marker selection is "the major factor that impacts on a machine learning model’s predictive performance"10
The mechanic through which markers affect phenotype might not be known

Model-Free Genotype Based Prediction

Overview

Goal: compute the likelihood of phenotype $\varphi^*$ given the underlying genotype $h_0^*$.

Exploit information from paired haplotype-phenotype sequence: $$ H_0 \bigtriangleup \Phi = \left\lbrace (h_0^1, \varphi_1), \ldots, (h_0^n, \varphi_n) \right\rbrace. $$

A bit of notation: $$ H_0^* = H_0 \cup \lbrace h_0^* \rbrace, \quad \Phi^* = \Phi \cup \lbrace \varphi^* \rbrace $$

Likelihood

Assuming unrelatedness, $$ L(\varphi^* | h_0^*, H_0, \Phi) \propto f(H_0^*, \Phi^*) $$

The law of total probabilities allows the introduction of evolutionary history in the form of genealogies: $$ f(H_0^*, \Phi^*) = \int_{\color{#4d7e65} \mathcal G} f(H_0^*, \Phi^* | {\color{#4d7e65} G}) g({\color{#4d7e65} G}) \text d{\color{#4d7e65} G} $$

Assumption: Conditional Independance

$$ (H_0^* | {\color{#4d7e65} G}) \perp (\Phi^* | {\color{#4d7e65} G}) $$

# Likelihood — Conditional Independance
Assuming conditional independance,
$$
f(H_0^*, \Phi^*)
= \int_{\mathcal G} f(H_0^* | G) f(\Phi^* | G) g(G) \text d G
$$
---
# Consistency
- A genealogy $G$ is said to be consistent with a set of sequences $H_0$ if the sample induced by $G$ is precisely $H_0$
- A given genealogy is consistent with **exactly 1** sample

Thus,
$$
f(H_0^* | G) = \mathbb I(G \text{ consistent with } H_0^*),
$$
---
# Likelihood — Sample Consistency
Let $\mathcal G_{H_0^*} \subseteq \mathcal G$ be the set of genealogies consistent with $H_0^*$. The likelihood simplifies to
$$
\begin{align*}
L(\varphi^*| H_0^*, \Phi) \propto f(H_0^*, \Phi^*)
&= \int_{\mathrlap{\mathcal G_{H_0^*}}} f(\Phi^* | G) g(G) \text d G\\\\
&= \int_{\mathrlap{\mathcal G_{H_0^*}}} f(\varphi^*, \Phi | G) g(G) \text d G
\end{align*}
$$

Notes:
- Last expression usefull for computation.
- Conclusion:
  - We avoid having to make assumptions about the mode of action of genotype on phenotype.
  - On the other hand, we have to specify
    1. a distribution for genealogies;
    2. a conditional distribution for phenotypes.

Computing the Likelihood

Context

Discrete phenotype: $\Phi^* \in \lbrace 0, 1 \rbrace^{n + 1}$
Quantity of interest: $L(\varphi^* = 1 | H_0^*, \Phi)$
No recombination between causal markers
$G \sim$ ARG11
$(\Phi^* | G) \sim$ ???

First question: How to compute exact likelihood (not up to a constant)?

# Exact Likelihood
Efficient computation of the normalisation constant (Tonelli):
$$
\begin{align*}
f(H_0^*, \Phi)
&= {\color{#e65100} \int_{\mathrlap{\Omega_{\varphi^*}}}} f(\varphi^*, H_0^*, \Phi) \\;{\color{#e65100}\text d \varphi^*}\\\\
&= {\color{#e65100} \int_{\Omega_{\varphi^*}}} {\color{#9558b2} \int_{\mathrlap{\mathcal G_{H_0^*}}}} f(\varphi^*, \Phi | G) g(G) \\;{\color{#9558b2} \text d G} \\;{\color{#e65100} \text d \varphi^*}\\\\
&= {\color{#9558b2} \int_{\mathcal G_{H_0^*}}} {\color{#e65100} \int_{\mathrlap{\Omega_{\varphi^*}}}} f(\varphi^*, \Phi | G) g(G) \\;{\color{#e65100} \text d \varphi^*} \\;{\color{#9558b2} \text d G}
\end{align*}
$$

Notes:
- The integral w.r.t φ* is often easier to compute than the one w.r.t. G. Example:
 - Binary trait => sum of 2 terms;
 - ARG => numerical integration.
- Joint density must be well-defined.
- Always work for discrete phenotype (dominated convergence).
---
# Sample genealogy
![Example genealogy](assets/example.svg)

Conditional Density of the Phenotypes

For each marginal tree $T_i$, we compute the marginal density $f(\varphi^* = 1, \Phi | T_i)$.

Select tree $T^*$ based on absolute pointwise mutual information:

$$ T^* = \argmax_T \left\vert \text{pmi}(\Phi^*, T) \right\vert = \left\vert \frac{f(\Phi^* | G)}{f(\Phi^*)} \right\vert $$

Conditional Density II

$f(\Phi | T_i) \rightarrow f(\Phi)$ as ${\text{TMRCA}(T_i) \rightarrow \infty}$
$\varphi_k \sim \mathcal B(p)$, $\Phi \sim \mathcal B(n, p)$

Assume conditional independance on ancestor:

$$ f(\varphi_k | T_i, \Phi \setminus \lbrace \varphi_k \rbrace) = f(\varphi_k | p_{T_i}(k), \Phi\vert_{p_{T_i}(k)}) $$ Where $p_{T_i}(k)$: parent of sequence $k$, ${\Phi\vert_x =\lbrace \varphi \in \Phi : \varphi \text{ descedent of x} \rbrace}$.

Conditional Density III

$$ \begin{align*} &f(\Phi^* | T^* )\\ &\quad = f(\varphi^* | \Phi, T^*) \prod_{k = 1}^n f(\varphi_k| \Phi_{k - 1}, T^*)\\ &\quad = f(\varphi^* | p_{T_i}(*), \Phi\vert_{p_{T_i}(*)}) \prod_{k = 1}^n f(\varphi_k | p_{T_i}(k), \Phi\vert_{p_{T_i}(k)}) \end{align*} $$

Conditional Density: Single Phenotype

$\alpha(t): \mathbb R_+ \to [0, 1]$ strictly monotonous such that $\alpha(0) = 0$ and $\alpha(t) \to 1$ as $t \to \infty$
$t_k = \text{TMRCA}(\Phi\vert_{p_{T_i}(k)})$
$h$: U-shaped beta-binomial mass function

$$ f(\varphi_k | t_k, \Phi\vert_{p_{T_i}(k)}) = \alpha(t_k) f(\Phi) + (1 - \alpha(t_k)) h(\Phi). $$

References

Croucha, D. J. M., & Bodmer, W. F. (2020). Polygenic inheritance, GWAS, polygenic risk scores, and the search for functional variants. Proceedings of the National Academy of Sciences of the United States of America, 117(32), 18924–18933. https://doi.org/10.1073/pnas.2005634117
Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics, 157(4), 1819–1829. https://doi.org/10.1093/GENETICS/157.4.1819
Guindo-Martínez, M., et al. (2021). The impact of non-additive genetic associations on age-related complex diseases. Nature Communications 2021 12:1, 12(1), 1–14. https://doi.org/10.1038/s41467-021-21952-4
Pozarickij, A., Williams, C., & Guggenheim, J. A. (2020). Non-additive (dominance) effects of genetic variants associated with refractive error and myopia. Molecular Genetics and Genomics, 295(4), 843. https://doi.org/10.1007/S00438-020-01666-W
Non Additive Genetic Effects Portal - Home. (n.d.). Retrieved May 25, 2022, from https://nage.hugeamp.org/

Furlong, L. I. (2013). Human diseases through the lens of network biology. Trends in Genetics, 29(3), 150–159. https://doi.org/10.1016/J.TIG.2012.11.004
Mäki-Tanila, A., & Hill, W. G. (2014). Influence of Gene Interaction on Complex Trait Variation with Multilocus Models. Genetics, 198(1), 355–367. https://doi.org/10.1534/GENETICS.114.165282
Massi M.C., Franco N.R., Ieva F., Manzoni A., Paganoni A.M., Zunino P. HighOrder Interaction Learning via Targeted Pattern Search. MOX Report 59/2020, 2020. Retrieved May 25, 2022, from https://www.mate.polimi.it/biblioteca/add/qmox/59-2020.pdf
Franco, N. R., Massi, M. C., et al. (2021). Development of a method for generating SNP interaction-aware polygenic risk scores for radiotherapy toxicity. Radiotherapy and Oncology, 159, 241–248. https://doi.org/10.1016/j.radonc.2021.03.024
Ho, D. S. W., Schierding, W., Wake, M., Saffery, R., & O’Sullivan, J. (2019). Machine learning SNP based prediction for precision medicine. Frontiers in Genetics, 10(MAR), 267. https://doi.org/https://doi.org/10.3389/fgene.2019.00267

Griffiths, R. C., and Marjoram, P. (1996). An ancestral recombination graph. IMA Volume on Mathematical Population Genetics (P. Donnelly and S. Tavare, Eds.), Springer-Verlag, New York, 257–270