This post is to mark the end of GSoC 2022, but marks just the beginning of the project. As per the work submission guidelines, here are my contributions. The goal of my project is build submodules for conditional autoregressive models. I have worked on many aspects that will benefit the submodule, but have no pull requests that will be merged into the main branch by the end of the summer.

1 The gap - why we need conditional autoregressive models in PyMC

Conditional autoregressive (CAR) models are a class of spatial and spatio-temporal models that are commonly used in areas such as disease mapping and for inference within large datasets. They are amenable to use with large datasetys due to the often low-order "autoregressive" structure used to model the dependence between observations. For example, within a spatial CAR model, the spatial dependence of an area on other areas is often modelled as dependent on only the direct neighbours of that area. This induces a sparsity in the precision matrix between these spatial effects, ensuring more efficient inference than if the precision matrix was dense. Currently, packages such as CARBayesST, Nimble, and INLA support a wide variety of CAR models. However, CARBayesST and Nimble use samplers such as random-walk MCMC, which often suffer from either poor convergence or low effective sample size. INLA uses a series of numerical optimization routines to offer a fast approximation to the posterior, yet can be known to occassionally have poor accuracy.

2 The models

The CAR prior models the spatial dependence observed within an areal data set such that the random effect $\phi_i$ for area $i$ has the following conditional prior distribution $$ \phi_i | \boldsymbol{\phi}_{-i} \sim \mathcal{N} \left( \alpha\frac{\sum_{j \sim i} \phi_j}{n_i}, \frac{\tau^2}{n_i} \right), $$ where $j\sim i$ indicates the all neighbours of $i$, $n_i$ is the total number of neighbours for area $i$, and $\alpha\in(0, 1)$, $\tau^2$ are free. In practice the parameter $\alpha$ is difficult to estimate and the CAR prior is rarely used as part of a more complex statistical model.

The ICAR prior is equivalent to the CAR prior as $\alpha \rightarrow 1$. The ICAR models the spatial dependence observed within an areal data set such that the random effect $\phi_i$ for area $i$ has the following conditional prior distribution $$ \phi_i | \boldsymbol{\phi}_{-i} \sim \mathcal{N} \left( \frac{\sum_{j \sim i} \phi_j}{n_i}, \frac{\tau^2}{n_i} \right). $$ As the ICAR random effect captures variation spatial in nature, models that use the ICAR component often use an additional independent random effect vector $\boldsymbol{\epsilon}$ to capture added variation not explain by the ICAR component. This type of model is called the BYM model.

As there are two random effects for each area within a BYM model, the individual random effects are unidentifiable, and thus only the sum of the two random effects are identifiable. An alternative approach that circumvents this unidentifiability is the Leroux prior, denoted as $$ \phi_i | \boldsymbol{\phi}_{-i} \sim \mathcal{N} \left( \frac{\rho \sum_{j \sim i} \phi_j}{1 - \rho + \rho n_i}, \frac{\tau^2}{1 - \rho + \rho n_i} \right)\\ \rho \sim \text{Uniform}(0,1), $$ where $\rho$ is a scaling parameter such that when $\rho=0$, the random effects are independent, and when $\rho=1$, the Leroux prior is equivalent to the ICAR component.

Alternative approaches that look to circumvent the unidentifiability of the BYM modelling approach include a reparameterization called the BYM2 model.

3 Current state of play in PyMC

Currently, PyMC has a CAR distribution, and an open pull request for an ICAR distribution here. As it currently stands within the pull request for the ICAR distribution, the test coverage of the logp function is poor, and hence why the pull request has not been merged. The reason for the poor testing covergage is that the implementation of the ICAR distribution is based off of a singular Gaussian distribution, making it difficult to write tests for the logp with a different construction to the implementation of the distribution itself. Currently there exists no tutorial using the pm.CAR distribution as it currently stands within PyMC. Please see ongoing progress in this pull request.

4 The ICAR prior

4.1 Writing the ICAR prior as a pairwise difference

Often the log probabiliy of the ICAR distribution is expressed as a pairwise summation. For example, see this implementation [cite the Stan case studies] of the BYM model in Stan. Let the notation $i\sim j$ denote areas $i$ and $j$ as neighbours. When a summation is over the index $i\sim j$, it means that the summation is over all unique neighbour pairs in the data set. The adjacency matrix $\mathbf{W}\in R^{N\times N}$ has entries $w_{ij} = 1 $ if $i\sim j$, else $w_{ij}=0$. The elements $w_{ii}=0$. The matrix $\mathbf{D} \in R^{N\times N}$ is a diagonal matrix such that each diagonal element corresponds to the number of neighbours $n_i$ for area $i$. By definition, the matrix $\mathbf{D} - \mathbf{W}$ is diagonally dominant, thus positive semi-definite, with non-negative eigenvalues. Unfortunately, the matrix is singular because it is not strictly diagonally dominant. The matrix $\mathbf{I}$ is the identity. The log probability of the Leroux prior can be written as proportional to the following pairwise difference: $$ \log p(\boldsymbol{\phi}) = -\frac{1}{2}\boldsymbol{\phi}^T\big [\mathbf{D}-\mathbf{W}]\boldsymbol{\phi} + c\\ \propto -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \phi_i\big [\mathbf{D}-\mathbf{W}\big ]_{ij}\phi_j\\ = -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \phi_i[\mathbf{D}-\mathbf{W} \big ]_{ij} \phi_j\\ = -\frac{1}{2}\bigg [\sum_{i=1}^N\sum_{j=1}^N \phi_i\phi_{ij}\bigg [ -\sum_{i=1}^N\sum_{j=1}^N \phi_i \phi_j w_{ij}\bigg ]\\ = -\frac{1}{2}\bigg [\sum_{i=1}^N \phi_i^2 d_{i} -2\sum_{i\sim j} \phi_i \phi_j\bigg ]\\ = -\frac{1}{2}\bigg [\sum_{i\sim j} \phi_i^2 + \phi_j^2 -2 \phi_i \phi_j\bigg ]\\ $$ The log-prior specified above assumes a mean of $0$ and a variance scaling of $1$.

4.2 Implementation of this logp term in PyMC

The pairwise difference can be written as a function (suppose it is called icar_pairwise_diff). The key to implementing the log probability of an ICAR prior within PyMC is to initialize the vector of random effects $\boldsymbol{\phi}$ with pm.Flat, which adds $0$ to the log probability of the graph, and then to evaluate the function icar_pairwise_diff using pm.Potential to add the required log probability value to the overall log probability of the graph.

# node 1 and 2 denote pairwise relations 
coords = {"num_areas": np.arange(N)}
phi = pm.Flat("phi", dims="num_areas")
pm.Potential("pairwise_sum", icar_pairwise_diff(phi, node1, node2))

4.3 Example notebook using the ICAR prior within a BYM model fit

Provided here is an implementation of the BYM model within PyMC using the pairwise difference formulation and a sum to zero constraint.

5 Leroux prior

5.1 Writing the Leroux prior as a pairwise difference

Seeing the ICAR prior expressed as a pairwise summation, this motivates us to find an expression for the Leroux prior as a pairwise summation, and thus we can use similar functionality above to add the Leroux prior in a PyMC. This is what we now do. Although the following algebraic manipulations are trivial, to the best of my knowledge, I never seen the Leroux prior expressed as a pairwise summation before. Let the notation $i\sim j$ denote areas $i$ and $j$ as neighbours. When a summation is over the index $i\sim j$, it means that the summation is over all unique neighbour pairs in the data set. The adjacency matrix $\mathbf{W}\in R^{N\times N}$ has entries $w_{ij} = 1 $ if $i\sim j$, else $w_{ij}=0$. The elements $w_{ii}=0$. The matrix $\mathbf{D} \in R^{N\times N}$ is a diagonal matrix such that each diagonal element corresponds to the number of neighbours $n_i$ for area $i$. By definition, the matrix $\mathbf{D} - \mathbf{W}$ is diagonally dominant, thus positive semi-definite, with non-negative eigenvalues. Unfortunately, the matrix is singular because it is not strictly diagonally dominant. The matrix $\mathbf{I}$ is the identity. The log probability of the Leroux prior can be written as proportional to the following pairwise difference: $$ \log p(\boldsymbol{\phi}) = -\frac{1}{2}\boldsymbol{\phi}^T\big [\rho(\mathbf{D}-\mathbf{W}) + (1-\rho)\mathbf{I})\big ]\boldsymbol{\phi} + c\\ \propto -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \phi_i\big [\rho(\mathbf{D}-\mathbf{W}) + (1-\rho)\mathbf{I})\big ]_{ij}\phi_j\\ = -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \phi_i[\rho(\mathbf{D}-\mathbf{W}) \big ]_{ij} \phi_j-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \phi_i \big [(1-\rho)\mathbf{I})\big ]_{ij}\phi_j\\ = -\frac{1}{2}\rho\bigg [\sum_{i=1}^N\sum_{j=1}^N \phi_i\phi_{ij}\bigg [ -\sum_{i=1}^N\sum_{j=1}^N \phi_i \phi_j w_{ij}\bigg ] - \frac{1}{2}(1-\rho)\sum_{i=1}^N\phi_{i}^2\\ = -\frac{1}{2}\rho\bigg [\sum_{i=1}^N \phi_i^2 d_{i} -2\sum_{i\sim j} \phi_i \phi_j\bigg ] - \frac{1}{2}(1-\rho)\sum_{i=1}^N\phi_{i}^2\\ = -\frac{1}{2}\rho\bigg [\sum_{i\sim j} \phi_i^2 + \phi_j^2 -2 \phi_i \phi_j\bigg ] - \frac{1}{2}(1-\rho)\sum_{i=1}^N\phi_{i}^2\\ = -\frac{1}{2}\rho \sum_{i\sim j} (\phi_i - \phi_j)^2 - \frac{1}{2}(1-\rho)\sum_{i=1}^N\phi_{i}^2. $$ The log-prior specified above assumes a mean of $0$ and a variance scaling of $1$.

5.2 Implementation of this logp term into PyMC

The implementation is similar to the ICAR prior above except we evaluate a pm.Potential expression under a function that represents the pairwise summation given above for the pairwise Leroux prior.

5.3 Example notebook of a Leroux model fit

Provided here is an implementation of a model using the Leroux prior within PyMC using the pairwise difference formulation and a sum to zero constraint.

6 Extensions to spatio-temporal models

6.1 The linear time CAR model

We can now use the building block of the Leroux prior to fit models with more complex random effect structures towards spatio-temporal datasets within PyMC. A simple example od this is the linear time spatio-temporal conditional autoregressive model.

For this this model, data from area $i$ at time $t$ is modelled as $$ y_{i, t} \sim \text{Poisson}(\mu_{i, t})\\ \log \mu_{i, t} = \beta_0 + \beta_1 x_{i, t} + \log E_{i, t} +\psi_{i, t}, $$ where $\psi_{i, t}$ is the random effect, modelled as $$ \psi_{i, t}=\mu+\phi_{i}+(\alpha+\delta_i)\frac{t-\bar{t}}{T}, $$ where both the $\boldsymbol{\phi}$'s and $\boldsymbol{\delta}$'s have a Leroux prior. Effectively, this is a hierarchical model with random intercepts and random slopes, except that these two terms are "smoothed" over by the random intercepts and slopes respectively of their directly local neighbours.

6.2 Example notebook of the linear-time CAR model

Provided here is an implementation of the linear time spatio-temporal conditional autoregressive model PyMC using Leroux priors expressed with the pairwise difference formulation and a sum to zero constraint.

7 Future work

7.1 How I want the models discussed to be able to be used within PyMC

The current notebooks provided manually define random effects using $pm.Flat$, evaluate the pairwise difference formulation using calls to pm.Potential, manually apply a distributional constrant on the sum of the random effects, and define priors for the free parameters within the ICAR and the Leroux specifications. My next step is to create functionality such as pm.BYM, pm.Leroux, pm.BYM2 that completess all of the above mentioned steps, such that fitting these models becomes as simple as defining your random effect structure with a single call to something such as pm.Leroux.

Issues to consider with the proposed approach in 7.1

  • prior specification within the call to pm.Leroux (e.g., priors being defined for $\rho$, $\tau^2$ by implicitly calling pm.Leroux). If the user wants to define their own priors for $\rho$ and $\tau$, making sure the arguments are available that the pre-defined parameters can be linked with the call to pm.Leroux.
  • CAR models are hierarchical in nature, and so there is optionality to choose centered versus non centered parameterizations, and exploration is required into how these different parameterizations affect the performance of fitting such spatio-temporal models using the available inference algorithms.
  • sum-to-zero constraints are required on the random effects to ensure parameter identifiability with any intercept parameters. Need to determine the best way to satisfy this constraint for parameters defined by pm.Flat.

7.2 Addition of more spatiotemporal CAR models

Now that we can use Leroux priors within PyMC, the building blocks exist for extensions into according more complex, especially within the spatio-temporal model class. Here is an example.

The AR-2 spatio-temporal CAR model

Suppose we now observe $\mathbf{y}\in\mathbb{R}^{N\times T}$, $\mathbf{x}\in\mathbb{R}^{N\times T}$, $\mathbf{x}_{\text{pop}}\in\mathbb{R}^{N\times T}$, e.g., we have a sequence in time, of observations of the response, covariate, and offset for each area. We model the mean vector of the random effects at each time point as following an autoregressive process of order $2$. Ignoring the model terms that are not relevant to the Leroux prior, we get the following: $$ \psi_{t, i} = \phi_{t, i}\\ \boldsymbol{\phi}_t|\boldsymbol{\phi}_{t-1}, \boldsymbol{\phi}_{t-2} \sim \text{Normal}\bigg (\rho_{T_1}\boldsymbol{\phi}_{t-1}+\rho_{T_2}\boldsymbol{\phi}_{t-2}, \tau^2 \bigg (\rho_{s}[\mathbf{D}-\mathbf{W}]+(1-\rho_s)\mathbf{I}\bigg )\bigg )\\ \boldsymbol{\phi}_1, \boldsymbol{\phi}_2 \sim \text{Normal}\bigg (\mathbf{0}, \tau^2\bigg (\rho_{s}[\mathbf{D}-\mathbf{W}]+(1-\rho_s)\mathbf{I}\bigg )\bigg )\\ \tau^2 \sim \text{Inverse-Gamma}(1, b)\\ \rho_S \sim \text{Uniform}(0, 1)\\ f(\rho_{T_1}, \rho_{T_2})\propto 1. $$

The AR-2 spatio-temporal CAR model as a pairwise summation

We can see that the vector of random effects for $t=1, 2$ have the same log-probability structure as the Leroux prior defined with no temporal component, and thus are proportional to $$ \log p(\boldsymbol{\phi}_{1}, \boldsymbol{\phi}_2)\propto -\frac{\tau}{2}\rho_S[\sum_{t=1}^2\sum_{j\sim i}(\phi_{t,i}-\phi_{t, j})^2] - \frac{\tau}{2}(1-\rho_S)\sum_{t=1}^2\sum_{j\sim i}(\phi_{t,i}^2). $$ We define the mean vector an time point $t$ as equal to $\boldsymbol{\mu_t}=\rho_{T_1}\boldsymbol{\phi}_{t-1}+\rho_{t-2}\boldsymbol{\phi}_{t-2}$. Thus, the log probability over the vectors of random effects from time points $t=3, 3, \ldots, T$, can be written as proportional to $$ \log p(\boldsymbol{\phi}_3, \boldsymbol{\phi}_4, \ldots, \boldsymbol{\phi}_T) \propto -\frac{\tau}{2}\rho_S \bigg[\sum_{t=3}^T \sum_{j\sim i}\phi_{t, i}-\phi_{t, j} -(\mu_{t, i}-\mu_{t, j})\bigg] -\frac{\tau}{2}(1-\rho_S) \bigg[\sum_{t=3}^T \sum_{i=1}^N(\phi_{t, i} - \mu_{t, i})^2\bigg]. $$

Added difficulties with such a model

The mean $\boldsymbol{\mu}_{t}$ of the random effects at each time $t$ must be computed using values of the random effects at previous time points. To compute such quantities without creating a double for loop within the pm.Model we must make use of functionality such as aesara.scan. Additionally, the sum to zero constraints for such models must hold for the random effects at each of the $T$ time points. This will potentially require looking at alternate ways to satisfy such constraints.

Additional models of interest

The CARBayesST package within R has implementations of numerous spatio-temporal CAR models, such as the aforementioned AR-2 spatio-temporal CAR model. We have now developed the ability to derive the log probability of the random effects of the majority of these models as a pairwise summation. Hopefully this will allow us to include such models using a possible model_type argument with the pm.Leroux, pm.BYM, and pm.BYM2 functions

7.4 Profiling of PyMC implementation compared to CARBayes, Nimble, and INLA

Lastly, once the ideas discussed above are added to the PyMC, it is important that we profile the computational efficiency and the inference accuracy of libraries that currently support the same class of models, such as CARBayesST, Nimble, and INLA.

I am very grateful for the awesome oppurtunity to learn from some awesome people in Bill and Chris. My hope in doing this experience was to improve my programming and software development skills. While the progress that I have made in this regard in the past few months was slower than I would like, I am excited for the potential use cases of the end product of the the work that we have done, once it exists and is available within the core PyMC library. Since first seeing Leroux priors used within spatio-temporal models, it has been a source of confusion as to the lack of availabile software that can sample these models using gradient-based samplers such as NUTS. I am happy with our progress in learning how to evaluate the log probability of such priors such that they can be sampled within modern probabilistic programming packages (like PyMC :)).