Log link in Poisson ash

Table of Contents

Introduction

We previously found several instances of genes which clearly exhibit unimodal expression variation (by inspection), but significantly depart from the estimated unimodal distribution

\begin{align*} x_i \mid s_i, \lambda_i &\sim \operatorname{Poisson}(s_i \lambda_i)\\ \lambda_i &\sim g(\cdot) = \sum_{k=1}^K w_k \operatorname{Uniform}(\lambda_0, a_k) \end{align*}

where \(\lambda_0\) is the mode and we abuse notation to allow endpoints \(a_k < \lambda_0\). One feature which appears to be common to these examples is estimating the density near zero wrong, such that the smallest randomized quantiles deviate from uniform.

Here, we investigate problems in fitting a unimodal assumption under the log link

\begin{align*} x_i \mid s_i, \lambda_i &\sim \operatorname{Poisson}(s_i \lambda_i)\\ \theta_i = \ln \lambda_i &\sim g(\cdot) = \sum_{k=1}^K w_k \operatorname{Uniform}(\theta_0, a_k) \end{align*}

Setup

import numpy as np
import pandas as pd
import rpy2.robjects
import rpy2.robjects.pandas2ri
import scipy.integrate as si
import scipy.special as sp
import scipy.stats as st

rpy2.robjects.pandas2ri.activate()
%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])
import matplotlib.pyplot as plt
plt.rcParams['figure.facecolor'] = 'w'
plt.rcParams['font.family'] = 'Nimbus Sans'

Derivations

Under the log link, we have

\begin{align*} x_i \mid s_i, \lambda_i &\sim \operatorname{Poisson}(s_i \lambda_i)\\ \theta_i = \ln \lambda_i &\sim \operatorname{Uniform}(a, b) \end{align*}

and

\begin{align*} \Pr(x_i = x) &= \frac{1}{b - a} \int_a^b \operatorname{Poisson}(x; \exp(\ln s_i + \theta_i))\,d\theta_i\\ &= \frac{1}{b - a} \int_a^b \frac{(s_i)^x}{\Gamma(x + 1)} \exp(\theta_i)^{x}\exp(-s_i \exp(\theta_i))\,d\theta_i\\ &= \frac{1}{x (b - a)} \int_{\exp(a)}^{\exp(b)} \frac{(s_i)^x}{\Gamma(x)} \lambda_i^{x - 1}\exp(-s_i \lambda_i)\,d\lambda_i\\ &= \frac{1}{x (b - a)} \left( F_\Gamma(\exp(b); x, s_i) - F_\Gamma(\exp(a); x, s_i) \right)\\ \Pr(x_i \leq x) &= \sum_{k=0}^{x} \frac{1}{k (b - a)} \left( F_\Gamma(\exp(b); k, s_i) - F_\Gamma(\exp(a); k, s_i) \right) \end{align*}

where \(F_\Gamma(\cdot; \alpha, \beta)\) denotes the CDF of the Gamma distribution with shape \(\alpha\) and rate \(\beta\). This derivation breaks down for \(\Pr(x_i = 0)\), because the Gamma distribution is not defined for \(\alpha = 0\), and because there is a factor of \(1 / x_i\). Instead,

\begin{align*} \Pr(x_i = 0) &= \frac{1}{b - a} \int_{\exp(a)}^{\exp(b)} \frac{\exp(-s_i \lambda_i)}{\lambda_i}\,d\lambda_i\\ &= \frac{1}{b - a} \int_{-s_i \exp(a)}^{-s_i \exp(b)} \frac{\exp(t)}{t}\,dt\\ &= \frac{\operatorname{Ei}(-s_i \exp(b)) - \operatorname{Ei}(-s_i \exp(a))}{b - a} \end{align*}

where \(\operatorname{Ei}\) is the exponential integral.

Compare the estimate of \(\ln \Pr(x_i = 0 \mid x_i+)\) using numerical integration, the special case analytic formula involving \(\operatorname{Ei}\), and introducing a pseudocount into the formula involving \(F_\Gamma\).

# Typical values
s = np.logspace(0, 6, 20)
a = np.log(2e-3)
b = np.log(3e-3)

# Numerical integration
px_0 = np.array([si.quad(lambda x: st.poisson(mu=s_j * np.exp(x)).pmf(0), a, b)[0] / (b - a) for s_j in s])

# Analytic
px_1 = (sp.expi(-s * np.exp(b)) - sp.expi(-s * np.exp(a))) / (b - a)

# Follow ashr
eps = 1e-5
F = st.gamma(a=eps, scale=1 / s).cdf
px_2 = (F(np.exp(b)) - F(np.exp(a))) / (eps * (b - a))

np.log(pd.DataFrame({'quad': px_0, 'Ei': px_1, 'F_gamma': px_2}, index=s))
quad          Ei    F_gamma
1.000000         -0.002466   -0.002466  -0.002521
2.069138         -0.005103   -0.005103  -0.005150
4.281332         -0.010558   -0.010558  -0.010598
8.858668         -0.021845   -0.021845  -0.021877
18.329807        -0.045193   -0.045193  -0.045218
37.926902        -0.093480   -0.093480  -0.093498
78.475997        -0.193290   -0.193290  -0.193301
162.377674       -0.399380   -0.399380  -0.399383
335.981829       -0.823968   -0.823968  -0.823964
695.192796       -1.694739   -1.694739  -1.694728
1438.449888      -3.464673   -3.464673  -3.464655
2976.351442      -7.008214   -7.008214  -7.008189
6158.482111     -13.999734  -13.999734 -13.999714
12742.749857    -27.858046  -27.858046       -inf
26366.508987    -55.813991  -55.813991       -inf
54555.947812   -112.910591 -112.910591       -inf
112883.789168  -230.288764 -230.288764       -inf
233572.146909  -472.390342 -472.390345       -inf
483293.023857         -inf        -inf       -inf
1000000.000000        -inf        -inf       -inf

Matthew Stephens provided an alternative derivation:

\begin{align*} \Pr(x_i = x) &= \frac{1}{b - a} \int_a^b \operatorname{Poisson}(x; \exp(\ln s_i + \theta_i))\,d\theta_i\\ &= \frac{1}{b - a} \int_{\exp(a)}^{\exp(b)} \frac{s_i^x}{\Gamma(x + 1)} \lambda_i^{x - 1}\exp(-s_i \lambda_i)\,d\lambda_i\\ &= \frac{1}{\Gamma(x + 1) (b - a)} \int_{s_i \exp(a)}^{s_i \exp(b)} u^{x-1} \exp(-u)\,du\\ &= \frac{1}{\Gamma(x + 1) (b - a)} \left( \Gamma(x, s_i \exp(a)) - \Gamma(x, s_i \exp(b)) \right) \end{align*}

where \(\Gamma(\cdot, \cdot)\) is the upper incomplete Gamma function. The resulting expression is defined for all \(x \geq 0\).

Results

Simulated data

Simulate some near-NB ZINB data.

dat <- readRDS("/scratch/midway2/aksarkar/modes/test-data/pois-mode-est.Rds")
with(dat, summary(MASS::glm.nb(x ~ offset(log(s))))$twologlik / 2)
-2008.50373054974

Fit ash_pois using identity link.

fit0 <- ashr::ash_pois(dat$x, dat$s, link="identity", mixcompdist="halfuniform")
fit0$loglik
-2008.27125771255

Implement autoselect.mixsd for log link.

Fit ash_pois using log link. (The problem turned out to be mode estimation limits.)

fit1 <- ashr::ash_pois(dat$x, dat$s, link="log", mixcompdist="halfuniform")
fit1$loglik
-2008.26590367036

Real data

Load a real data example which is clearly unimodal, but departs from the fitted distribution.

dat <- readRDS("/scratch/midway2/aksarkar/modes/test-data/b-cell-data.Rds")
query <- dat[dat["gene"] == "ENSG00000019582",]
data = pd.DataFrame(rpy2.robjects.r['readRDS']("/scratch/midway2/aksarkar/modes/test-data/b-cell-data.Rds"))
query = data[data["gene"] == "ENSG00000019582"]

Plot the empirical distribution of molecule counts.

plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.hist(query['count'], bins=np.arange(query['count'].max() + 1), color='k')
plt.xlabel('Number of molecules')
plt.ylabel('Number of cells')
plt.tight_layout()

b-cell-ex.png

Plot the distribution of MLEs \(x_i / s_i\).

plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.hist(query['count'] / query['size'], bins=100, color='k')
plt.xlabel('$\hat\lambda_i$')
plt.ylabel('Number of cells')
plt.tight_layout()

b-cell-ex2.png

As a baseline, fit a Gamma assumption.

with(query, summary(MASS::glm.nb(count ~ offset(log(size))))$twologlik / 2)
-31414.5999341206

Fit a unimodal assumption under the identity link.

fit0 <- ashr::ash_pois(query$count, query$size, link="identity", mixcompdist="halfuniform")
fit0$loglik
-31395.0538031464

Report the diagnostic test statistic and p-value.

zgrep -wm1 ENSG00000019582 /project2/mstephens/aksarkar/projects/singlecell-modes/data/gof/b_cells-unimodal.txt.gz

Fit a unimodal assumption under the log link.

fit1 <- ashr::ash_pois(query$count, query$size, link="log", mixcompdist="halfuniform")
fit1$loglik
-31394.8462180293

Make sure the marginal PMF is sensible. (This was not implemented correctly before.)

any(fit1$fitted_g$pi %*% ashr:::comp_dens_conv(fit1$fitted_g, fit1$data) > 1)
FALSE

any(ashr:::comp_dens_conv(fit1$fitted_g, fit1$data) > 1)
FALSE

Implement the diagnostic for the log link.

fx = fit1$fitted_g$pi %*% ashr:::comp_dens_conv(fit1$fitted_g, fit1$data)
N = nrow(query)
K = length(fit1$fitted_g$a)
## Important: this is F_j(x_{ij} - 1), needed for randomized quantiles
Fx_1 = matrix(rep(0, N * K), N, K)
for (i in 1:N) {
  for (k in 1:K) {
    if (fit1$fitted_g$pi[k] < 1e-8) {
      next
    }
    else if (query$count[i] == 0) {
      Fx_1[i, k] = 0
    }
    else if (fit1$fitted_g$a[k] == fit1$fitted_g$b[k]) {
      Fx_1[i, k] = ppois(query$count[i] - 1, lambda=query$size[i] * exp(fit1$fitted_g$a[k]))
    }
    else {
      ak = min(fit1$fitted_g$a[k], fit1$fitted_g$b[k])
      bk = max(fit1$fitted_g$a[k], fit1$fitted_g$b[k])
      grid = seq(0, query$count[i] - 1)
      Fx_1[i, k] = sum((expint::gammainc(grid, query$size[i] * exp(ak)) -
                          expint::gammainc(grid, query$size[i] * exp(bk))) /
                         (gamma(grid + 1) * (bk - ak)))
    }
  }
}
rpp = drop(Fx_1 %*% fit1$fitted_g$pi) + runif(n=N) * fx
ks.test(rpp, punif)$p.value
0.00683832270839435

Author: Abhishek Sarkar

Created: 2020-01-09 Thu 20:20

Validate