10 Estimating Models 10.1 Parameter optimization in linear and nonlinear models 10.3 Evaluating Goodness-of-fit

10.2 Statistical Formulation of Encoding Models

Let us denote the observed spike train data by $D$ . In general, $D$ could represent measurements from a population of neurons. To keep the arguments simple, we focus for the moment on a single neuron, but at the end of the section we will extend the approach to a population of interacting neurons. If the time bins are chosen smaller than the absolute refractory period, the discretized spike train is described by a sequence of scalar variables $n_{t}$ in an interval $0<t\leq T$

D=\{n_{1},n_{2},\dots,n_{T}\}\,,

(10.15)

cf. Fig. 9.6 in Chapter 9. If time bins are large, spike counts can take values larger than one.

A neural ‘encoding model’ is a model that assigns a conditional probability, $p(D|\mbox{\boldmath$x$})$ , to any possible neural response $D$ given a stimulus $x$ . The vector $\mbox{\boldmath$x$}_{t}$ can include the momentary stimulus presented at time $t$ , or more generally the concatenated spatiotemporal stimulus history up to time $t$ . Examples have been given in Section 10.1.

As emphasized in the introduction to this chapter, it is not feasible to directly measure this probability $p(D|\mbox{\boldmath$x$})$ for all stimulus-response pairs $(\mbox{\boldmath$x$},D)$ , simply because there are infinitely many potential stimuli. We therefore hypothesize some encoding model,

p(D|\mbox{\boldmath\(x\)},\theta)\,.

(10.16)

Here $\theta$ is a short-hand notation for the set of all model parameters. In the examples of the previous section, the model parameters are $\theta=\{\mbox{\boldmath$k$},b\}$ .

Our aim is to estimate the model parameters $\theta$ so that the model ‘fits’ the observed data $D$ . Once $\theta$ is in hand we may compute the desired response probabilities as

p(D|\mbox{\boldmath\(x\)})\approx p(D|\mbox{\boldmath\(x\)},\theta)\,,

(10.17)

that is, knowing $\theta$ allows us to interpolate between the observed (noisy) stimulus-response pairs, in order to predict the response probabilities for novel stimuli $x$ for which we have not yet observed any responses.

10.2.1 Parameter estimation

How do we find a good estimate for the parameters $\theta$ for a chosen model class? The general recipe is as follows. The first step is to introduce a model that makes sense biophysically, and incorporates our prior knowledge in a tractable manner. Next we write down the likelihood of the observed data given the model parameters, along with a prior distribution that encodes our prior beliefs about the model parameters. Finally, we compute the posterior distribution of the model parameters given the observed data, using Bayes’ rule, which states that

p(\theta|D)\propto p(D|\theta)p(\theta);

(10.18)

the left-hand side is the desired posterior distribution, while the right-hand side is just the product of the likelihood and the prior.

In the current setting, we need to write down the likelihood $p(D|X,\mbox{\boldmath$k$})$ of the observed spike data $D$ given the model parameter $k$ and the observed set of stimuli summarized in the matrix $X$ , and then we may employ standard likelihood optimization methods to obtain the maximum likelihood (ML) or maximum a posteriori (MAP) solutions for $k$ , defined by

	$\displaystyle{\rm ML:~{}}\hat{\mbox{\boldmath$k$}}_{\rm ML}$	$\displaystyle=$	$\displaystyle{\rm argmax}_{\mbox{\boldmath$k$}}\{p(D\|X,\mbox{\boldmath$k$})\}$		(10.19)
	$\displaystyle{\rm MAP:~{}}\hat{\mbox{\boldmath$k$}}_{\rm MAP}$	$\displaystyle=$	$\displaystyle{\rm argmax}_{\mbox{\boldmath$k$}}\{p(D\|X,\mbox{\boldmath$k$}% )\,p(\mbox{\boldmath$k$})\}$		(10.20)

where the maximization runs over all possible parameter choices.

We assume that spike counts per bin follow a conditional Poisson distribution, given $\rho(t)$ :

n_{t}\sim Poiss[\rho(t){\text{d}}t]\,;

(10.21)

cf. text and exercises of Chapter 7. For example, with the rate parameter of the Poisson distribution given by a GLM or SRM model $\rho(t)=f(\mbox{\boldmath$k$}\cdot\mbox{\boldmath$x$}_{t})$ , we have

p(D|X,\mbox{\boldmath\(k\)})=\,\prod_{t}\left\{{[f(\mbox{\boldmath\(k\)}\cdot% \mbox{\boldmath\(x\)}_{t}){\text{d}}t]^{n_{t}}\over(n_{t})!}\exp[-f(\mbox{% \boldmath\(x\)}_{t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t]\right\}\,

(10.22)

Here $\Pi_{t}$ denotes the product over all time steps. We recall that, by definition, $n_{t}!=1$ for $n_{t}=0$ .

Our aim is to optimize the parameter $k$ . For a given observed spike train, the spike count numbers $n_{t}$ are fixed. In this case, $({\text{d}}t)^{n_{t}}/(n_{t})!$ is a constant which is irrelevant for the parameter optimization. If we work with a fixed time step ${\text{d}}t$ and drop all units, we can therefore reshuffle the terms and consider the logarithm of the above likelihood

{\rm log}p(D|X,\mbox{\boldmath\(k\)})=c_{0}+\sum_{t}\left\{{n_{t}}{\rm log}[f(% \mbox{\boldmath\(k\)}\cdot\mbox{\boldmath\(x\)}_{t})]-f(\mbox{\boldmath\(x\)}_% {t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t\right\}\,.

(10.23)

If we choose the time step ${\text{d}}t$ shorter than the half-width of an action potential, say 1 ms, the spike count variable $n_{t}$ can only take the values zero or one. For small ${\text{d}}t$ , the likelihood of Eq. (10.22) is then identical to that of the SRM with escape noise, defined in Chapter 9; cf. Eqs. (9.10) and (9.15) - (9.17).

We don’t have an analytical expression for the maximum of the likelihood defined in Eq. (10.22), but nonetheless we can numerically optimize this function quite easily if we are willing to make two assumptions about the nonlinear function $f(.)$ . More precisely, if we assume that

(i) $f(u)$ is a convex (upward-curving) function of its scalar argument $u$ , and

(ii) $\log f(u)$ is concave (downward-curving) in $u$ ,

then the loglikelihood above is guaranteed to be a concave function of the parameter $k$ , since in this case the loglikelihood is just a sum of concave functions of $k$ (383).

This implies that the likelihood has no non-global maximum (also called local maximum). Therefore the maximum likelihood parameter $\hat{\mbox{\boldmath$k$}}_{\rm ML}$ may be found by numerical ascent techniques; cf. Fig. 10.2. Functions $f(.)$ satisfying these two constraints are easy to think of: for example, the standard linear rectifier and the exponential function both work.

Fitting model parameters proceeds as follows: we form the (augmented) matrix $X$ where each row is now

\mbox{\boldmath\(x\)}_{t}=(1,I_{t-1}{\text{d}}t,I_{t-2}{\text{d}}t,\dots,I_{t-% K}{\text{d}}t,n_{t-1},n_{t-2},\dots n_{t-J})\,.

(10.24)

Similarly, the parameter vector is in analogy to Eq. (10.10)

\mbox{\boldmath\(k\)}=(b,\kappa({\text{d}}t),\kappa(2{\text{d}}t)\dots,\kappa(% K{\text{d}}t),\eta({\text{d}}t),\eta(2{\text{d}}t),\dots,\eta(J{\text{d}}t));

(10.25)

here $b=u_{\rm rest}-\vartheta$ is a constant offset term which we want to optimize.

We then calculate the log-likelihood

\log p(D|X,\mbox{\boldmath\(k\)})=\sum_{t}\left(n_{t}\log f(X_{t}\cdot\mbox{% \boldmath\(k\)})-f(X_{t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t\right)

(10.26)

and compute the ML or maximum a posteriori (MAP) solution for the model parameters $k$ by a concave optimization algorithm. Note that, while we still assume that the conditional spike count $n_{t}$ within a given short time bin is drawn from a one-dimensional $Poiss(\rho(t){\text{d}}t)$ distribution given $\rho(t)$ , the resulting model displays strong history effects (since $\rho(t)$ depends on the past spike trains) and therefore the output of the model, considered as a vector of counts $D=\{n_{t}\}$ , is no longer a Poisson process, unless $\mbox{\boldmath$\eta$}=0$ . Importantly, because of the refractory effects incorporated by a strong negative $\eta$ at small times, the spike count variable $n_{t}$ cannot take a value larger than $1$ if ${\text{d}}t$ is in the range of one or a few milliseconds. Therefore, we can expect that interspike intervals are correctly reproduced in the model; see Fig. 10.5 for an example application.

Finally, we may expand the definition of $X$ to include observations of other spike trains, and therefore develop GLMs not just of single spike trains, but network models of how populations of neurons encode information jointly (95; 379; 521; 399). The resulting model is summarized as follows: Spike counts are conditionally Poisson distributed given $\rho_{i}(t)$ $n_{i,t}\sim Poiss(\rho_{i}(t){\text{d}}t)$ with a firing rate

\rho_{i}(t)=f\bigg(\mbox{\boldmath\(k\)}_{i}\cdot\mbox{\boldmath\(x\)}_{t}+% \sum_{i^{\prime}\neq i,j}\epsilon_{i^{\prime},j}n_{i^{\prime},t-j}\bigg)\,.

(10.27)

Here, $\rho_{i}(t)$ denotes the instantaneous firing rate of the $i$ -th cell at time $t$ and $\mbox{\boldmath$k$}_{i}$ is the cell’s linear receptive field including spike-history effects; cf. Eq. (10.25). The net effect of a spike of neuron $i^{\prime}$ onto the membrane potential of neuron $i$ is summarized by $\epsilon_{i^{\prime},j}$ ; these terms are summed over all past spike activity $n_{i^{\prime},t-j}$ in the population of cells from which we are recording simultaneously. In the special case that we record from all neurons in the population, $\epsilon_{i^{\prime},j}$ can be interpreted as the excitatory or inhibitory postsynaptic potential caused by a spike of neuron $i^{\prime}$ a time $j\,{\text{d}}t$ earlier.

Example: Linear regression and voltage estimation

It may be helpful to draw an analogy to linear regression here. We want to show that the standard procedure of least-square minimization can be linked to statistical parameter estimation under the assumption of Gaussian noise.

We consider the linear voltage model of Eq. (10.1). We are interested in the temporal filter properties of the neuron when it is driven by a time-dependent input $I(t)$ . Let us set $\mbox{\boldmath$x$}_{t}=(I_{t},I_{t-1},\dots,I_{t-K})\,{\text{d}}t$ and $\mbox{\boldmath$k$}=(\kappa({\text{d}}t),\dots,\kappa(K\,{\text{d}}t))$ . If we assume that the discrete-time voltage measurements have a Gaussian distribution around the mean predicted by the model of Eq. (10.4), then we need to maximize the likelihood

\log p(D|X,\mbox{\boldmath\(k\)})=c_{1}-c_{2}\sum_{t}\left(u_{t}-(\mbox{% \boldmath\(k\)}\cdot\mbox{\boldmath\(x\)}_{t})\right)^{2},\vspace{-4mm}

(10.28)

where $X=(x_{1},x_{2},...,x_{T})$ is the matrix of observed stimuli, $c_{1},c_{2}$ are constants that do not depend on the parameter $k$ , and the sum in $t$ is over all observed time bins. Maximization yields $\mbox{\boldmath$k$}_{opt}=(X^{T}X)^{-1}\left(\sum_{t}u_{t}\mbox{\boldmath$x% $}_{t}/{\text{d}}t\right)\,$ which determines the time course of the filter $\kappa(s)$ that characterizes the passive membrane properties. The result is identical to Eq. (10.8):

\hat{\mbox{\boldmath\(k\)}}_{opt}=\hat{\mbox{\boldmath\(k\)}}_{\rm LS}\,.

(10.29)

10.2.2 Regularization: maximum penalized likelihood

In the linear regression case it is well-known that estimates of the components of the parameter vector $k$ can be quite noisy when the dimension of $k$ is large. The noisiness of the estimate $\hat{\mbox{\boldmath$k$}}_{ML}$ is roughly proportional to the dimensionality of $k$ (the number of parameters in $k$ that we need to estimate from data) divided by the total number of observed samples (382). The same ‘overfitting’ phenomenon (estimator variability increasing with number of parameters) occurs in the GLM context. A variety of methods have been introduced to ‘regularize’ the estimated $k$ , to incorporate prior knowledge about the shape and/or magnitude of the true $k$ to reduce the noise in $\hat{\mbox{\boldmath$k$}}_{ML}$ . One basic idea is to restrict $k$ to lie within a lower-dimensional subspace; we then employ the same fitting procedure to estimate the coefficients of $k$ within this lower-dimensional basis (model selection procedures may be employed to choose the dimensionality of this subspace (521)).

A slightly less restrictive approach is to maximize the posterior

p(\mbox{\boldmath\(k\)}|X,D)\propto p(D|X,\mbox{\boldmath\(k\)})p(\mbox{% \boldmath\(k\)})

(10.30)

(with $k$ allowed to take values in the full original basis), instead of the likelihood $p(D|X,\mbox{\boldmath$k$})$ ; here $p(\mbox{\boldmath$k$})$ encodes our a priori beliefs about the true underlying $k$ .

It is easy to incorporate this maxima a posteriori idea in the GLM context (383): we simply maximize

	$\displaystyle\log p(\mbox{\boldmath$k$}\|X,D)$	$\displaystyle=$	$\displaystyle c+\log p(\mbox{\boldmath$k$})+\log p(D\|X,\mbox{\boldmath$k$})$		(10.31)
		$\displaystyle=$	$\displaystyle c+\log p(\mbox{\boldmath$k$})+\sum_{t}\left(n_{t}\log f(\mbox{% \boldmath$x$}_{t}\cdot\mbox{\boldmath$k$})-f(\mbox{\boldmath$x$}_{t}% \cdot\mbox{\boldmath$k$}){\text{d}}t\right).$		(10.32)

Whenever $\log p(\mbox{\boldmath$k$})$ is a concave function of $k$ , this ‘penalized’ likelihood (where $\log p(\mbox{\boldmath$k$})$ acts to penalize improbable values of $k$ ) is a concave function of $k$ , and ascent-based maximization may proceed (with no local maximum) as before; cf. Fig. 10.4.

Fig. 10.4: Regularization. Because of the concavity, there is only a single global maximum of the log-likelihood which defines the optimal parameter choice $\mbox{\boldmath$k$}_{\rm ML}$ . Right: Example of regularization by a prior (dashed line) that favors a smaller value for $\mbox{\boldmath$k$}_{\rm MAP}$ .

Example: Linear regression and Gaussian prior

In the linear regression case, the computationally-simplest prior is a zero-mean Gaussian, $\log p(\mbox{\boldmath$k$})=c-\mbox{\boldmath$k$}^{T}A\mbox{\boldmath$k$% }/2$ , where $A$ is a positive definite matrix (the inverse covariance matrix). The Gaussian prior can be combined with the Gaussian noise model of Eq. (10.28). Maximizing the corresponding posterior analytically (452; 485) leads then directly to the regularized least-square estimator

\hat{\mbox{\boldmath\(k\)}}_{\rm RLS}=(X^{T}X+A)^{-1}\left(\sum_{t}u_{t}\mbox{% \boldmath\(x\)}_{t}/{\text{d}}t\right)\,.

(10.33)

10.2.3 Fitting Generalized Integrate-and-Fire models to Data

Suppose an experimentalist has injected a time-dependent current $I(t)$ into a neuron and has recorded with a second electrode the voltage $u^{\rm exp}(t)$ of the same neuron. The voltage trajectory contains spikes $S(t)=\sum_{f}\delta(t-t^{(f)})$ with firing times $t^{(1)},t^{(2)},...t^{(N)}$ . The natural approach would be to write down the joint likelihood of observing both the spike times and the subthreshold membrane potential (381). A simpler approach would be to maximize the likelihood of observing the spike separately from the likelihood of observing the membrane potential.

Given the input $I(t)$ and the spike train $S(t)$ , the voltage of a SRM is given by Eq. (10.9) and we can adjust the parameters of the filter $\kappa$ and $\eta$ so as to minimize the squared error Eq. (10.5). We now fix the parameters for the membrane potential and maximize the likelihood of observing the spike times given our model voltage trajectory $u(t)$ . We insert $u(t)$ into the escape rate function $\rho(t)=f(u(t)-\vartheta(t))$ which contains the parameters of the threshold

\vartheta(t)=\vartheta_{0}+\int_{0}^{\infty}\theta_{1}(s)S(t-s){\text{d}}s\,.

(10.34)

We then calculate the log-likelihood

\log p(D|X,\mbox{\boldmath\(k\)})=c+\sum_{t}\left(n_{t}\log f(X_{t}\cdot\mbox{% \boldmath\(k\)})-f(X_{t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t\right)

(10.35)

and compute the ML or maximum a posteriori (MAP) solution for the model parameters $k$ (which are the parameters of the threshold - the subthreshold voltage parameters are already fixed) by an optimization algorithm for concave functions.

Fig. 10.5 shows an example application. Both voltage in the subthreshold regime and spike times are nicely reproduced. Therefore, we can expect that interspike intervals are correctly reproduced as well. In order to quantify the performance of neuron models, we need the develop criteria of ‘goodness of fit’ for subthreshold membrane potential, spike timings, and possibly higher-order spike train statistics. This is the topic of section 10.3; we will return to similar applications in the next chapter.

10.2.4 Extensions (*)

The GLM encoding framework described here can be extended in a number of important directions. We briefly describe two such directions here.

First, as we have described the GLM above, it may appear that the model is restricted to including only linear dependencies on the stimulus $\mbox{\boldmath$x$}_{t}$ , through the $\mbox{\boldmath$k$}\cdot\mbox{\boldmath$x$}_{t}$ term. However, if we modify our input matrix $X$ once again, to include nonlinear transformations $\mathcal{F}_{j}(\mbox{\boldmath$x$})$ of the stimulus $x$ , we may fit nonlinear models of the form

\rho(t)=f(\sum_{j}a_{j}\mathcal{F}_{j}(\mbox{\boldmath\(x\)}))

(10.36)

efficiently by maximizing the loglikelihood $\log p(D|X,\mbox{\boldmath$a$})$ with respect to the weight parameter $a$ (560; 12). Mathematically, the nonlinearities $\mathcal{F}_{j}(\mbox{\boldmath$x$})$ may take essentially arbitrary form; physiologically speaking, it is clearly wise to choose $\mathcal{F}_{j}(\mbox{\boldmath$x$})$ to reflect known facts about the anatomy and physiology of the system (e.g., $\mathcal{F}_{j}(\mbox{\boldmath$x$})$ might model inputs from a presynaptic layer whose responses are better-characterized than are those of the neuron of interest (450)).

Second, in many cases it is reasonable to include terms in $X$ that we may not be able to observe or calculate directly (e.g., intracellular noise, or the dynamical state of the network); fitting the model parameters in this case requires that we properly account for these ‘latent,’ unobserved variables in our likelihood. While inference in the presence of these hidden parameters is beyond the scope of this chapter, it is worth noting that this type of model may be fit tractably using generalizations of the methods described here, at the cost of increased computational complexity, but the benefit of enhanced model flexibility and realism (484; 563; 536).

Example: Estimating spike triggered currents and dynamic threshold

In Fig. 10.1A, we have suggested to estimate the filter $\kappa(s)$ by extracting its values $k_{j}=\kappa(j{\text{d}}t)$ at discrete equally spaced time steps: The integral $\int_{0}^{\infty}\kappa(s)\,I(t-s){\text{d}}s=\sum_{j=1}^{K}k_{l}I_{t-j}{\text% {d}}t$ is linear in the $K$ parameters.

However, the observables remain linear in the parameters ( $k_{j}$ ) if we set $\kappa(s)=\sum_{j=1}^{4}k_{j}\exp(-s/\tau_{j})$ with fixed time constants, e.g., $\tau_{j}=10^{j}$ ms. Again, the integral $\int_{0}^{\infty}\kappa(s)\,I(t-s){\text{d}}s=\sum_{j=1}^{4}k_{j}\int_{0}^{% \infty}\exp(-s/\tau_{j})I(t-s){\text{d}}s$ is linear in the parameters. The exponentials play the role of ‘basis functions’ $F_{j}$ .

Similarly, the threshold filter $\vartheta_{1}(s)$ or the spike-afterpotential $\eta(s)$ can be expressed with basis functions. A common choice is to take rectangular basis functions $F_{j}$ which take a value of unity on a finite interval $[t_{j},t_{j+1}]$ . Exponential spacing $t_{j}=2^{j-1}$ ms of time points allows us to cover a large time span with a small number of parameters. Regular spacing leads back to the naive discretization scheme.

	$\displaystyle\log p(\mbox{\boldmath\(k\)}\|X,D)$	$\displaystyle=$	$\displaystyle c+\log p(\mbox{\boldmath\(k\)})+\log p(D\|X,\mbox{\boldmath\(k\)})$		(10.31)
		$\displaystyle=$	$\displaystyle c+\log p(\mbox{\boldmath\(k\)})+\sum_{t}\left(n_{t}\log f(\mbox{% \boldmath\(x\)}_{t}\cdot\mbox{\boldmath\(k\)})-f(\mbox{\boldmath\(x\)}_{t}% \cdot\mbox{\boldmath\(k\)}){\text{d}}t\right).$		(10.32)