10 Estimating Models

10.2 Statistical Formulation of Encoding Models

Let us denote the observed spike train data by DD. In general, DD could represent measurements from a population of neurons. To keep the arguments simple, we focus for the moment on a single neuron, but at the end of the section we will extend the approach to a population of interacting neurons. If the time bins are chosen smaller than the absolute refractory period, the discretized spike train is described by a sequence of scalar variables ntn_{t} in an interval 0<tT0<t\leq T

D={n1,n2,,nT},D=\{n_{1},n_{2},\dots,n_{T}\}\,, (10.15)

cf. Fig. 9.6 in Chapter 9. If time bins are large, spike counts can take values larger than one.

A neural ‘encoding model’ is a model that assigns a conditional probability, p(D|𝒙)p(D|\mbox{\boldmath\(x\)}), to any possible neural response DD given a stimulus 𝒙x. The vector 𝒙t\mbox{\boldmath\(x\)}_{t} can include the momentary stimulus presented at time tt, or more generally the concatenated spatiotemporal stimulus history up to time tt. Examples have been given in Section 10.1.

As emphasized in the introduction to this chapter, it is not feasible to directly measure this probability p(D|𝒙)p(D|\mbox{\boldmath\(x\)}) for all stimulus-response pairs (𝒙,D)(\mbox{\boldmath\(x\)},D), simply because there are infinitely many potential stimuli. We therefore hypothesize some encoding model,

p(D|𝒙,θ).p(D|\mbox{\boldmath\(x\)},\theta)\,. (10.16)

Here θ\theta is a short-hand notation for the set of all model parameters. In the examples of the previous section, the model parameters are θ={𝒌,b}\theta=\{\mbox{\boldmath\(k\)},b\}.

Our aim is to estimate the model parameters θ\theta so that the model ‘fits’ the observed data DD. Once θ\theta is in hand we may compute the desired response probabilities as

p(D|𝒙)p(D|𝒙,θ),p(D|\mbox{\boldmath\(x\)})\approx p(D|\mbox{\boldmath\(x\)},\theta)\,, (10.17)

that is, knowing θ\theta allows us to interpolate between the observed (noisy) stimulus-response pairs, in order to predict the response probabilities for novel stimuli 𝒙x for which we have not yet observed any responses.

10.2.1 Parameter estimation

How do we find a good estimate for the parameters θ\theta for a chosen model class? The general recipe is as follows. The first step is to introduce a model that makes sense biophysically, and incorporates our prior knowledge in a tractable manner. Next we write down the likelihood of the observed data given the model parameters, along with a prior distribution that encodes our prior beliefs about the model parameters. Finally, we compute the posterior distribution of the model parameters given the observed data, using Bayes’ rule, which states that

p(θ|D)p(D|θ)p(θ);p(\theta|D)\propto p(D|\theta)p(\theta); (10.18)

the left-hand side is the desired posterior distribution, while the right-hand side is just the product of the likelihood and the prior.

In the current setting, we need to write down the likelihood p(D|X,𝒌)p(D|X,\mbox{\boldmath\(k\)}) of the observed spike data DD given the model parameter 𝒌k and the observed set of stimuli summarized in the matrix XX, and then we may employ standard likelihood optimization methods to obtain the maximum likelihood (ML) or maximum a posteriori (MAP) solutions for 𝒌k, defined by

ML:𝒌^ML\displaystyle{\rm ML:~{}}\hat{\mbox{\boldmath\(k\)}}_{\rm ML} =\displaystyle= argmax𝒌{p(D|X,𝒌)}\displaystyle{\rm argmax}_{\mbox{\boldmath\(k\)}}\{p(D|X,\mbox{\boldmath\(k\)})\} (10.19)
MAP:𝒌^MAP\displaystyle{\rm MAP:~{}}\hat{\mbox{\boldmath\(k\)}}_{\rm MAP} =\displaystyle= argmax𝒌{p(D|X,𝒌)p(𝒌)}\displaystyle{\rm argmax}_{\mbox{\boldmath\(k\)}}\{p(D|X,\mbox{\boldmath\(k\)}% )\,p(\mbox{\boldmath\(k\)})\} (10.20)

where the maximization runs over all possible parameter choices.

We assume that spike counts per bin follow a conditional Poisson distribution, given ρ(t)\rho(t):

ntPoiss[ρ(t)dt];n_{t}\sim Poiss[\rho(t){\text{d}}t]\,; (10.21)

cf. text and exercises of Chapter 7. For example, with the rate parameter of the Poisson distribution given by a GLM or SRM model ρ(t)=f(𝒌𝒙t)\rho(t)=f(\mbox{\boldmath\(k\)}\cdot\mbox{\boldmath\(x\)}_{t}), we have

p(D|X,𝒌)=t{[f(𝒌𝒙t)dt]nt(nt)!exp[-f(𝒙t𝒌)dt]}p(D|X,\mbox{\boldmath\(k\)})=\,\prod_{t}\left\{{[f(\mbox{\boldmath\(k\)}\cdot% \mbox{\boldmath\(x\)}_{t}){\text{d}}t]^{n_{t}}\over(n_{t})!}\exp[-f(\mbox{% \boldmath\(x\)}_{t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t]\right\}\, (10.22)

Here Πt\Pi_{t} denotes the product over all time steps. We recall that, by definition, nt!=1n_{t}!=1 for nt=0n_{t}=0.

Our aim is to optimize the parameter 𝒌k. For a given observed spike train, the spike count numbers ntn_{t} are fixed. In this case, (dt)nt/(nt)!({\text{d}}t)^{n_{t}}/(n_{t})! is a constant which is irrelevant for the parameter optimization. If we work with a fixed time step dt{\text{d}}t and drop all units, we can therefore reshuffle the terms and consider the logarithm of the above likelihood

logp(D|X,𝒌)=c0+t{ntlog[f(𝒌𝒙t)]-f(𝒙t𝒌)dt}.{\rm log}p(D|X,\mbox{\boldmath\(k\)})=c_{0}+\sum_{t}\left\{{n_{t}}{\rm log}[f(% \mbox{\boldmath\(k\)}\cdot\mbox{\boldmath\(x\)}_{t})]-f(\mbox{\boldmath\(x\)}_% {t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t\right\}\,. (10.23)

If we choose the time step dt{\text{d}}t shorter than the half-width of an action potential, say 1 ms, the spike count variable ntn_{t} can only take the values zero or one. For small dt{\text{d}}t, the likelihood of Eq. (10.22) is then identical to that of the SRM with escape noise, defined in Chapter 9; cf. Eqs. (9.10) and (9.15) - (9.17).

We don’t have an analytical expression for the maximum of the likelihood defined in Eq. (10.22), but nonetheless we can numerically optimize this function quite easily if we are willing to make two assumptions about the nonlinear function f(.)f(.). More precisely, if we assume that

(i) f(u)f(u) is a convex (upward-curving) function of its scalar argument uu, and

(ii) logf(u)\log f(u) is concave (downward-curving) in uu,

then the loglikelihood above is guaranteed to be a concave function of the parameter 𝒌k, since in this case the loglikelihood is just a sum of concave functions of 𝒌k (383).

This implies that the likelihood has no non-global maximum (also called local maximum). Therefore the maximum likelihood parameter 𝒌^ML\hat{\mbox{\boldmath\(k\)}}_{\rm ML} may be found by numerical ascent techniques; cf. Fig. 10.2. Functions f(.)f(.) satisfying these two constraints are easy to think of: for example, the standard linear rectifier and the exponential function both work.

Fitting model parameters proceeds as follows: we form the (augmented) matrix XX where each row is now

𝒙t=(1,It-1dt,It-2dt,,It-Kdt,nt-1,nt-2,nt-J).\mbox{\boldmath\(x\)}_{t}=(1,I_{t-1}{\text{d}}t,I_{t-2}{\text{d}}t,\dots,I_{t-% K}{\text{d}}t,n_{t-1},n_{t-2},\dots n_{t-J})\,. (10.24)

Similarly, the parameter vector is in analogy to Eq. (10.10)

𝒌=(b,κ(dt),κ(2dt),κ(Kdt),η(dt),η(2dt),,η(Jdt));\mbox{\boldmath\(k\)}=(b,\kappa({\text{d}}t),\kappa(2{\text{d}}t)\dots,\kappa(% K{\text{d}}t),\eta({\text{d}}t),\eta(2{\text{d}}t),\dots,\eta(J{\text{d}}t)); (10.25)

here b=urest-ϑb=u_{\rm rest}-\vartheta is a constant offset term which we want to optimize.

We then calculate the log-likelihood

logp(D|X,𝒌)=t(ntlogf(Xt𝒌)-f(Xt𝒌)dt)\log p(D|X,\mbox{\boldmath\(k\)})=\sum_{t}\left(n_{t}\log f(X_{t}\cdot\mbox{% \boldmath\(k\)})-f(X_{t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t\right) (10.26)

and compute the ML or maximum a posteriori (MAP) solution for the model parameters 𝒌k by a concave optimization algorithm. Note that, while we still assume that the conditional spike count ntn_{t} within a given short time bin is drawn from a one-dimensional Poiss(ρ(t)dt)Poiss(\rho(t){\text{d}}t) distribution given ρ(t)\rho(t), the resulting model displays strong history effects (since ρ(t)\rho(t) depends on the past spike trains) and therefore the output of the model, considered as a vector of counts D={nt}D=\{n_{t}\}, is no longer a Poisson process, unless 𝜼=0\mbox{\boldmath\(\eta\)}=0. Importantly, because of the refractory effects incorporated by a strong negative η\eta at small times, the spike count variable ntn_{t} cannot take a value larger than 11 if dt{\text{d}}t is in the range of one or a few milliseconds. Therefore, we can expect that interspike intervals are correctly reproduced in the model; see Fig. 10.5 for an example application.

Finally, we may expand the definition of XX to include observations of other spike trains, and therefore develop GLMs not just of single spike trains, but network models of how populations of neurons encode information jointly (95; 379; 521; 399). The resulting model is summarized as follows: Spike counts are conditionally Poisson distributed given ρi(t)\rho_{i}(t) ni,tPoiss(ρi(t)dt)n_{i,t}\sim Poiss(\rho_{i}(t){\text{d}}t) with a firing rate

ρi(t)=f(𝒌i𝒙t+ii,jϵi,jni,t-j).\rho_{i}(t)=f\bigg(\mbox{\boldmath\(k\)}_{i}\cdot\mbox{\boldmath\(x\)}_{t}+% \sum_{i^{\prime}\neq i,j}\epsilon_{i^{\prime},j}n_{i^{\prime},t-j}\bigg)\,. (10.27)

Here, ρi(t)\rho_{i}(t) denotes the instantaneous firing rate of the ii-th cell at time tt and 𝒌i\mbox{\boldmath\(k\)}_{i} is the cell’s linear receptive field including spike-history effects; cf. Eq. (10.25). The net effect of a spike of neuron ii^{\prime} onto the membrane potential of neuron ii is summarized by ϵi,j\epsilon_{i^{\prime},j}; these terms are summed over all past spike activity ni,t-jn_{i^{\prime},t-j} in the population of cells from which we are recording simultaneously. In the special case that we record from all neurons in the population, ϵi,j\epsilon_{i^{\prime},j} can be interpreted as the excitatory or inhibitory postsynaptic potential caused by a spike of neuron ii^{\prime} a time jdtj\,{\text{d}}t earlier.

Example: Linear regression and voltage estimation

It may be helpful to draw an analogy to linear regression here. We want to show that the standard procedure of least-square minimization can be linked to statistical parameter estimation under the assumption of Gaussian noise.

We consider the linear voltage model of Eq. (10.1). We are interested in the temporal filter properties of the neuron when it is driven by a time-dependent input I(t)I(t). Let us set 𝒙t=(It,It-1,,It-K)dt\mbox{\boldmath\(x\)}_{t}=(I_{t},I_{t-1},\dots,I_{t-K})\,{\text{d}}t and 𝒌=(κ(dt),,κ(Kdt))\mbox{\boldmath\(k\)}=(\kappa({\text{d}}t),\dots,\kappa(K\,{\text{d}}t)). If we assume that the discrete-time voltage measurements have a Gaussian distribution around the mean predicted by the model of Eq. (10.4), then we need to maximize the likelihood

logp(D|X,𝒌)=c1-c2t(ut-(𝒌𝒙t))2,\log p(D|X,\mbox{\boldmath\(k\)})=c_{1}-c_{2}\sum_{t}\left(u_{t}-(\mbox{% \boldmath\(k\)}\cdot\mbox{\boldmath\(x\)}_{t})\right)^{2},\vspace{-4mm} (10.28)

where X=(x1,x2,,xT)X=(x_{1},x_{2},...,x_{T}) is the matrix of observed stimuli, c1,c2c_{1},c_{2} are constants that do not depend on the parameter 𝒌k, and the sum in tt is over all observed time bins. Maximization yields 𝒌opt=(XTX)-1(tut𝒙t/dt)\mbox{\boldmath\(k\)}_{opt}=(X^{T}X)^{-1}\left(\sum_{t}u_{t}\mbox{\boldmath\(x% \)}_{t}/{\text{d}}t\right)\, which determines the time course of the filter κ(s)\kappa(s) that characterizes the passive membrane properties. The result is identical to Eq. (10.8):

𝒌^opt=𝒌^LS.\hat{\mbox{\boldmath\(k\)}}_{opt}=\hat{\mbox{\boldmath\(k\)}}_{\rm LS}\,. (10.29)

10.2.2 Regularization: maximum penalized likelihood

In the linear regression case it is well-known that estimates of the components of the parameter vector 𝒌k can be quite noisy when the dimension of 𝒌k is large. The noisiness of the estimate 𝒌^ML\hat{\mbox{\boldmath\(k\)}}_{ML} is roughly proportional to the dimensionality of 𝒌k (the number of parameters in 𝒌k that we need to estimate from data) divided by the total number of observed samples (382). The same ‘overfitting’ phenomenon (estimator variability increasing with number of parameters) occurs in the GLM context. A variety of methods have been introduced to ‘regularize’ the estimated 𝒌k, to incorporate prior knowledge about the shape and/or magnitude of the true 𝒌k to reduce the noise in 𝒌^ML\hat{\mbox{\boldmath\(k\)}}_{ML}. One basic idea is to restrict 𝒌k to lie within a lower-dimensional subspace; we then employ the same fitting procedure to estimate the coefficients of 𝒌k within this lower-dimensional basis (model selection procedures may be employed to choose the dimensionality of this subspace (521)).

A slightly less restrictive approach is to maximize the posterior

p(𝒌|X,D)p(D|X,𝒌)p(𝒌)p(\mbox{\boldmath\(k\)}|X,D)\propto p(D|X,\mbox{\boldmath\(k\)})p(\mbox{% \boldmath\(k\)}) (10.30)

(with 𝒌k allowed to take values in the full original basis), instead of the likelihood p(D|X,𝒌)p(D|X,\mbox{\boldmath\(k\)}); here p(𝒌)p(\mbox{\boldmath\(k\)}) encodes our a priori beliefs about the true underlying 𝒌k.

It is easy to incorporate this maxima a posteriori idea in the GLM context (383): we simply maximize

logp(𝒌|X,D)\displaystyle\log p(\mbox{\boldmath\(k\)}|X,D) =\displaystyle= c+logp(𝒌)+logp(D|X,𝒌)\displaystyle c+\log p(\mbox{\boldmath\(k\)})+\log p(D|X,\mbox{\boldmath\(k\)}) (10.31)
=\displaystyle= c+logp(𝒌)+t(ntlogf(𝒙t𝒌)-f(𝒙t𝒌)dt).\displaystyle c+\log p(\mbox{\boldmath\(k\)})+\sum_{t}\left(n_{t}\log f(\mbox{% \boldmath\(x\)}_{t}\cdot\mbox{\boldmath\(k\)})-f(\mbox{\boldmath\(x\)}_{t}% \cdot\mbox{\boldmath\(k\)}){\text{d}}t\right). (10.32)

Whenever logp(𝒌)\log p(\mbox{\boldmath\(k\)}) is a concave function of 𝒌k, this ‘penalized’ likelihood (where logp(𝒌)\log p(\mbox{\boldmath\(k\)}) acts to penalize improbable values of 𝒌k) is a concave function of 𝒌k, and ascent-based maximization may proceed (with no local maximum) as before; cf. Fig. 10.4.

Fig. 10.4: Regularization. Because of the concavity, there is only a single global maximum of the log-likelihood which defines the optimal parameter choice 𝒌ML\mbox{\boldmath\(k\)}_{\rm ML}. Right: Example of regularization by a prior (dashed line) that favors a smaller value for 𝒌MAP\mbox{\boldmath\(k\)}_{\rm MAP}.

Example: Linear regression and Gaussian prior

In the linear regression case, the computationally-simplest prior is a zero-mean Gaussian, logp(𝒌)=c-𝒌TA𝒌/2\log p(\mbox{\boldmath\(k\)})=c-\mbox{\boldmath\(k\)}^{T}A\mbox{\boldmath\(k\)% }/2, where AA is a positive definite matrix (the inverse covariance matrix). The Gaussian prior can be combined with the Gaussian noise model of Eq. (10.28). Maximizing the corresponding posterior analytically (452; 485) leads then directly to the regularized least-square estimator

𝒌^RLS=(XTX+A)-1(tut𝒙t/   d   t).\hat{\mbox{\boldmath\(k\)}}_{\rm RLS}=(X^{T}X+A)^{-1}\left(\sum_{t}u_{t}\mbox{% \boldmath\(x\)}_{t}/{\text{d}}t\right)\,. (10.33)

10.2.3 Fitting Generalized Integrate-and-Fire models to Data

Suppose an experimentalist has injected a time-dependent current I(t)I(t) into a neuron and has recorded with a second electrode the voltage uexp(t)u^{\rm exp}(t) of the same neuron. The voltage trajectory contains spikes S(t)=fδ(t-t(f))S(t)=\sum_{f}\delta(t-t^{(f)}) with firing times t(1),t(2),t(N)t^{(1)},t^{(2)},...t^{(N)}. The natural approach would be to write down the joint likelihood of observing both the spike times and the subthreshold membrane potential (381). A simpler approach would be to maximize the likelihood of observing the spike separately from the likelihood of observing the membrane potential.

Given the input I(t)I(t) and the spike train S(t)S(t), the voltage of a SRM is given by Eq. (10.9) and we can adjust the parameters of the filter κ\kappa and η\eta so as to minimize the squared error Eq. (10.5). We now fix the parameters for the membrane potential and maximize the likelihood of observing the spike times given our model voltage trajectory u(t)u(t). We insert u(t)u(t) into the escape rate function ρ(t)=f(u(t)-ϑ(t))\rho(t)=f(u(t)-\vartheta(t)) which contains the parameters of the threshold

ϑ(t)=ϑ0+0θ1(s)S(t-s)ds.\vartheta(t)=\vartheta_{0}+\int_{0}^{\infty}\theta_{1}(s)S(t-s){\text{d}}s\,. (10.34)

We then calculate the log-likelihood

logp(D|X,𝒌)=c+t(ntlogf(Xt𝒌)-f(Xt𝒌)dt)\log p(D|X,\mbox{\boldmath\(k\)})=c+\sum_{t}\left(n_{t}\log f(X_{t}\cdot\mbox{% \boldmath\(k\)})-f(X_{t}\cdot\mbox{\boldmath\(k\)}){\text{d}}t\right) (10.35)

and compute the ML or maximum a posteriori (MAP) solution for the model parameters 𝒌k (which are the parameters of the threshold - the subthreshold voltage parameters are already fixed) by an optimization algorithm for concave functions.

Fig. 10.5 shows an example application. Both voltage in the subthreshold regime and spike times are nicely reproduced. Therefore, we can expect that interspike intervals are correctly reproduced as well. In order to quantify the performance of neuron models, we need the develop criteria of ‘goodness of fit’ for subthreshold membrane potential, spike timings, and possibly higher-order spike train statistics. This is the topic of section 10.3; we will return to similar applications in the next chapter.

10.2.4 Extensions (*)

The GLM encoding framework described here can be extended in a number of important directions. We briefly describe two such directions here.

First, as we have described the GLM above, it may appear that the model is restricted to including only linear dependencies on the stimulus 𝒙t\mbox{\boldmath\(x\)}_{t}, through the 𝒌𝒙t\mbox{\boldmath\(k\)}\cdot\mbox{\boldmath\(x\)}_{t} term. However, if we modify our input matrix XX once again, to include nonlinear transformations j(𝒙)\mathcal{F}_{j}(\mbox{\boldmath\(x\)}) of the stimulus 𝒙x, we may fit nonlinear models of the form

ρ(t)=f(jajj(𝒙))\rho(t)=f(\sum_{j}a_{j}\mathcal{F}_{j}(\mbox{\boldmath\(x\)})) (10.36)

efficiently by maximizing the loglikelihood logp(D|X,𝒂)\log p(D|X,\mbox{\boldmath\(a\)}) with respect to the weight parameter 𝒂a (560; 12). Mathematically, the nonlinearities j(𝒙)\mathcal{F}_{j}(\mbox{\boldmath\(x\)}) may take essentially arbitrary form; physiologically speaking, it is clearly wise to choose j(𝒙)\mathcal{F}_{j}(\mbox{\boldmath\(x\)}) to reflect known facts about the anatomy and physiology of the system (e.g., j(𝒙)\mathcal{F}_{j}(\mbox{\boldmath\(x\)}) might model inputs from a presynaptic layer whose responses are better-characterized than are those of the neuron of interest (450)).

Second, in many cases it is reasonable to include terms in XX that we may not be able to observe or calculate directly (e.g., intracellular noise, or the dynamical state of the network); fitting the model parameters in this case requires that we properly account for these ‘latent,’ unobserved variables in our likelihood. While inference in the presence of these hidden parameters is beyond the scope of this chapter, it is worth noting that this type of model may be fit tractably using generalizations of the methods described here, at the cost of increased computational complexity, but the benefit of enhanced model flexibility and realism (484; 563; 536).

Example: Estimating spike triggered currents and dynamic threshold

In Fig. 10.1A, we have suggested to estimate the filter κ(s)\kappa(s) by extracting its values kj=κ(jdt)k_{j}=\kappa(j{\text{d}}t) at discrete equally spaced time steps: The integral 0κ(s)I(t-s)ds=j=1KklIt-jdt\int_{0}^{\infty}\kappa(s)\,I(t-s){\text{d}}s=\sum_{j=1}^{K}k_{l}I_{t-j}{\text% {d}}t is linear in the KK parameters.

However, the observables remain linear in the parameters (kjk_{j}) if we set κ(s)=j=14kjexp(-s/τj)\kappa(s)=\sum_{j=1}^{4}k_{j}\exp(-s/\tau_{j}) with fixed time constants, e.g., τj=10j\tau_{j}=10^{j}ms. Again, the integral 0κ(s)I(t-s)ds=j=14kj0exp(-s/τj)I(t-s)ds\int_{0}^{\infty}\kappa(s)\,I(t-s){\text{d}}s=\sum_{j=1}^{4}k_{j}\int_{0}^{% \infty}\exp(-s/\tau_{j})I(t-s){\text{d}}s is linear in the parameters. The exponentials play the role of ‘basis functions’ FjF_{j}.

Similarly, the threshold filter ϑ1(s)\vartheta_{1}(s) or the spike-afterpotential η(s)\eta(s) can be expressed with basis functions. A common choice is to take rectangular basis functions FjF_{j} which take a value of unity on a finite interval [tj,tj+1][t_{j},t_{j+1}]. Exponential spacing tj=2j-1t_{j}=2^{j-1}ms of time points allows us to cover a large time span with a small number of parameters. Regular spacing leads back to the naive discretization scheme.

Fig. 10.5: Comparing models with intracellular recordings.A. A noisy time-dependent current is used to stimulate the neurons experimentally (dashed line corresponds to zero current). B. Recording from the neuron (thin black line) shows membrane potential fluctuations and action potentials. Simulating a SRM (thick black line) with the same current and using parameters previously optimized on a different data set shows similar membrane potential fluctuations (inset) and action potentials. Some of the spikes are missed, some are added, but most coincide with the recorded ones. C. Multiple repeated stimulations with the same current shows the intrinsic variability of neural responses (first 9 rows are recorded action potentials indicated by a thick black ticks). The variability is matched by the model (last 9 rows are model action potentials). Data and models from Mensi et al. (338).