Building demand system from scratch

My personal notes on demand system in IO
study
Author

Hyoungchul Kim

Published

March 27, 2026

Notes

These are my personal notes on my understanding of demand system and its estimation (especially focusing on logit-like model) in IO. Note that this is a living document and will be updated as I learn more things.

What is a structural model?

This section is not for rigorously defining the meaning of structural model. It is more about listing factors that I think are useful to think about in the back of your head when dealing with structural model. This is important because lot of what we do in modeling demand system in IO (at least that’s what I think) is about specifying a structural model.

  • At the end of the day, the goal of the structural model is to specify a system of economic environment with features that the researcher believe is important for understanding certain human behavior or outcome.
  • For example, why does some people choose insurance A while others choose insurance B (which is an outcome observed in the data). Depending on the empirical context, it could be about unobserved product quality or it could be about some search friction.
  • Data (observed variation) is what we use to pin down our features (e.g. “deep” parameters) of our interest.
  • Having a probablistic model (and thinking about unobservable) is important for two reasons:
    1. It is realistic. Econometrician cannot observe everything.
    2. We cannot fit the data if we do not have them. Data has weird variation that we cannot explain and we need to assume unobservable shock to justify it.
  • In the end, it’s all about finding the identifying variation for your parameters of interest (e.g. If you have a utility function with price variable and unobserved product quality, you cannot identify the coefficient on the price variable unless you have some temporal variation in price).

What is IO?

This section is not for formally defining the field of IO. I am just writng down some things IO research seem to focus on.

IO is about understanding market structure and how it affects the equilibrium outcomes such as price, quantity, etc. By market structure, we mean mainly the various features of the supply-side of the market (the firm). IO tries to understand the characteristics of the market structure and how it affects welfare (consumer surplus, profit) in the equilibrium.

Old IO used to analyze empirical association of market structures across industries. But this was problematic because market structure is endogenous. This lead to the period of game theory. Game theory was useful because it provided useful tools to characterize the strategic interactions of individuals that make up the market.

New empirical IO seems to focus their research within some specific industry and also employ economic theory and econometric methods to fully characterize the market of their interests. During this process, they developed many useful tools (demand estimation, production function estimation, dynamic model, etc) that are also helpful to non-IO researchers.

Why demand system?

So why care about demand system? Understanding demand system itself is also crucial and helpful. But demand can tell us more than just demand per se.

In IO, people are interested about market structure and competition. One quantitative measure that is useful in this respect is markups. This is because markups tell us about the market power the firm has in the market. In the simple perfect competition, firms will set price equal to marginal cost. But many markets are not perfectly competitive. In this case, measuring the markups will give us some sense of the competitive nature of the market.

The problem is we cannot directly observe markup. Markup is basically \(\frac{p - mc}{mc}\). From this expression you can clearly see the issue: we don’t know the marginal cost. We usually do not have data on cost for producing a marginal product (This is different from accounting cost. Also, accounting cost usually is not sufficient to tell us about the marginal cost).

So where does estimating demand fit in this context? Surprisingly, estimating demand can help us recover marginal cost and thus markup.

Differentiated product Bertrand

While differentiated product bertrand (competition by price) is not the only setting IO economist work on, I feel it’s the most intuitive setting to understand why people care about demand.1

In multi-product Bertrand problem, firm will solve \[\begin{align} \pi_f (\mathbf{p}) = \sum_{j \in J_f}(p_j - c_j) \cdot q_j(\mathbf{p}) \end{align}\] where \(f\) if firm, \(j\) if product, and \(\mathbf{p}\) is price vector.

If you apply FOC, you get: \[\begin{align} 0 = q_j(\mathbf{p}) + \sum_{k \in J_f} (p_k - c_k) \frac{\partial q_k (\mathbf{p})}{\partial p_j} \end{align}\]

If you rearrange this with respect to price, you will get some nice equation with markups and lerner index. But for now, we will not go there. For easier manipulation, I can just write this FOC in matrix form: \[\begin{align} q(\mathbf{p}) = \Delta(\mathbf{p}) \cdot (\mathbf{p} - \mathbf{mc}) \end{align}\] where \(\Delta(\mathbf{p})\) is \(-\frac{\partial q_j (\mathbf{p})}{\partial p_k}\) for \((j, k) \in J_f\) and 0 otherwise.

Then we can recover marginal cost by inverting the matrix of price derivatives. And how can we get price derivative? Demand estimation!

However, there are challenges in estimating demand:

  1. Endogeneity of price
  2. So many substitution patterns to estimate (\(J\) products mean \(J^2\) parameters to estimate!)

Product space (mostly about AIDS model)

(This part contains notes I created from claude after doing back-and-forth self-education session with it.)

1. The Goal: What Are We Trying to Do?

We want a system of demand equations — one for each good — that satisfies the restrictions of consumer theory (adding up, homogeneity, symmetry, negative semidefiniteness of the Slutsky matrix) while being flexible enough to approximate arbitrary demand behavior. Ideally, the model should also be estimable with aggregate data on budget shares and prices.

The AIDS model (Deaton and Muellbauer, 1980) achieves this by starting from a specific class of preferences — the PIGLOG class — and deriving demand equations from them.


2. Background: Duality and the Expenditure Function

Consumer theory gives us two equivalent ways to describe preferences:

  • Direct utility: the consumer maximizes \(u(q)\) subject to \(p'q \leq x\), yielding Marshallian demands \(q_i(p, x)\).
  • Expenditure function: the minimum expenditure required to reach utility level \(u\) at prices \(p\), written \(e(p, u)\).

The expenditure function \(e(p, u)\) must satisfy several properties to be consistent with rational behavior:

  1. Homogeneous of degree 1 in prices: doubling all prices doubles the cost of reaching any utility level.
  2. Non-decreasing in \(p\) and \(u\): higher prices or higher utility require more spending.
  3. Concave in \(p\): reflects the consumer’s ability to substitute away from goods whose prices rise.

The key duality result is Shephard’s lemma: the Hicksian (compensated) demand for good \(i\) is the partial derivative of the expenditure function with respect to \(p_i\):

\[h_i(p, u) = \frac{\partial e(p, u)}{\partial p_i}\]

This means that if we can write down a valid expenditure function, we can derive demand equations by differentiation — and the resulting demands will automatically satisfy all the theoretical restrictions. This is the strategy AIDS follows.


3. The PIGLOG Class: What It Is and Why It Matters

3.1 Definition

PIGLOG stands for Price-Independent Generalized Logarithmic preferences. The class is defined by expenditure functions that take the following form:

\[\ln e(p, u) = (1 - u) \ln a(p) + u \ln b(p)\]

where:

  • \(a(p)\) is a function of prices alone, interpretable as the cost of “subsistence” or the expenditure function at the lowest utility level (\(u = 0\)).
  • \(b(p)\) is another function of prices alone, interpretable as the cost of “bliss” or the expenditure function at the highest utility level (\(u = 1\)).
  • \(u \in [0, 1]\) is utility, normalized to this interval.

So \(\ln e\) is a weighted average (in logs) of two price-dependent cost functions, with utility serving as the weight.

3.2 What “Price-Independent Generalized Logarithmic” Means

The name comes from two properties:

  • Logarithmic: the expenditure function is log-linear in utility. That is, \(\ln e\) is an affine function of \(u\). This is a generalization of the simple log utility case.
  • Price-Independent: the way utility enters the expenditure function does not depend on prices. The weights in the affine combination — \((1 - u)\) and \(u\) — are pure numbers, not functions of \(p\). The prices only affect the two anchor points \(a(p)\) and \(b(p)\).
  • Generalized: compared to simpler specifications (like homothetic preferences), PIGLOG allows non-homothetic behavior — budget shares can change with income.

3.3 Why PIGLOG? The Aggregation Argument

This is a crucial motivation that is often underappreciated. Muellbauer (1975, 1976) showed that for market-level demand to behave as if it were generated by a single “representative consumer,” preferences must belong to a restricted class. The PIGLOG class is the most general class of preferences for which:

\[W = f\left(\frac{x}{p}\right)\]

where \(W\) is the market budget share, \(x\) is aggregate (or average) expenditure, and \(p\) is a price index. In other words, market-level budget shares depend on a single scalar summary of the expenditure distribution — you don’t need to know the entire income distribution, just total expenditure (or mean expenditure).

This is a powerful result. Most demand systems estimated on aggregate data implicitly assume some form of aggregation consistency. PIGLOG makes this assumption explicit and provides the weakest sufficient conditions for it. If preferences are not PIGLOG, then aggregate demand behavior generally cannot be rationalized by a representative consumer, and market-level budget share equations are misspecified unless you model the entire income distribution.

3.4 What PIGLOG Rules Out

The PIGLOG restriction is not innocuous. It imposes a specific structure on how income (expenditure) affects demand:

  • Engel curves are rank-2: the budget share for each good is a function of \(\ln x\) (specifically, affine in \(\ln x\) after appropriate price deflation). This is more flexible than homothetic preferences (where budget shares are independent of income) but less flexible than completely unrestricted Engel curves.
  • No higher-order income effects: PIGLOG cannot capture, for example, goods that are necessities at low income, luxuries at middle income, and inferior at high income. The budget share is monotone in \(\ln(x/P)\) for any given set of prices.

In practice, the rank-2 restriction is often adequate for broad expenditure categories but may be too restrictive for highly disaggregated goods.


4. Specifying \(a(p)\) and \(b(p)\): From PIGLOG to AIDS

The PIGLOG structure tells us that \(\ln e = (1 - u) \ln a(p) + u \ln b(p)\). To get a concrete demand system, we need to choose functional forms for \(a(p)\) and \(b(p)\).

4.1 The AIDS Specification

Deaton and Muellbauer chose:

\[\ln a(p) = \alpha_0 + \sum_k \alpha_k \ln p_k + \frac{1}{2} \sum_k \sum_j \gamma_{kj}^* \ln p_k \ln p_j\]

\[\ln b(p) = \ln a(p) + \beta_0 \prod_k p_k^{\beta_k}\]

So \(a(p)\) is a flexible quadratic (in logs) function of prices, and \(b(p)\) is \(a(p)\) scaled by a Cobb-Douglas term \(\beta_0 \prod_k p_k^{\beta_k}\).

Substituting into the PIGLOG structure:

\[\ln e(p, u) = \alpha_0 + \sum_k \alpha_k \ln p_k + \frac{1}{2} \sum_k \sum_j \gamma_{kj}^* \ln p_k \ln p_j + u \beta_0 \prod_k p_k^{\beta_k}\]

4.2 Why These Specific Forms?

The choice of \(a(p)\) as a translog-like function is driven by the goal of second-order flexibility (explained in detail below). The quadratic-in-logs structure gives enough free parameters to match arbitrary first and second derivatives of any expenditure function at a point. The Cobb-Douglas form for the \(b(p)/a(p)\) ratio is chosen for tractability — it keeps the demand equations linear in parameters.


5. Deriving the Demand Equations

5.1 From Expenditure Function to Budget Shares

A fundamental property of expenditure functions is that the budget share of good \(i\) can be written as:

\[w_i = \frac{\partial \ln e(p, u)}{\partial \ln p_i}\]

This follows from Shephard’s lemma: \(h_i = \partial e / \partial p_i\), and the budget share is \(w_i = p_i h_i / e\), which equals \(\partial \ln e / \partial \ln p_i\).

5.2 Differentiating

Taking the log-price derivative of the expenditure function:

\[w_i = \alpha_i + \sum_j \gamma_{ij} \ln p_j + \beta_i u \beta_0 \prod_k p_k^{\beta_k}\]

where \(\gamma_{ij} = \frac{1}{2}(\gamma_{ij}^* + \gamma_{ji}^*)\) (symmetrizing the cross-price terms).

5.3 Eliminating Utility

The expenditure function gives us \(\ln e(p, u)\), and in equilibrium the consumer spends their income: \(e(p, u) = x\). So \(\ln x = \ln e(p, u)\), which means:

\[u \beta_0 \prod_k p_k^{\beta_k} = \ln x - \ln a(p) = \ln(x / a(p))\]

Wait — more precisely, from the expenditure function:

\[\ln x = \alpha_0 + \sum_k \alpha_k \ln p_k + \frac{1}{2} \sum_k \sum_j \gamma_{kj} \ln p_k \ln p_j + u \beta_0 \prod_k p_k^{\beta_k}\]

So:

\[u \beta_0 \prod_k p_k^{\beta_k} = \ln x - \alpha_0 - \sum_k \alpha_k \ln p_k - \frac{1}{2} \sum_k \sum_j \gamma_{kj} \ln p_k \ln p_j\]

The right-hand side is \(\ln x - \ln a(p) = \ln(x / a(p))\).

But notice that \(\ln a(p)\) is the price index \(\ln P\):

\[\ln P = \alpha_0 + \sum_k \alpha_k \ln p_k + \frac{1}{2} \sum_k \sum_j \gamma_{kj} \ln p_k \ln p_j\]

5.4 The Final AIDS Budget Share Equation

Substituting back:

\[w_i = \alpha_i + \sum_j \gamma_{ij} \ln p_j + \beta_i \ln\left(\frac{x}{P}\right)\]

This is the AIDS demand system. Each good’s budget share is:

  • A constant \(\alpha_i\) (the share when all log prices are zero and real expenditure is one — interpretable as the “base” share).
  • A sum of log-price effects \(\gamma_{ij} \ln p_j\) (how the share responds to each price).
  • An income effect \(\beta_i \ln(x/P)\) (how the share responds to real expenditure, deflated by the price index \(P\)).

6. The Price Index and the Linear Approximation (LA/AIDS)

6.1 The Nonlinear Price Index

The “exact” AIDS price index is:

\[\ln P = \alpha_0 + \sum_k \alpha_k \ln p_k + \frac{1}{2} \sum_k \sum_j \gamma_{kj} \ln p_k \ln p_j\]

This is nonlinear in parameters (because \(\alpha_0\), \(\alpha_k\), and \(\gamma_{kj}\) appear both in the price index and in the share equations). Estimation requires iterative nonlinear methods.

6.2 Stone’s Price Index Approximation

Deaton and Muellbauer suggested replacing \(\ln P\) with Stone’s geometric price index:

\[\ln P^* = \sum_k w_k \ln p_k\]

where \(w_k\) are the observed budget shares. This yields the Linear Approximate AIDS (LA/AIDS), which is linear in parameters and can be estimated by standard linear methods (OLS, SUR, or ITSUR).

The LA/AIDS is by far the most commonly estimated version. However, there are known issues:

  • Stone’s index uses endogenous budget shares \(w_k\) on the right-hand side, potentially creating simultaneity.
  • The elasticities from LA/AIDS are not exactly the same as those from the exact AIDS, though the differences are typically small.
  • Various alternatives have been proposed (e.g., the Laspeyres, Paasche, or Tornqvist index), but Stone’s index remains the default.

7. Second-Order Flexibility: What It Means and Why It Matters

7.1 The Concept of a Flexible Functional Form

A functional form is called a first-order flexible form if it has enough free parameters to match the value and all first derivatives of an arbitrary function at a single point. It is called second-order flexible if it can also match all second derivatives at that point.

For a demand system, the “arbitrary function” is the true (unknown) expenditure function \(e^*(p, u)\), and the “point” is a particular price-utility vector \((p^0, u^0)\), often taken to be the sample mean.

7.2 What Second-Order Flexibility Requires

Consider an expenditure function with \(n\) goods. At a point, you need to match:

  • The value: \(\ln e(p^0, u^0)\) — 1 parameter.
  • First derivatives: \(\partial \ln e / \partial \ln p_i\) for each \(i\)\(n\) parameters. These are the budget shares.
  • Second derivatives: \(\partial^2 \ln e / \partial \ln p_i \partial \ln p_j\) for all \(i, j\)\(n(n+1)/2\) parameters (exploiting symmetry). These are the price elasticities (Slutsky terms).
  • Income derivatives: \(\partial \ln e / \partial u\) and \(\partial^2 \ln e / \partial u \partial \ln p_i\)\(1 + n\) more parameters. These govern income effects.

In total, second-order flexibility requires roughly \(1 + n + n(n+1)/2 + 1 + n\) free parameters for an \(n\)-good system.

7.3 How AIDS Achieves Second-Order Flexibility

The AIDS expenditure function has the following free parameters:

  • \(\alpha_0\): 1 parameter.
  • \(\alpha_i\) for \(i = 1, \ldots, n\): \(n\) parameters.
  • \(\gamma_{ij}\) for \(i \leq j\): \(n(n+1)/2\) parameters.
  • \(\beta_i\) for \(i = 1, \ldots, n\): \(n\) parameters.

This count matches the requirement above. At any point \((p^0, u^0)\), you can choose these parameters so that the AIDS expenditure function matches the value, gradient, and Hessian of any arbitrary expenditure function. The AIDS model can therefore approximate any well-behaved demand system to a second-order approximation.

7.4 What Second-Order Flexibility Does and Does Not Mean

What it means:

  • Local approximation guarantee: at the approximation point, the AIDS model’s demand elasticities, income elasticities, and budget shares are exactly right (since it matches through second derivatives). Near that point, the approximation error is third-order — small for small deviations.
  • No a priori restrictions on elasticities: at the approximation point, the own-price elasticities, cross-price elasticities, and income elasticities are unrestricted (subject to adding up, homogeneity, and symmetry). The model does not force goods to be substitutes or complements.

What it does not mean:

  • Not a global guarantee: away from the approximation point, the AIDS model may be a poor approximation. It could violate concavity of the expenditure function (which means violations of the law of demand) at data points far from the base.
  • Not a claim about the true functional form: saying AIDS is second-order flexible is not saying the true preferences are PIGLOG. It is saying that whatever the true preferences are, AIDS can locally mimic them to second order.
  • Not unique: many other functional forms are also second-order flexible (the translog, the generalized Leontief, the quadratic expenditure system, etc.). AIDS is special not because of flexibility per se but because it combines flexibility with the budget-share formulation, aggregation properties, and tractability.

7.5 Comparison with Other Flexible Forms

Model Second-Order Flexible? Budget Share Form? Aggregation? Notes
AIDS Yes Yes (linear in \(\ln p\), \(\ln(x/P)\)) PIGLOG class The main subject here
Translog (cost fn.) Yes Yes (in producer theory) Limited Common in production economics
Rotterdam First-order only Yes (in differences) Restrictive Simpler but less flexible
Quadratic AIDS (QUAIDS) Yes (enhanced) Yes Extended PIGLOG Adds \([\ln(x/P)]^2\) term
CBS Yes Hybrid Limited Combines features of AIDS and Rotterdam

7.6 QUAIDS: When Second-Order Flexibility of AIDS Is Not Enough

Banks, Blundell, and Lewbel (1997) observed that Engel curves for some goods (e.g., alcohol, fuel) are not well-described by the linear \(\ln(x/P)\) term in AIDS — they exhibit curvature. The Quadratic AIDS (QUAIDS) adds a squared real-expenditure term:

\[w_i = \alpha_i + \sum_j \gamma_{ij} \ln p_j + \beta_i \ln\left(\frac{x}{P}\right) + \frac{\lambda_i}{b(p)} \left[\ln\left(\frac{x}{P}\right)\right]^2\]

This allows budget shares to be quadratic in log real expenditure, capturing goods that are luxuries at low income but necessities at high income (or vice versa). QUAIDS extends the PIGLOG class to a broader rank-3 expenditure function while maintaining much of AIDS’s tractability.


8. Theoretical Restrictions and Testing

8.1 Adding Up

Budget shares must sum to 1: \(\sum_i w_i = 1\) for all \(p\) and \(x\). This imposes:

\[\sum_i \alpha_i = 1, \qquad \sum_i \gamma_{ij} = 0 \; \forall j, \qquad \sum_i \beta_i = 0\]

These are parametric restrictions that can be imposed during estimation (typically by dropping one equation and recovering its parameters from the restrictions).

8.2 Homogeneity of Degree Zero

Demand should be unaffected by proportional scaling of all prices and income: \(q_i(\lambda p, \lambda x) = q_i(p, x)\). In the AIDS share equation, this requires:

\[\sum_j \gamma_{ij} = 0 \; \forall i\]

This says the sum of all price coefficients in each equation is zero — a proportional increase in all prices does not change any budget share (since real expenditure \(x/P\) also adjusts).

8.3 Slutsky Symmetry

The compensated cross-price effects must be symmetric: \(\partial h_i / \partial p_j = \partial h_j / \partial p_i\). In the AIDS parameterization, this requires:

\[\gamma_{ij} = \gamma_{ji}\]

This is a testable restriction. In the unrestricted model, \(\gamma_{ij}\) and \(\gamma_{ji}\) are separate parameters. Imposing symmetry reduces the parameter count and can be tested with a standard likelihood ratio or Wald test.

8.4 Negativity (Negative Semidefiniteness of the Slutsky Matrix)

The Slutsky substitution matrix must be negative semidefinite — own-price compensated effects must be non-positive. Unlike the above restrictions, negativity is an inequality constraint and cannot be imposed by simple parameter restrictions. It must be checked at each data point after estimation. Violations indicate that the model is locally inconsistent with utility maximization at those observations.


9. Elasticities from the AIDS Model

9.1 Expenditure (Income) Elasticity

From the budget share equation:

\[\eta_i = 1 + \frac{\beta_i}{w_i}\]

If \(\beta_i > 0\), good \(i\) is a luxury (\(\eta_i > 1\)): its budget share rises with income. If \(\beta_i < 0\), it is a necessity (\(\eta_i < 1\)). If \(\beta_i < -w_i\), it is an inferior good (\(\eta_i < 0\)).

9.2 Uncompensated (Marshallian) Price Elasticity

\[\epsilon_{ij} = -\delta_{ij} + \frac{\gamma_{ij}}{w_i} - \frac{\beta_i w_j}{w_i}\]

where \(\delta_{ij}\) is the Kronecker delta (\(1\) if \(i = j\), \(0\) otherwise).

9.3 Compensated (Hicksian) Price Elasticity

Applying the Slutsky equation (\(\epsilon_{ij}^* = \epsilon_{ij} + \eta_i w_j\)):

\[\epsilon_{ij}^* = -\delta_{ij} + \frac{\gamma_{ij}}{w_i} + w_j\]

These elasticities are computed at observed budget shares, so they vary across observations — a desirable property for a flexible model.


10. Estimation in Practice

10.1 System Estimation

AIDS is a system of \(n\) equations (one per good). Because the adding-up constraint means the \(n\)-th equation is redundant (its parameters are determined by the others), you estimate \(n - 1\) equations. The results are invariant to which equation is dropped.

Typical estimation methods:

  • Seemingly Unrelated Regressions (SUR): accounts for correlation in errors across equations.
  • Iterated SUR (ITSUR): iterates until convergence; numerically equivalent to maximum likelihood under normality.
  • Full Information Maximum Likelihood (FIML): directly maximizes the joint likelihood.

10.2 Endogeneity

Prices and expenditure are often endogenous:

  • Prices may be correlated with demand shocks (e.g., a supply shock affects both price and quantity).
  • Total expenditure \(x\) is chosen by the consumer and therefore jointly determined with budget shares.

Instruments are needed. For expenditure, total household income is the standard instrument. For prices, cost shifters, supply-side variables, or Hausman-type instruments (prices of the same good in other markets) are common.

10.3 Dynamics and Stationarity

Static AIDS assumes instantaneous adjustment. In practice, habits, adjustment costs, and infrequent purchasing create dynamics. Extensions include:

  • Error correction AIDS: models short-run dynamics around a long-run AIDS equilibrium.
  • Dynamic AIDS: includes lagged budget shares or lagged prices.
  • First-differenced AIDS: removes fixed effects and addresses nonstationarity.

11. AIDS vs. Random-Coefficient Logit: A Summary of the Tradeoff

Dimension AIDS Random-Coefficient Logit
Substitution Free parameters for every pair (\(\gamma_{ij}\)) — unrestricted within the product set Induced through characteristics — restricted by what you observe
Scalability \(O(n^2)\) parameters — fails with many products \(O(K^2)\) parameters where \(K\) = characteristics — scales to hundreds of products
New products Cannot accommodate without re-specification Natural: new product = new characteristic bundle
Theory Derived from expenditure function; clean welfare analysis Based on random utility; welfare via Small-Rosen surplus
Aggregation PIGLOG class guarantees representative consumer Requires simulation over heterogeneous consumers
Flexibility Second-order flexible at a point Not “flexible” in the Diewert sense, but can approximate rich patterns if characteristics are well-chosen
Income effects Nonlinear Engel curves built in Requires additional specification (e.g., log-income interactions)
Why substitution works Does not explain — just estimates \(\gamma_{ij}\) Explains via position in characteristic space
Typical application Broad categories (food groups, fuel types, imports by origin) Differentiated products (cars, cereals, insurance plans)

12. Summary

The AIDS model derives demand equations from a specific expenditure function belonging to the PIGLOG class. The PIGLOG specification buys two things: exact aggregation (market-level budget shares can be treated as if generated by a representative consumer) and tractable demand equations that are linear in parameters. The cost is a restriction on the shape of Engel curves (rank-2, linear in log real expenditure).

The model is second-order flexible, meaning it has enough free parameters to locally approximate any demand system at a point — no a priori restrictions on own-price, cross-price, or income elasticities at the approximation point. This is a local property: the approximation may deteriorate away from the base point, and global regularity (concavity, monotonicity) is not guaranteed.

The fundamental tradeoff with characteristic-space models like random-coefficient logit is between the unrestricted substitution patterns of AIDS (which come at the cost of \(O(n^2)\) parameters and inability to handle new products) and the parsimonious, generalizable structure of random coefficients (which require the analyst to correctly specify the relevant product characteristics). For markets with few, stable products and aggregate data, AIDS is often the right tool. For markets with many differentiated products, product entry/exit, and the need for counterfactual analysis, random-coefficient logit dominates.

Statistical model of product differentiation: discrete choice model

Now we will move on to characteristic space. This assumes that consumers choose products by their characteristics. While it might sound less intuitive than product space, this setup is useful because it solve the dimensionality issue. As long as set of characteristics are sensible, increase in products do not lead to more parameters to estimate. Also, characteristics space is useful for understanding cases such as new product since new product will be just some bundle of characteristics already defined in the model.

But before we move onto the structural model, we will first focus on the statistical model that is used a lot in this setup: discrete choice model.

btw, Why do we start with indirect utility in Discrete choice model?

Actually, this can be derived from the usual utility function. Suppose consumer maximizes:

\[ \max_{q_1, q_2, c} U(q_1, q_2, c) \]

subject to

\[\begin{align} p_1 q_1 + p_2 q_2 + c = m \\ q_1 q_2 = 0 \end{align}\]

\(c\) is the numeraire good with its price normalized to 1. To make this more simple, we will assume that quantity has to be either 1 or 0.

Then we can get conditional indirect utility function which is (we condition on \(q_1 = 0\))

\[ \max_{q_2, c} U(0, q_2, c) \]

subject to

\[ p_2 q_2 + c = m \]

There is a standard solution for this demand functions \(q_2(p_2, m)\) and \(c(p_2, m)\). So plug this into the demand function and we get the indirect utility function:

\[ V_2(p_2, m) = U(0, q_2(p_2, m), c(p_2, m)) = U(0, 1, m-p_2) \]

where the last equation holds because we are only considering that consumer can buy at most one good for \(q_1, q_2\).

Doing the same conditioning on \(q_2 = 0\), we can see that the solution to the discrete choice part of the decision problem is then given by a choice between the wo conditional utility functions:

\[ \max_{j=\{1,2 \}} V_j (p_j, m) \]

This is the discrete choice model we usually see. We then usually add some random shocks to accomodate the changes in the people’s choice in the real-world data.

Multinomial logit2

  • in period \(t\)
  • with \(J_t\) alternatives.
  • subscript individual agents by \(i\).
  • agents choose certain choice \(j \in J_t\) with probability \(s_{ijt}\).
  • agent \(i\) receives utility \(U_{ijt}\) for choosing \(j\).
  • choice is exhaustive and mutually exclusive.

We might wonder why we don’t have some parametric form. Actually, (at least for a start), we don’t need to start discrete choice model with certain parametric form. We can be fully nonparametric.

The we can write probability \(s_{ij}\) as:

\[\begin{align} s_{ij} &= Pr (U_{ij} > U_{ik}, \quad \forall k \neq j)\\ &= Pr(V_{ij} + \varepsilon_{ij} > V_{ik} + \varepsilon_{ik}, \quad \forall k \neq j)\\ &= Pr(\varepsilon_{ij} - \varepsilon_{ik} > V_{ik} - V_{ij}, \quad \forall k \neq j)\\ &= \int I(\varepsilon_{ij} - \varepsilon_{ik} > V_{ik} - V_{ij}) f(\varepsilon_i) \partial \varepsilon_i \end{align}\]

where \(f(\varepsilon_i)\) is \(J\) vetor of individual \(i\)’s unobserved utility. You can already see that implementing \(J\) dimensional integration would be painful. To alleviate this, we usually assume some good distribution for unobserved utility (T1EV, etc).

In these multinomial logit, what is important is differences in utility and not levels. Thus only differences in utility matter and adding constants is irrelevant. Thus individual specific factors will not be identified.

One thing to note is that \(U_ij = V_{ij} + \varepsilon_{ij}\) and \(U_ij = \lambda V_{ij} + \lambda \varepsilon_{ij}\) doesn’t change any result. So we normalize this by fixing the variance of the error term. This means we are normalizing the scale of the utility. This also implies that only ratio of the coefficient is identified.

Reason people use T1EV is becuse it gives closed form choice probabilities:

\[ s_{ij} = \frac{e^{V_{ij}}}{\sum_k e^{V_{ik}}}. \]

We can also get nice closed form for expected maximum:

\[ E[\max_j U_{ij}] = \log \left( \sum_j \exp [V_{ij} ] \right) + C \]

This is nice as ex-ante expected utility of best option (without knowing error term) does not depend on the error term. It also allows simple computation of \(\Delta CS\) for consumer welfare (but not CS) itself.

btw, we also need to normalize one of the choice to provide a utility of zero (since adding constant does not change the result). We usually set outside option to have 0 determinant utility.

One interesting property of multinomial logit is that the ratio of choice probabilities for \(j\) and \(k\) depends only on \(j\) and \(k\) and not on any alternative \(l\). This is known as independence of irrelevant alternatives (IIA).

IIA happens as unobserved error terms are assumed to be uncorrelated.

The IIA Problem in Standard Logit in detail3

In a random utility model, individual \(i\)’s utility for alternative \(j\) is decomposed into an observed component \(V_{ij}\) and an unobserved component \(\varepsilon_{ij}\):

\[U_{ij} = V_{ij} + \varepsilon_{ij}\]

The standard logit model assumes that all \(\varepsilon_{ij}\) are drawn i.i.d. from a Type I extreme value distribution. This yields the familiar closed-form choice probability:

\[s_{ij} = \frac{e^{V_{ij}}}{\sum_k e^{V_{ik}}}\]

A direct consequence of this structure is the Independence of Irrelevant Alternatives (IIA) property: the ratio of choice probabilities between any two alternatives depends only on those two alternatives:

\[\frac{s_j}{s_k} = \frac{e^{V_j}}{e^{V_k}}\]

This ratio is unaffected by the presence or characteristics of any other alternative in the choice set. In the classic red bus/blue bus example, adding blue bus (which is essentially identical to red bus) causes the model to steal market share proportionally from all existing alternatives, including car. In reality, blue bus should steal almost entirely from red bus, since they are near-perfect substitutes.

Why Better Observed Characteristics Don’t Fix IIA

A natural intuition is that if we include enough observed characteristics in \(V\) — say a “bus” dummy, travel time, cost, comfort — the model should be able to recognize that red bus and blue bus are similar and handle substitution correctly. This intuition is wrong, and the reason is fundamental.

No matter how richly you specify \(V\), the logit model always appends an independent error:

\[U_{\text{red bus}} = V_{\text{bus}} + \varepsilon_{\text{red}}\] \[U_{\text{blue bus}} = V_{\text{bus}} + \varepsilon_{\text{blue}}\]

Even if \(V\) is identical for both bus options, \(\varepsilon_{\text{red}}\) and \(\varepsilon_{\text{blue}}\) are independent by assumption. The model treats the unobserved idiosyncratic shock to red bus as having nothing to do with the shock to blue bus — just as if you were comparing bus to bicycle.

In reality, whatever unobserved reason makes someone prefer red bus (maybe they live near the bus stop, or they like the route) almost certainly applies to blue bus too. The unobserved errors should be highly correlated. But the logit model cannot represent this correlation — it is not a matter of adding more covariates to \(V\). You are only enriching the observed part while the unobserved part always remains, always has a fixed variance (normalized to \(\pi^2/6\)), and is always independent across alternatives.

In principle, if the unobserved component were genuinely tiny relative to \(V\), the IIA distortion would be small. But the logit model does not reward you for having great observables — the error variance is fixed by normalization — and in practice the unobserved component tends to be large enough that IIA creates real problems.

How Random Coefficients Solve the Problem

The mixed logit (random coefficients) model solves the IIA problem not by directly specifying how errors are correlated, but by inducing correlation through the observed characteristics.

Suppose people have heterogeneous valuations of travel time, so the coefficient \(\beta_i\) is random:

\[U_{ij} = \beta_i \cdot \text{time}_j + \varepsilon_{ij}\]

From the perspective of someone who only observes the population mean \(\bar{\beta}\), the effective unobserved component for alternative \(j\) is:

\[(\beta_i - \bar{\beta}) \cdot \text{time}_j + \varepsilon_{ij}\]

The first term, \((\beta_i - \bar{\beta}) \cdot \text{time}_j\), is the key. It is correlated across alternatives in proportion to how similar their values of \(\text{time}_j\) are. Since red bus and blue bus have nearly identical travel times, this component is nearly identical for both — creating exactly the correlation structure that standard logit lacks. Car has a very different travel time, so its effective error is only weakly correlated with the bus options.

The i.i.d. \(\varepsilon_{ij}\) terms are still present, but they are now dominated by the correlated component flowing through the random coefficient. The model no longer treats red bus and blue bus as independent in the unobserved dimension.

A Concrete Numerical Example

Consider three alternatives with the following travel times:

  • Red bus: 30 minutes
  • Blue bus: 32 minutes
  • Car: 10 minutes

Suppose the population mean coefficient is \(\bar{\beta} = -2\). Consider two individuals with different draws of \(\beta_i\).

Person A draws \(\beta_i = -4\) (hates travel time more than average), so \(\beta_i - \bar{\beta} = -2\). The unobserved component \((\beta_i - \bar{\beta}) \cdot \text{time}_j\) for each alternative is:

  • Red bus: \((-2) \times 30 = -60\)
  • Blue bus: \((-2) \times 32 = -64\)
  • Car: \((-2) \times 10 = -20\)

Red bus and blue bus are hit by nearly the same penalty (\(-60\) vs \(-64\)) because their travel times are close. Car receives a much smaller penalty (\(-20\)) because its travel time is very different.

Person B draws \(\beta_i = -1\) (dislikes travel time less than average), so \(\beta_i - \bar{\beta} = +1\). The unobserved component becomes:

  • Red bus: \((+1) \times 30 = +30\)
  • Blue bus: \((+1) \times 32 = +32\)
  • Car: \((+1) \times 10 = +10\)

Again, red bus and blue bus move nearly in lockstep — both receive a large boost — while car moves in the same direction but by a different magnitude.

Across the population, as \(\beta_i\) varies, the utilities of red bus and blue bus go up and down together because they share similar travel times. This means the people who switch to blue bus are predominantly the same people who already liked red bus. The model correctly understands that blue bus steals mostly from red bus, not proportionally from car. IIA is broken.

The Role of Observed Characteristics

For this mechanism to work well, you need enough observed variables to span the main dimensions along which alternatives are similar. The random coefficients can only generate correlation through the variables you include.

If you only have one observed variable (say travel time), the model captures correlation along that single dimension. Alternatives similar in travel time will have correlated utilities, but alternatives similar in some unmeasured way will not.

If you add more variables — cost, comfort, frequency, reliability — and allow their coefficients to be random, the model can capture correlation along all of those dimensions. Two alternatives that are similar on multiple observed characteristics will end up with highly correlated effective errors, producing realistic substitution patterns.

The honest limitation is that if there exists some important dimension of similarity that no observed variable captures — even approximately — random coefficients cannot help with that. The correlation must flow through something in the model.

In practice, however, mixed logit is remarkably flexible. Even rough proxies for the true sources of similarity tend to work reasonably well. And compared to nested logit, where the analyst must manually specify which alternatives belong in which group, mixed logit lets the substitution patterns emerge from the data — a significant advantage when the structure of similarity across alternatives is not obvious in advance.

What Adding Observed Characteristics Does (and Does Not) Do in Standard Logit

If adding observed characteristics does not fix IIA, one might wonder what the point of including them is in the first place. The answer is that they serve a different and still essential purpose: they improve the accuracy of the predicted choice probabilities by getting \(V_{ij}\) right.

Consider a transportation mode choice model where you initially include only travel cost. The model predicts choice probabilities based on cost differences alone. Now suppose you add travel time. The model can now distinguish between a cheap-but-slow bus and a cheap-and-fast subway. The predicted \(V\) for each alternative becomes more accurate, and the overall fit of the model improves. People who face long bus travel times will have lower predicted \(V_{\text{bus}}\), and the model will correctly predict a lower choice probability for them.

What does not change is the substitution pattern. Suppose the initial choice probabilities are 50% car, 30% red bus, 20% blue bus. Under IIA, if blue bus becomes unavailable, its 20% is redistributed proportionally: car goes to 62.5% and red bus goes to 37.5%. Adding more covariates changes the levels (maybe with travel time included, the probabilities become 40% car, 35% red bus, 25% blue bus), but the proportional substitution rule still applies. If blue bus is removed, car and red bus still gain in proportion to their original shares.

To illustrate with numbers, consider two scenarios in a standard logit:

Sparse model (cost only): \(V_{\text{car}} = -5\), \(V_{\text{red bus}} = -3\), \(V_{\text{blue bus}} = -3\). The predicted shares are approximately 5% car, 47.5% red bus, 47.5% blue bus. Remove blue bus, and the model predicts car rises to 10% and red bus to 90%. The ratio \(s_{\text{car}} / s_{\text{red bus}}\) stays fixed at the original ratio.

Rich model (cost + time + comfort): \(V_{\text{car}} = -2\), \(V_{\text{red bus}} = -3\), \(V_{\text{blue bus}} = -3.1\). The shares change — maybe 42% car, 30% red bus, 28% blue bus — but the substitution pattern is still proportional. Remove blue bus, and car and red bus gain in proportion to 42:30.

In the rich model, the levels are more accurate, but the cross-elasticities are still constrained by IIA. The model still cannot capture that blue bus’s share should flow mostly to red bus. So adding observed variables improves what the model predicts but not how it reallocates when the choice set changes.

This is precisely why random coefficients are needed. They take those same observed variables and use them to generate the correlation that governs substitution. The observed variables serve double duty in a mixed logit: they improve the mean utility \(V\) (just as in standard logit) and they induce realistic substitution patterns through preference heterogeneity.

Random Coefficients as Preference Heterogeneity

The motivation for random coefficients is often presented in terms of fixing IIA, but the more fundamental interpretation is that they capture preference heterogeneity — the fact that different people value the same characteristic differently.

In a fixed coefficient model, everyone has the same \(\beta\) for travel time. This implies that a one-minute reduction in travel time is worth the same to a busy executive as to a retired person with no schedule constraints. This is clearly unrealistic. In a random coefficient model, \(\beta_i\) varies across individuals, reflecting the reality that some people care intensely about time savings while others barely notice.

Consider a concrete example with two observed characteristics, travel time and cost:

\[U_{ij} = \beta_i^{\text{time}} \cdot \text{time}_j + \beta_i^{\text{cost}} \cdot \text{cost}_j + \varepsilon_{ij}\]

Suppose \(\beta_i^{\text{time}} \sim N(-2, 1.5^2)\) and \(\beta_i^{\text{cost}} \sim N(-3, 0.5^2)\).

The large standard deviation on time (1.5) relative to its mean (−2) tells us that people vary a lot in how much they care about travel time. Some people have \(\beta_i^{\text{time}}\) near zero (they don’t mind long trips) while others have \(\beta_i^{\text{time}}\) near \(-5\) (they strongly avoid slow options). This heterogeneity is substantively important: it tells you who benefits from time-saving investments, how the market segments, and how different groups would respond to policy changes.

The smaller standard deviation on cost (0.5) relative to its mean (−3) tells us that people are more uniform in how they respond to cost — nearly everyone dislikes paying more, and they dislike it by roughly similar amounts.

This preference heterogeneity is valuable in its own right, completely apart from IIA. It tells the analyst about the population, enables welfare calculations for different subgroups, and improves predictions about how people respond to changes. The IIA fix is a beneficial consequence of modeling this heterogeneity, not the sole motivation for it.

When Preference Heterogeneity Is Absent: Homogeneous Coefficients

A natural follow-up question is: what happens if preferences for some characteristic are actually homogeneous? If everyone truly values cost at \(\beta^{\text{cost}} = -3\) with no variation, does making that coefficient random matter?

The answer is no — and the data will tell you. If you estimate a mixed logit and allow \(\beta_i^{\text{cost}}\) to be random, the estimated standard deviation will come back near zero. The model is saying: “I tried to find heterogeneity on this dimension and there isn’t any.”

When the standard deviation is zero, the random coefficient collapses to a fixed coefficient:

\[(\beta_i^{\text{cost}} - \bar{\beta}^{\text{cost}}) \cdot \text{cost}_j \approx 0\]

This component contributes no variation across people and therefore no correlation across alternatives. Cost, in this scenario, does not help break IIA — but it still contributes to getting the mean utility \(V_{ij}\) right, just as it would in standard logit.

The variables that matter for breaking IIA are specifically those with meaningful preference heterogeneity. Consider a model with three random coefficients:

  • \(\beta_i^{\text{time}} \sim N(-2, 1.5^2)\) — large heterogeneity
  • \(\beta_i^{\text{comfort}} \sim N(1, 1.2^2)\) — large heterogeneity
  • \(\beta_i^{\text{cost}} \sim N(-3, 0.1^2)\) — near-homogeneous

In this model, time and comfort drive the substitution patterns because people vary substantially in how they value these characteristics. Cost contributes almost nothing to the correlation structure because everyone values it similarly. If two alternatives have similar travel times and comfort levels, they will have highly correlated effective errors — regardless of whether their costs are similar or different — because the heterogeneity is concentrated on time and comfort.

In practice, this means the analyst should think carefully about which coefficients to allow to be random. Making everything random adds computational burden. The coefficients that matter most are those where there are strong prior reasons to expect heterogeneity — characteristics where different types of people plausibly have very different valuations. For characteristics where preferences are likely uniform (everyone prefers lower cost, everyone prefers less pollution), a fixed coefficient may be sufficient, and the data will confirm this through a small estimated standard deviation.

Bit More About What V Does and What ε Does

The Two Components of Utility in Standard Logit

In the standard logit model, individual \(n\)’s utility for alternative \(j\) is:

\[U_{nj} = V_{nj} + \varepsilon_{nj}\]

where \(V_{nj} = x_{nj}'\beta\) is the deterministic (observed) component and \(\varepsilon_{nj}\) is the random (unobserved) component, drawn i.i.d. from a Type I extreme value distribution.

These two components serve fundamentally different roles in the model, and confusing those roles is the source of a very natural — but ultimately incorrect — intuition about why “better observables should fix IIA.”


What V Does: Setting the Levels

The deterministic component \(V_{nj}\) controls market shares — the overall probability that each alternative is chosen. The logit choice probability is:

\[P_j = \frac{e^{V_j}}{\sum_k e^{V_k}}\]

If you enrich \(V_j\) by adding more and better observed characteristics — a bus dummy, travel time, cost, comfort, frequency, reliability — you improve the model’s ability to predict how popular each alternative is. You get the levels right.

This is valuable. A model that correctly predicts red bus has 40% market share, blue bus has 36%, and car has 24% is better than one that predicts equal shares. But getting the levels right is the only thing \(V\) does. It does not, and cannot, influence how consumers substitute between alternatives when conditions change.

What ε Does: Governing Substitution

The substitution pattern — how demand shifts across alternatives when a price changes or an alternative is added or removed — is entirely determined by the structure of the error terms \(\varepsilon_{nj}\).

In standard logit, the errors are i.i.d. This independence assumption means that the unobserved factors making someone prefer red bus are treated as statistically unrelated to the unobserved factors making them prefer blue bus. From the model’s perspective, whatever idiosyncratic, unmeasured reason draws someone toward red bus (maybe they live near the stop, maybe they like the driver) has zero correlation with their idiosyncratic draw toward blue bus.

This independence is what produces the IIA property: the ratio of choice probabilities between any two alternatives depends only on those two alternatives:

\[\frac{P_j}{P_k} = \frac{e^{V_j}}{e^{V_k}}\]

And the substitution pattern that follows from IIA is proportional substitution: when people leave one alternative, they redistribute to all other alternatives in proportion to those alternatives’ market shares.

The cross-elasticity of the probability of choosing \(j\) with respect to an attribute of alternative \(k\) is:

\[\frac{\partial P_j}{\partial x_k} \cdot \frac{x_k}{P_j} \propto P_k\]

This depends only on \(P_k\) — the market share of the alternative whose attribute changed. It does not depend on how similar \(j\) is to \(k\) in any characteristic dimension.


The Natural (But Wrong) Intuition

Here is the intuition that trips people up: “If I specify \(V\) well enough that red bus and blue bus have very similar deterministic utilities, the model should recognize they are similar and generate appropriate substitution patterns.”

This feels right because it conflates two distinct things:

  1. Similar \(V\) values produce similar market shares. This is true. If \(V_{\text{red bus}} = 2.0\) and \(V_{\text{blue bus}} = 1.9\), they will have similar choice probabilities.

  2. Alternatives with similar market shares receive more of the substitution. This is also true under logit. When someone leaves red bus, more of them go to blue bus (share = 0.36) than to car (share = 0.24), simply because blue bus has a bigger share.

So putting these together, it seems like making \(V\) similar for similar products should generate the right substitution. The problem is that this logic proves too much — it would also predict the same substitution toward any alternative that happens to have the same market share, regardless of whether it is actually similar.


A Concrete Example: Why Similar V Values Are Not Enough

Consider a market with three alternatives. We have done an excellent job specifying \(V\) — we’ve included a bus dummy, travel time, cost, comfort, everything relevant we can observe:

Alternative \(V\) Choice Probability
Red bus 2.0 0.40
Blue bus 1.9 0.36
Car 1.0 0.24

The model correctly identifies that both buses are popular and car is less so. Now suppose red bus raises its price, and some red bus riders switch away. Under logit, the switchers redistribute proportionally:

  • Fraction going to blue bus: \(\frac{0.36}{0.36 + 0.24} = 60\%\)
  • Fraction going to car: \(\frac{0.24}{0.36 + 0.24} = 40\%\)

Blue bus absorbs more switchers than car. This seems reasonable — and your intuition says: “See, the model is sending more people to the similar alternative.” But now consider the following thought experiment.

The Thought Experiment: Replace Blue Bus with Bicycle

Suppose instead of blue bus, the third alternative is “bicycle,” and suppose bicycle happens to have \(V = 1.9\) as well (perhaps it’s free, healthy, and appeals to a segment of the population). The market shares are now:

Alternative \(V\) Choice Probability
Red bus 2.0 0.40
Bicycle 1.9 0.36
Car 1.0 0.24

Now red bus raises its price again. The substitution pattern is identical:

  • Fraction going to bicycle: \(\frac{0.36}{0.36 + 0.24} = 60\%\)
  • Fraction going to car: \(\frac{0.24}{0.36 + 0.24} = 40\%\)

The model sends 60% of red bus switchers to bicycle — exactly the same fraction it sent to blue bus in the previous scenario. But in reality, these situations are nothing alike. People leaving red bus because it got expensive should overwhelmingly switch to blue bus (it’s the same service in a different color), but they should not overwhelmingly switch to bicycle (a completely different mode of transportation).

The standard logit model cannot distinguish between these two scenarios because it only sees market shares. An alternative with a 36% share that is nearly identical to the focal product and an alternative with a 36% share that is completely different are treated the same way. The similarity of blue bus to red bus — encoded in their similar \(V\) values — bought us nothing beyond the correct share prediction.

Another Way to See It: The Full Red Bus / Blue Bus Problem

Take the classic setup. Initially there are two alternatives:

Alternative \(V\) Choice Probability
Red bus 1.0 0.50
Car 1.0 0.50

Now add blue bus with \(V_{\text{blue bus}} = 1.0\) (identical to red bus in all observed characteristics). The new probabilities are:

Alternative \(V\) Choice Probability
Red bus 1.0 0.333
Blue bus 1.0 0.333
Car 1.0 0.333

Car’s share drops from 0.50 to 0.333. But in reality, blue bus should steal almost entirely from red bus. The correct shares should be approximately:

Alternative Realistic Probability
Red bus 0.25
Blue bus 0.25
Car 0.50

No matter how many variables you add to \(V\) to make \(V_{\text{red bus}}\) and \(V_{\text{blue bus}}\) identical, you cannot change this result. Making their \(V\) values identical makes their shares identical (which is correct), but the model still steals from car proportionally (which is wrong). The model is incapable of saying “blue bus is a substitute for red bus more than it is a substitute for car” — it can only say “blue bus is popular” or “blue bus is unpopular.”


Why Adding More Observed Characteristics to V Cannot Fix This

It is worth understanding exactly why enriching \(V\) fails, not just that it fails.

No matter how richly you specify \(V_{nj}\), the full utility is:

\[U_{nj} = V_{nj} + \varepsilon_{nj}\]

The error \(\varepsilon_{nj}\) always remains, always has a fixed variance (normalized to \(\pi^2/6\) by the Type I extreme value assumption), and is always independent across alternatives. You have no control over this.

Even if \(V\) is identical for both bus options, \(\varepsilon_{\text{red bus}}\) and \(\varepsilon_{\text{blue bus}}\) are independent. The model treats the unobserved idiosyncratic shock to red bus as having nothing to do with the shock to blue bus — exactly as if you were comparing bus to bicycle, or bus to airplane.

In reality, whatever unobserved reason makes someone prefer red bus — maybe they live near the bus stop, or they like the schedule — almost certainly applies to blue bus too. The unobserved errors should be highly correlated. But the logit model’s distributional assumption makes this correlation exactly zero, always and everywhere. This is not a matter of having insufficient data or a poorly specified \(V\). It is a structural feature of the model that no amount of covariate enrichment can change.

To summarize: \(V\) controls levels, \(\varepsilon\) controls substitution. You can improve \(V\) all you want — the substitution pattern remains proportional because \(\varepsilon\) remains i.i.d.


What It Would Take to Fix Substitution

To get realistic substitution patterns, you need to break the independence of the error terms. There are several ways to do this, each corresponding to a different model:

Nested logit introduces correlation within pre-specified groups. You manually assign alternatives to “nests” (e.g., a bus nest and a car nest), and errors are correlated within nests but independent across nests. This lets blue bus and red bus be closer substitutes. The limitation is that you must specify the nesting structure in advance, and it can only capture one dimension of similarity.

Mixed logit (random coefficients) induces correlation through observed characteristics, as discussed extensively in the companion document. By making the coefficient on travel time random, alternatives with similar travel times end up with correlated effective errors. Importantly, this works through the same variables that are in \(V\) — but through the variance of the coefficients, not through the mean. The mean \(\bar{\beta}\) sets levels (just like \(\beta\) in standard logit). The variance \(\Sigma\) generates correlation, and therefore substitution patterns. This is the mechanism that standard logit entirely lacks.

Probit allows a full covariance matrix on the error terms, giving maximum flexibility in substitution. But estimation is computationally demanding because the choice probabilities require evaluating high-dimensional normal integrals.

In all these cases, the fix involves modifying the covariance structure of the unobserved component — something that cannot be accomplished by working on \(V\) alone.


Summary: The Precise Division of Labor

Component What It Controls What It Cannot Do
\(V_{nj}\) (deterministic) Market shares — how popular each alternative is Generate non-proportional substitution; reflect similarity between alternatives
\(\varepsilon_{nj}\) (random, i.i.d.) Substitution pattern — always proportional under logit Be modified by the analyst; be made correlated by adding variables to \(V\)
Random coefficients on \(V\) Substitution pattern — non-proportional, driven by similarity in characteristic space Capture similarity along unobserved dimensions not represented in \(V\)

The core lesson is that in standard logit, the deterministic utility and the substitution pattern are completely decoupled. Getting \(V\) right is necessary for good share predictions but is irrelevant for substitution. Similar \(V\) values between two alternatives tells the model that they are equally popular, not that they are similar. Fixing substitution requires changing the unobserved part of the model — and that is exactly what mixed logit, nested logit, and probit do, each in their own way.

Nested logit

Mixed logit

BLP

Footnotes

  1. Notation I use here is mostly from Chris Conlon’s Grad IO lecture slides.↩︎

  2. This part follows mostly from Conlon’s notes and train’s book. We will now consider multinomial discrete choice setup:↩︎

  3. This was created after doing back-and-forth self-education with claude code.↩︎