BLP Demystified: From Basics to Brain-Busting Models!

Author

Hyoungchul Kim

Install necessary julia packages for later computation:

# You might need these commented codes to install packages
# using Pkg
# Pkg.add(["DataFrames", "CSV", "GLM", "Statistics", "LinearAlgebra", "Distributions", "NLopt", "FixedEffectModels", "RegressionTables", "DataFramesMeta", "Random", "StatsModels", "Optim", "StatsBase"])

1 BLP

This exercise estimates the demand-side BLP model.

1.1 Motivation

Why do this? Demand estimation is very important in IO literature because measuring market power is important in IO. How do we quantify market power? Usually we use markup as the measure. But it is hard to directly calculate markup because it depends on the cost function of the firm which is not observed. But IO theory shows that we can actually get the markup using demand elasticity. Thus estimating demand is important.

1.2 Basic: McFadden (1974) style logit model

1.2.1 Model setup

We will first estimate a basic logit model with no unobserved demand shifters and no random coefficents. But let’s just talk bit about the background of this discrete choice model. Note that most of it is from Train (2009).

Even before McFadden (1974), there has been a long history of the development of the logit model. But McFadden (1974) provides a complete, well-defined econometric model that is consistent with the utility maximization behavior of individuals.

Individual’s (\(i\)) utility maximizing behavior (indirect utility) can be specified as follows:

\[ u_{ij} = \underbrace{x_j \beta + \alpha p_j}_{\delta_j} + \varepsilon_{ij} \]

where mean utility of outside option is normalized to zero. Also, idiosyncratic shock (i.i.d) follows Type 1 Extreme Value distribution (T1EV). We also assume there are \(0, \ldots, J\) products (denote \(0\) as the outside option) where one option is outside option. We can think of \(\delta_j\) as the mean utility from the product \(j\). This is because in this parameterization, \(\delta_j\) does not depend on \(i\).

Now let’s do some math to derive the logit choice probabilities. One benefit about logit model is that we can get a close-form solution. We are going to compute the probability of individuals choosing product \(j\) given \(p_j\), and \(x_j\).

\[\begin{align} P (u_{ij} \geq \forall_{j' \neq j} u_{ij'} \mid x_j, p_j) &= P (x_j \beta + \alpha p_j + \varepsilon_{ij} \geq \forall_{j' \neq j} x_{j'}\beta + \alpha p_{j'} + \varepsilon_{ij'} \mid x_j, p_j) \\ &= P ( \varepsilon_{ij'} \leq \varepsilon_{ij} + \delta_j - \delta_{j'} \, \forall j' \neq j). \end{align}\]

If we assume that \(\varepsilon_{ij}\) is given, we can think of the last term as the cumulative distribution of the T1EV where \(F(\varepsilon_{ij}) = e^{-e^{- \varepsilon_{ij}}}\). Since we assumed i.i.d., we can express the last term as the product of the individual cumulative distributions (For brevity, we will now denote the conditional logit choice probability as \(P_{ij}\)):

\[ P_{ij} \mid \varepsilon_{ij} = \prod_{j' \neq j} e^{ - e^{-(\varepsilon_{ij} + \delta_j - \delta_{j'})}}. \]

Since \(\varepsilon_{ij}\) is not given, we need o integrate it over density of \(\varepsilon_{ij}\):

\[ P_{ij} = \int \left(\prod_{j' \neq j} e^{ - e^{-(\varepsilon_{ij} + \delta_j - \delta_{j'})}} \right) e^{- \varepsilon_{ij}} e^{-e^{\varepsilon_{ij}}} d \varepsilon_{ij}. \]

Now let’s get this into a closed-form expression:

As a result, we can get the closed-form expression:

\[ P_{ij} = \frac{e^{\delta_{ij}}}{\sum_{j'} e^{\delta_{ij'}}} \]

This could be understood as the predicted share function given the fixed values of the parameters.

Note that this is a very simple model because we are not assuming any unobserved product demand shifters that could be affected the utility gained from the product. In fact, we are assuming that econometricians can fully observe all the necessary variables that constructs the mean utility. Thus there is not much econometrics involved. You can just get the parameters as follows:

  1. Assuming you have the data on market share, you can use it to match it to \(P_{ij} \cdot M\) where \(M\)is the total market size.

  2. Then since we will get \(J\) equations using \(J\) market share, we can do simple algebra to get the mean utility \(\delta_j\).

  3. Then you can do some nonlinear least squares that minimize the sum of the differences between oberved and predicted shares of all products. This will get you the parameters that best fit the data.

1.2.2 Adding unobserved demand shifters

We can add the additional unobserved variables \(\xi_j\) which can be thought of as some demand shifter for product \(j\). This allows the model to be more flexible to incorporate the realistic situation where econometrician might not be able to observe some components that might be affecting the utility of getting some product. Thus most of what we did above does not change much. The only problem would be understanding the nature of this unobserved terms with other main parameters of interest. If there is endogeneity, we would need some IV to estimate the parameter. In this section, we will do both cases (OLS, IV).

1.2.3 Computation (Following Berry (1994))

So how can we retrieve the parameters of interest? Naive way to think about it would be doing some nonlinear least squares where you minimize the sum of differences between predicted share and observed shares of all products. The problem is that this directy way is implausible: You would need to know the \(\xi_j\). Since this is unobservable, it is problematic.

This is where Berry (1994) comes in. He introduces this clever two steps estimation process.

Step 1: Inversion

Notation: Let \(\hat{s}_j (\delta)\) be predicted shares and let \(s_j\) be observed shares.1

Then you can use the system of equations from matching actual to predicted shares and invert them to get the mean utility. For this simple case, we can get the following equations:

\[ \delta_j = \log s_j - \log \hat{s}_0, \quad j = 1, \ldots, J. \]

So this inversion gets us the value of the mean utility. Then we have the second step.

Step 2: IV estimation

By definition, we have \(\delta_j = x_j \beta + \alpha p_j + \xi_j\). So we can do the regression to retrieve the parameters. I put IV, but this could be just OLS if you can assume the unobserved term is uncorrelated with the price.

1.2.4 Coding (with Julia)

Finally we will do some coding to get the result we just talked about.

using FixedEffectModels, DataFrames, CSV, RegressionTables 

# read in the data
otc = CSV.read("data/otc.csv", DataFrame)

# Run regressions
ols1 = reg(otc, @formula(ln_mkt_share_diff ~ price + promotion)) 
ols2 = reg(otc, @formula(ln_mkt_share_diff ~ price + promotion + fe(product)))
ols3 = reg(otc, @formula(ln_mkt_share_diff ~ price + promotion + fe(product) + fe(store)))
iv1 = reg(otc, @formula(ln_mkt_share_diff ~ (price ~ cost) + promotion))
iv2 = reg(otc, @formula(ln_mkt_share_diff ~ (price ~ cost) + promotion + fe(product)))

regtable(ols1, ols2, ols3, iv1, iv2, order = ["price"], drop = ["(Intercept)"], regression_statistics = [FStatIV, Nobs, R2],
  labels = Dict(
    "price" => "Price",
    "promotion" => "Promotion",
    "ln_mkt_share_diff" => "Log Mkt share difference"
  ))
## Some R codes that I followed

# m1 <- lm(ln_mkt_share_diff ~ price + promotion , data = otc)
# m2 <- lm(ln_mkt_share_diff ~ price + promotion + factor(product), data = otc)
# m3 <- lm(ln_mkt_share_diff ~ price + promotion + factor(product) + factor(store), data = otc)
# m4 <- ivreg(ln_mkt_share_diff ~ price + promotion | . - price + cost, data = otc)
# m5 <- ivreg(ln_mkt_share_diff ~ price + promotion + factor(product) | . - price + cost, data = otc)
# stargazer(m1, m2, m3, m4, m5, 
#           omit = c("product", "store"),
#           type = "text")

-----------------------------------------------------------------------------
                                        Log Mkt share difference             
                          ---------------------------------------------------
                              (1)        (2)       (3)         (4)        (5)
-----------------------------------------------------------------------------
Price                       0.020   -0.189**   -0.145*    0.069***      0.169
                          (0.014)    (0.059)   (0.059)     (0.015)    (0.115)
Promotion                   0.121     0.187*   0.201**       0.149   0.308***
                          (0.093)    (0.074)   (0.073)     (0.093)    (0.082)
-----------------------------------------------------------------------------
product Fixed Effects                    Yes       Yes                    Yes
store Fixed Effects                                Yes                       
-----------------------------------------------------------------------------
Estimator                     OLS        OLS       OLS          IV         IV
-----------------------------------------------------------------------------
Controls                      Yes                              Yes           
-----------------------------------------------------------------------------
First-stage F statistic                                  8,147.921    394.113
N                           1,056      1,056     1,056       1,056      1,056
R2                          0.003      0.440     0.456      -0.008      0.420
-----------------------------------------------------------------------------

1.2.5 Caveats

But we don’t usually use this basic setup in IO. This is because the model is bit too simple to fully capture the reality. One of the well known problem is the Independence of irrelevant alternatives (IIA). Basically what this means is that we don’t get a realistic demand elasticities. If you want to know more about it, google the famouse Red bus, blue bus story.

1.2.6 Solutions?

There are some ways to alleviate this problem. One of them (which we will not discuss), is using nested logit. Basically we are defining certain group of products where IIA holds within the group but may not hold across the group. So for the case of red bus, blue bus, they would be in a same group.

Another way is to do enhance the random utility model into logit model with random coefficients. In essence, this is sort of introducing preference heterogeneity of consumers into the model. This is done by interacting consumer preferences with product characteristics. The nuisance with this case is that now closed-form expression for choice probability is not obtainable. We need to do some numerical computation.

1.3 Advanced: Berry, Levinsohn, and Pakes (1995) (Random coefficients logit model)

We again start with the individual utility function. But now something is added (we will now also explicitly denote markets as \(t\)):

\[ u_{ijt} = x_{jt} \beta_{it} + \alpha p_{jt} + \xi_{jt} + \varepsilon_{ijt} \]

The difference is that slope coefficients can now vary across individuals \(i\). For now, we will assume \(\beta_{it}^k = \beta_0^k + \sigma_{kt} v_{it}^k\). We now have \(k\) which is the dimension of \(\beta\). \(\beta_0^k\) are fixed taste for characteristics \(k\) and \(v_{it}^k\) are random tastes that follow standard normal distribution.

Now we can expand the model:

\[\begin{align} u_{ijt} &= (x_{j1t}, \ldots, x_{jKt}) \cdot (\beta_{0}^1 + \sigma_1 v_{it}^1, \ldots, \beta_{0}^K + \sigma_K v_{it}^K)^T + \alpha p_{jt} + \xi_{jt} + \varepsilon_{ijt}\\ &= x_{jt}\beta_{it} + \sum_k x_{jkt} \sigma_{k}v_{ikt} + \alpha p_{jt} + \xi_{jt} + \varepsilon_{ijt}\\ &= \underbrace{x_{jt}\beta_{it} + \alpha p_{jt} + \xi_{jt}}_{\delta_{jt}} + \underbrace{\sum_k x_{jkt} \sigma_{k}v_{ikt}}_{\mu_{ijt}} + \varepsilon_{ijt}. \end{align}\]

We can easily see that this is just an extension of what we did for the basic random utility model. Indirect utility is made up of mean utility \(\delta_{jt}\) and random coefficient term \(\mu_{ijt} + \varepsilon_{ijt}\).

Now we will make some simplication. We will assume that characteristics dimension of individual is one: \(K = 1\). Using this simplication, we can again use the assumption that idiosyncratic shock follows T1EV to get aggregate share:

\[ s_{jt} = \int \frac{\exp(\delta_{jt} + x_{jt} \sigma_t v_{it})}{1 + \sum_j \exp(\delta_{jt} + x_{jt} \sigma_t v_{it})} f(v_i)dv_i \]

The integral has no analytical solution in the random coefficient model, so we need to compute the integral by simulation. One way to do it is as follows:

\[ \hat{s}_{jt} = \frac{1}{ns} \sum_{i=1}^{ns} \frac{\exp(\delta_{jt} + x_{jt} \sigma_t v_{it})}{1 + \sum_j \exp(\delta_{jt} + x_{jt} \sigma_t v_{it})} \]

where \(ns\) is number of random draws from \(v_i\).

Now we can see the inversion method we did before is not easy to implement. This is because we now have additional parameters that we do not know the values.

So in BLP, we need to do nested estimation algorithm.

  1. In the outer loop, we iterate over different values of the parameters.

  2. In the inner loop, for a given parameter value, we do the inversion to get the mean utility and estimate the GMM objective function.

  3. We keep on doing this iteration until we get the parameters that minimize the GMM function.

Now let’s do some coding!

1.3.1 Another coding (with Julia)

This portion of the code is from here

###############################################################################
####  BLP fixed-point algorithm, inverting mkt shares to get mean utility  ####
###############################################################################
using CSV
using DataFrames
using GLM
using Statistics
using LinearAlgebra
using Distributions
using NLopt
otc = CSV.read("data/otc.csv", DataFrame)

ns = 500;
nmkt = maximum(otc.mkt);
mkt = unique(otc.mkt);
nprod = maximum(otc.product);

vi = quantile.(Normal(), collect(range(0.5/ns, step = 1/ns, length = ns)));
sigma = 1;

function calc_mkt_share_t(delta_t, sigma_t, x_t, vi_t)
    # Dimension: delta_t 11*1, simga_t 1*1, x_t 11*1
    delta_t = delta_t .* ones(nprod, ns)
    mu_t = x_t*sigma_t*vi_t'
    numerator = exp.(delta_t .+ mu_t)
    denominator = ones(nprod, ns) .+ sum(numerator, dims = 1)
    mkt_share_t = mean(numerator./denominator, dims = 2)
end

function contraction_t(d0, sigma_t, x_t, vi_t, mkt_t, tol = 1e-5, maxiter = 1e5)
    obs_mkt_share_t = mkt_t.mkt_share
    d_old = d0
    normdiff = Inf
    iter = 0
    while normdiff > tol && iter <= maxiter
        model_mkt_share_t = calc_mkt_share_t(d_old, sigma_t, x_t, vi_t)
        d_new = d_old .+ log.(obs_mkt_share_t) .- log.(model_mkt_share_t)
        normdiff = maximum(norm.(d_new .- d_old))
        d_old = d_new
        iter += 1
    end
    return d_old
end

function calc_delta(sigma)
    delta_fp = zeros(nprod, nmkt);
    for t in mkt
        mkt_t = otc[otc.mkt .== t, :];
        x_t = ones(nprod, 1);
        delta_t = zeros(nprod, 1);
        sigma_t = sigma;
        vi_t = vi;
        delta_fp[:, t] = contraction_t(delta_t, sigma_t, x_t, vi_t, mkt_t);
    end
    return vec(delta_fp);
end

@time delta_fp = calc_delta(sigma);
mean(delta_fp)
std(delta_fp)
  1.348311 seconds (3.75 M allocations: 365.638 MiB, 3.21% gc time, 96.17% compilation time)
0.8537059827877238
################################################################
#### Estimate beta and sigma using GMM (cost as instrument) ####
################################################################
X = hcat(ones(nprod*nmkt, 1),
         otc.price, otc.promotion,
         otc.product_2, otc.product_3, otc.product_4, otc.product_5,
         otc.product_6, otc.product_7, otc.product_8, otc.product_9,
         otc.product_10, otc.product_11);
z = hcat(X, otc.cost);
Phi = z'*z/1056;
inv_Phi = inv(Phi);

function GMMObjFunc(theta2::Vector, grad::Vector)
    sigma = theta2[1]
    delta = calc_delta(sigma)
    theta1 = inv(X'*z*inv_Phi*z'*X)*X'*z*inv_Phi*z'*delta
    error = delta - X*theta1
    obj = error'*z*inv_Phi*z'*error
    return obj
end

opt = Opt(:LN_COBYLA, 1)
opt.xtol_rel = 1e-4
opt.lower_bounds = [0.00001]
opt.min_objective = GMMObjFunc
@time (minf,minx,ret) = optimize(opt, [1])

@show sigma = minx[1]
delta = calc_delta(sigma[1]);
theta1 = inv(X'*z*inv_Phi*z'*X)*X'*z*inv_Phi*z'*delta
 18.817741 seconds (20.18 M allocations: 109.887 GiB, 14.52% gc time, 0.73% compilation time)
sigma = minx[1] = 28.720209709193362
13-element Vector{Float64}:
 -71.52085329511755
  -1.0852292863458644
   0.22206483339059624
   1.903952993172239
   3.990433860485858
  -0.6194995526125144
   1.2498418962356002
   3.9120756908464607
  -2.12500554825884
  -1.1737916667530266
   0.2462099284926299
  -2.7352849052497947
   0.011924728064080292

1.3.2 My personal coding (again Julia)

I will try to write my personal code using Julia following the exercises in Mixtape session for Demand Estimation. Note that I will be just copying the exercise questions from that site below. Also, I am going to use Julia.

2 Exercise 1

2.1 Introduction

This is the first of three exercises that will give you a solid foundation for doing BLP-style estimation. The running example is the same as in lecture: what if we halved an important product’s price?

2.2 Setup

Most of the computational heavy-lifting in these exercises will be done by the open-source Python package PyBLP. It is easiest to use PyBLP in Python, and the hints/solutions for the exercises will be given in Python. But for those who are more familiar with R, it is straightforward to call PyBLP from R with the reticulate package. It is technically possible to call PyBLP from other languages like Julia and MATLAB, but most users either use Python or R.

You should install PyBLP on top of the Anaconda Distribution. Anaconda comes pre-packaged with all of PyBLP’s dependencies and many more Python packages that are useful for statistical computing. Steps:

  1. Install Anaconda if you haven’t already. You may wish to create a new environment for just these exercises, but this isn’t strictly necessary.
  2. Install PyBLP. On the Anaconda command line, you can run the command pip install pyblp.

If you’re using Python, you have two broad options for how to do the coding exercises.

  • Use a Jupyter Notebook. The solutions to each exercise will be in a notebook. In general, notebooks are a good way to weave text and code for short exercises, and to distribute quick snippets of code with others.
  • Use an integrated development environment (IDE). Once you get beyond a few hundred lines of code, I strongly recommend using an IDE and not notebooks. For Python, I recommend VS Code or PyCharm. The former is free and the latter has a free community edition with all the features you’ll need for standard Python development. Both integrate well with Anaconda.

If using a notebook, you can right click and save the following notebook template: notebook.ipynb. If using an IDE, you can right click and save the following script template: script.py. Both import various packages used throughout the exercise.

#| eval: false
 
import pyblp
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

The notebook additionally configures these packages to reduce the amount of information printed to the screen.

#| eval: false
 
pyblp.options.digits = 3
pyblp.options.verbose = False
pd.options.display.precision = 3
pd.options.display.max_columns = 50

import IPython.display
IPython.display.display(IPython.display.HTML('<style>pre { white-space: pre !important; }</style>'))

Finally, both show how to load the data that we’ll be using today.

2.3 Data

Today, you’ll use the products.csv dataset, which is a simplified version of Nevo’s (2000) fake cereal data with less information and fewer derived columns. The data were motivated by real grocery store scanner data, but due to the proprietary nature of this type of data, the provided data are not entirely real. This dataset has been used as a standard example in much of the literature on BLP estimation.

Compared to typical datasets you might use in your own work, the number of observations in this example dataset is quite small. This helps with making these exercises run very fast, but in practice one would want more data than just a couple thousand data points to estimate a flexible model of demand. Typical datasets will also include many more product characteristics. This one only includes a couple to keep the length of the exercises manageable.

The data contains information about 24 breakfast cereals across 94 markets. Each row is a product-market pair. Each market has the same set of breakfast cereals, although with different prices and quantities. The columns in the data are as follows.

Column Data Type Description
market String The city-quarter pair that defines markets \(t\) used in these exercises. The data were motivated by real cereal purchase data across 47 US cities in the first 2 quarters of 1988.
product String The firm-brand pair that defines products \(j\) used in these exercises. Each of 5 firms produces between 1 and 9 brands of cereal.
mushy Binary A dummy product characteristic equal to one if the product gets soggy in milk.
servings_sold Float Total quantity \(q_{jt}\) of servings of the product sold in a market, which will be used to compute market shares.
city_population Float Total population of the city, which will be used to define a market size.
price_per_serving Float The product’s price \(p_{jt}\) used in these exercises.
price_instrument Float An instrument to handle price endogeneity in these exercises. Think of it as a cost-shifter, a Hausman instrument, or any other valid IV that we discussed in class.

Throughout the exercises, we use these data to estimate an increasingly flexible BLP-style model of demand for cereal. We will use predictions from this model to see how our running example, cutting the price of one cereal, affects demand for that cereal and for its substitutes.

3 Exercise 1

3.1 1. Describe the data

You can download products.csv from this link. To load it, you can use pd.read_csv. To look at a random sample of its rows, you can use .sample. To compute summary statistics for different columns, you can use .describe. Throughout these exercises, you’ll be given links to functions and methods that can be used to answer the questions. If you’re unsure about how to use them, you should click on the link, where there is typically example code lower down on the page.


using CSV, DataFrames, DataFramesMeta, StatsModels, LinearAlgebra, Random, Statistics, Distributions

df = CSV.read("data/raw/products.csv", DataFrame) 

describe(df)
7×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 market C01Q1 C65Q2 0 String7
2 product F1B04 F6B18 0 String7
3 mushy 0.333333 0 0.0 1 0 Int64
4 servings_sold 1.20296e6 4085.41 3.62285e5 9.85623e7 0 Float64
5 city_population 641995.0 173072 332943.0 7322564 0 Int64
6 price_per_serving 0.12574 0.0454871 0.123829 0.225728 0 Float64
7 price_instrument 0.0908116 -0.00265638 0.0890101 0.188043 0 Float64

3.2 2. Compute market shares

To transform observed quantities \(q_{jt}\) into market shares \(s_{jt} = q_{jt} / M_t\), we first need to define a market size \(M_t\). We’ll assume that the potential number of servings sold in a market is the city’s total population multiplied by 90 days in the quarter. Create a new column called market_size equal to city_population times 90. Note that this assumption is somewhat reasonable but also somewhat arbitrary. Perhaps a sizable portion of the population in a city would never even consider purchasing cereal. Or perhaps those who do tend to want more than one serving per day. In the third exercise, we’ll think more about how to discipline our market size assumption with data.

Next, compute a new column market_share equal to servings_sold divided by market_size. This gives our market shares \(s_{jt}\). We’ll also need the outside share \(s_{0t} = 1 - \sum_{j \in J_t} s_{jt}\). Create a new column outside_share equal to this expression. You can use .groupby to group by market and .transform('sum') to compute the within-market sum of inside shares. Compute summary statistics for your inside and outside shares. If you computed market shares correctly, the smallest outside share should be \(s_{0t} \approx 0.305\) and the largest should be \(s_{0t} \approx 0.815\).


@transform! df :market_size = :city_population .* 90

@transform! df :market_share = :servings_sold ./ :market_size

@chain df begin
  groupby(:market)
  @transform! :outside_share = 1 .- sum(:market_share)
end

describe(df.outside_share)
Summary Stats:
Length:         2256
Missing Count:  0
Mean:           0.524197
Std. Deviation: 0.109622
Minimum:        0.304575
1st Quartile:   0.439481
Median:         0.535808
3rd Quartile:   0.604251
Maximum:        0.815168
Type:           Float64

3.3 3. Estimate the pure logit model with OLS

Recall the pure logit estimating equation: \(\log(s_{jt} / s_{0t}) = \delta_{jt} = \alpha p_{jt} + x_{jt}' \beta + \xi_{jt}\). First, create a new column logit_delta equal to the left-hand side of this expression. You can use np.log to compute the log.


@transform! df :logit_delta = log.(:market_share) .- log.(:outside_share)

first(df, 5)
5×11 DataFrame
Row market product mushy servings_sold city_population price_per_serving price_instrument market_size market_share outside_share logit_delta
String7 String7 Int64 Float64 Int64 Float64 Float64 Int64 Float64 Float64 Float64
1 C01Q1 F1B04 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029
2 C01Q1 F1B06 1 362850.0 516259 0.114178 0.0725791 46463310 0.00780939 0.555225 -4.26405
3 C01Q1 F1B07 1 603768.0 516259 0.132391 0.101842 46463310 0.0129945 0.555225 -3.75485
4 C01Q1 F1B09 0 2.68092e5 516259 0.130344 0.104332 46463310 0.00576996 0.555225 -4.56671
5 C01Q1 F1B11 0 8.3328e5 516259 0.154823 0.121111 46463310 0.0179341 0.555225 -3.43267

Then, use the package of your choice to run an OLS regression of logit_delta on a constant, mushy, and price_per_serving. There are many packages for running OLS regressions in Python. One option is to use the formula interface for statsmodels. To use robust standard errors, you can specify cov_type='HC0' in OLS.fit.

Interpret your estimates. Your coefficient on price_per_serving should be around -7.48. In particular, can you re-express your estimate on mushy in terms of how much consumers are willing to pay for mushy, using your estimated price coefficient?


using FixedEffectModels, RegressionTables

ols = reg(df, @formula(logit_delta ~ mushy + price_per_serving), Vcov.robust())

regtable(ols, order = ["price_per_serving"], drop = ["(Intercept)"], regression_statistics = [Nobs, R2])

-------------------------------
                    logit_delta
-------------------------------
price_per_serving     -7.480***
                        (0.840)
mushy                     0.075
                        (0.054)
-------------------------------
Controls                    Yes
-------------------------------
N                         2,256
R2                        0.034
-------------------------------

3.4 4. Run the same regression with PyBLP

For the rest of the exercises, we’ll use PyBLP do to our demand estimation. This isn’t necessary for estimating the pure logit model, which can be done with linear regressions, but using PyBLP allows us to easily run our price cut counterfactual and make the model more flexible in subsequent days’ exercises.

PyBLP requires that some key columns have specific names. You can use .rename to rename the following columns so that they can be understood by PyBLP.

  • market –> market_ids
  • product –> product_ids
  • market_share –> shares
  • price_per_serving –> prices

By default, PyBLP treats prices as endogenous, so it won’t include them in its matrix of instruments. But the “instruments” for running an OLS regression are the same as the full set of regressors. So when running an OLS regression and not account for price endogeneity, we’ll “instrument” for prices with prices themselves. We can do this by creating a new column demand_instruments0 equal to prices. PyBLP will recognize all columns that start with demand_instruments and end with 0, 1, 2, etc., as “excluded” instruments to be stacked with the exogenous characteristics to create the full set of instruments.

With the correct columns in hand, we can initialize our pyblp.Problem. To specify the same R-style formula for our regressors, use pyblp.Formulation. The full code should look like the following.

#| eval: false
 
ols_problem = pyblp.Problem(pyblp.Formulation('1 + mushy + prices'), product_data)

If you print(ols_problem), you’ll get information about the configured problem. There should be 94 markets (T), 2256 observations (N), 3 product characteristics (K1), and 3 total instruments (MD). You can verify that these instruments are simply the regressors by looking at ols_problem.products.X1 and ols_problem.products.ZD, comparing these with mushy and prices in your dataframe. For the full set of notation used by PyBLP, which is very close to the notation used in the lectures, see this page.

To estimate the configured problem, use .solve. Use method='1s' to just do 1-step GMM instead of the default 2-step GMM. In this case, this will just run a simple linear OLS regression. The full code should look like the following.

#| eval: false
 
ols_results = ols_problem.solve(method='1s')

Again, if you print(ols_results), you’ll get estimates from the logit model. Make sure that your estimates are the same as those you got from your OLS regression. If you used 'HC0' standard errors like suggested above, you standard errors should also be the same.

3.5 5. Add market and product fixed effects

Since we expect price \(p_{jt}\) to be correlated with unobserved product quality \(\xi_{jt}\), we should be worried that our estimated \(\hat{\alpha}\) on price is biased. Since we have multiple observations per market and product, and prices vary both across and within markets, it is feasible for us to add both market and product fixed effects. If \(\xi_{jt} = \xi_j + \xi_t + \Delta\xi_{jt}\) and most of the correlation between \(p_{jt}\) and \(\xi_{jt}\) is due to correlation between \(p_{jt}\) and either \(\xi_j\) (product fixed effects) or \(\xi_t\) (market fixed effects), then explicitly accounting for these fixed effects during estimation should help reduce the bias of our \(\hat{\alpha}\).

The simplest way to add fixed effects is as dummy variables. We won’t do this today, but for your own reference, you could do this either by making a separate column for each possible market and product fixed effects and adding these to your formulation, or you could use the shorthand mushy + prices + C(market_ids) + C(product_ids). See pyblp.Formulation for different shorthands you can use. Since there are only 24 products and 94 markets for a total of 118 fixed effects, this approach is actually feasible in this case. But in a more realistic dataset with hundreds or thousands of products and markets, running an OLS regression with this many dummy variables starts to become computationally infeasible.

The alternative, which we’ll do today, is to “absorb” the fixed effects. For a single fixed effect, we could just de-mean our outcome variable and each of our regressors within the fixed effects levels, and then run our regression. For multiple fixed effects, we need to iteratively de-mean. PyBLP does this automatically if you specify absorb='C(market_ids) + C(product_ids)' in your formulation instead of adding these as dummy variables.

Since mushy is always either 1 or 0 for the same product across different markets, it’s collinear with product fixed effects, and you can drop it from your formula. Similarly, you can drop the constant. After dropping these, re-create your problem with absorbed fixed effects and re-solve it. Compare the new \(\hat{\alpha}\) estimate with the last one. You should now be getting a coefficient on price of around -28.6. Does its change suggest that price was positively or negatively correlated with unobserved product-level/market-level quality?


ols2 = reg(df, @formula(logit_delta ~ price_per_serving + fe(market) + fe(product)), Vcov.robust())

regtable(ols2, order = ["price_per_serving"], drop = ["(Intercept)"], regression_statistics = [Nobs, R2])

-----------------------------------
                        logit_delta
-----------------------------------
price_per_serving        -28.618***
                            (0.916)
-----------------------------------
market Fixed Effects            Yes
product Fixed Effects           Yes
-----------------------------------
N                             2,256
R2                            0.560
-----------------------------------

3.6 6. Add an instrument for price

Adding market and product fixed effects can be helpful, but since unobserved quality typically varies by both product and market, we really want to instrument for prices. The data comes with a column price_instrument that we should interpret as a valid instrument for price that satisfies the needed exclusion restriction. It could be a cost-shifter, a valid Hausman instrument, or similar.

Before using it, we should first run a first-stage regression to make sure that it’s a relevant instrument for price. To do so, use the same package you used above to run an OLS regression to run a second OLS regression of prices on price_instrument and your market and product fixed effects. If using the formula interface for statsmodels, you can use the same fixed effect shorthand as in PyBLP, with your full formula looking like prices ~ price_instrument + C(market_ids) + C(product_ids). Does price_instrument seem like a relevant instrument for prices?


# first stage
iv0 = reg(df, @formula(price_per_serving ~ price_instrument + fe(market) + fe(product)), Vcov.robust())

regtable(iv0, order = ["price_instrument"], drop = ["(Intercept)"], regression_statistics = [FStatIV, Nobs, R2])

-------------------------------------------
                          price_per_serving
-------------------------------------------
price_instrument                   0.877***
                                    (0.007)
-------------------------------------------
market Fixed Effects                    Yes
product Fixed Effects                   Yes
-------------------------------------------
First-stage F statistic                    
N                                     2,256
R2                                    0.964
-------------------------------------------

Now that we’ve checked relevance, we can set our demand_instruments0 column equal to price_instrument, re-create the problem, and re-solve it. You should get a new coefficient on price of around -30.6. Does the change in \(\hat{\alpha}\) suggest that price was positively or negatively correlated with \(\Delta\xi_{jt}\) in \(\xi_{jt} = \xi_j + \xi_t + \Delta\xi_{jt}\)?


# first stage
iv1 = reg(df, @formula(logit_delta ~ (price_per_serving ~ price_instrument) + fe(market) + fe(product)), Vcov.robust())

regtable(iv1, order = ["price_per_serving"], drop = ["(Intercept)"], regression_statistics = [FStatIV, Nobs, R2])

-------------------------------------
                          logit_delta
-------------------------------------
price_per_serving          -30.600***
                              (0.994)
-------------------------------------
market Fixed Effects              Yes
product Fixed Effects             Yes
-------------------------------------
First-stage F statistic    16,814.451
N                               2,256
R2                              0.559
-------------------------------------

3.7 7. Cut a price in half and see what happens

Now that we have our pure logit model estimated, we can run our counterfactual of interest: what if we halved an important product’s price? We’ll select a single market, the most recent quarter in the first city: C01Q2. Create a new dataframe called counterfactual_data by selecting data for just that market and inspect the data. We’ll pretend that we’re firm one, and deciding whether we want to cut the price of our brand four’s product F1B04. In particular, we might be worried about cannibalization, i.e. how much this price cut will result in consumers of our other 8 brands of cereal in this market just substituting from their old choice to the new, cheaper cereal. Alternatively, we could be a regulator or academic interested in how taxing that product would affect demand in the market.

In your new dataframe with just data from C01Q2, create a new_prices column that is the same as prices but with the price of F1B04 cut in half. To do this, you could use DataFrame.loc. Then, use .compute_shares on your results from the last question, passing market_id='C01Q2' to only compute new market shares for our market of interest, and passing prices=counterfactual_data['new_prices'] to specify that prices should be set to the new prices. This function will re-compute market shares at the changed prices implied by the model’s estimates. Store them in a new_shares column.

Compute the percent change in shares for each product in the market. From firm one’s perspective, do the estimates of cannibalization make sense. That is, do the signs on the percent changes for product F1B04 and for other products make sense? Would you normally expect percent changes for other products to be different depending on how other products compare to the one whose price is being changed?


counterfactual_data = @rsubset df :market == "C01Q2" 

@rtransform! df :new_prices = ifelse(:product == "F1B04", :price_per_serving / 2, :price_per_serving)

first(counterfactual_data, 5)
5×11 DataFrame
Row market product mushy servings_sold city_population price_per_serving price_instrument market_size market_share outside_share logit_delta
String7 String7 Int64 Float64 Int64 Float64 Float64 Int64 Float64 Float64 Float64
1 C01Q2 F1B04 1 2.99352e5 516259 0.0777177 0.0441138 46463310 0.00644276 0.502744 -4.35712
2 C01Q2 F1B06 1 6.56443e6 516259 0.141041 0.101043 46463310 0.141282 0.502744 -1.26932
3 C01Q2 F1B07 1 4.08387e6 516259 0.0726827 0.0164402 46463310 0.0878946 0.502744 -1.74394
4 C01Q2 F1B09 0 3.07654e5 516259 0.0769744 0.0498315 46463310 0.00662145 0.502744 -4.32977
5 C01Q2 F1B11 0 2.52179e6 516259 0.167009 0.132775 46463310 0.0542749 0.502744 -2.22602

3.8 8. Compute demand elasticities

To better understand what’s going on, use .compute_elasticities, again specifying market_id='C01Q2', to compute price elasticities for our market of interest. These measure what the model predicts will happen to demand in percentage terms when there’s a 1% change in price of a product. The diagonal elements are own-price elasticities and the off-diagonal elements are cross-price elasticities. Does demand seem very elastic? Do the cross-price elasticities seem particularly reasonable?

3.9 Supplemental Questions

These questions will not be directly covered in lecture, but will be useful to think about when doing BLP-style estimation in your own work.

3.10 1. Try different standard errors

By default, PyBLP computed standard errors that are robust to heteroskedasticity. But we may be concerned that unobserved quality \(\xi_{jt}\) is systematically correlated across markets for a given product \(j\), or across products for a given market \(t\). Choose which one you think is more likely and try clustering your standard errors by that dimension. You can do this with se_type='clustered' in .solve, for which you’ll need a clustering_ids column in your product data. See how your standard error for \(\alpha\) changes.

3.11 2. Compute confidence intervals for your counterfactual

Your estimate of \(\hat{\alpha}\) comes with a standard error, but your counterfactual demand predictions don’t. Ideally we’d like to not only have a point estimate for a counterfactual prediction, but also a measure (up to model misspecification) of how confident we are in these predictions. The easiest way to do this is with a “parametric bootstrap.” The intuition is we can draw from the estimated asymptotic distribution of our \(\hat{\alpha}\), and for each draw, re-compute demand, and see how demand responds to the same price cut.

You can do a parametric bootstrap with the .bootstrap method. Start with just a few draws (e.g., draws=100) and remember to set your seed so that you get the same draws every time you run the code. When new parameters are drawn, you get new .boostrapped_shares, which take the place of your old shares. You can use the same .compute_shares method on the BootstrapedResults class, although you’ll have to pass a prices argument with prices replicated along a new axis by as many draws as you have.

Once you have some bootstrapped shares, compute the same percent changes, and compute the 2.5th and 97.5th percentiles of these changes for each product. Are these 95% confidence intervals for your predictions particularly wide?

3.12 3. Impute marginal costs from pricing optimality

The canonical demand side of the BLP model assumes firms set prices in static Bertrand-Nash equilibrium. See this section for a quick summary using PyBLP notation. Given an estimated demand model and such assumptions about pricing, we can impute marginal costs c_{jt}.

To do so, you first need to tell PyBLP what firms own what products. Create a new firm_ids column in your data, re-initialize your problem, and re-solve it. Then, you should be able to run the .compute_costs method to impute firms’ marginal cost of producing each cereal. Do these marginal costs look particularly reasonable? How might limitations of your demand model and supply model bias them? What would they and observed prices imply about firms’ markups and economic profits?

3.13 4. Check your code by simulating data

Even experienced software developers make a lot of mistakes when writing code. Writing “unit tests” or “integration tests” that check whether the code you’ve written seems to be working properly is incredibly important when writing complicated code to estimate demand. Perhaps the most useful test you can write when doing demand estimation (or most other types of structural estimation) is the following.

  1. Simulate fake data under some true parameters.
  2. Estimate your model on the simulated data and make sure that you can recover the true parameters, up to sampling error.

If you do these steps many times, the resulting Monte Carlo experiment will also give you a good sense for the finite sample statistical properties of your estimator.

PyBLP’s Simulation class makes simulating data fairly straightforward. Its interface is similar to Problem, but you also specify your parameter estimates and structural errors. In addition to checking your code, you can also use this class for more complicated counterfactuals. After initializing your simulation, you can use .replace_endogenous to have PyBLP replace the prices \(p_{jt}\) and market shares \(s_{jt}\) with those that rationalize the chosen true parameters and other parts of the simulation. It does so by solving the pricing equilibrium. You’ll have to pass your imputed marginal costs via the costs argument.

Initialize a simulation of the pure logit model with the same product_data and the same estimated xi but with an \(\alpha\) somewhat different than the one you estimated. Make sure your chosen \(\alpha\) is negative, otherwise demand will be upward sloping and PyBLP will have trouble solving for equilibrium prices. To the estimated .xi you can add the estimated fixed effects .xi_fe, since the simulation class does not support fixed effects absorption.

Have PyBLP solve for prices and market shares, and use the resulting data to re-estimate your pure logit regression. See if you can get an estimated \(\hat{\alpha}\) close to the true one.

4 Exercise 2

4.1 Introduction

This is the second of three exercises that will give you a solid foundation for doing BLP-style estimation. We’ll continue with the same running example: what if we halved an important product’s price? Our goal today is to relax some of the unrealistic substitution patterns implied by the pure logit model by incorporating preference heterogeneity. To do so, we will use cross-market variation in our product and some new demographic data to estimate parameters that govern preference heterogeneity.

4.2 Setup

We’ll be continuing where we left off after the first exercise. You should just keep adding to your code, using its solutions if you had trouble with some parts.

4.3 Data

Today, you’ll incorporate demographics.csv into estimation, which again is a simplified version of Nevo’s (2000) demographic data with less information and fewer derived columns. The data were originally draws from the Current Population Survey.

In your own work, when incorporating demographic data into estimation, you will want to sample from the whole Current Population Survey (or whatever survey/census data you are using), not just from a subset of it. The small size of today’s demographic data helps with distributing the data, but in practice you should ideally be sampling from a much larger dataset of demographic information. In your own work you will also want to incorporate more demographic variables than the one included in this dataset. Like the product data, in these exercises we only consider a few columns to keep the exercises a manageable length.

The demographic dataset contains information about 20 individuals drawn from the Current Population Survey for each of the 94 markets in the product data. Each row is a different individual. The columns in the data are as follows.

Column Data Type Description
market String The city-quarter pair that defines markets \(t\) used in these exercises. The data were motivated by real cereal purchase data across 47 US cities in the first 2 quarters of 1988.
quarterly_income Float The quarterly income of the individual in dollars.

In today and tomorrow’s exercises, we will use these demographic data to introduce income-specific preference heterogeneity into our BLP model of demand for cereal and see how our counterfactual changes. By incorporating income, we will also be able to speak to distributional concerns: how will counterfactual changes in the market differentially affect high- and low-income consumers?

4.4 My BLP.jl (This section failed miserably…)

In this section, I will try my best to make a very flexible BLP estimation code.

First load two data we will use and see how they look like:

using CSV, DataFrames, DataFramesMeta, Random, LinearAlgebra, Statistics, Distributions, StatsBase
using FixedEffectModels, RegressionTables, Optim

# Load the necessary data
data_prod = df;

# rename each certain columns to match it to the key argument in later functions

@rename! data_prod begin
  :market_id = :market
  :product_id = :product
  :price_x0 = :price_per_serving
  :mushy_x1 = :mushy
  :price_instrument_z0 = :price_instrument
  :s_obs = :market_share
  :s_out = :outside_share
end

first(data_prod, 5)
WARNING: using Optim.optimize in module Notebook conflicts with an existing identifier.
5×12 DataFrame
Row market_id product_id mushy_x1 servings_sold city_population price_x0 price_instrument_z0 market_size s_obs s_out logit_delta new_prices
String7 String7 Int64 Float64 Int64 Float64 Float64 Int64 Float64 Float64 Float64 Float64
1 C01Q1 F1B04 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044
2 C01Q1 F1B06 1 362850.0 516259 0.114178 0.0725791 46463310 0.00780939 0.555225 -4.26405 0.114178
3 C01Q1 F1B07 1 603768.0 516259 0.132391 0.101842 46463310 0.0129945 0.555225 -3.75485 0.132391
4 C01Q1 F1B09 0 2.68092e5 516259 0.130344 0.104332 46463310 0.00576996 0.555225 -4.56671 0.130344
5 C01Q1 F1B11 0 8.3328e5 516259 0.154823 0.121111 46463310 0.0179341 0.555225 -3.43267 0.154823
# Load the necessary data
data_demo = CSV.read("data/raw/demographics.csv", DataFrame)

@rename! data_demo begin
  :market_id = :market
  :log_income_y0 = :quarterly_income
end

@rtransform! data_demo :log_income_y0 = log(:log_income_y0)

first(data_demo, 5)
5×2 DataFrame
Row market_id log_income_y0
String7 Float64
1 C01Q1 8.58651
2 C01Q1 9.48285
3 C01Q1 8.40662
4 C01Q1 7.69358
5 C01Q1 6.42269

Next we sample the demo data to have 1000 random draw for each market:

function sample_ind(df, n = 1000, seedn = 123)
    Random.seed!(seedn)

    grouped = groupby(df, :market_id)

    sampled_df = combine(grouped) do subdf
        sampled_rows = sample(eachrow(subdf), n; replace = true)
        DataFrame(sampled_rows)
    end

    sampled_df = transform(groupby(sampled_df, :market_id), eachindex => :ind_id)

    return sampled_df
end

# 94 markets x 1000 individual
data_demo = sample_ind(data_demo);


data_demo
# first(data_demo, 10)
94000×3 DataFrame
93975 rows omitted
Row market_id log_income_y0 ind_id
String7 Float64 Int64
1 C01Q1 8.39605 1
2 C01Q1 7.76281 2
3 C01Q1 7.75411 3
4 C01Q1 9.09651 4
5 C01Q1 9.09651 5
6 C01Q1 7.75411 6
7 C01Q1 8.29923 7
8 C01Q1 8.40662 8
9 C01Q1 8.47015 9
10 C01Q1 8.1857 10
11 C01Q1 7.77479 11
12 C01Q1 7.76281 12
13 C01Q1 9.09651 13
93989 C65Q2 8.46581 989
93990 C65Q2 7.4559 990
93991 C65Q2 8.68641 991
93992 C65Q2 7.5741 992
93993 C65Q2 7.91547 993
93994 C65Q2 8.98306 994
93995 C65Q2 8.45945 995
93996 C65Q2 8.95375 996
93997 C65Q2 7.80121 997
93998 C65Q2 7.17447 998
93999 C65Q2 7.58398 999
94000 C65Q2 5.13197 1000

Create unobserved individual \(v_{it}\):

function create_v!(df::DataFrame, node::Int)
    for i in 1:node
        colname = Symbol("node_y", i)
        df[!, colname] = randn(nrow(df))
    end
    return df
end

create_v!(data_demo, 1)
94000×4 DataFrame
93975 rows omitted
Row market_id log_income_y0 ind_id node_y1
String7 Float64 Int64 Float64
1 C01Q1 8.39605 1 0.808288
2 C01Q1 7.76281 2 -1.12207
3 C01Q1 7.75411 3 -1.10464
4 C01Q1 9.09651 4 -0.416993
5 C01Q1 9.09651 5 0.287588
6 C01Q1 7.75411 6 0.229819
7 C01Q1 8.29923 7 -0.421769
8 C01Q1 8.40662 8 -1.35559
9 C01Q1 8.47015 9 0.0694591
10 C01Q1 8.1857 10 -0.117323
11 C01Q1 7.77479 11 1.21928
12 C01Q1 7.76281 12 0.292914
13 C01Q1 9.09651 13 -0.0311481
93989 C65Q2 8.46581 989 1.45184
93990 C65Q2 7.4559 990 -0.873824
93991 C65Q2 8.68641 991 0.534542
93992 C65Q2 7.5741 992 0.487625
93993 C65Q2 7.91547 993 -1.12754
93994 C65Q2 8.98306 994 0.991109
93995 C65Q2 8.45945 995 -0.112785
93996 C65Q2 8.95375 996 -1.30939
93997 C65Q2 7.80121 997 -1.64881
93998 C65Q2 7.17447 998 0.645287
93999 C65Q2 7.58398 999 -0.0317194
94000 C65Q2 5.13197 1000 1.23878

Join all dataframes into one:


df1 = unique(@select data_prod :market_id :product_id)
df2 = unique(@select data_demo :ind_id)

# Cross join to mimic expand_grid
main_data = crossjoin(df1, df2)

leftjoin!(main_data, data_prod; on=[:market_id, :product_id])
leftjoin!(main_data, data_demo; on=[:market_id, :ind_id])

main_data
2256000×15 DataFrame
2255975 rows omitted
Row market_id product_id ind_id mushy_x1 servings_sold city_population price_x0 price_instrument_z0 market_size s_obs s_out logit_delta new_prices log_income_y0 node_y1
String7 String7 Int64 Int64? Float64? Int64? Float64? Float64? Int64? Float64? Float64? Float64? Float64? Float64? Float64?
1 C01Q1 F1B04 1 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 8.39605 0.808288
2 C01Q1 F1B04 2 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 7.76281 -1.12207
3 C01Q1 F1B04 3 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 7.75411 -1.10464
4 C01Q1 F1B04 4 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 9.09651 -0.416993
5 C01Q1 F1B04 5 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 9.09651 0.287588
6 C01Q1 F1B04 6 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 7.75411 0.229819
7 C01Q1 F1B04 7 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 8.29923 -0.421769
8 C01Q1 F1B04 8 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 8.40662 -1.35559
9 C01Q1 F1B04 9 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 8.47015 0.0694591
10 C01Q1 F1B04 10 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 8.1857 -0.117323
11 C01Q1 F1B04 11 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 7.77479 1.21928
12 C01Q1 F1B04 12 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 7.76281 0.292914
13 C01Q1 F1B04 13 1 5.76945e5 516259 0.0720879 0.0354837 46463310 0.0124172 0.555225 -3.80029 0.036044 9.09651 -0.0311481
2255989 C65Q2 F6B18 989 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 8.46581 1.45184
2255990 C65Q2 F6B18 990 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 7.4559 -0.873824
2255991 C65Q2 F6B18 991 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 8.68641 0.534542
2255992 C65Q2 F6B18 992 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 7.5741 0.487625
2255993 C65Q2 F6B18 993 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 7.91547 -1.12754
2255994 C65Q2 F6B18 994 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 8.98306 0.991109
2255995 C65Q2 F6B18 995 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 8.45945 -0.112785
2255996 C65Q2 F6B18 996 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 8.95375 -1.30939
2255997 C65Q2 F6B18 997 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 7.80121 -1.64881
2255998 C65Q2 F6B18 998 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 7.17447 0.645287
2255999 C65Q2 F6B18 999 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 7.58398 -0.0317194
2256000 C65Q2 F6B18 1000 0 4.43638e5 188082 0.127557 0.0815838 16927380 0.0262083 0.642477 -3.19925 0.127557 5.13197 1.23878

Now let’s start construcing BLP estimation algorithm! First we will construct functions that calculate predicted market share and perform contraction mapping to get \(\delta\) when given \(\Pi\), \(\Sigma\):

# pi: 2x1
# sigma: 2x1
# delta should be jt variable
 
function calc_mkt_share(df, delta, pi, sigma)
  df = leftjoin(df, delta; on=[:market_id, :product_id])

  data = @rtransform df :util = exp(:d + :price_x0 * sigma[1] * :node_y1 + :mushy_x1 * sigma[2] * :node_y1 + :price_x0 * pi[1] * :log_income_y0 + :mushy_x1 * pi[2] * :log_income_y0)

  gdp = groupby(data, [:market_id, :ind_id])
  data = @transform gdp :util_it = exp.(:util) ./ (sum(exp.(:util)) + 1)

  gdp = groupby(data, [:market_id, :product_id])
  data = @combine gdp :pred_sh_jt = sum(:util_it .* (1/1000))
end

  delta_fp = unique(df, [:market_id, :product_id])
  @select! delta_fp :market_id :product_id
  @rtransform! delta_fp :d = 0
calc_mkt_share(main_data, delta_fp, [1,1], [1,1])

# #d0 should be jt variable
# function contraction_map(df, d0, pi, sigma, tol = 1e-5, maxiter = 1e5)
#   obs_mkt_share = unique(df, [:market_id, :product_id])
#   @select! obs_mkt_share :market_id :product_id :s_obs
#   d_old = copy(d0)
#   normdiff = Inf
#   iter = 0
#   while normdiff > tol && iter <= maxiter
#     model_mkt_share = calc_mkt_share(df, d_old, pi, sigma)
#     df_sh = @select df :market_id :product_id :s_obs
#     model_mkt_share = leftjoin(model_mkt_share, df_sh, on=[:market_id, :product_id])

#     d_new = leftjoin(d_old, model_mkt_share; on= [:market_id, :product_id]) 
#     d_new = @rtransform d_new :d_iter = :d + log(:s_obs) - log(:pred_sh_jt)

#     normdiff = maximum(abs.(d_new.d_iter .- d_new.d))

#     # normdiff = maximum(norm.(d_new.d_iter .- d_new.d))
#     d_old = @select d_new begin
#       :market_id
#       :product_id
#       :d = :d_iter 
#     end 
#     iter += 1
#   end
#   return d_old
# end

# function calc_delta(df, pi, sigma)
#   delta_fp = unique(df, [:market_id, :product_id])
#   @select! delta_fp :market_id :product_id
#   @rtransform! delta_fp :d = 0

#   delta_jt = contraction_map(df, delta_fp, pi, sigma)

#   return delta_jt
# end

# @time delta_res = calc_delta(main_data, [1, 1], [1, 1])
# mean(delta_res)
2256×3 DataFrame
2231 rows omitted
Row market_id product_id pred_sh_jt
String7 String7 Float64
1 C01Q1 F1B04 NaN
2 C01Q1 F1B06 NaN
3 C01Q1 F1B07 NaN
4 C01Q1 F1B09 2.00559e-55
5 C01Q1 F1B11 2.39821e-55
6 C01Q1 F1B13 2.10254e-55
7 C01Q1 F1B17 NaN
8 C01Q1 F1B30 1.97595e-55
9 C01Q1 F1B45 2.30514e-55
10 C01Q1 F2B05 1.73479e-55
11 C01Q1 F2B08 2.03294e-55
12 C01Q1 F2B15 NaN
13 C01Q1 F2B16 NaN
2245 C65Q2 F2B16 NaN
2246 C65Q2 F2B19 2.79288e-6
2247 C65Q2 F2B26 2.61091e-6
2248 C65Q2 F2B28 NaN
2249 C65Q2 F2B40 2.79267e-6
2250 C65Q2 F2B48 2.64727e-6
2251 C65Q2 F3B06 NaN
2252 C65Q2 F3B14 2.64427e-6
2253 C65Q2 F4B02 3.01899e-6
2254 C65Q2 F4B10 2.69486e-6
2255 C65Q2 F4B12 2.53109e-6
2256 C65Q2 F6B18 2.65098e-6

4.5 My BLP.jl (standing on the shoulders of giants)

Okay, my previous works failed miserably. This is probably because it is hard to do the matrix complication in terms of the dataframe-dplyr style in R. But then again, maybe it is just because I did not do a good job. Thus I will try to write up a code following others while trying to use DataFramesMeta syntax whenever it is doable for transparency.

But this will take a lot of work. I will periodically come back to update my BLP. In the meantime, enjoy your life!

References

Berry, Steven. 1994. “Estimating Discrete-Choice Models of Product Differentiation.” The RAND Journal of Economics 25 (2): 242–62. http://www.jstor.org/stable/2555829.
Berry, Steven, James Levinsohn, and Ariel Pakes. 1995. “Automobile Prices in Market Equilibrium.” Econometrica 63 (4): 841–90. http://www.jstor.org/stable/2171802.
McFadden, Daniel. 1974. “Conditional Logit Analysis of Qualitative Choice Behavior.” In Fontiers in Econometrics, edited by Paul Zarembka, 105–42. New York: Academic press.
Train, Kenneth E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge University Press.

Footnotes

  1. You might have already noticed, but I kind of use variables without subscript as the vector of the variables. For example, \(\delta\) is just \((\delta_1, \ldots, \delta_J).\)↩︎