# You might need these commented codes to install packages
# using Pkg
# Pkg.add(["DataFrames", "CSV", "GLM", "Statistics", "LinearAlgebra", "Distributions", "NLopt", "FixedEffectModels", "RegressionTables", "DataFramesMeta", "Random", "StatsModels", "Optim", "StatsBase"])
BLP Demystified: From Basics to Brain-Busting Models!
Install necessary julia packages for later computation:
1 BLP
This exercise estimates the demand-side BLP model.
1.1 Motivation
Why do this? Demand estimation is very important in IO literature because measuring market power is important in IO. How do we quantify market power? Usually we use markup as the measure. But it is hard to directly calculate markup because it depends on the cost function of the firm which is not observed. But IO theory shows that we can actually get the markup using demand elasticity. Thus estimating demand is important.
1.2 Basic: McFadden (1974) style logit model
1.2.1 Model setup
We will first estimate a basic logit model with no unobserved demand shifters and no random coefficents. But let’s just talk bit about the background of this discrete choice model. Note that most of it is from Train (2009).
Even before McFadden (1974), there has been a long history of the development of the logit model. But McFadden (1974) provides a complete, well-defined econometric model that is consistent with the utility maximization behavior of individuals.
Individual’s (\(i\)) utility maximizing behavior (indirect utility) can be specified as follows:
\[ u_{ij} = \underbrace{x_j \beta + \alpha p_j}_{\delta_j} + \varepsilon_{ij} \]
where mean utility of outside option is normalized to zero. Also, idiosyncratic shock (i.i.d) follows Type 1 Extreme Value distribution (T1EV). We also assume there are \(0, \ldots, J\) products (denote \(0\) as the outside option) where one option is outside option. We can think of \(\delta_j\) as the mean utility from the product \(j\). This is because in this parameterization, \(\delta_j\) does not depend on \(i\).
Now let’s do some math to derive the logit choice probabilities. One benefit about logit model is that we can get a close-form solution. We are going to compute the probability of individuals choosing product \(j\) given \(p_j\), and \(x_j\).
\[\begin{align} P (u_{ij} \geq \forall_{j' \neq j} u_{ij'} \mid x_j, p_j) &= P (x_j \beta + \alpha p_j + \varepsilon_{ij} \geq \forall_{j' \neq j} x_{j'}\beta + \alpha p_{j'} + \varepsilon_{ij'} \mid x_j, p_j) \\ &= P ( \varepsilon_{ij'} \leq \varepsilon_{ij} + \delta_j - \delta_{j'} \, \forall j' \neq j). \end{align}\]
If we assume that \(\varepsilon_{ij}\) is given, we can think of the last term as the cumulative distribution of the T1EV where \(F(\varepsilon_{ij}) = e^{-e^{- \varepsilon_{ij}}}\). Since we assumed i.i.d., we can express the last term as the product of the individual cumulative distributions (For brevity, we will now denote the conditional logit choice probability as \(P_{ij}\)):
\[ P_{ij} \mid \varepsilon_{ij} = \prod_{j' \neq j} e^{ - e^{-(\varepsilon_{ij} + \delta_j - \delta_{j'})}}. \]
Since \(\varepsilon_{ij}\) is not given, we need o integrate it over density of \(\varepsilon_{ij}\):
\[ P_{ij} = \int \left(\prod_{j' \neq j} e^{ - e^{-(\varepsilon_{ij} + \delta_j - \delta_{j'})}} \right) e^{- \varepsilon_{ij}} e^{-e^{\varepsilon_{ij}}} d \varepsilon_{ij}. \]
Now let’s get this into a closed-form expression:
As a result, we can get the closed-form expression:
\[ P_{ij} = \frac{e^{\delta_{ij}}}{\sum_{j'} e^{\delta_{ij'}}} \]
This could be understood as the predicted share function given the fixed values of the parameters.
Note that this is a very simple model because we are not assuming any unobserved product demand shifters that could be affected the utility gained from the product. In fact, we are assuming that econometricians can fully observe all the necessary variables that constructs the mean utility. Thus there is not much econometrics involved. You can just get the parameters as follows:
Assuming you have the data on market share, you can use it to match it to \(P_{ij} \cdot M\) where \(M\)is the total market size.
Then since we will get \(J\) equations using \(J\) market share, we can do simple algebra to get the mean utility \(\delta_j\).
Then you can do some nonlinear least squares that minimize the sum of the differences between oberved and predicted shares of all products. This will get you the parameters that best fit the data.
1.2.2 Adding unobserved demand shifters
We can add the additional unobserved variables \(\xi_j\) which can be thought of as some demand shifter for product \(j\). This allows the model to be more flexible to incorporate the realistic situation where econometrician might not be able to observe some components that might be affecting the utility of getting some product. Thus most of what we did above does not change much. The only problem would be understanding the nature of this unobserved terms with other main parameters of interest. If there is endogeneity, we would need some IV to estimate the parameter. In this section, we will do both cases (OLS, IV).
1.2.3 Computation (Following Berry (1994))
So how can we retrieve the parameters of interest? Naive way to think about it would be doing some nonlinear least squares where you minimize the sum of differences between predicted share and observed shares of all products. The problem is that this directy way is implausible: You would need to know the \(\xi_j\). Since this is unobservable, it is problematic.
This is where Berry (1994) comes in. He introduces this clever two steps estimation process.
Step 1: Inversion
Notation: Let \(\hat{s}_j (\delta)\) be predicted shares and let \(s_j\) be observed shares.1
Then you can use the system of equations from matching actual to predicted shares and invert them to get the mean utility. For this simple case, we can get the following equations:
\[ \delta_j = \log s_j - \log \hat{s}_0, \quad j = 1, \ldots, J. \]
So this inversion gets us the value of the mean utility. Then we have the second step.
Step 2: IV estimation
By definition, we have \(\delta_j = x_j \beta + \alpha p_j + \xi_j\). So we can do the regression to retrieve the parameters. I put IV, but this could be just OLS if you can assume the unobserved term is uncorrelated with the price.
1.2.4 Coding (with Julia
)
Finally we will do some coding to get the result we just talked about.
using FixedEffectModels, DataFrames, CSV, RegressionTables
# read in the data
= CSV.read("data/otc.csv", DataFrame)
otc
# Run regressions
= reg(otc, @formula(ln_mkt_share_diff ~ price + promotion))
ols1 = reg(otc, @formula(ln_mkt_share_diff ~ price + promotion + fe(product)))
ols2 = reg(otc, @formula(ln_mkt_share_diff ~ price + promotion + fe(product) + fe(store)))
ols3 = reg(otc, @formula(ln_mkt_share_diff ~ (price ~ cost) + promotion))
iv1 = reg(otc, @formula(ln_mkt_share_diff ~ (price ~ cost) + promotion + fe(product)))
iv2
regtable(ols1, ols2, ols3, iv1, iv2, order = ["price"], drop = ["(Intercept)"], regression_statistics = [FStatIV, Nobs, R2],
= Dict(
labels "price" => "Price",
"promotion" => "Promotion",
"ln_mkt_share_diff" => "Log Mkt share difference"
))## Some R codes that I followed
# m1 <- lm(ln_mkt_share_diff ~ price + promotion , data = otc)
# m2 <- lm(ln_mkt_share_diff ~ price + promotion + factor(product), data = otc)
# m3 <- lm(ln_mkt_share_diff ~ price + promotion + factor(product) + factor(store), data = otc)
# m4 <- ivreg(ln_mkt_share_diff ~ price + promotion | . - price + cost, data = otc)
# m5 <- ivreg(ln_mkt_share_diff ~ price + promotion + factor(product) | . - price + cost, data = otc)
# stargazer(m1, m2, m3, m4, m5,
# omit = c("product", "store"),
# type = "text")
-----------------------------------------------------------------------------
Log Mkt share difference
---------------------------------------------------
(1) (2) (3) (4) (5)
-----------------------------------------------------------------------------
Price 0.020 -0.189** -0.145* 0.069*** 0.169
(0.014) (0.059) (0.059) (0.015) (0.115)
Promotion 0.121 0.187* 0.201** 0.149 0.308***
(0.093) (0.074) (0.073) (0.093) (0.082)
-----------------------------------------------------------------------------
product Fixed Effects Yes Yes Yes
store Fixed Effects Yes
-----------------------------------------------------------------------------
Estimator OLS OLS OLS IV IV
-----------------------------------------------------------------------------
Controls Yes Yes
-----------------------------------------------------------------------------
First-stage F statistic 8,147.921 394.113
N 1,056 1,056 1,056 1,056 1,056
R2 0.003 0.440 0.456 -0.008 0.420
-----------------------------------------------------------------------------
1.2.5 Caveats
But we don’t usually use this basic setup in IO. This is because the model is bit too simple to fully capture the reality. One of the well known problem is the Independence of irrelevant alternatives (IIA). Basically what this means is that we don’t get a realistic demand elasticities. If you want to know more about it, google the famouse Red bus, blue bus story.
1.2.6 Solutions?
There are some ways to alleviate this problem. One of them (which we will not discuss), is using nested logit. Basically we are defining certain group of products where IIA holds within the group but may not hold across the group. So for the case of red bus, blue bus, they would be in a same group.
Another way is to do enhance the random utility model into logit model with random coefficients. In essence, this is sort of introducing preference heterogeneity of consumers into the model. This is done by interacting consumer preferences with product characteristics. The nuisance with this case is that now closed-form expression for choice probability is not obtainable. We need to do some numerical computation.
1.3 Advanced: Berry, Levinsohn, and Pakes (1995) (Random coefficients logit model)
We again start with the individual utility function. But now something is added (we will now also explicitly denote markets as \(t\)):
\[ u_{ijt} = x_{jt} \beta_{it} + \alpha p_{jt} + \xi_{jt} + \varepsilon_{ijt} \]
The difference is that slope coefficients can now vary across individuals \(i\). For now, we will assume \(\beta_{it}^k = \beta_0^k + \sigma_{kt} v_{it}^k\). We now have \(k\) which is the dimension of \(\beta\). \(\beta_0^k\) are fixed taste for characteristics \(k\) and \(v_{it}^k\) are random tastes that follow standard normal distribution.
Now we can expand the model:
\[\begin{align} u_{ijt} &= (x_{j1t}, \ldots, x_{jKt}) \cdot (\beta_{0}^1 + \sigma_1 v_{it}^1, \ldots, \beta_{0}^K + \sigma_K v_{it}^K)^T + \alpha p_{jt} + \xi_{jt} + \varepsilon_{ijt}\\ &= x_{jt}\beta_{it} + \sum_k x_{jkt} \sigma_{k}v_{ikt} + \alpha p_{jt} + \xi_{jt} + \varepsilon_{ijt}\\ &= \underbrace{x_{jt}\beta_{it} + \alpha p_{jt} + \xi_{jt}}_{\delta_{jt}} + \underbrace{\sum_k x_{jkt} \sigma_{k}v_{ikt}}_{\mu_{ijt}} + \varepsilon_{ijt}. \end{align}\]
We can easily see that this is just an extension of what we did for the basic random utility model. Indirect utility is made up of mean utility \(\delta_{jt}\) and random coefficient term \(\mu_{ijt} + \varepsilon_{ijt}\).
Now we will make some simplication. We will assume that characteristics dimension of individual is one: \(K = 1\). Using this simplication, we can again use the assumption that idiosyncratic shock follows T1EV to get aggregate share:
\[ s_{jt} = \int \frac{\exp(\delta_{jt} + x_{jt} \sigma_t v_{it})}{1 + \sum_j \exp(\delta_{jt} + x_{jt} \sigma_t v_{it})} f(v_i)dv_i \]
The integral has no analytical solution in the random coefficient model, so we need to compute the integral by simulation. One way to do it is as follows:
\[ \hat{s}_{jt} = \frac{1}{ns} \sum_{i=1}^{ns} \frac{\exp(\delta_{jt} + x_{jt} \sigma_t v_{it})}{1 + \sum_j \exp(\delta_{jt} + x_{jt} \sigma_t v_{it})} \]
where \(ns\) is number of random draws from \(v_i\).
Now we can see the inversion method we did before is not easy to implement. This is because we now have additional parameters that we do not know the values.
So in BLP, we need to do nested estimation algorithm.
In the outer loop, we iterate over different values of the parameters.
In the inner loop, for a given parameter value, we do the inversion to get the mean utility and estimate the GMM objective function.
We keep on doing this iteration until we get the parameters that minimize the GMM function.
Now let’s do some coding!
1.3.1 Another coding (with Julia
)
This portion of the code is from here
###############################################################################
#### BLP fixed-point algorithm, inverting mkt shares to get mean utility ####
###############################################################################
using CSV
using DataFrames
using GLM
using Statistics
using LinearAlgebra
using Distributions
using NLopt
= CSV.read("data/otc.csv", DataFrame)
otc
= 500;
ns = maximum(otc.mkt);
nmkt = unique(otc.mkt);
mkt = maximum(otc.product);
nprod
= quantile.(Normal(), collect(range(0.5/ns, step = 1/ns, length = ns)));
vi = 1;
sigma
function calc_mkt_share_t(delta_t, sigma_t, x_t, vi_t)
# Dimension: delta_t 11*1, simga_t 1*1, x_t 11*1
= delta_t .* ones(nprod, ns)
delta_t = x_t*sigma_t*vi_t'
mu_t = exp.(delta_t .+ mu_t)
numerator = ones(nprod, ns) .+ sum(numerator, dims = 1)
denominator = mean(numerator./denominator, dims = 2)
mkt_share_t end
function contraction_t(d0, sigma_t, x_t, vi_t, mkt_t, tol = 1e-5, maxiter = 1e5)
= mkt_t.mkt_share
obs_mkt_share_t = d0
d_old = Inf
normdiff = 0
iter while normdiff > tol && iter <= maxiter
= calc_mkt_share_t(d_old, sigma_t, x_t, vi_t)
model_mkt_share_t = d_old .+ log.(obs_mkt_share_t) .- log.(model_mkt_share_t)
d_new = maximum(norm.(d_new .- d_old))
normdiff = d_new
d_old += 1
iter end
return d_old
end
function calc_delta(sigma)
= zeros(nprod, nmkt);
delta_fp for t in mkt
= otc[otc.mkt .== t, :];
mkt_t = ones(nprod, 1);
x_t = zeros(nprod, 1);
delta_t = sigma;
sigma_t = vi;
vi_t :, t] = contraction_t(delta_t, sigma_t, x_t, vi_t, mkt_t);
delta_fp[end
return vec(delta_fp);
end
@time delta_fp = calc_delta(sigma);
mean(delta_fp)
std(delta_fp)
1.348311 seconds (3.75 M allocations: 365.638 MiB, 3.21% gc time, 96.17% compilation time)
0.8537059827877238
################################################################
#### Estimate beta and sigma using GMM (cost as instrument) ####
################################################################
= hcat(ones(nprod*nmkt, 1),
X
otc.price, otc.promotion,
otc.product_2, otc.product_3, otc.product_4, otc.product_5,
otc.product_6, otc.product_7, otc.product_8, otc.product_9,
otc.product_10, otc.product_11);= hcat(X, otc.cost);
z = z'*z/1056;
Phi = inv(Phi);
inv_Phi
function GMMObjFunc(theta2::Vector, grad::Vector)
= theta2[1]
sigma = calc_delta(sigma)
delta = inv(X'*z*inv_Phi*z'*X)*X'*z*inv_Phi*z'*delta
theta1 = delta - X*theta1
error = error'*z*inv_Phi*z'*error
obj return obj
end
= Opt(:LN_COBYLA, 1)
opt = 1e-4
opt.xtol_rel = [0.00001]
opt.lower_bounds = GMMObjFunc
opt.min_objective @time (minf,minx,ret) = optimize(opt, [1])
@show sigma = minx[1]
= calc_delta(sigma[1]);
delta = inv(X'*z*inv_Phi*z'*X)*X'*z*inv_Phi*z'*delta theta1
18.817741 seconds (20.18 M allocations: 109.887 GiB, 14.52% gc time, 0.73% compilation time)
sigma = minx[1] = 28.720209709193362
13-element Vector{Float64}:
-71.52085329511755
-1.0852292863458644
0.22206483339059624
1.903952993172239
3.990433860485858
-0.6194995526125144
1.2498418962356002
3.9120756908464607
-2.12500554825884
-1.1737916667530266
0.2462099284926299
-2.7352849052497947
0.011924728064080292
1.3.2 My personal coding (again Julia
)
I will try to write my personal code using Julia
following the exercises in Mixtape session for Demand Estimation. Note that I will be just copying the exercise questions from that site below. Also, I am going to use Julia
.
2 Exercise 1
2.1 Introduction
This is the first of three exercises that will give you a solid foundation for doing BLP-style estimation. The running example is the same as in lecture: what if we halved an important product’s price?
2.2 Setup
Most of the computational heavy-lifting in these exercises will be done by the open-source Python package PyBLP. It is easiest to use PyBLP in Python, and the hints/solutions for the exercises will be given in Python. But for those who are more familiar with R, it is straightforward to call PyBLP from R with the reticulate package. It is technically possible to call PyBLP from other languages like Julia and MATLAB, but most users either use Python or R.
You should install PyBLP on top of the Anaconda Distribution. Anaconda comes pre-packaged with all of PyBLP’s dependencies and many more Python packages that are useful for statistical computing. Steps:
- Install Anaconda if you haven’t already. You may wish to create a new environment for just these exercises, but this isn’t strictly necessary.
- Install PyBLP. On the Anaconda command line, you can run the command
pip install pyblp
.
If you’re using Python, you have two broad options for how to do the coding exercises.
- Use a Jupyter Notebook. The solutions to each exercise will be in a notebook. In general, notebooks are a good way to weave text and code for short exercises, and to distribute quick snippets of code with others.
- Use an integrated development environment (IDE). Once you get beyond a few hundred lines of code, I strongly recommend using an IDE and not notebooks. For Python, I recommend VS Code or PyCharm. The former is free and the latter has a free community edition with all the features you’ll need for standard Python development. Both integrate well with Anaconda.
If using a notebook, you can right click and save the following notebook template: notebook.ipynb. If using an IDE, you can right click and save the following script template: script.py. Both import various packages used throughout the exercise.
#| eval: false
import pyblp
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
The notebook additionally configures these packages to reduce the amount of information printed to the screen.
#| eval: false
pyblp.options.digits = 3
pyblp.options.verbose = False
pd.options.display.precision = 3
pd.options.display.max_columns = 50
import IPython.display
IPython.display.display(IPython.display.HTML('<style>pre { white-space: pre !important; }</style>'))
Finally, both show how to load the data that we’ll be using today.
2.3 Data
Today, you’ll use the products.csv
dataset, which is a simplified version of Nevo’s (2000) fake cereal data with less information and fewer derived columns. The data were motivated by real grocery store scanner data, but due to the proprietary nature of this type of data, the provided data are not entirely real. This dataset has been used as a standard example in much of the literature on BLP estimation.
Compared to typical datasets you might use in your own work, the number of observations in this example dataset is quite small. This helps with making these exercises run very fast, but in practice one would want more data than just a couple thousand data points to estimate a flexible model of demand. Typical datasets will also include many more product characteristics. This one only includes a couple to keep the length of the exercises manageable.
The data contains information about 24 breakfast cereals across 94 markets. Each row is a product-market pair. Each market has the same set of breakfast cereals, although with different prices and quantities. The columns in the data are as follows.
Column | Data Type | Description |
---|---|---|
market |
String | The city-quarter pair that defines markets \(t\) used in these exercises. The data were motivated by real cereal purchase data across 47 US cities in the first 2 quarters of 1988. |
product |
String | The firm-brand pair that defines products \(j\) used in these exercises. Each of 5 firms produces between 1 and 9 brands of cereal. |
mushy |
Binary | A dummy product characteristic equal to one if the product gets soggy in milk. |
servings_sold |
Float | Total quantity \(q_{jt}\) of servings of the product sold in a market, which will be used to compute market shares. |
city_population |
Float | Total population of the city, which will be used to define a market size. |
price_per_serving |
Float | The product’s price \(p_{jt}\) used in these exercises. |
price_instrument |
Float | An instrument to handle price endogeneity in these exercises. Think of it as a cost-shifter, a Hausman instrument, or any other valid IV that we discussed in class. |
Throughout the exercises, we use these data to estimate an increasingly flexible BLP-style model of demand for cereal. We will use predictions from this model to see how our running example, cutting the price of one cereal, affects demand for that cereal and for its substitutes.
3 Exercise 1
3.1 1. Describe the data
You can download products.csv
from this link. To load it, you can use pd.read_csv
. To look at a random sample of its rows, you can use .sample
. To compute summary statistics for different columns, you can use .describe
. Throughout these exercises, you’ll be given links to functions and methods that can be used to answer the questions. If you’re unsure about how to use them, you should click on the link, where there is typically example code lower down on the page.
using CSV, DataFrames, DataFramesMeta, StatsModels, LinearAlgebra, Random, Statistics, Distributions
= CSV.read("data/raw/products.csv", DataFrame)
df
describe(df)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | market | C01Q1 | C65Q2 | 0 | String7 | ||
2 | product | F1B04 | F6B18 | 0 | String7 | ||
3 | mushy | 0.333333 | 0 | 0.0 | 1 | 0 | Int64 |
4 | servings_sold | 1.20296e6 | 4085.41 | 3.62285e5 | 9.85623e7 | 0 | Float64 |
5 | city_population | 641995.0 | 173072 | 332943.0 | 7322564 | 0 | Int64 |
6 | price_per_serving | 0.12574 | 0.0454871 | 0.123829 | 0.225728 | 0 | Float64 |
7 | price_instrument | 0.0908116 | -0.00265638 | 0.0890101 | 0.188043 | 0 | Float64 |
3.3 3. Estimate the pure logit model with OLS
Recall the pure logit estimating equation: \(\log(s_{jt} / s_{0t}) = \delta_{jt} = \alpha p_{jt} + x_{jt}' \beta + \xi_{jt}\). First, create a new column logit_delta
equal to the left-hand side of this expression. You can use np.log
to compute the log.
@transform! df :logit_delta = log.(:market_share) .- log.(:outside_share)
first(df, 5)
Row | market | product | mushy | servings_sold | city_population | price_per_serving | price_instrument | market_size | market_share | outside_share | logit_delta |
---|---|---|---|---|---|---|---|---|---|---|---|
String7 | String7 | Int64 | Float64 | Int64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | |
1 | C01Q1 | F1B04 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 |
2 | C01Q1 | F1B06 | 1 | 362850.0 | 516259 | 0.114178 | 0.0725791 | 46463310 | 0.00780939 | 0.555225 | -4.26405 |
3 | C01Q1 | F1B07 | 1 | 603768.0 | 516259 | 0.132391 | 0.101842 | 46463310 | 0.0129945 | 0.555225 | -3.75485 |
4 | C01Q1 | F1B09 | 0 | 2.68092e5 | 516259 | 0.130344 | 0.104332 | 46463310 | 0.00576996 | 0.555225 | -4.56671 |
5 | C01Q1 | F1B11 | 0 | 8.3328e5 | 516259 | 0.154823 | 0.121111 | 46463310 | 0.0179341 | 0.555225 | -3.43267 |
Then, use the package of your choice to run an OLS regression of logit_delta
on a constant, mushy
, and price_per_serving
. There are many packages for running OLS regressions in Python. One option is to use the formula interface for statsmodels
. To use robust standard errors, you can specify cov_type='HC0'
in OLS.fit
.
Interpret your estimates. Your coefficient on price_per_serving
should be around -7.48
. In particular, can you re-express your estimate on mushy
in terms of how much consumers are willing to pay for mushy
, using your estimated price coefficient?
using FixedEffectModels, RegressionTables
= reg(df, @formula(logit_delta ~ mushy + price_per_serving), Vcov.robust())
ols
regtable(ols, order = ["price_per_serving"], drop = ["(Intercept)"], regression_statistics = [Nobs, R2])
-------------------------------
logit_delta
-------------------------------
price_per_serving -7.480***
(0.840)
mushy 0.075
(0.054)
-------------------------------
Controls Yes
-------------------------------
N 2,256
R2 0.034
-------------------------------
3.4 4. Run the same regression with PyBLP
For the rest of the exercises, we’ll use PyBLP do to our demand estimation. This isn’t necessary for estimating the pure logit model, which can be done with linear regressions, but using PyBLP allows us to easily run our price cut counterfactual and make the model more flexible in subsequent days’ exercises.
PyBLP requires that some key columns have specific names. You can use .rename
to rename the following columns so that they can be understood by PyBLP.
market
–>market_ids
product
–>product_ids
market_share
–>shares
price_per_serving
–>prices
By default, PyBLP treats prices
as endogenous, so it won’t include them in its matrix of instruments. But the “instruments” for running an OLS regression are the same as the full set of regressors. So when running an OLS regression and not account for price endogeneity, we’ll “instrument” for prices
with prices
themselves. We can do this by creating a new column demand_instruments0
equal to prices
. PyBLP will recognize all columns that start with demand_instruments
and end with 0
, 1
, 2
, etc., as “excluded” instruments to be stacked with the exogenous characteristics to create the full set of instruments.
With the correct columns in hand, we can initialize our pyblp.Problem
. To specify the same R-style formula for our regressors, use pyblp.Formulation
. The full code should look like the following.
#| eval: false
ols_problem = pyblp.Problem(pyblp.Formulation('1 + mushy + prices'), product_data)
If you print(ols_problem)
, you’ll get information about the configured problem. There should be 94 markets (T
), 2256 observations (N
), 3 product characteristics (K1
), and 3 total instruments (MD
). You can verify that these instruments are simply the regressors by looking at ols_problem.products.X1
and ols_problem.products.ZD
, comparing these with mushy
and prices
in your dataframe. For the full set of notation used by PyBLP, which is very close to the notation used in the lectures, see this page.
To estimate the configured problem, use .solve
. Use method='1s'
to just do 1-step GMM instead of the default 2-step GMM. In this case, this will just run a simple linear OLS regression. The full code should look like the following.
#| eval: false
ols_results = ols_problem.solve(method='1s')
Again, if you print(ols_results)
, you’ll get estimates from the logit model. Make sure that your estimates are the same as those you got from your OLS regression. If you used 'HC0'
standard errors like suggested above, you standard errors should also be the same.
3.5 5. Add market and product fixed effects
Since we expect price \(p_{jt}\) to be correlated with unobserved product quality \(\xi_{jt}\), we should be worried that our estimated \(\hat{\alpha}\) on price is biased. Since we have multiple observations per market and product, and prices vary both across and within markets, it is feasible for us to add both market and product fixed effects. If \(\xi_{jt} = \xi_j + \xi_t + \Delta\xi_{jt}\) and most of the correlation between \(p_{jt}\) and \(\xi_{jt}\) is due to correlation between \(p_{jt}\) and either \(\xi_j\) (product fixed effects) or \(\xi_t\) (market fixed effects), then explicitly accounting for these fixed effects during estimation should help reduce the bias of our \(\hat{\alpha}\).
The simplest way to add fixed effects is as dummy variables. We won’t do this today, but for your own reference, you could do this either by making a separate column for each possible market and product fixed effects and adding these to your formulation, or you could use the shorthand mushy + prices + C(market_ids) + C(product_ids)
. See pyblp.Formulation
for different shorthands you can use. Since there are only 24 products and 94 markets for a total of 118 fixed effects, this approach is actually feasible in this case. But in a more realistic dataset with hundreds or thousands of products and markets, running an OLS regression with this many dummy variables starts to become computationally infeasible.
The alternative, which we’ll do today, is to “absorb” the fixed effects. For a single fixed effect, we could just de-mean our outcome variable and each of our regressors within the fixed effects levels, and then run our regression. For multiple fixed effects, we need to iteratively de-mean. PyBLP does this automatically if you specify absorb='C(market_ids) + C(product_ids)'
in your formulation instead of adding these as dummy variables.
Since mushy
is always either 1 or 0 for the same product across different markets, it’s collinear with product fixed effects, and you can drop it from your formula. Similarly, you can drop the constant. After dropping these, re-create your problem with absorbed fixed effects and re-solve it. Compare the new \(\hat{\alpha}\) estimate with the last one. You should now be getting a coefficient on price of around -28.6
. Does its change suggest that price was positively or negatively correlated with unobserved product-level/market-level quality?
= reg(df, @formula(logit_delta ~ price_per_serving + fe(market) + fe(product)), Vcov.robust())
ols2
regtable(ols2, order = ["price_per_serving"], drop = ["(Intercept)"], regression_statistics = [Nobs, R2])
-----------------------------------
logit_delta
-----------------------------------
price_per_serving -28.618***
(0.916)
-----------------------------------
market Fixed Effects Yes
product Fixed Effects Yes
-----------------------------------
N 2,256
R2 0.560
-----------------------------------
3.6 6. Add an instrument for price
Adding market and product fixed effects can be helpful, but since unobserved quality typically varies by both product and market, we really want to instrument for prices. The data comes with a column price_instrument
that we should interpret as a valid instrument for price that satisfies the needed exclusion restriction. It could be a cost-shifter, a valid Hausman instrument, or similar.
Before using it, we should first run a first-stage regression to make sure that it’s a relevant instrument for price. To do so, use the same package you used above to run an OLS regression to run a second OLS regression of prices on price_instrument
and your market and product fixed effects. If using the formula interface for statsmodels
, you can use the same fixed effect shorthand as in PyBLP, with your full formula looking like prices ~ price_instrument + C(market_ids) + C(product_ids)
. Does price_instrument
seem like a relevant instrument for prices
?
# first stage
= reg(df, @formula(price_per_serving ~ price_instrument + fe(market) + fe(product)), Vcov.robust())
iv0
regtable(iv0, order = ["price_instrument"], drop = ["(Intercept)"], regression_statistics = [FStatIV, Nobs, R2])
-------------------------------------------
price_per_serving
-------------------------------------------
price_instrument 0.877***
(0.007)
-------------------------------------------
market Fixed Effects Yes
product Fixed Effects Yes
-------------------------------------------
First-stage F statistic
N 2,256
R2 0.964
-------------------------------------------
Now that we’ve checked relevance, we can set our demand_instruments0
column equal to price_instrument
, re-create the problem, and re-solve it. You should get a new coefficient on price of around -30.6
. Does the change in \(\hat{\alpha}\) suggest that price was positively or negatively correlated with \(\Delta\xi_{jt}\) in \(\xi_{jt} = \xi_j + \xi_t + \Delta\xi_{jt}\)?
# first stage
= reg(df, @formula(logit_delta ~ (price_per_serving ~ price_instrument) + fe(market) + fe(product)), Vcov.robust())
iv1
regtable(iv1, order = ["price_per_serving"], drop = ["(Intercept)"], regression_statistics = [FStatIV, Nobs, R2])
-------------------------------------
logit_delta
-------------------------------------
price_per_serving -30.600***
(0.994)
-------------------------------------
market Fixed Effects Yes
product Fixed Effects Yes
-------------------------------------
First-stage F statistic 16,814.451
N 2,256
R2 0.559
-------------------------------------
3.7 7. Cut a price in half and see what happens
Now that we have our pure logit model estimated, we can run our counterfactual of interest: what if we halved an important product’s price? We’ll select a single market, the most recent quarter in the first city: C01Q2
. Create a new dataframe called counterfactual_data
by selecting data for just that market and inspect the data. We’ll pretend that we’re firm one, and deciding whether we want to cut the price of our brand four’s product F1B04
. In particular, we might be worried about cannibalization, i.e. how much this price cut will result in consumers of our other 8 brands of cereal in this market just substituting from their old choice to the new, cheaper cereal. Alternatively, we could be a regulator or academic interested in how taxing that product would affect demand in the market.
In your new dataframe with just data from C01Q2
, create a new_prices
column that is the same as prices
but with the price of F1B04
cut in half. To do this, you could use DataFrame.loc
. Then, use .compute_shares
on your results from the last question, passing market_id='C01Q2'
to only compute new market shares for our market of interest, and passing prices=counterfactual_data['new_prices']
to specify that prices should be set to the new prices. This function will re-compute market shares at the changed prices implied by the model’s estimates. Store them in a new_shares
column.
Compute the percent change in shares for each product in the market. From firm one’s perspective, do the estimates of cannibalization make sense. That is, do the signs on the percent changes for product F1B04
and for other products make sense? Would you normally expect percent changes for other products to be different depending on how other products compare to the one whose price is being changed?
= @rsubset df :market == "C01Q2"
counterfactual_data
@rtransform! df :new_prices = ifelse(:product == "F1B04", :price_per_serving / 2, :price_per_serving)
first(counterfactual_data, 5)
Row | market | product | mushy | servings_sold | city_population | price_per_serving | price_instrument | market_size | market_share | outside_share | logit_delta |
---|---|---|---|---|---|---|---|---|---|---|---|
String7 | String7 | Int64 | Float64 | Int64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | |
1 | C01Q2 | F1B04 | 1 | 2.99352e5 | 516259 | 0.0777177 | 0.0441138 | 46463310 | 0.00644276 | 0.502744 | -4.35712 |
2 | C01Q2 | F1B06 | 1 | 6.56443e6 | 516259 | 0.141041 | 0.101043 | 46463310 | 0.141282 | 0.502744 | -1.26932 |
3 | C01Q2 | F1B07 | 1 | 4.08387e6 | 516259 | 0.0726827 | 0.0164402 | 46463310 | 0.0878946 | 0.502744 | -1.74394 |
4 | C01Q2 | F1B09 | 0 | 3.07654e5 | 516259 | 0.0769744 | 0.0498315 | 46463310 | 0.00662145 | 0.502744 | -4.32977 |
5 | C01Q2 | F1B11 | 0 | 2.52179e6 | 516259 | 0.167009 | 0.132775 | 46463310 | 0.0542749 | 0.502744 | -2.22602 |
3.8 8. Compute demand elasticities
To better understand what’s going on, use .compute_elasticities
, again specifying market_id='C01Q2'
, to compute price elasticities for our market of interest. These measure what the model predicts will happen to demand in percentage terms when there’s a 1% change in price of a product. The diagonal elements are own-price elasticities and the off-diagonal elements are cross-price elasticities. Does demand seem very elastic? Do the cross-price elasticities seem particularly reasonable?
3.9 Supplemental Questions
These questions will not be directly covered in lecture, but will be useful to think about when doing BLP-style estimation in your own work.
3.10 1. Try different standard errors
By default, PyBLP computed standard errors that are robust to heteroskedasticity. But we may be concerned that unobserved quality \(\xi_{jt}\) is systematically correlated across markets for a given product \(j\), or across products for a given market \(t\). Choose which one you think is more likely and try clustering your standard errors by that dimension. You can do this with se_type='clustered'
in .solve
, for which you’ll need a clustering_ids
column in your product data. See how your standard error for \(\alpha\) changes.
3.11 2. Compute confidence intervals for your counterfactual
Your estimate of \(\hat{\alpha}\) comes with a standard error, but your counterfactual demand predictions don’t. Ideally we’d like to not only have a point estimate for a counterfactual prediction, but also a measure (up to model misspecification) of how confident we are in these predictions. The easiest way to do this is with a “parametric bootstrap.” The intuition is we can draw from the estimated asymptotic distribution of our \(\hat{\alpha}\), and for each draw, re-compute demand, and see how demand responds to the same price cut.
You can do a parametric bootstrap with the .bootstrap
method. Start with just a few draws (e.g., draws=100
) and remember to set your seed
so that you get the same draws every time you run the code. When new parameters are drawn, you get new .boostrapped_shares
, which take the place of your old shares
. You can use the same .compute_shares
method on the BootstrapedResults
class, although you’ll have to pass a prices
argument with prices replicated along a new axis by as many draws as you have.
Once you have some bootstrapped shares, compute the same percent changes, and compute the 2.5th and 97.5th percentiles of these changes for each product. Are these 95% confidence intervals for your predictions particularly wide?
3.12 3. Impute marginal costs from pricing optimality
The canonical demand side of the BLP model assumes firms set prices in static Bertrand-Nash equilibrium. See this section for a quick summary using PyBLP notation. Given an estimated demand model and such assumptions about pricing, we can impute marginal costs c_{jt}
.
To do so, you first need to tell PyBLP what firms own what products. Create a new firm_ids
column in your data, re-initialize your problem, and re-solve it. Then, you should be able to run the .compute_costs
method to impute firms’ marginal cost of producing each cereal. Do these marginal costs look particularly reasonable? How might limitations of your demand model and supply model bias them? What would they and observed prices imply about firms’ markups and economic profits?
3.13 4. Check your code by simulating data
Even experienced software developers make a lot of mistakes when writing code. Writing “unit tests” or “integration tests” that check whether the code you’ve written seems to be working properly is incredibly important when writing complicated code to estimate demand. Perhaps the most useful test you can write when doing demand estimation (or most other types of structural estimation) is the following.
- Simulate fake data under some true parameters.
- Estimate your model on the simulated data and make sure that you can recover the true parameters, up to sampling error.
If you do these steps many times, the resulting Monte Carlo experiment will also give you a good sense for the finite sample statistical properties of your estimator.
PyBLP’s Simulation
class makes simulating data fairly straightforward. Its interface is similar to Problem
, but you also specify your parameter estimates and structural errors. In addition to checking your code, you can also use this class for more complicated counterfactuals. After initializing your simulation, you can use .replace_endogenous
to have PyBLP replace the prices \(p_{jt}\) and market shares \(s_{jt}\) with those that rationalize the chosen true parameters and other parts of the simulation. It does so by solving the pricing equilibrium. You’ll have to pass your imputed marginal costs via the costs
argument.
Initialize a simulation of the pure logit model with the same product_data
and the same estimated xi
but with an \(\alpha\) somewhat different than the one you estimated. Make sure your chosen \(\alpha\) is negative, otherwise demand will be upward sloping and PyBLP will have trouble solving for equilibrium prices. To the estimated .xi
you can add the estimated fixed effects .xi_fe
, since the simulation class does not support fixed effects absorption.
Have PyBLP solve for prices and market shares, and use the resulting data to re-estimate your pure logit regression. See if you can get an estimated \(\hat{\alpha}\) close to the true one.
4 Exercise 2
4.1 Introduction
This is the second of three exercises that will give you a solid foundation for doing BLP-style estimation. We’ll continue with the same running example: what if we halved an important product’s price? Our goal today is to relax some of the unrealistic substitution patterns implied by the pure logit model by incorporating preference heterogeneity. To do so, we will use cross-market variation in our product and some new demographic data to estimate parameters that govern preference heterogeneity.
4.2 Setup
We’ll be continuing where we left off after the first exercise. You should just keep adding to your code, using its solutions if you had trouble with some parts.
4.3 Data
Today, you’ll incorporate demographics.csv
into estimation, which again is a simplified version of Nevo’s (2000) demographic data with less information and fewer derived columns. The data were originally draws from the Current Population Survey.
In your own work, when incorporating demographic data into estimation, you will want to sample from the whole Current Population Survey (or whatever survey/census data you are using), not just from a subset of it. The small size of today’s demographic data helps with distributing the data, but in practice you should ideally be sampling from a much larger dataset of demographic information. In your own work you will also want to incorporate more demographic variables than the one included in this dataset. Like the product data, in these exercises we only consider a few columns to keep the exercises a manageable length.
The demographic dataset contains information about 20 individuals drawn from the Current Population Survey for each of the 94 markets in the product data. Each row is a different individual. The columns in the data are as follows.
Column | Data Type | Description |
---|---|---|
market |
String | The city-quarter pair that defines markets \(t\) used in these exercises. The data were motivated by real cereal purchase data across 47 US cities in the first 2 quarters of 1988. |
quarterly_income |
Float | The quarterly income of the individual in dollars. |
In today and tomorrow’s exercises, we will use these demographic data to introduce income-specific preference heterogeneity into our BLP model of demand for cereal and see how our counterfactual changes. By incorporating income, we will also be able to speak to distributional concerns: how will counterfactual changes in the market differentially affect high- and low-income consumers?
4.4 My BLP.jl (This section failed miserably…)
In this section, I will try my best to make a very flexible BLP estimation code.
First load two data we will use and see how they look like:
using CSV, DataFrames, DataFramesMeta, Random, LinearAlgebra, Statistics, Distributions, StatsBase
using FixedEffectModels, RegressionTables, Optim
# Load the necessary data
= df;
data_prod
# rename each certain columns to match it to the key argument in later functions
@rename! data_prod begin
:market_id = :market
:product_id = :product
:price_x0 = :price_per_serving
:mushy_x1 = :mushy
:price_instrument_z0 = :price_instrument
:s_obs = :market_share
:s_out = :outside_share
end
first(data_prod, 5)
WARNING: using Optim.optimize in module Notebook conflicts with an existing identifier.
Row | market_id | product_id | mushy_x1 | servings_sold | city_population | price_x0 | price_instrument_z0 | market_size | s_obs | s_out | logit_delta | new_prices |
---|---|---|---|---|---|---|---|---|---|---|---|---|
String7 | String7 | Int64 | Float64 | Int64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | |
1 | C01Q1 | F1B04 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 |
2 | C01Q1 | F1B06 | 1 | 362850.0 | 516259 | 0.114178 | 0.0725791 | 46463310 | 0.00780939 | 0.555225 | -4.26405 | 0.114178 |
3 | C01Q1 | F1B07 | 1 | 603768.0 | 516259 | 0.132391 | 0.101842 | 46463310 | 0.0129945 | 0.555225 | -3.75485 | 0.132391 |
4 | C01Q1 | F1B09 | 0 | 2.68092e5 | 516259 | 0.130344 | 0.104332 | 46463310 | 0.00576996 | 0.555225 | -4.56671 | 0.130344 |
5 | C01Q1 | F1B11 | 0 | 8.3328e5 | 516259 | 0.154823 | 0.121111 | 46463310 | 0.0179341 | 0.555225 | -3.43267 | 0.154823 |
# Load the necessary data
= CSV.read("data/raw/demographics.csv", DataFrame)
data_demo
@rename! data_demo begin
:market_id = :market
:log_income_y0 = :quarterly_income
end
@rtransform! data_demo :log_income_y0 = log(:log_income_y0)
first(data_demo, 5)
Row | market_id | log_income_y0 |
---|---|---|
String7 | Float64 | |
1 | C01Q1 | 8.58651 |
2 | C01Q1 | 9.48285 |
3 | C01Q1 | 8.40662 |
4 | C01Q1 | 7.69358 |
5 | C01Q1 | 6.42269 |
Next we sample the demo data to have 1000 random draw for each market:
function sample_ind(df, n = 1000, seedn = 123)
Random.seed!(seedn)
= groupby(df, :market_id)
grouped
= combine(grouped) do subdf
sampled_df = sample(eachrow(subdf), n; replace = true)
sampled_rows DataFrame(sampled_rows)
end
= transform(groupby(sampled_df, :market_id), eachindex => :ind_id)
sampled_df
return sampled_df
end
# 94 markets x 1000 individual
= sample_ind(data_demo);
data_demo
data_demo# first(data_demo, 10)
Row | market_id | log_income_y0 | ind_id |
---|---|---|---|
String7 | Float64 | Int64 | |
1 | C01Q1 | 8.39605 | 1 |
2 | C01Q1 | 7.76281 | 2 |
3 | C01Q1 | 7.75411 | 3 |
4 | C01Q1 | 9.09651 | 4 |
5 | C01Q1 | 9.09651 | 5 |
6 | C01Q1 | 7.75411 | 6 |
7 | C01Q1 | 8.29923 | 7 |
8 | C01Q1 | 8.40662 | 8 |
9 | C01Q1 | 8.47015 | 9 |
10 | C01Q1 | 8.1857 | 10 |
11 | C01Q1 | 7.77479 | 11 |
12 | C01Q1 | 7.76281 | 12 |
13 | C01Q1 | 9.09651 | 13 |
⋮ | ⋮ | ⋮ | ⋮ |
93989 | C65Q2 | 8.46581 | 989 |
93990 | C65Q2 | 7.4559 | 990 |
93991 | C65Q2 | 8.68641 | 991 |
93992 | C65Q2 | 7.5741 | 992 |
93993 | C65Q2 | 7.91547 | 993 |
93994 | C65Q2 | 8.98306 | 994 |
93995 | C65Q2 | 8.45945 | 995 |
93996 | C65Q2 | 8.95375 | 996 |
93997 | C65Q2 | 7.80121 | 997 |
93998 | C65Q2 | 7.17447 | 998 |
93999 | C65Q2 | 7.58398 | 999 |
94000 | C65Q2 | 5.13197 | 1000 |
Create unobserved individual \(v_{it}\):
function create_v!(df::DataFrame, node::Int)
for i in 1:node
= Symbol("node_y", i)
colname = randn(nrow(df))
df[!, colname] end
return df
end
create_v!(data_demo, 1)
Row | market_id | log_income_y0 | ind_id | node_y1 |
---|---|---|---|---|
String7 | Float64 | Int64 | Float64 | |
1 | C01Q1 | 8.39605 | 1 | 0.808288 |
2 | C01Q1 | 7.76281 | 2 | -1.12207 |
3 | C01Q1 | 7.75411 | 3 | -1.10464 |
4 | C01Q1 | 9.09651 | 4 | -0.416993 |
5 | C01Q1 | 9.09651 | 5 | 0.287588 |
6 | C01Q1 | 7.75411 | 6 | 0.229819 |
7 | C01Q1 | 8.29923 | 7 | -0.421769 |
8 | C01Q1 | 8.40662 | 8 | -1.35559 |
9 | C01Q1 | 8.47015 | 9 | 0.0694591 |
10 | C01Q1 | 8.1857 | 10 | -0.117323 |
11 | C01Q1 | 7.77479 | 11 | 1.21928 |
12 | C01Q1 | 7.76281 | 12 | 0.292914 |
13 | C01Q1 | 9.09651 | 13 | -0.0311481 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
93989 | C65Q2 | 8.46581 | 989 | 1.45184 |
93990 | C65Q2 | 7.4559 | 990 | -0.873824 |
93991 | C65Q2 | 8.68641 | 991 | 0.534542 |
93992 | C65Q2 | 7.5741 | 992 | 0.487625 |
93993 | C65Q2 | 7.91547 | 993 | -1.12754 |
93994 | C65Q2 | 8.98306 | 994 | 0.991109 |
93995 | C65Q2 | 8.45945 | 995 | -0.112785 |
93996 | C65Q2 | 8.95375 | 996 | -1.30939 |
93997 | C65Q2 | 7.80121 | 997 | -1.64881 |
93998 | C65Q2 | 7.17447 | 998 | 0.645287 |
93999 | C65Q2 | 7.58398 | 999 | -0.0317194 |
94000 | C65Q2 | 5.13197 | 1000 | 1.23878 |
Join all dataframes into one:
= unique(@select data_prod :market_id :product_id)
df1 = unique(@select data_demo :ind_id)
df2
# Cross join to mimic expand_grid
= crossjoin(df1, df2)
main_data
leftjoin!(main_data, data_prod; on=[:market_id, :product_id])
leftjoin!(main_data, data_demo; on=[:market_id, :ind_id])
main_data
Row | market_id | product_id | ind_id | mushy_x1 | servings_sold | city_population | price_x0 | price_instrument_z0 | market_size | s_obs | s_out | logit_delta | new_prices | log_income_y0 | node_y1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String7 | String7 | Int64 | Int64? | Float64? | Int64? | Float64? | Float64? | Int64? | Float64? | Float64? | Float64? | Float64? | Float64? | Float64? | |
1 | C01Q1 | F1B04 | 1 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 8.39605 | 0.808288 |
2 | C01Q1 | F1B04 | 2 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 7.76281 | -1.12207 |
3 | C01Q1 | F1B04 | 3 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 7.75411 | -1.10464 |
4 | C01Q1 | F1B04 | 4 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 9.09651 | -0.416993 |
5 | C01Q1 | F1B04 | 5 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 9.09651 | 0.287588 |
6 | C01Q1 | F1B04 | 6 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 7.75411 | 0.229819 |
7 | C01Q1 | F1B04 | 7 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 8.29923 | -0.421769 |
8 | C01Q1 | F1B04 | 8 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 8.40662 | -1.35559 |
9 | C01Q1 | F1B04 | 9 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 8.47015 | 0.0694591 |
10 | C01Q1 | F1B04 | 10 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 8.1857 | -0.117323 |
11 | C01Q1 | F1B04 | 11 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 7.77479 | 1.21928 |
12 | C01Q1 | F1B04 | 12 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 7.76281 | 0.292914 |
13 | C01Q1 | F1B04 | 13 | 1 | 5.76945e5 | 516259 | 0.0720879 | 0.0354837 | 46463310 | 0.0124172 | 0.555225 | -3.80029 | 0.036044 | 9.09651 | -0.0311481 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
2255989 | C65Q2 | F6B18 | 989 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 8.46581 | 1.45184 |
2255990 | C65Q2 | F6B18 | 990 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 7.4559 | -0.873824 |
2255991 | C65Q2 | F6B18 | 991 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 8.68641 | 0.534542 |
2255992 | C65Q2 | F6B18 | 992 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 7.5741 | 0.487625 |
2255993 | C65Q2 | F6B18 | 993 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 7.91547 | -1.12754 |
2255994 | C65Q2 | F6B18 | 994 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 8.98306 | 0.991109 |
2255995 | C65Q2 | F6B18 | 995 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 8.45945 | -0.112785 |
2255996 | C65Q2 | F6B18 | 996 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 8.95375 | -1.30939 |
2255997 | C65Q2 | F6B18 | 997 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 7.80121 | -1.64881 |
2255998 | C65Q2 | F6B18 | 998 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 7.17447 | 0.645287 |
2255999 | C65Q2 | F6B18 | 999 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 7.58398 | -0.0317194 |
2256000 | C65Q2 | F6B18 | 1000 | 0 | 4.43638e5 | 188082 | 0.127557 | 0.0815838 | 16927380 | 0.0262083 | 0.642477 | -3.19925 | 0.127557 | 5.13197 | 1.23878 |
Now let’s start construcing BLP estimation algorithm! First we will construct functions that calculate predicted market share and perform contraction mapping to get \(\delta\) when given \(\Pi\), \(\Sigma\):
# pi: 2x1
# sigma: 2x1
# delta should be jt variable
function calc_mkt_share(df, delta, pi, sigma)
= leftjoin(df, delta; on=[:market_id, :product_id])
df
= @rtransform df :util = exp(:d + :price_x0 * sigma[1] * :node_y1 + :mushy_x1 * sigma[2] * :node_y1 + :price_x0 * pi[1] * :log_income_y0 + :mushy_x1 * pi[2] * :log_income_y0)
data
= groupby(data, [:market_id, :ind_id])
gdp = @transform gdp :util_it = exp.(:util) ./ (sum(exp.(:util)) + 1)
data
= groupby(data, [:market_id, :product_id])
gdp = @combine gdp :pred_sh_jt = sum(:util_it .* (1/1000))
data end
= unique(df, [:market_id, :product_id])
delta_fp @select! delta_fp :market_id :product_id
@rtransform! delta_fp :d = 0
calc_mkt_share(main_data, delta_fp, [1,1], [1,1])
# #d0 should be jt variable
# function contraction_map(df, d0, pi, sigma, tol = 1e-5, maxiter = 1e5)
# obs_mkt_share = unique(df, [:market_id, :product_id])
# @select! obs_mkt_share :market_id :product_id :s_obs
# d_old = copy(d0)
# normdiff = Inf
# iter = 0
# while normdiff > tol && iter <= maxiter
# model_mkt_share = calc_mkt_share(df, d_old, pi, sigma)
# df_sh = @select df :market_id :product_id :s_obs
# model_mkt_share = leftjoin(model_mkt_share, df_sh, on=[:market_id, :product_id])
# d_new = leftjoin(d_old, model_mkt_share; on= [:market_id, :product_id])
# d_new = @rtransform d_new :d_iter = :d + log(:s_obs) - log(:pred_sh_jt)
# normdiff = maximum(abs.(d_new.d_iter .- d_new.d))
# # normdiff = maximum(norm.(d_new.d_iter .- d_new.d))
# d_old = @select d_new begin
# :market_id
# :product_id
# :d = :d_iter
# end
# iter += 1
# end
# return d_old
# end
# function calc_delta(df, pi, sigma)
# delta_fp = unique(df, [:market_id, :product_id])
# @select! delta_fp :market_id :product_id
# @rtransform! delta_fp :d = 0
# delta_jt = contraction_map(df, delta_fp, pi, sigma)
# return delta_jt
# end
# @time delta_res = calc_delta(main_data, [1, 1], [1, 1])
# mean(delta_res)
Row | market_id | product_id | pred_sh_jt |
---|---|---|---|
String7 | String7 | Float64 | |
1 | C01Q1 | F1B04 | NaN |
2 | C01Q1 | F1B06 | NaN |
3 | C01Q1 | F1B07 | NaN |
4 | C01Q1 | F1B09 | 2.00559e-55 |
5 | C01Q1 | F1B11 | 2.39821e-55 |
6 | C01Q1 | F1B13 | 2.10254e-55 |
7 | C01Q1 | F1B17 | NaN |
8 | C01Q1 | F1B30 | 1.97595e-55 |
9 | C01Q1 | F1B45 | 2.30514e-55 |
10 | C01Q1 | F2B05 | 1.73479e-55 |
11 | C01Q1 | F2B08 | 2.03294e-55 |
12 | C01Q1 | F2B15 | NaN |
13 | C01Q1 | F2B16 | NaN |
⋮ | ⋮ | ⋮ | ⋮ |
2245 | C65Q2 | F2B16 | NaN |
2246 | C65Q2 | F2B19 | 2.79288e-6 |
2247 | C65Q2 | F2B26 | 2.61091e-6 |
2248 | C65Q2 | F2B28 | NaN |
2249 | C65Q2 | F2B40 | 2.79267e-6 |
2250 | C65Q2 | F2B48 | 2.64727e-6 |
2251 | C65Q2 | F3B06 | NaN |
2252 | C65Q2 | F3B14 | 2.64427e-6 |
2253 | C65Q2 | F4B02 | 3.01899e-6 |
2254 | C65Q2 | F4B10 | 2.69486e-6 |
2255 | C65Q2 | F4B12 | 2.53109e-6 |
2256 | C65Q2 | F6B18 | 2.65098e-6 |
4.5 My BLP.jl (standing on the shoulders of giants)
Okay, my previous works failed miserably. This is probably because it is hard to do the matrix complication in terms of the dataframe-dplyr style in R. But then again, maybe it is just because I did not do a good job. Thus I will try to write up a code following others while trying to use DataFramesMeta
syntax whenever it is doable for transparency.
But this will take a lot of work. I will periodically come back to update my BLP. In the meantime, enjoy your life!
References
Footnotes
You might have already noticed, but I kind of use variables without subscript as the vector of the variables. For example, \(\delta\) is just \((\delta_1, \ldots, \delta_J).\)↩︎