The pricelevels
-package provides index number
methods for price comparisons within or between countries. As price
comparisons over time usually rely on the same index number methods, the
package has been denoted more generally as pricelevels
,
though its primary focus is on spatial price comparisons. Currently, the
following index number methods (or price indices) are implemented.
jevons()
dutot()
carli()
harmonic()
bmw()
cswd()
uvalue()
banerjee()
davies()
lehr()
laspeyres()
and
geolaspeyres()
paasche()
and
geopaasche()
fisher()
toernqvist()
walsh()
and
geowalsh()
theil()
medgeworth()
palgrave()
svartia()
drobisch()
lowe()
young()
cpd()
and
nlcpd()
geks()
gkhamis()
ikle()
rao()
rhajargasht()
gerardi()
Moreover, the package offers functions for sampling and characterizing price data. This vignette highlights the main package features. It contains four sections. The first is on price data, the second on bilateral indices, and the third on multilateral indices. The last section section is on some further package features.
The pricelevels
-functions are designed based on
R
’s data.table
-package.
Data for spatial price comparisons usually have three dimensions:
For transaction and supermarket scanner data, purchased quantities \(q_{bn}^r\) of the individual products are often available.
Having such detailed data can pose some additional challenges. Not every individual product might be available in each region. Hence, there are gaps in the data. In the worst case these gaps can result in non-connected price data (World Bank 2013, 98), where price level comparisons of all regions are no longer possible.
The pricelevels
-package offers various functions to
characterize price data and to deal with the issue of data gaps. As a
first step, it’s good to check the sparsity (share of gaps) of
a dataset and if the dataset is connected.
Suppose we have prices for 5 products (all belonging to the same
basic heading) in 4 regions. These data are stored in the data.table
dt
. Transforming dt
into a price matrix (rows:
products; columns: regions) shows that the data exhibit gaps as products
4
and 5
are not priced in regions
c
and d
, respectively.
# example price data with gaps:
dt <- data.table(
"r"=c(rep(c("a","b"), each=5), rep(c("c","d"), each=3)),
"n"=as.character(c(rep(1:5, 2), rep(1:3, 2))),
"p"=round(runif(16, min=10, max=20), 1))
# price matrix:
dt[, tapply(X=p, INDEX=list(n,r), FUN=mean)]
#> a b c d
#> 1 15.1 17.9 18.3 15.0
#> 2 13.7 19.6 18.1 17.5
#> 3 16.6 15.6 14.2 16.8
#> 4 12.2 18.3 NA NA
#> 5 14.2 17.5 NA NA
Due to the gaps, the sparsity()
of this dataset is
greater 0. As there are only a few gaps, however, function
is.connected()
indicates that the dataset is connected. The
function comparisons()
provides some further details. It
shows that all four regions form one ‘block of connected regions’ since
they are connected via direct or indirect links. Moreover, the function
output highlights that each region can be directly compared with the
other regions using at least one price ratio.
dt[, sparsity(r=r, n=n)] # sparsity
#> [1] 0.2
dt[, is.connected(r=r, n=n)] # is connected
#> [1] TRUE
dt[, comparisons(r=r, n=n)] # one group of regions
#> Key: <group_id>
#> group_id group_members group_size total direct indirect n_obs
#> <int> <char> <int> <int> <int> <int> <int>
#> 1: 1 a;b;c;d 4 6 0 6 16
Let’s assume now that prices for products 1
,
2
, and 3
are no longer available in regions
a
and b
(for example because these products
are not sold there anymore). This situation is captured in data.table
dt2
. In total, the price data still include 4 regions and 5
products. However, the price matrix reveals that the pattern of gaps
changed.
# non-connected example price data:
dt2 <- dt[!(n%in%1:3 & r%in%c("a","b")),]
# price matrix:
dt2[, tapply(X=p, INDEX=list(n,r), FUN=mean)]
#> a b c d
#> 1 NA NA 18.3 15.0
#> 2 NA NA 18.1 17.5
#> 3 NA NA 14.2 16.8
#> 4 12.2 18.3 NA NA
#> 5 14.2 17.5 NA NA
The sparsity increased to 0.5. Even worse, the dataset is no longer
connected. This can be easily seen from the price matrix, where regions
a
and b
form one block of connected regions,
and regions c
and d
another but separate
block. This is also highlighted in the output of
comparisons()
, which now has two rows.
dt2[, sparsity(r=r, n=n)] # sparsity
#> [1] 0.5
dt2[, is.connected(r=r, n=n)] # is not connected
#> [1] FALSE
dt2[, comparisons(r=r, n=n)] # two groups of regions
#> Key: <group_id>
#> group_id group_members group_size total direct indirect n_obs
#> <int> <char> <int> <int> <int> <int> <int>
#> 1: 1 a;b 2 1 0 1 4
#> 2: 2 c;d 2 1 0 1 6
The fact that the dataset dt2
is not connected does not
mean that no price comparison is possible. It only means that a price
comparison of all regions cannot be done. The function output
of comparisons()
shows that separate price comparisons
within each block of regions could be done. However, it is clear that a
comparison of the price levels between the two blocks is still not
possible. To follow this approach, the regions in dt2
can
be classified into blocks of connected regions using the function
neighbors()
. Otherwise, if one wants to carry out the price
comparison only for the biggest block of connected regions,
dt2
can be connected using function connect()
,
as shown below.
# group regions into groups of connected regions:
dt2[, "block" := neighbors(r=r, n=n, simplify=TRUE)]
dt2[, is.connected(r=r, n=n), by="block"]
#> block V1
#> <fctr> <lgcl>
#> 1: 1 TRUE
#> 2: 2 TRUE
# subset to biggest group of connected regions:
dt2[connect(r=r, n=n), is.connected(r=r, n=n)]
#> [1] TRUE
Both approaches allow to perform a price comparison on a subset of the data by excluding some regions.
The quality of price data is of utmost importance for the reliability
of a regional price comparison. To assess the impact of gaps on the
reliability of price index numbers, it is helpful to control the amount
of gaps in the price data. At the same time, however, it must be ensured
that the price data stay connected. A simplistic approach of simulating
price data would not guarantee this. Therefore, the
pricelevels
-package offers some functions for data
sampling.
In the following, we want to simulate random price data for 5 products within the same basic heading in 4 regions. This can be easily achieved if there are no data gaps. First, products and regions are sampled. Second, random prices are added.
R <- 4 # number of regions
N <- 5 # number of products
# set frame:
dt <- data.table("r"=as.character(rep(1:R, each=N)),
"n"=as.character(rep(1:N, times=R)))
# add random prices:
set.seed(1)
dt[, "p":=round(x=rnorm(n=.N, mean=as.integer(n), sd=0.1), digits=1)]
# price matrix:
dt[, tapply(X=p, INDEX=list(n,r), FUN=mean)]
#> 1 2 3 4
#> 1 0.9 0.9 1.2 1.0
#> 2 2.0 2.0 2.0 2.0
#> 3 2.9 3.1 2.9 3.1
#> 4 4.2 4.1 3.8 4.1
#> 5 5.0 5.0 5.1 5.1
It is clear that the price data dt
are connected.
However, to get more realistic data, we have to introduce gaps. This can
be done randomly as well. If there are only few gaps, the likelihood to
end up with non-connected price data is relatively low. However, with
more gaps, it can happen that the resulting price data are no longer
connected. To overcome this issue, the function rgaps()
introduces gaps into an existing (complete) dataset while ensuring that
the data stay connected. In the following, we introduce approx. 20% gaps
into the price data dt
.
# introduce gaps:
set.seed(1)
dt.gaps <- dt[!rgaps(r=r, n=n, amount=0.2), ]
# price matrix:
dt.gaps[, tapply(X=p, INDEX=list(n,r), FUN=mean)]
#> 1 2 3 4
#> 1 NA 0.9 1.2 1.0
#> 2 2.0 2.0 2.0 2.0
#> 3 2.9 3.1 2.9 3.1
#> 4 4.2 NA 3.8 NA
#> 5 5.0 NA 5.1 5.1
From the price matrix, we see that there is no price for product
1
in region 1
. The
rgaps()
-function also allows to exclude specific regions
and/or products from the sampling of gaps. If we want, for example, that
there are generally no gaps in region 1
, we can implement
this as follows:
# introduce gaps but not for region 1:
set.seed(1)
dt.gaps <- dt[!rgaps(r=r, n=n, amount=0.2, exclude=data.frame(r="1", n=NA)), ]
# price matrix:
dt.gaps[, tapply(X=p, INDEX=list(n,r), FUN=mean)]
#> 1 2 3 4
#> 1 0.9 0.9 1.2 NA
#> 2 2.0 2.0 2.0 2.0
#> 3 2.9 3.1 2.9 3.1
#> 4 4.2 NA 3.8 NA
#> 5 5.0 NA 5.1 5.1
We see from the price matrix that no price is missing in region
1
.
If the gaps should occur with a higher likelihood for specific
regions and/or products, this can be defined as well. The
prob
-argument in rgaps()
allows to specify for
each observation the (relative) likelihood that gaps occur at this
position in the data. In the following example, gaps will occur more
often for products 4
and 5
.
The price data in dt
represent the data that statistical
offices usually collect as quantities or weights on individual products
are not available. However, transaction data often include purchased
quantities while data at the basic heading level exhibit expenditure
weights as an indication of the relative importance for household
consumption.
As the purchased quantities can be expected to heavily depend on the
prices of products, there is no stand-alone function for the sampling of
random quantities. However, random expenditure weights for basic
headings can be added to some dataset using the function
rweights()
. These weights can be the same for each basic
heading in each region, differ across basic headings but are the same
for each region, or differ across basic headings and regions. This is
shown below, where we sample the weights for the individual products
(and not basic headings):
# constant expenditure weights:
dt.gaps[, "w1" := rweights(r=r, b=n, type=~1)]
dt.gaps[, tapply(X=w1, INDEX=list(n,r), FUN=mean)]
#> 1 2 3 4
#> 1 0.2 0.2 0.2 0.2
#> 2 0.2 0.2 0.2 0.2
#> 3 0.2 0.2 0.2 0.2
#> 4 NA NA 0.2 0.2
#> 5 NA 0.2 0.2 NA
# weights different for basic headings but same among regions:
dt.gaps[, "w2" := rweights(r=r, b=n, type=~n)]
dt.gaps[, tapply(X=w2, INDEX=list(n,r), FUN=mean)]
#> 1 2 3 4
#> 1 0.19621397 0.19621397 0.19621397 0.19621397
#> 2 0.08129106 0.08129106 0.08129106 0.08129106
#> 3 0.50029883 0.50029883 0.50029883 0.50029883
#> 4 NA NA 0.05927477 0.05927477
#> 5 NA 0.16292136 0.16292136 NA
# weights different for basic headings and regions:
dt.gaps[, "w3" := rweights(r=r, b=n, type=~n+r)]
dt.gaps[, tapply(X=w3, INDEX=list(n,r), FUN=mean)]
#> 1 2 3 4
#> 1 0.01418186 0.01418186 0.01418186 0.01418186
#> 2 0.40005184 0.40005184 0.40005184 0.40005184
#> 3 0.12417969 0.12417969 0.12417969 0.12417969
#> 4 NA NA 0.33378579 0.33378579
#> 5 NA 0.12780082 0.12780082 NA
If there are no gaps in the data, the weights always add up to 1.
With gaps, however, the type
of weights is prioritized,
meaning that constant weights remain constant even though a
renormalization would have led to different weights across basic heading
and regions. Therefore, the weights do not necessarily add up to 1 in
each region. Of course, the weights can easily be renormalized if
necessary.
The previous steps to simulate random price data are put together in
the function rdata()
. This function easily simulates random
price data for a specified number of regions, basic headings, and
individual products including gaps and weights. However, beside this
easier usage, there are some additional benefits:
rdata()
and can be added to the
function output if desired. This is particularly useful for simulations
where some calculated price levels are compared to the ‘true’ price
levels underlying the data generation.The following code snippet shows how easily a connected price dataset with gaps, sales and expenditure weights for \(B=4\) basic headings with \(N_b=(2,2,3,3)\) products in \(R=9\) regions can be simulated.
# simulate random price data:
set.seed(123)
srp <- rdata(
R=9, # number of regions
B=4, # number of product groups
N=c(2,2,3,3), # number of individual products per basic heading
gaps=0.2, # share of gaps
weights=~b, # same varying expenditure weights for regions
sales=0.1, # share of sales
settings=list(par.add=TRUE) # add true parameters to function output
)
# true parameters:
srp$param
#> $lnP
#> 1 2 3 4 5 6
#> 0.01351101 -0.06729194 -0.03758062 -0.02719543 -0.13336083 0.11908720
#> 7 8 9
#> 0.05064581 -0.07850519 0.16068999
#>
#> $pi
#> 01 02 03 04 05 06 07 08
#> 2.151639 2.109964 1.372202 1.754044 5.870784 5.714721 5.906962 1.797949
#> 09 10
#> 1.728033 1.619993
#>
#> $delta
#> 01 02 03 04 05 06 07
#> 1.2247660 1.2530426 1.0527690 0.5926094 2.2100173 1.3200648 -0.3813724
#> 08 09 10
#> 1.4605623 0.6499459 0.2391432
# price data:
head(srp$data)
#> Key: <group, product, region>
#> group weight region product sale price quantity
#> <fctr> <num> <fctr> <fctr> <lgcl> <num> <num>
#> 1: 1 0.07939543 1 01 TRUE 8.13 53705
#> 2: 1 0.07939543 2 01 FALSE 7.90 9765
#> 3: 1 0.07939543 3 01 FALSE 8.29 10992
#> 4: 1 0.07939543 5 01 FALSE 7.36 9829
#> 5: 1 0.07939543 6 01 FALSE 10.02 38338
#> 6: 1 0.07939543 7 01 FALSE 9.20 22230
The graph shows that the individual products’ prices vary across regions and within the basic heading. However, while they are at similar levels within the same basic heading, they can considerably differ between basic headings.
Bilateral price indices compute the overall price level difference between two regions. For that purpose, the prices of individual products available in both regions are aggregated into a price index showing the price level difference of the comparison region relative to the base region. If quantities, expenditure shares, or any other sort of information on economic importance are available, these data can serve as weights in the aggregation. Otherwise, prices must be aggregated without any weights. Both approaches are shown in the following examples, where price data for four products belonging to the same basic heading were collected in five regions.
# simulate random price data:
set.seed(1)
dt <- rdata(R=5, B=1, N=4)
head(dt)
#> Key: <group, product, region>
#> group weight region product sale price quantity
#> <fctr> <num> <fctr> <fctr> <lgcl> <num> <num>
#> 1: 1 1 1 1 FALSE 17.81 14529
#> 2: 1 1 2 1 FALSE 17.48 102595
#> 3: 1 1 3 1 FALSE 17.22 76664
#> 4: 1 1 4 1 FALSE 20.07 127831
#> 5: 1 1 5 1 FALSE 15.53 157119
#> 6: 1 1 1 2 FALSE 20.79 18665
If prices are collected by statistical offices locally (local or
field-based price collection), quantities or any other information
of the economic importance of the individual products are usually
lacking. Therefore, the prices must be aggregated without any weights.
The most prominent elementary price indices are implemented in
the pricelevels
-package:
where the indices express the prices of each region \(r\) relative to the base region \(s\) based on the prices of the individual products \(n \left(n=1,\ldots,N\right)\).
If we want to compare the prices of each region in the data
dt
to those in region 1
using the Jevons
index, we can set region 1
as the base region in the
function jevons()
:
dt[, jevons(p=price, r=region, n=product, base="1")]
#> 1 2 3 4 5
#> 1.0000000 0.9995577 0.9924891 1.0624618 0.9397925
Similarly, we can compute the price index numbers relative the base
region 2
using the Dutot index:
dt[, dutot(p=price, r=region, n=product, base="2")]
#> 1 2 3 4 5
#> 1.0002609 1.0000000 0.9932159 1.0627528 0.9419439
The price comparison of each region relative to region 2
are carried out separately (or bilaterally). Each bilateral price
comparison thus relies solely on the prices of the two regions that are
compared. This is the rationale of bilateral price indices. However, it
also results in some undesirable properties.
A price comparison of region 2
to 1
should
generally give the same result as the reciprocal of the price comparison
of region 1
to 2
. This requirement is denoted
as the country reversal test. However, as shown below, only the
Jevons index and the Dutot index comply with this requirement for the
considered dataset.
# harmonic mean index fails:
all.equal(
dt[, harmonic(p=price, r=region, n=product, base="1")][2],
1/dt[, harmonic(p=price, r=region, n=product, base="2")][1],
check.attributes=FALSE)
#> [1] "Mean relative difference: 0.0001990829"
# carli fails:
all.equal(
dt[, carli(p=price, r=region, n=product, base="1")][2],
1/dt[, carli(p=price, r=region, n=product, base="2")][1],
check.attributes=FALSE)
#> [1] "Mean relative difference: 0.0001990433"
# jevons index ok:
all.equal(
dt[, jevons(p=price, r=region, n=product, base="1")][2],
1/dt[, jevons(p=price, r=region, n=product, base="2")][1],
check.attributes=FALSE)
#> [1] TRUE
# dutot index ok:
all.equal(
dt[, dutot(p=price, r=region, n=product, base="1")][2],
1/dt[, dutot(p=price, r=region, n=product, base="2")][1],
check.attributes=FALSE)
#> [1] TRUE
The resulting price index numbers of the Carli index and the Harmonic mean index depend on the choice of the base region. Consequently, if one changes the base region also the regional price level differences change.
Another important requirement on spatial price index numbers is transitivity, which demands that a direct price comparison of two regions gives the same result as an indirect comparison of the two regions via a third one. Consequently, transitivity ensures the consistency of a set of price index numbers. Again, if there are no data gaps, the Jevons and Dutot indices are transitive, while the Carli and Harmonic mean indices fail this test.
Multilateral price indices simultaneously use the prices of all
regions involved in the price comparison to compute a set of transitive
price index numbers - irrespective of any data gaps. The
pricelevels
-package offers the three most prominent
multilateral price indices, and a newly developed extension:
cpd()
and
nlcpd()
geks()
gkhamis()
,
ikle()
, rao()
, and
rhajargasht()
gerardi()
The CPD method (Summers 1973) is a linear regression model that explains the logarithmic price of product \(n\) in region \(r\), \(\ln p_n^r\), by the general product prices, \(\ln \pi_n \ (n=1,\ldots,N)\), and the overall regional price levels, \(\ln P^r \ (r=1,\ldots,R)\):
\[\ln p_n^r = \ln \pi_n + \ln P^r + \ln \epsilon_n^r \quad \text{with} \ \ln \epsilon_n^r \sim N(0, \sigma^2)\] Auer and Weinand (2022) recently proposed a generalization of the CPD method. This nonlinear CPD method (NLCPD method) inflates the CPD model by product-specific elasticities \(\delta_n \ (n=1,\ldots,N)\):
\[\ln p_n^r = \ln \pi_n + \delta_n \ln P^r + \ln \epsilon_n^r \quad \text{with} \ \ln \epsilon_n^r \sim N(0, \sigma^2)\] The CPD method implicitly assumes that all \(\delta_n=1\). However, this assumption is not very realistic as price elasticities can considerably differ between the products, particularly at higher levels (e.g., rents versus food).
Estimating the CPD and NLCPD regression model produces estimates for
the regional price levels, respectively. However, both methods require a
normalization of the estimated price levels \(\widehat{\ln P^r}\) to avoid
multicollinearity. With normalization \(\sum_{r=1}^{R} \widehat{\ln P^r}=0\), the
price levels are expressed relative to the regional average; otherwise,
the (logarithm of the) price level of one specific region is set to 0.
The NLCPD method additionally imposes the restriction \(\sum_{n=1}^{N} w_n \widehat{\delta}_n=1\).
In this package (function nlcpd()
), it is always the
parameter \(\widehat{\delta}_1\) that
is derived residually from this restriction.
The CPD and NLCPD methods are implemented in the functions
cpd()
and nlcpd()
. The functions can be used
similarly to those for bilateral indices as shown below. However, if
requested by the user, they provide some additional information and they
can express the regional price levels relative to the unweighted
regional average by setting base=NULL
.
# CPD estimation with respect to regional average:
dt[, cpd(p=price, r=region, n=product, q=quantity, base=NULL)]
#> 1 2 3 4 5
#> 0.9992531 0.9959948 0.9952321 1.0750871 0.9390731
# CPD estimation with respect to region 1:
dt[, cpd(p=price, r=region, n=product, q=quantity, base="1")]
#> 1 2 3 4 5
#> 1.0000000 0.9967392 0.9959759 1.0758907 0.9397750
# same price levels with shares as weights:
dt[, cpd(p=price, r=region, n=product, w=share, base="1")]
#> 1 2 3 4 5
#> 1.0000000 0.9967392 0.9959759 1.0758907 0.9397750
# NLCPD estimation with shares as weights:
dt[, nlcpd(p=price, r=region, n=product, w=share, base="1")]
#> 1 2 3 4 5
#> 1.0000000 0.9923211 0.9860177 1.0615430 0.9341702
If not only the (unlogged) price level estimates but the full
regression output is of interest, this can be retrieved by setting
simplify=FALSE
in the functions (shown below for
cpd()
):
# full CPD regression output:
dt[, cpd(p=price, r=region, n=product, w=share, simplify=FALSE)]
#>
#> Call:
#> stats::lm(formula = cpd_mod, data = pdata, weights = w, singular.ok = FALSE)
#>
#> Coefficients:
#> pi.1 pi.2 pi.3 pi.4 lnP.1 lnP.2
#> 2.8741375 3.0253536 2.9090996 2.9991116 -0.0007471 -0.0040133
#> lnP.3 lnP.4
#> -0.0047793 0.0724017
The NLCPD method solves a nonlinear optimization problem. Therefore,
it is computationally much slower than most other price indices.
However, computation speed can be improved by using the Jacobian matrix
for the optimization instead of deriving this matrix analytically during
optimization. This can be achieved by changing the settings
in nlcpd()
.
set.seed(123)
dt.big <- rdata(R=50, B=1, N=30, gaps=0.25)
# don't use jacobian matrix:
system.time(m1 <- dt.big[, nlcpd(p=price, r=region, n=product, q=quantity,
settings=list(use.jac=FALSE), simplify=FALSE,
control=minpack.lm::nls.lm.control("maxiter"=200))])
#> user system elapsed
#> 0.64 0.07 0.77
# use jacobian matrix:
system.time(m2 <- dt.big[, nlcpd(p=price, r=region, n=product, q=quantity,
settings=list(use.jac=TRUE), simplify=FALSE,
control=minpack.lm::nls.lm.control("maxiter"=200))])
#> user system elapsed
#> 0.09 0.00 0.11
# less computation time needed for m2, but same results as m1:
all.equal(m1$par, m2$par, tol=1e-05)
#> [1] TRUE
As stated earlier, the CPD and NLCPD methods produce transitive price
levels. This is also true when there are data gaps. Below, we introduce
random gaps in the data dt
, reestimate the CPD and NLCPD
models, and test for transitivity. More precisely, we check if the
direct price comparison between regions 2
and
1
is the same as the indirect comparison including region
3
, i.e., \(\widehat{P}^{12} =
\widehat{P}^{13} \cdot \widehat{P}^{32}\).
# introduce 20% data gaps:
set.seed(1)
dt.gaps <- dt[!rgaps(r=region, n=product, amount=0.2)]
# estimate CPD model using different base regions and check transitivity:
P1.cpd <- dt.gaps[, cpd(p=price, r=region, n=product, q=quantity, base="1")]
P3.cpd <- dt.gaps[, cpd(p=price, r=region, n=product, q=quantity, base="3")]
all.equal(P1.cpd[2], P1.cpd[3]*P3.cpd[2], check.attributes=FALSE)
#> [1] TRUE
# estimate NLCPD model using different base regions and check transitivity:
P1.nlcpd <- dt.gaps[, nlcpd(p=price, r=region, n=product, q=quantity, base="1")]
P3.nlcpd <- dt.gaps[, nlcpd(p=price, r=region, n=product, q=quantity, base="3")]
all.equal(P1.nlcpd[2], P1.nlcpd[3]*P3.nlcpd[2], check.attributes=FALSE)
#> [1] TRUE
The results indicate that both the CPD method and the NLCPD method produce transitive price level estimates even when data gaps are present.
The GEKS method (Gini 1924, 1931; Eltetö and
Köves 1964; Szulc 1964) is a two-step approach. First, prices are
aggregated into bilateral index numbers for every pair of regions in the
data. Second, the bilateral index numbers are transformed into a set of
transitive index numbers by estimating the linear regression model \[\ln \dot{P}^{sr} = \ln \left( P^r / P^s \right) +
\ln \epsilon^{sr} \quad \text{with} \ \ln \epsilon^{sr} \sim
N(0,\sigma^2) \ ,\] where \(\dot{P}^{sr}\) is the bilateral price index
for regions \(s\) and \(r\) computed in the first stage. For this
computation any bilateral index could theoretically be used (and is
supported in the pricelevels
-package). Rao and Banerjee (1986), however, recommend that
the bilateral price index satisfies the country reversal test. If, for
example, the Jevons index is used in the computations, then it is more
precise to denote the resulting index numbers as the GEKS-Jevons price
index.
# GEKS using Törnqvist:
dt.gaps[, geks(p=price, r=region, n=product, q=quantity, settings=list(type="toernqvist"))]
#> 1 2 3 4 5
#> 0.9968630 0.9939038 0.9905194 1.0940081 0.9314009
# GEKS using Jevons so quantities have no impact:
dt.gaps[, geks(p=price, r=region, n=product, q=quantity, settings=list(type="jevons"))]
#> 1 2 3 4 5
#> 0.9954545 0.9945994 0.9916983 1.0880181 0.9360838
The geks()
-function internally calls the function
index.pairs()
to compute the bilateral indices of all
region pairs. The quantities q
or weights w
are used within this aggregation of prices into bilateral index numbers
(first stage) while the subsequent transformation of these index numbers
(second stage) usually does not rely on any weights (but can if
specified in settings$wmethod
). In this case, the solution
to the regression model above simplifies to \[P_{\text{GEKS-J}}^{1r} = \prod_{s=1}^{R} \left(
\dot{P}^{1s}_{\text{J}} \dot{P}^{sr}_{\text{J}} \right)^{1/R} \
,\] if the Jevons index is used and region 1
serves
as the base region (ILO et al. 2020,
448).
Again, the following checks show that the price index numbers derived with the GEKS-Törnqvist index are transitive:
# estimate GEKS-Törnqvist using different base regions and
# applying weights in second aggregation stage:
P1.geks <- dt.gaps[, geks(p=price, r=region, n=product, q=quantity, base="1",
settings=list(type="toernqvist", wmethod="obs"))]
P3.geks <- dt.gaps[, geks(p=price, r=region, n=product, q=quantity, base="3",
settings=list(type="toernqvist", wmethod="obs"))]
# check transitivity:
all.equal(P1.geks[2], P1.geks[3]*P3.geks[2], check.attributes=FALSE)
#> [1] TRUE
The following price indices belong to this very general class of
multilateral price indices and are implemented in the
pricelevels
-package.
gkhamis()
rao()
ikle()
rhajargasht()
All methods have in common that they set up a system of interrelated equations of international product prices and regional price levels, which must be solved iteratively. It is only the definition of the international product prices and the price levels that differ between the methods.
The Geary-Khamis method defines the international average prices, \(\pi_n \ (n=1,\ldots,N)\), and the regional price levels, \(P^r \ (r=1,\ldots,R)\), as \[\pi_n = \frac{\sum_{r=1}^{R} q_n^r \left( p_n^r/ P^r \right)}{\sum_{r=1}^{R} q_n^r} \quad \text{and} \quad P^r = \frac{\sum_{n=1}^{N} p_n^r q_n^r}{\sum_{n=1}^{N} \pi_n q_n^r} = \left[ \sum_{n=1}^{N} w_n^r \left( p_n^r / \pi_n \right)^{-1} \right]^{-1} \, ,\] where the weights \(w_n^r= p_n^r q_n^r / \sum_{n=1}^{N} p_n^r q_n^r\) represent expenditure shares. This system of interrelated equations can be solved iteratively or analytically using matrix algebra (Diewert 1999). Which approach is computationally faster depends on the data.
# sample data with gaps:
set.seed(123)
dt.big <- rdata(R=99, B=1, N=50, gaps=0.25)
# iterative processing:
system.time(m1 <- dt.big[, gkhamis(p=price, r=region, n=product, q=quantity,
settings=list(solve="iterative"))])
#> user system elapsed
#> 0.00 0.00 0.02
# matrix algebra:
system.time(m2 <- dt.big[, gkhamis(p=price, r=region, n=product, q=quantity,
settings=list(solve="matrix"))])
#> user system elapsed
#> 0.03 0.00 0.05
# compare results:
all.equal(m1, m2, tol=1e-05)
#> [1] TRUE
The Iklé index defines the regional price levels \(P^r\) exactly in the same way as the Geary-Khamis index. However, the calculation of the international product prices \(\pi_n\) uses expenditure shares instead of quantities: \[\pi_n = \left[ \frac{\sum_{r=1}^{R} w_n^r \left( p_n^r / P^r \right)^{-1}}{\sum_{r=1}^{R} w_n^r} \right]^{-1} \quad \text{and} \quad P^r = \left[ \sum_{n=1}^{N} w_n^r \left( p_n^r / \pi_n \right)^{-1} \right]^{-1} \, . \] Therefore, the Iklé index suffers less from the Gerschenkorn effect than the Geary-Khamis index as bigger countries do not receive more influence in the calculations than smaller countries.
In contrast to the Geary-Khamis index, the Iklé index can be computed with quantities \(q_n^r\) or expenditure share weights \(w_n^r\). This also applies to the Rao index, which derives the international product prices, \(\pi_n \ (n=1,\ldots,N)\), and regional price levels, \(P^r \ (r=1,\ldots,R)\), from the following system of interrelated equations: \[\pi_n = \prod_{r=1}^{R} \left( p_n^r / P^r \right)^{w_n^r / \sum_{s=1}^{R} w_n^s} \quad \text{and} \quad P^r = \prod_{n=1}^{N} \left( p_n^r / \pi_n \right)^{w_n^r} \, , \] while the Rao-Hajargasht index sets up the following system of equations:
\[\pi_n = \sum_{r=1}^{R}
\frac{w_n^r}{\sum_{s=1}^{R} w_n^s} \left( p_n^r / P^r \right) \quad
\text{and} \quad P^r = \sum_{n=1}^{N} w_n^r \left( p_n^r / \pi_n \right)
\ . \] All formulas above for \(\pi_n\) and \(P^r\) require quantities (or expenditure
share weights). Following Rao and Hajargasht
(2016, 417), however, unweighted variants of the multilateral
indices exist, which are implemented in the corresponding functions by
setting q=NULL
or w=NULL
.
Also the price indices belonging to this class of multilateral systems of equations produce transitive price levels.
The multilateral Gerardi index used by Eurostat (1978) is implemented in the function
gerardi()
. Similar to the indices from the previous
section, the Gerardi index defines international product prices, \(\pi_n\), and regional price levels, \(P^r\):
\[\pi_n = \left( \prod_{r=1}^{R} p_n^r \right)^{1/R} \quad n=1,\ldots,N\] and \[P^r = \frac{\sum_{n=1}^{N} p_n^r q_n^r}{\sum_{n=1}^{N} \pi_n q_n^r} = \left[ \sum_{n=1}^{N} w_n^r \left( p_n^r / \pi_n \right)^{-1} \right]^{-1} \quad r=1,\ldots,R \, .\]
Consequently, the regional price levels are defined in the same way as for the Geary-Khamis and Ikle methods. It is only the definition of the international product prices that differs. More importantly, however, the \(\pi_n\) rely only on the observed prices. Therefore, the Gerardi index does not require an iterative procedure, but can be solved in one step.
One obvious drawback of the Gerardi index is the definition of \(\pi_n\) as an unweighted geometric mean
even when quantities or expenditure shares are available. Hence, the
function gerardi()
offers another variant where the \(\pi_n\)’s are computed as weighted
geometric means. This can be set by
settings=list(variant="adjusted")
.
dt.gaps[, gerardi(p=price, r=region, n=product, q=quantity, base="1", settings=list(variant="adjusted"))]
#> 1 2 3 4 5
#> 1.0000000 0.9818328 0.9909638 1.0803662 0.9304644
The Gerardi index produces transitive regional price levels.
Bilateral price indices use solely the direct matches or links between two regions. Hence, indirect links between two regions via a third one are not taken into account. This is different for multilateral price indices, which make use of these indirect links. Consequently, the function output between these indices may differ.
# example data:
dt <- data.table(
"region"=rep(letters[1:5], c(3,3,7,4,4)),
"product"=as.character(c(1:3, 1:3, 1:7, 4:7, 4:7)))
set.seed(123)
dt[, "price":=rnorm(n=.N, mean=20, sd=2)]
dt[, "quantity":=rnorm(n=.N, mean=999, sd=100)]
# price matrix:
with(dt, tapply(X=price, list(product, region), mean))
#> a b c d e
#> 1 18.87905 20.14102 20.92183 NA NA
#> 2 19.53965 20.25858 17.46988 NA NA
#> 3 23.11742 23.43013 18.62629 NA NA
#> 4 NA NA 19.10868 20.22137 16.06677
#> 5 NA NA 22.44816 18.88832 21.40271
#> 6 NA NA 20.71963 23.57383 19.05442
#> 7 NA NA 20.80154 20.99570 17.86435
The price matrix shows that prices for products 1
to
3
are available in regions a
, b
,
and c
, while prices for products 4
to
7
were recorded in regions c
, d
,
and e
. Since in region c
the prices are
complete, the data are connected - either through direct or indirect
regional links.
However, the price matrix also shows that there are no product
matches for regions a
and d
, for example.
Therefore, any bilateral price index between these two regions is not
defined, while the multilateral price indices will provide a price level
by taking into account the indirect link of regions a
and
d
via region c
.
# bilateral jevons index:
dt[, jevons(p=price, r=region, n=product, base="a")]
#> a b c d e
#> 1.0000000 1.0388264 0.9276696 NA NA
# multilateral unweighted cpd index:
dt[, cpd(p=price, r=region, n=product, base="a")]
#> a b c d e
#> 1.0000000 1.0388264 0.9276696 0.9328507 0.8274965
If we now assume that there is no region c
, the data
will consist of two separate blocks of regions and, thus, become
non-connected.
# drop region 'c':
dt2 <- dt[!region%in%"c",]
# price matrix:
with(dt2, tapply(X=price, list(product, region), mean))
#> a b d e
#> 1 18.87905 20.14102 NA NA
#> 2 19.53965 20.25858 NA NA
#> 3 23.11742 23.43013 NA NA
#> 4 NA NA 20.22137 16.06677
#> 5 NA NA 18.88832 21.40271
#> 6 NA NA 23.57383 19.05442
#> 7 NA NA 20.99570 17.86435
# check if connected:
dt2[, is.connected(r=region, n=product)]
#> [1] FALSE
In this case, no price comparison involving all regions is possible.
The price indices implemented in the pricelevels
-package
deal with this issue by using either the block of regions that includes
the base region or the biggest block of regions for a price comparison
on a subset of the data.
# bilateral jevons index:
dt2[, jevons(p=price, r=region, n=product, base="a")]
#> Warning: Non-connected regions -> computations with subset of data
#> a b d e
#> 1.000000 1.038826 NA NA
# multilateral unweighted cpd index:
dt2[, cpd(p=price, r=region, n=product, base="a")]
#> Warning: Non-connected regions -> computations with subset of data
#> a b d e
#> 1.000000 1.038826 NA NA
As can be seen, the two functions return the price levels for regions
a
and b
only, while the price levels for
regions c
and d
are set to NA
.
The functions also print a corresponding warning. All
pricelevels
-warnings can be suppressed in the settings if
wanted.
dt2[, cpd(p=price, r=region, n=product, base="a", settings=list(chatty=FALSE))]
#> a b d e
#> 1.000000 1.038826 NA NA
Also the checking of connectedness can be suppressed. This can be useful in simulations or other cases when it is known that the data are connected. Of course, in our example this setting would result in an error.
dt2[, cpd(p=price, r=region, n=product, base="a", settings=list(connect=FALSE))]
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): singular fit encountered
If the data contains duplicated prices or missing values
NA
, these are removed before any calculations start. Prices
and weights are averaged within each basic heading for each region and
product, while quantities are summed up. Again, a corresponding warning
message is returned.
# example data:
dt <- data.table(
"region"=c("a","a","a","b","b","b","b"),
"product"=as.character(c(1,1,2,1,1,2,2)))
set.seed(123)
dt[, "price":=rnorm(n=.N, mean=20, sd=2)]
dt[, "quantity":=rnorm(n=.N, mean=999, sd=100)]
dt[1, "price" := NA_real_]
dt[, cpd(p=price, r=region, n=product)]
#> Warning: 1 incomplete case(s) found and removed
#> Warning: Duplicated observations found and aggregated
#> a b
#> 1.0020896 0.9979148
In some situations it is useful to compute the price levels of the regions using multiple price indices. This could be done by calculating, for example, the CPD index, the GEKS-Jevons index, and the Jevons index, separately using the corresponding package functions.
# example data:
set.seed(123)
dt <- rdata(R=5, B=1, N=7, gaps=0.2)
# cpd:
dt[, cpd(p=price, r=region, n=product, base="1")]
#> 1 2 3 4 5
#> 1.000000 1.304266 1.170156 1.098752 1.197490
# geks-jevons:
dt[, geks(p=price, r=region, n=product, base="1", setting=list(type="jevons"))]
#> 1 2 3 4 5
#> 1.000000 1.294057 1.165078 1.093972 1.191073
# jevons:
dt[, jevons(p=price, r=region, n=product, base="1")]
#> 1 2 3 4 5
#> 1.000000 1.294572 1.145790 1.097794 1.206428
However, this approach is not convenient and also not efficient as
each price index function checks the same user inputs, removes the same
missing values and duplicated observations, and connects the same data
if needed. This can be time consuming for bigger datasets. The
pricelevels
-package therefore offers the
pricelevels()
-function, which can be used to efficiently
calculate multiple price indices at once.
dt[, pricelevels(p=price, r=region, n=product, base="1", settings=list(type=c("jevons","cpd","geks-jevons")))]
#> 1 2 3 4 5
#> cpd 1 1.304266 1.170156 1.098752 1.197490
#> geks-jevons 1 1.294057 1.165078 1.093972 1.191073
#> jevons 1 1.294572 1.145790 1.097794 1.206428
If the user wants to compute all unweighted price indices available
in the pricelevels
-package, this can be achieved by setting
settings$type=NULL
:
dt[, pricelevels(p=price, r=region, n=product, base="1")]
#> 1 2 3 4 5
#> bmw 1 1.294645 1.145793 1.097794 1.206430
#> carli 1 1.303176 1.147201 1.098526 1.209764
#> cpd 1 1.304266 1.170156 1.098752 1.197490
#> cswd 1 1.294864 1.145803 1.097795 1.206437
#> dutot 1 1.303007 1.150488 1.099698 1.208837
#> geks-bmw 1 1.294092 1.165093 1.093983 1.191088
#> geks-carli 1 1.294195 1.165138 1.094014 1.191133
#> geks-cswd 1 1.294195 1.165138 1.094014 1.191133
#> geks-dutot 1 1.302091 1.168106 1.096550 1.194862
#> geks-harmonic 1 1.294195 1.165138 1.094014 1.191133
#> geks-jevons 1 1.294057 1.165078 1.093972 1.191073
#> gkhamis 1 1.311388 1.172914 1.101115 1.199854
#> harmonic 1 1.286606 1.144407 1.097065 1.203118
#> ikle 1 1.304265 1.171281 1.099912 1.198626
#> jevons 1 1.294572 1.145790 1.097794 1.206428
#> nlcpd 1 1.313387 1.164121 1.096029 1.193531
#> rao 1 1.304266 1.170156 1.098752 1.197490
#> rhajargasht 1 1.304206 1.168994 1.097576 1.196306
Similarly, if there are quantities or weights available in the data,
all weighted and unweighted price indices available in the
pricelevels
-package are computed at once:
dt[, pricelevels(p=price, r=region, n=product, q=quantity, base="1")]
#> 1 2 3 4 5
#> banerjee 1 1.253417 1.141503 1.094111 1.198268
#> bmw 1 1.294645 1.145793 1.097794 1.206430
#> carli 1 1.303176 1.147201 1.098526 1.209764
#> cpd 1 1.266089 1.152739 1.082465 1.182087
#> cswd 1 1.294864 1.145803 1.097795 1.206437
#> davies 1 1.252229 1.142528 1.094370 1.199939
#> drobisch 1 1.252491 1.142447 1.094359 1.200005
#> dutot 1 1.303007 1.150488 1.099698 1.208837
#> fisher 1 1.252462 1.142372 1.094347 1.199874
#> geks-banerjee 1 1.271009 1.153711 1.083487 1.180644
#> geks-bmw 1 1.294092 1.165093 1.093983 1.191088
#> geks-carli 1 1.294195 1.165138 1.094014 1.191133
#> geks-cswd 1 1.294195 1.165138 1.094014 1.191133
#> geks-davies 1 1.270677 1.154297 1.084168 1.181476
#> geks-drobisch 1 1.270818 1.154259 1.084083 1.181445
#> geks-dutot 1 1.302091 1.168106 1.096550 1.194862
#> geks-fisher 1 1.270818 1.154259 1.084083 1.181445
#> geks-geolaspeyres 1 1.270597 1.154364 1.084272 1.181536
#> geks-geopaasche 1 1.270597 1.154364 1.084272 1.181536
#> geks-geowalsh 1 1.269451 1.153443 1.084131 1.181396
#> geks-harmonic 1 1.294195 1.165138 1.094014 1.191133
#> geks-jevons 1 1.294057 1.165078 1.093972 1.191073
#> geks-laspeyres 1 1.270818 1.154259 1.084083 1.181445
#> geks-lehr 1 1.253837 1.142062 1.078325 1.171155
#> geks-lowe 1 1.281260 1.159068 1.089011 1.182639
#> geks-medgeworth 1 1.270634 1.155082 1.084910 1.183136
#> geks-paasche 1 1.270818 1.154259 1.084083 1.181445
#> geks-palgrave 1 1.270620 1.154579 1.084537 1.181742
#> geks-svartia 1 1.269841 1.153757 1.084181 1.181444
#> geks-theil 1 1.269829 1.153749 1.084179 1.181445
#> geks-toernqvist 1 1.270597 1.154364 1.084272 1.181536
#> geks-uvalue 1 1.236111 1.163371 1.076930 1.211239
#> geks-walsh 1 1.269482 1.153458 1.084141 1.181411
#> geks-young 1 1.281712 1.159905 1.090079 1.183622
#> geolaspeyres 1 1.255885 1.128502 1.088582 1.179864
#> geopaasche 1 1.248368 1.157057 1.100236 1.220512
#> geowalsh 1 1.252576 1.140507 1.093937 1.200037
#> gerardi 1 1.241508 1.141248 1.081700 1.175969
#> gkhamis 1 1.267113 1.156586 1.084440 1.183675
#> harmonic 1 1.286606 1.144407 1.097065 1.203118
#> ikle 1 1.266336 1.153682 1.083412 1.183243
#> jevons 1 1.294572 1.145790 1.097794 1.206428
#> laspeyres 1 1.261008 1.129372 1.089128 1.182314
#> lehr 1 1.246348 1.126641 1.087801 1.183916
#> lowe 1 1.279239 1.142610 1.093595 1.196531
#> medgeworth 1 1.250211 1.142092 1.096414 1.203379
#> nlcpd 1 1.304658 1.160267 1.094826 1.189399
#> paasche 1 1.243973 1.155522 1.099591 1.217695
#> palgrave 1 1.253266 1.158605 1.100882 1.223354
#> rao 1 1.266089 1.152739 1.082465 1.182087
#> rhajargasht 1 1.265828 1.151783 1.081527 1.180898
#> svartia 1 1.252423 1.141251 1.094091 1.200032
#> theil 1 1.252424 1.141228 1.094087 1.200039
#> toernqvist 1 1.252121 1.142690 1.094394 1.200016
#> uvalue 1 1.222189 1.171691 1.093794 1.197584
#> walsh 1 1.252639 1.140511 1.093938 1.200043
#> young 1 1.287435 1.144315 1.095294 1.200834
It is important to note that only the simplified output including the
regions’ price levels is returned. Moreover, the user has to provide a
specific base region as bilateral indices do not allow a comparison to
the average regional price level. However, any additional settings of
some price index function (e.g., settings$use.jac=TRUE
for
nlcpd()
) can be used in pricelevels()
as
well.