This vignette describes the extensions to the rtrim
package as per version 2.0:
and in addition extensions introduced in version 2.2: * long output * scaled index standard errors
See also the vignettes rtrim confidence intervals and taming overdispersion for additional new features.
One of the major improvements in rtrim
2.0 with respect
to 1.0 is the handling of monthly, or any other intra-annual, data. For
example, where a classic TRIM model 3 looks like \[ \ln \mu_{ij} = \alpha_i + \beta_j \]
where \(\mu_{ij}\) is an expected
count, \(\alpha_i\) is a site parameter
for site \(i\) and \(\beta_j\) is a time point parameter for
year \(j\). the extension towards
months looks like \[ \ln \mu_{ijm} = \alpha_i
+ \beta_j + \delta_m \] where \(\delta_m\) are month parameters for month
\(m\).
Please take a look in Models and statistical methods in rtrim for a full explanation, and application to other model formulations.
The general syntax to specify R-Trim models that use monthly data is as follows:
z <- trim(count ~ site + year, data=...) # simple annual data
z <- trim(count ~ site + year + habitat, data=...) # using covariates
z <- trim(count ~ site + (year+month), data=...) # using monthy data
z <- trim(count ~ site + (year+month) + habitat, data=...) # using both
Note the use of brackets to distinguish months from covariates.
Here is a full example for Oystercatcher data, which now comes with
rtrim
2.0
rm(list=ls()) # always start with a clean slate
library(rtrim)
data(oystercatcher)
oc <- trim(count ~ site + (year+month), data=oystercatcher, model=3, overdisp=TRUE,
constrain_overdisp=0.999)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
While in the past TRIM was used to analyse count data with an annual
resolution (i.e. one observation per site per year), the software
package UIndex [Underhill, 1989; Underhill and Prŷs-Jones, 1995] was and
is used to analyse count data with higher (e.g., monthly) resolution. As
demonstrated above, rtrim
is extended to accept and analyse
monthly data as well. This section demonstrates the application of
rtrim
to monthly data, and compares the output with that of
UIndex.
UIndex was used to analyse the monthly Oystercatcher counts, collected by SOVON Netherlands. Here we show the pre-saved output of UIndex, as the main trend, and the 90% `consistency intervals’. Note also the use of 2004 as base year.
load("UIndex_Oystercatcher_output.RData")
yrange <- range(uidx$index, uidx$lower, uidx$upper)
plot(uidx$year, uidx$index, type='l', xlab="Year", ylab="Index", ylim=yrange)
segments(uidx$year, uidx$lower, uidx$year,uidx$upper)
legend("topright", "UIndex", col="black", lty="solid")
# Add index=1 line for reference
abline(h=1.0, lty="dashed", col=gray(0.5))
# Mark the base year
ibase <- match(2004, uidx$year)
points(uidx$year[ibase], uidx$index[ibase], pch=16)
rtrim
Here we show the comparision with rtrim
, using the
results computed above.
# Compute and plot an index for Oystercatcher counts, using 2004 as base year and
# adding 90% confidence intervals as well.
idx <- index(oc, level=0.9, base=2004)
plot(idx, band="ci")
# Plot UIndex on top
lines(uidx$year, uidx$index)
segments(uidx$year, uidx$lower, uidx$year,uidx$upper, lwd=2)
legend("bottom", c("UIndex","TRIM"), col=c("black","red"), lty="solid")
Note the computation and display of confidence intervals, which is
new for rtrim
2.0, along with the standard errors of both
classic TRIM and rtrim
1.0.
This plot demonstrates that the indices as computed by UIndex and
rtrim
are virtually identical, and that the 90% confidence
intervals of TRIM are well comparable to the 90% consistency intervals
of TRIM, although both are estimated using completely different
approaches. In the case of TRIM, confidence intervals are based on
standard errors which are derived analytically as part of the GEE
estimation process and ultimately are based on the variance within the
orginal data. See the vignettes Models and
statistical methods in rtrim and rtrim confidence intervals
for more information. Consistency intervals in UIndex are estimated by
means of a bootstrap method. See Underhill [1989] and Underhill and
Prŷs-Jones [1995] for more information.
rtrim
Sometimes it can be usefull to combine rtrim
results for
different regions (‘strata’) into a single, larger scale
(‘superstratum’) rtrim
analysis. One particular example is
the case where individual EU countries use TRIM or rtrim
to
compute indices for their own countries, and submit the results to the
European Bird Census Counsil EBCC
for aggregatation on the EU scale, see van Strien et al. [2001] for an
example using the previous stand-alone version of TRIM. In this case,
the output of the lower scale rtrim
runs, i.e., the time
totals, are used as ‘observations’ in the higher scale run. Although
this procedure works out well for the estimates and indices, it doesn’t
work for the associated standard errors, because the time totals are not
Poisson distributed, where the original counts are. To circumvent this
problem, rtrim
has options to export the variances of the
lower scale runs and to import these into the higher scale runs, to use
instead.
The following example shows the associated workflow. Strictly for demonstration purposes, we split the Skylark dataset into two ‘regions’ associated with the habitats (heath and dunes).
# split data
data(skylark2)
heath <- subset(skylark2, habitat=="heath") # 208 records
dunes <- subset(skylark2, habitat=="dunes") # 232 records
heath$site <- factor(heath$site) # get rid of empty levels
dunes$site <- factor(dunes$site)
# run models
m1 <- trim(count ~ site + year, data=heath, model=3)
m2 <- trim(count ~ site + year, data=dunes, model=3)
# collect imputed time-totals (which is the default)
t1 <- totals(m1)
t2 <- totals(m2)
plot(t1,t2, names=c("heath", "dunes"))
Note the use of multiple time-totals in a single plot (new for
rtrim
2.0)
The next step is to use the time totals for the differente habitats
(strata') as input data for an upscaled (
superstratum’)
run. The habitat types now serve as site names, and imputed counts will
be the input counts.
t1$region <- "heath"
t2$region <- "dunes"
t12 <- rbind(t1, t2)
head(t12)
#> time imputed se_imp region
#> 1 1984 376 33 heath
#> 2 1985 255 25 heath
#> 3 1986 339 22 heath
#> 4 1987 336 21 heath
#> 5 1988 389 23 heath
#> 6 1989 425 24 heath
The final preparation step is to extract the variance-covariance information for the different habitats, and combine them into a single list, using habitat/region names as identifier, enabling the correct match between the site identifiers in the data, and the variance-covariance matrices.
# Also collect the variance-covariance matrices for both runs
vcv1 <- vcov(m1)
vcv2 <- vcov(m2)
vcv3 <- list(heath=vcv1, dunes=vcv2)
and off we go with the superstratum run. Note the new argument
covin
to use the variance-covariance data.
Now, just for comparison, we compare index plots for both the baseline run (where dunes and heath are taken together, but do act as covariates) and the upscaled `superstratum’ variant.
m0 <- trim(count ~ site + year + habitat, data=skylark2, model=3) # baseline
t0 <- totals(m0)
t3 <- totals(m3)
plot(t0,t3, names=c("baseline","superstrata"))
Which suggests that for this example the differences are small, if any.
In some cases, especially with clustering bird species, overdispersion can be huge, reaching unrealistic values of more than 500. rtrim now contains an option to constrain the computed value of overdispersion by detecting outliers, and removing them from the computation of overdispersion (but retaining them for all other calculations). The full rationale and methdology is described in Taming overdispersion, but the actual use is rather simple.
Take for example the Oystercatcher data, which results in a huge overdispersion of about 850
data(oystercatcher)
m1 <- trim(count ~ site + (year + month), data=oystercatcher, model=3, overdisp=TRUE)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
The inclusion of the option constrain_overdisp=0.999
triggers the detection of outliers that have a probability of 0.1%.
m2 <- trim(count ~ site + (year + month), data=oystercatcher, model=3, overdisp=TRUE,
constrain_overdisp=0.99)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
And so we get a much more reasonable result, with smaller standard errors.
Once an rtrim
model has been estimated, one of the first
steps of analysis schould be the plotting of time-totals. This is done
by first calling the totals()
function, and then a custom
plot()
function:
rm(list=ls()) # New section; time for a new blank slate
data(skylark2) # reload Skylark data
m1 <- trim(count ~ site + year, data=skylark2, model=3)
t1 <- totals(m1) # By default, the time-totals for the imputed data set
plot(t1)
Alternatively, one may compute the fitted time-totals. the next example shows the plotting of both the imputed and fitted time-totals, and also demonstrates how series can be named, and the plot can be decorated with a main title.
m2 <- trim(count ~ site + year, data=skylark2, model=2, changepoints=c(1,2))
ti <- totals(m2, "imputed")
tf <- totals(m2, "fitted")
plot(ti, tf, names=c("imputed","fitted"), main="Skylark", leg.pos="bottomright")
Since imputed totals are composed of both observed and estimated counts, it might be insightful to plot the observed counts as well.
m3 <- trim(count ~ site + year, data=skylark2, model=3)
t3 <- totals(m3, obs=TRUE) # Extract observations in addition to totals
plot(t3)
As can be seen, the amount of observed Skylarks is considerable smaller than the time totals suggest. Furthermore, it can be seen that while the observed counts decrease from 1989, the imputed counts continue to increase. It is thus suggested to look into more detail what is going on in different sites.
Once a TRIM model has been estimated, and indices are computed, these
latter can be plotted using the generic plot command plot()
(which, behind the screens, calls plot.trim.index()
).
m <- trim(count ~ site + year, data=skylark2, model=3) # Run a fairly basic TRIM model
idx <- index(m) # By default, the indices for the imputed data set
plot(idx)
If required, the x-axis and y-axis labels as well as the tile can be defined, and the index can be expressed as a percentage, instead as a fraction. This example shows all these options:
When covariates are involved, it can be helpful to compute and plot indices for the various covariate categories as well. The next example demonstrates this.
m <- trim(count ~ site + year + habitat, data=skylark2, model=3) # Run a fairly basic TRIM model
idx <- index(m, covars=TRUE)
plot(idx)
As can be seen, indices for the various covariate categories are
automatically plotted as well. This behaviour can be supressed by
setting covar="none"
in the call to plot()
(note the use of plural
covars' in the call to
index()--- because indices for multiple covariates can be computed, and the singular
covarin the call to
plot()`
— because only one of them can be used for a single figure)
Indices for multiple TRIM runs can be combined in a single plot.
data(skylark2)
m0 = trim(count ~ site + year , data=skylark2, model=3)
m1 = trim(count ~ site + year + habitat, data=skylark2, model=3)
idx0 <- index(m0)
idx1 <- index(m1)
plot(idx0, idx1)
As you see, a legend is inserted automatically. You can change the names
of the series by using the names
argument:
New in rtrim 2.0 is the possibility to express uncertainty as a
confidence interval, in addition to the standard errors. Both the
functions totals()
and index()
now accept the
option level
that specifies the confidence level and
triggers the computation.
m <- trim(count ~ site + year, data=skylark2, model=3)
tt <- totals(m, level=0.95) # Compute 95% confidence intervals
head(tt)
#> time imputed se_imp lo hi
#> 1 1984 511 38 510.0584 510.0584
#> 2 1985 362 31 361.1155 361.1155
#> 3 1986 429 26 428.4749 428.4749
#> 4 1987 423 25 422.5076 422.5076
#> 5 1988 469 27 468.4820 468.4820
#> 6 1989 522 27 521.5346 521.5346
So, the lower and upper bounds of the confidence interval is stored
in columns lo
and hi
. These are automatically
picked up by the plot()
function.
If required, the uncertainty band, which is by default plotted using
standard errors, can be plotted using the confidence intervals when the
option band="ci"
is used.
See vignette rtrim confidence intervals for more information on the underlying methodology.
Stating with version 2.2, the totals()
and
index()
functions can also generate so-called long
output, which simplifies plotting using alternative approaches,
e.g. ggplot2
or similar, especially because imputed and
fitted time-totals are in rows, not in columns.
tt <- totals(m, long=TRUE)
head(tt)
#> variable series year value SE
#> 1 time_totals imputed 1984 511 38
#> 2 time_totals imputed 1985 362 31
#> 3 time_totals imputed 1986 429 26
#> 4 time_totals imputed 1987 423 25
#> 5 time_totals imputed 1988 469 27
#> 6 time_totals imputed 1989 522 27
lo <- tt$value - 1.96 * tt$SE # Assume normal distribution
hi <- tt$value + 1.96 * tt$SE
# create an empty plot with sufficient space
xrange = range(tt$year)
yrange = range(lo, hi)
plot(xrange, yrange, type='n', xlab="Year", ylab="Time-totals")
# plot the time series and error bars
segments(tt$year, lo, tt$year, hi, col="red", lwd=2)
lines(tt$year, tt$value, col="red", lwd=2)
A similar mechanism has been included for custom plotting of overall
trends, using a new separate trendlines
function.
tl <- trendlines(overall(m)) # collect overall trend line
print(tl)
#> segment year value lo hi
#> 1 1 1984.000 404.7544 362.9033 451.4320
#> 2 1 1984.101 406.7516 365.4989 452.6604
#> 3 1 1984.203 408.7587 368.1056 453.9015
#> 4 1 1984.304 410.7757 370.7227 455.1559
#> 5 1 1984.406 412.8026 373.3498 456.4244
#> 6 1 1984.507 414.8395 375.9863 457.7077
#> 7 1 1984.609 416.8865 378.6314 459.0067
#> 8 1 1984.710 418.9436 381.2843 460.3223
#> 9 1 1984.812 421.0108 383.9444 461.6556
#> 10 1 1984.913 423.0882 386.6106 463.0075
#> 11 1 1985.014 425.1759 389.2821 464.3792
#> 12 1 1985.116 427.2739 391.9577 465.7720
#> 13 1 1985.217 429.3822 394.6363 467.1872
#> 14 1 1985.319 431.5009 397.3167 468.6263
#> 15 1 1985.420 433.6301 399.9974 470.0907
#> 16 1 1985.522 435.7698 402.6771 471.5821
#> 17 1 1985.623 437.9200 405.3541 473.1023
#> 18 1 1985.725 440.0809 408.0267 474.6532
#> 19 1 1985.826 442.2524 410.6932 476.2368
#> 20 1 1985.928 444.4347 413.3516 477.8551
#> 21 1 1986.029 446.6277 415.9999 479.5104
#> 22 1 1986.130 448.8315 418.6359 481.2050
#> 23 1 1986.232 451.0462 421.2574 482.9415
#> 24 1 1986.333 453.2718 423.8621 484.7222
#> 25 1 1986.435 455.5085 426.4474 486.5499
#> 26 1 1986.536 457.7561 429.0111 488.4271
#> 27 1 1986.638 460.0148 431.5505 490.3566
#> 28 1 1986.739 462.2847 434.0633 492.3410
#> 29 1 1986.841 464.5658 436.5470 494.3829
#> 30 1 1986.942 466.8582 438.9993 496.4850
#> 31 1 1987.043 469.1618 441.4179 498.6495
#> 32 1 1987.145 471.4768 443.8007 500.8788
#> 33 1 1987.246 473.8033 446.1460 503.1750
#> 34 1 1987.348 476.1412 448.4522 505.5398
#> 35 1 1987.449 478.4907 450.7177 507.9749
#> 36 1 1987.551 480.8517 452.9418 510.4814
#> 37 1 1987.652 483.2244 455.1235 513.0604
#> 38 1 1987.754 485.6088 457.2625 515.7124
#> 39 1 1987.855 488.0050 459.3587 518.4377
#> 40 1 1987.957 490.4130 461.4124 521.2364
#> 41 1 1988.058 492.8329 463.4240 524.1080
#> 42 1 1988.159 495.2647 465.3944 527.0522
#> 43 1 1988.261 497.7085 467.3246 530.0679
#> 44 1 1988.362 500.1644 469.2157 533.1544
#> 45 1 1988.464 502.6324 471.0693 536.3103
#> 46 1 1988.565 505.1126 472.8868 539.5344
#> 47 1 1988.667 507.6050 474.6699 542.8253
#> 48 1 1988.768 510.1097 476.4201 546.1816
#> 49 1 1988.870 512.6268 478.1393 549.6017
#> 50 1 1988.971 515.1563 479.8291 553.0843
#> 51 1 1989.072 517.6982 481.4913 556.6279
#> 52 1 1989.174 520.2527 483.1274 560.2309
#> 53 1 1989.275 522.8199 484.7392 563.8921
#> 54 1 1989.377 525.3997 486.3283 567.6100
#> 55 1 1989.478 527.9922 487.8960 571.3835
#> 56 1 1989.580 530.5975 489.4439 575.2113
#> 57 1 1989.681 533.2157 490.9734 579.0923
#> 58 1 1989.783 535.8467 492.4857 583.0255
#> 59 1 1989.884 538.4908 493.9821 587.0099
#> 60 1 1989.986 541.1479 495.4637 591.0445
#> 61 1 1990.087 543.8181 496.9316 595.1286
#> 62 1 1990.188 546.5016 498.3868 599.2614
#> 63 1 1990.290 549.1982 499.8303 603.4422
#> 64 1 1990.391 551.9081 501.2629 607.6703
#> 65 1 1990.493 554.6315 502.6855 611.9453
#> 66 1 1990.594 557.3682 504.0989 616.2666
#> 67 1 1990.696 560.1185 505.5038 620.6338
#> 68 1 1990.797 562.8823 506.9008 625.0464
#> 69 1 1990.899 565.6598 508.2906 629.5041
#> 70 1 1991.000 568.4510 509.6738 634.0065
tt <- totals(m, long=TRUE) # collect time-totals
# define plot limits
xr <- range(tt$year)
yr <- range(tl$lo, tl$hi, tt$value)
plot(xr, yr, type='n', xlab="Year", ylab="Total counts")
# Plot uncertainty band
ubx <- c(tl$year, rev(tl$year))
uby <- c(tl$lo, rev(tl$hi))
polygon(ubx, uby, col=gray(0.9), border=NA)
# Plot trend line
lines(tl$year, tl$value, col="black", lwd=2)
# Plot time-totals
lines(tt$year, tt$value, col="red", lwd=2)
points(tt$year, tt$value, col="red", pch=16, cex=1.5)
## “Scaled” standard errors
In rtrim
version \(<2.2\) standard errors for indices are
always computed using a formal approach that results in \(SE=0\) for the reference year.
Mathematically, this makes completely sense because the indices are
computed by dividing the time-totals time series by the time-total for
the reference year. The index for that reference year is thus by
definition 1, without any uncertainty, hence \(SE=0\). The following plot illustrates
this.
tt <- totals(m)
idx <- index(m)
par(mfrow=c(1,2))
plot(tt, main="Time-totals", ylab=NA)
plot(idx, main="Index", ylab=NA)
However, for many (communication) purposes this formal approach is
rather confusing, because the interpretation is often made that
uncertainties ‘disappeared’ this way, Furthermore, the indices are often
only calculated to compare trends for multiple species. For this
comparison purposes, one often wants to preserve the original
uncertainty pattern as in the time-totals. Starting with
rtrim
version 2.2, this is possible by using the
method="scaled"
option:
fidx <- index(m, method="formal") # same as just index(m)
sidx <- index(m, method="scaled")
par(mfrow=c(1,2))
plot(fidx, main="Formal approach")
plot(sidx, main="Scaled approach")
As can be seen, the uncertainty for the reference year now is \(>0\), while the uncertainties for the other years have been shrunk, which can be explaiend from what loosely can be described as a preservation of total uncertainty.
The detailed spatiotemporal structure of both the observed and the
imputed data can be inspected by means of the function
heatmap()
that operates on the output of
trim()
. The default behaviour of this function is to
display a heat map of the observed counts only:
In this heatmap, site/time combinations are colored by (log) counts: lower counts are colored a pale red, and high counts a dark red. Consistent with the nature of count data, this color scale is proportional to the log counts. Observed counts of 0 cannot be represented this way and are colored white. Missing site/time combinations are marked as gray.
It can be seen that the observational coverage is not constant: most sites have incomplete records, especially in the earlier years. This is a typical patern for an expanding observation program, but may have consequences for the statistical analysis, because the imputation for these years will in fact be an extrapolation back in time.
The next example shows the TRIM estimated counts (using shades of blue, rather than red:
From this plot, it is clear that the variance between sites is much higher than the variance between years. In fact, the trend in time can hardly be seen.
The last example sows the heatmap for the imputed data, where estimates are used to fill up the missing observations. Again, red is for obervations, blue for estimates.
For monthly data, heatmaps work slightly different, but in the same spirit:
data(oystercatcher)
m <- trim(count ~ site + (year + month), data=oystercatcher, model=3, overdisp=TRUE)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
Again, observational coverage is extremely variable in both space and time. There appears to be a few sites that have sporadic, yet high, count observations, causing large amounts of estimated counts for this location for all other time points, which may effect the aggregated time-totals in a significant way.
Also note that in this example, many site/time combinations have registered a count of 0, which are colored white, as explained above.
van Strien, A. J., J. Pannekoek and D.W. Gibbons (2001) Indexing European bird population trends using results of national monitoring schemes: a trial of a new method, Bird Study, 48 (2), 200-213, DOI: 10.1080/00063650109461219
Underhill, LG, Prŷs-Jones, RP (1994) Index numbers for waterbird populations. I. Review and methodology. J Appl Ecol, 31, 463-480. doi: 10.2307/2404443
Underhill, L.G. (1989) Indices for Waterbird Populations. BTO Research Report 52, British Trust for Ornithology, Tring.