bigergm
This vignette provides a brief introduction on how to use the R
package bigergm
, which estimates Hierarchical
Exponential-Family Random Graph Models (HERGMs,
Schweinberger and Handcock 2015). bigergm
is built
upon the R package hergm
(Schweinberger and Luna 2018) and applies
scalable algorithms and computational techniques. See Martínez Dahbura et al. (2021) for further
information.
library(bigergm)
#> Lade nötiges Paket: ergm
#> Lade nötiges Paket: network
#>
#> 'network' 1.18.2 (2023-12-04), part of the Statnet Project
#> * 'news(package="network")' for changes since last version
#> * 'citation("network")' for citation information
#> * 'https://statnet.org' for help, support, and other information
#>
#> 'ergm' 4.6.0 (2023-12-17), part of the Statnet Project
#> * 'news(package="ergm")' for changes since last version
#> * 'citation("ergm")' for citation information
#> * 'https://statnet.org' for help, support, and other information
#> 'ergm' 4 is a major update that introduces some backwards-incompatible
#> changes. Please type 'news(package="ergm")' for a list of major
#> changes.
#> Lade nötiges Paket: Rcpp
library(ergm)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#>
#> intersect, setdiff, setequal, union
bigergm
has a toy network to test-drive the package with
Let’s load the network data and plot it.
# Load an embedded network object.
data(toyNet)
# Draw the network.
plot(toyNet, vertex.col = rep(c("tomato", "steelblue", "darkgreen", "black"),
each = toyNet$gal$n/4))
As you can see, this network has a clear cluster or community structure. Although this is a fake network, we often observe such community structures in real social networks. Exploiting this stylized fact, we model the way agents in a network get connected differently for connections across and within communities:
To estimate the latent community structure of a network and agents’
preferences for connection, bigergm
implements the
following two-step procedure.
Seeing is believing. Let’s perform an estimation using the toy
network. (If you would like to see the progress, set
verbose = 1
.)
<- toyNet ~ edges + nodematch("x") + nodematch("y") + triangle
model_formula
<-
hergm_res ::hergm(
bigergm# The model you would like to estiamte
object = model_formula,
# The number of blocks
n_clusters = 4,
# The maximum number of MM algorithm steps
n_MM_step_max = 100,
# Perform parameter estimation after the block recovery step
estimate_parameters = TRUE,
# Indicate that clustering must take into account nodematch on characteristics
clustering_with_features = TRUE,
# Keep track of block memberships at each EM iteration
check_block_membership = TRUE
)
To see whether the first step (recovering the latent community structure) has converged, we can plot the estimated lower bound of the objective function over iterations.
plot(1:length(hergm_res$MM_lower_bound),
$MM_lower_bound, type = "l", xlab = "Iterations", ylab = "Lower Bound") hergm_res
This indicates that the clustering step converged at the early stage.
Note that the number of iterations that you need to perform
(n_em_step_max
) varies depending on the size of a network,
whether it has a clear community structure, etc.. You need trial and
error on how many iterations are at least necessary in your case.
Plotting the lower bound may help check the convergence of the
clustering step.
You can check the clustering result
# Number of nodes per recovered block:
table(hergm_res$partition)
#>
#> 1 2 3 4
#> 50 50 50 50
and estimated parameters.
# For the between networks
summary(hergm_res$est_between)
#> Call:
#> ergm(formula = between_formula, estimate = "MPLE")
#>
#> Maximum Likelihood Results:
#>
#> Estimate Std. Error MCMC % z value Pr(>|z|)
#> edges -4.50671 0.07465 0 -60.373 <1e-04 ***
#> nodematch.x 0.79350 0.16086 0 4.933 <1e-04 ***
#> nodematch.y 0.40233 0.18379 0 2.189 0.0286 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> For this model, the pseudolikelihood is the same as the likelihood.
#>
#> Null Deviance: 27587 on 19900 degrees of freedom
#> Residual Deviance: 2695 on 19897 degrees of freedom
#>
#> AIC: 2701 BIC: 2725 (Smaller is better. MC Std. Err. = 0)
# For the within networks
summary(hergm_res$est_within)
#> Call:
#> ergm::ergm(formula = formula, constraints = ~blockdiag("block"),
#> offset.coef = offset_coef, estimate = method_second_step,
#> control = control)
#>
#> Maximum Pseudolikelihood Results:
#>
#> Estimate Std. Error MCMC % z value Pr(>|z|)
#> edges -1.72212 0.06542 0 -26.326 <1e-04 ***
#> nodematch.x 0.54183 0.10602 0 5.111 <1e-04 ***
#> nodematch.y 0.63428 0.10574 0 5.998 <1e-04 ***
#> triangle 0.14674 0.01721 0 8.524 <1e-04 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Warning: The standard errors are based on naive pseudolikelihood and are suspect. Set control.ergm$MPLE.covariance.method='Godambe' for a simulation-based approximation of the standard errors.
#>
#> Null Pseudo-deviance: 6793 on 4900 degrees of freedom
#> Residual Pseudo-deviance: 5242 on 4896 degrees of freedom
#>
#> AIC: 5250 BIC: 5276 (Smaller is better. MC Std. Err. = 0)
Currently, the only supported way to include covariates in the model
is via nodematch()
.
You can also employ caching to avoid repeating heavy operations that
yield the same results for your network. To use it, pass the
cache
parameter to bigergm::hergm
, setting its
value to a cachem object. A disk cache lets you speed
up estimations on the same network data even across R Sessions.
You can evaluate the goodness-of-fit of the model with the
bigergm::gof_bigergm()
function:
# Prepare a data frame that contains nodal id and covariates.
<-
nodes_data ::tibble(
tibblenode_id = network::network.vertex.names(toyNet),
block = hergm_res$partition,
x = network::get.vertex.attribute(toyNet, "x"),
y = network::get.vertex.attribute(toyNet, "y")
)
# The feature adjacency matrices
<- bigergm::get_list_sparse_feature_adjmat(toyNet, model_formula)
list_feature_matrices
# The MCMC settings
<- ergm::control.simulate.formula(
sim_ergm_control MCMC.burnin = 1000000,
MCMC.interval = 100000
)
# The feature adjacency matrices
<- bigergm::get_list_sparse_feature_adjmat(toyNet, model_formula)
list_feature_matrices
<- bigergm::gof_bigergm(
gof_res
toyNet,# The feature adjacency matrices
list_feature_matrices = list_feature_matrices,
# A dataframe containing the nodes data.
data_for_simulation = nodes_data,
# The name of the nodes_data column containing the node IDs
# which are used within the network g
colname_vertex_id = 'node_id',
# The name of the nodes_data column containing the block ID.
colname_block_membership = 'block',
# The object returned by bigergm::hergm()
bigergm_results = hergm_res,
# The MCMC settings
ergm_control = sim_ergm_control,
# The number of simulations to use
n_sim = 100
)
Currently, gof is evaluated on the following metrics:
the network statistics (the counts you obtain when you use summary on an ergm formula, such as the number of edges, triangles, nodematches, etc.),
degree distribution
geodesic distance, and
edgewise shared partners.
bigergm::gof_bigergm()
returns a list of data frames for
these matrices instead of creating plots as ergm::gof()
does. This allows you to flexibly create gof plots that match your
needs.
Below is a example gof plot on degree distribution.
<-
degree_gof $simulated$degree_dist %>%
gof_res::group_by(degree) %>%
dplyr::summarise(log_mean_share = mean(log(share)),
dplyrlog_sd_share = sd(log(share))) %>%
::ungroup()
dplyrplot(degree_gof$degree, degree_gof$log_mean_share,
xlab = "Degree", ylab = "Log Prop. of Nodes",
ylim = c(-5.5,-1.8), xlim = c(6,20), type = "l", lty = 2)
lines(degree_gof$degree, degree_gof$log_mean_share+ 1.96 * degree_gof$log_sd_share, type = "l", lty = 2)
lines(degree_gof$degree, degree_gof$log_mean_share- 1.96 * degree_gof$log_sd_share, type = "l", lty = 2)
<- gof_res$original$degree_dist %>%
tmp_info ::filter(share > 0 & degree < 22)
dplyrlines(tmp_info$degree, log(tmp_info$share), lty = 1)
You can simulate networks with local dependence using the
bigergm::simulate_hergm()
.
# Estimated coefficients for the between-community connections
<- coef(hergm_res$est_between)
coef_between_block
# Estimated coefficients for the within-community connections
<- coef(hergm_res$est_within)
coef_within_block
<- bigergm::simulate_hergm(
sim_net # Formula for between-blocks
formula_for_simulation = model_formula,
# Same as for gof, a dataframe containing nodes attributes
data_for_simulation = nodes_data,
# Name of the column containing node IDs
colname_vertex_id = "node_id",
# Name of the column containing block IDs
colname_block_membership = "block",
# The coefficients for the between connections
coef_between_block = coef_between_block,
# The coefficients for the within connections
coef_within_block = coef_within_block,
# The MCMC settings
ergm_control = sim_ergm_control,
# Number of simulations to return
n_sim = 1,
# If `stats` a list with network statistics
# for the between and within connections is returned
output = "network",
# Simulates between connections by drawing from a logistic distribution.
# If FALSE, draws between connections by MCMC.
use_fast_between_simulation = TRUE,
# The feature adjacency matrices
list_feature_matrices = list_feature_matrices
)
plot(sim_net)
If you would like to estimate an HERGM with a large network (say, when the number of nodes \(\geq\) 50,000):
bigergm::compute_multiplied_feature_matrices()
, and pass it
to bigergm::hergm()
by
list_multiplied_feature_matrices
. Once calculated and
stored, it can be used in models with the same network and the same
features.igraph::cluster_infomap()
. To install it, run
system("pip3 install infomap")
and check if it is effective
by system("infomap --version")
. If
system("infomap --version")
yields an error, consider using
{reticulate}
.use_infomap_python = TRUE
in
bigergm::hergm()
.bigergm
class object to bigergm::hergm()
as
follows (all parameters such as the number of EM iterations will be
inherited from the previous estimation unless specified).<-
hergm_res_second ::hergm(object = hergm_res) bigergm