In here, we’ll walk through the algorithm for preprocessing 2D embedding data to construct a model overlaid with high-dimensional data.
The algorithm consists of two steps. First, construct the model in 2D space. Second, lift the model into high-dimensions. Therefore, to begin the process, first you need to know how the 2D model is constructed.
To construct the model in the 2D space, first you need to decide
hexagonal grid configurations, which means that the number of bins along
the x and y axes. You can begin by calculating default number of bins
along the axes with default settings for hex_size
,
buffer_x
, and buffer_y
.
Next, you need to hexagonally bins the 2D layout and obtain the bin centroids. Discussed in details in 3. Algorithm for binning data.
hb_obj <- hex_binning(data = s_curve_noise_umap_scaled, x = "UMAP1", y = "UMAP2",
num_bins_x = num_bins_x, num_bins_y = num_bins_y,
x_start = NA, y_start = NA, buffer_x = NA,
buffer_y = NA, hex_size = NA, col_start = "UMAP")
all_centroids_df <- as.data.frame(do.call(cbind, hb_obj$centroids))
counts_df <- as.data.frame(do.call(cbind, hb_obj$std_cts))
After obtaining the hexagonal bin centroid coordinates
(all_centroids_df
) and standard number of points within
each hexagon (counts_df
), you can generate a data frame or
tibble which gives hexagonal ID, centroid coordinates and standard
counts where data exists.
df_bin_centroids <- extract_hexbin_centroids(centroids_df = all_centroids_df,
counts_df = counts_df)
glimpse(df_bin_centroids)
#> Rows: 10
#> Columns: 4
#> $ hexID <dbl> 2, 6, 7, 12, 13, 18, 24, 28, 29, 34
#> $ c_x <dbl> 0.1732051, 0.0000000, 0.3464102, 0.1732051, 0.5196152, 0.69…
#> $ c_y <dbl> -0.15, 0.15, 0.15, 0.45, 0.45, 0.75, 1.05, 1.35, 1.35, 1.65
#> $ std_counts <dbl> 0.2352941, 0.5294118, 0.4117647, 0.1764706, 0.3529412, 0.70…
One of the parameters that you need to control is that the benchmark value to remove low density hexagons. The default value is the first quartile of the standardise counts.
benchmark_value_rm_lwd <- stats::quantile(df_bin_centroids$std_counts,
probs = c(0,0.25,0.5,0.75,1), names = FALSE)[2]
benchmark_value_rm_lwd
#> [1] 0.25
There is two ways that you can follow after this. First, you can
remove the low density hexagons from df_bin_centroids
and
proceed. Second, you can check whether is that actually reliable to
remove the identified low density hexagons by looking at their
neighboring bins and if so remove them and proceed. In here, let’s do
with second option.
Here, you need to obtain the low density hexagons.
df_bin_centroids_low <- df_bin_centroids |>
dplyr::filter(std_counts <= benchmark_value_rm_lwd)
glimpse(df_bin_centroids_low)
#> Rows: 3
#> Columns: 4
#> $ hexID <dbl> 2, 12, 29
#> $ c_x <dbl> 0.1732051, 0.1732051, 1.0392305
#> $ c_y <dbl> -0.15, 0.45, 1.35
#> $ std_counts <dbl> 0.2352941, 0.1764706, 0.2352941
Next, check the neighboring bins of low-density hexagons and decide which should actually need to remove.
identify_rm_bins <- find_low_dens_hex(df_bin_centroids_all = df_bin_centroids,
num_bins_x = num_bins_x,
df_bin_centroids_low = df_bin_centroids_low)
identify_rm_bins
#> numeric(0)
As you have seen, even though there are low density hexagons, it’s
not a good decision to remove them. Therefore, let’s use the same
df_bin_centroids
as before.
Then, you need to triangulate the bin centroids.
tr1_object <- tri_bin_centroids(hex_df = df_bin_centroids, x = "c_x", y = "c_y")
str(tr1_object)
#> List of 11
#> $ n : int 10
#> $ x : num [1:10] 0.173 0 0.346 0.173 0.52 ...
#> $ y : num [1:10] -0.15 0.15 0.15 0.45 0.45 0.75 1.05 1.35 1.35 1.65
#> $ nt : int 12
#> $ trlist: int [1:12, 1:9] 1 4 5 5 6 6 7 7 8 6 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:9] "i1" "i2" "i3" "j1" ...
#> $ cclist: num [1:12, 1:5] 1.73e-01 1.73e-01 3.46e-01 2.59e+15 3.46e-01 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:5] "x" "y" "r" "area" ...
#> $ nchull: int 6
#> $ chull : int [1:6] 2 1 7 9 10 8
#> $ narcs : int 21
#> $ arcs : int [1:21, 1:2] 3 2 1 3 4 3 5 1 5 6 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:2] "from" "to"
#> $ call : language interp::tri.mesh(x = hex_df[[rlang::as_string(rlang::sym(x))]], y = hex_df[[rlang::as_string(rlang::sym(y))]])
#> - attr(*, "class")= chr "triSht"
To visualize the results, simply use geom_trimesh()
and
provide the hexagonal bin centroid coordinates. This will display the
triangular mesh for you to examine.
To build the wireframe in 2D, you’ll need to identify which vertices
are connected. You can obtain this by passing the triangular object to
the gen_edges
function, which will provide information on
the existing edges and the connected vertices.
tr_from_to_df <- gen_edges(tri_object = tr1_object)
glimpse(tr_from_to_df)
#> Rows: 21
#> Columns: 6
#> $ from <int> 1, 2, 4, 3, 4, 6, 1, 4, 6, 8, 8, 1, 3, 1, 5, 5, 2, 7, 9, 2, 7
#> $ to <int> 3, 4, 5, 5, 6, 7, 7, 8, 8, 9, 10, 2, 4, 5, 6, 7, 8, 9, 10, 3, 8
#> $ x_from <dbl> 0.1732051, 0.0000000, 0.1732051, 0.3464102, 0.1732051, 0.692820…
#> $ y_from <dbl> -0.15, 0.15, 0.45, 0.15, 0.45, 0.75, -0.15, 0.45, 0.75, 1.35, 1…
#> $ x_to <dbl> 0.3464102, 0.1732051, 0.5196152, 0.5196152, 0.6928203, 0.866025…
#> $ y_to <dbl> 0.15, 0.45, 0.45, 0.45, 0.75, 1.05, 1.05, 1.35, 1.35, 1.35, 1.6…
Another important parameter in this algorithm is the benchmark value for removing long edges. To compute this value, you first need to generate the 2D Euclidean distance dataset for the edges.
distance_df <- cal_2d_dist(tr_coord_df = tr_from_to_df, start_x = "x_from",
start_y = "y_from", end_x = "x_to", end_y = "y_to",
select_vars = c("from", "to", "distance"))
glimpse(distance_df)
#> Rows: 21
#> Columns: 3
#> $ from <int> 1, 2, 4, 3, 4, 6, 1, 4, 6, 8, 8, 1, 3, 1, 5, 5, 2, 7, 9, 2, 7
#> $ to <int> 3, 4, 5, 5, 6, 7, 7, 8, 8, 9, 10, 2, 4, 5, 6, 7, 8, 9, 10, 3,…
#> $ distance <dbl> 0.3464102, 0.3464102, 0.3464102, 0.3464102, 0.6000000, 0.3464…
Then, you can use the find_lg_benchmark()
function to
compute a default benchmark value to remove long edges. However, this
default value may need adjustment for a better representation. In here,
used the benchmark value as 0.75.
benchmark <- find_lg_benchmark(distance_edges = distance_df,
distance_col = "distance")
benchmark
#> [1] 0.6
To visualize the results, you can use vis_lg_mesh()
and
vis_rmlg_mesh()
. These functions enable you to observe the
wireframe in 2D obtained from the algorithm’s computations.
trimesh_coloured <- vis_lg_mesh(distance_edges = distance_df,
benchmark_value = 0.75,
tr_coord_df = tr_from_to_df,
distance_col = "distance") +
xlab(expression(C[x]^{(2)})) + ylab(expression(C[y]^{(2)})) +
theme(axis.text = element_text(size = 5),
axis.title = element_text(size = 7),
legend.position = "bottom",
legend.title = element_blank())
trimesh_coloured
trimesh_removed <- vis_rmlg_mesh(distance_edges = distance_df,
benchmark_value = 0.75,
tr_coord_df = tr_from_to_df,
distance_col = "distance") +
xlab(expression(C[x]^{(2)})) + ylab(expression(C[y]^{(2)})) +
theme(axis.text = element_text(size = 5),
axis.title = element_text(size = 7))
trimesh_removed
To lift the constructed model into high-dimensions, you need to map the 2D hexagonal bin centroids to high-dimensions. To do that, first, you need to obtain the data set which have the 2D embedding with their corresponding hexagonal bin IDs.
umap_data_with_hb_id <- as.data.frame(do.call(cbind, hb_obj$data_hb_id))
glimpse(umap_data_with_hb_id)
#> Rows: 75
#> Columns: 3
#> $ UMAP1 <dbl> 0.08044271, 0.73857864, 0.83992020, 0.16719939, 0.26292782, 0.83…
#> $ UMAP2 <dbl> 0.274498327, 0.859778569, 0.929412446, 0.037066861, 0.341734583,…
#> $ hb_id <dbl> 6, 18, 24, 2, 12, 34, 18, 18, 18, 34, 18, 12, 34, 6, 18, 28, 12,…
Next, you need to create a data set with the high-dimensional data and the 2D embedding with hexagonal bin IDs.
df_all <- dplyr::bind_cols(s_curve_noise_training |> dplyr::select(-ID), umap_data_with_hb_id)
glimpse(df_all)
#> Rows: 75
#> Columns: 10
#> $ x1 <dbl> -0.11970232, -0.04921160, -0.77446658, -0.60566654, 0.81769684, …
#> $ x2 <dbl> 0.113792407, 0.822087939, 0.242750516, 1.955211672, 0.038841104,…
#> $ x3 <dbl> -1.9928098283, 0.0012116250, 0.3673851752, -1.7957185683, -1.575…
#> $ x4 <dbl> -0.002456044, 0.016093371, -0.019782622, 0.013202535, 0.00253174…
#> $ x5 <dbl> -0.0178407958, 0.0096807744, 0.0040810928, -0.0004790821, 0.0016…
#> $ x6 <dbl> -0.018086239, -0.083434181, -0.034917009, -0.004777390, 0.078094…
#> $ x7 <dbl> -0.0031710756, 0.0022970115, -0.0091095194, -0.0084291649, -0.00…
#> $ UMAP1 <dbl> 0.08044271, 0.73857864, 0.83992020, 0.16719939, 0.26292782, 0.83…
#> $ UMAP2 <dbl> 0.274498327, 0.859778569, 0.929412446, 0.037066861, 0.341734583,…
#> $ hb_id <dbl> 6, 18, 24, 2, 12, 34, 18, 18, 18, 34, 18, 12, 34, 6, 18, 28, 12,…
Then, use avg_highd_data()
to obtain the
high-dimensional coordinates of the model.
df_bin <- avg_highd_data(data = df_all, col_start = "x")
glimpse(df_bin)
#> Rows: 10
#> Columns: 8
#> $ hb_id <dbl> 2, 6, 7, 12, 13, 18, 24, 28, 29, 34
#> $ x1 <dbl> -0.636602704, -0.498439940, 0.294457441, 0.309089774, 0.86761456…
#> $ x2 <dbl> 1.74398667, 0.52443695, 1.40452199, 0.04212599, 0.74730061, 1.26…
#> $ x3 <dbl> -1.7641833, -1.7303881, -1.8845346, -1.8325686, -0.7811911, -0.1…
#> $ x4 <dbl> 0.0095304602, -0.0002371844, 0.0088975133, 0.0065624927, -0.0040…
#> $ x5 <dbl> -0.0014349303, 0.0023416710, -0.0080267578, 0.0082266577, 0.0008…
#> $ x6 <dbl> -0.011690348, -0.029689934, -0.012337284, 0.004886255, 0.0248442…
#> $ x7 <dbl> -0.0015190610, -0.0002422371, -0.0012016197, -0.0038868552, 0.00…
Finally, to visualise the model overlaid with the high-dimensional
data, you initially need to pass the data set with the high-dimensional
data and the 2D embedding with hexagonal bin IDs (df_all
),
high-dimensional mapping of hexagonal bin centroids
(df_bin
), 2D hexagonal bin coordinates
(df_bin_centroids
), and wireframe data
(distance_df
).