| Type: | Package |
| Title: | Machine Learning for Integrating Partially Overlapped Genetic Datasets |
| Version: | 1.3.2 |
| Description: | Tools to simulate genetic distance matrices, align and compare them via multidimensional scaling (MDS) and Procrustes, and evaluate imputation with the Bootstrapping Evaluation for Structural Missingness Imputation (BESMI) framework. Methods align with Zhu et al. (2025) <doi:10.3389/fpls.2025.1543956> and the associated software resource Zhu (2025) <doi:10.26188/28602953>. |
| License: | GPL-3 |
| URL: | https://github.com/jiashuaiz/DataFusion-GDM |
| BugReports: | https://github.com/jiashuaiz/DataFusion-GDM/issues |
| Encoding: | UTF-8 |
| Depends: | R (≥ 3.6) |
| Imports: | ggplot2, vegan, mice, stats, utils |
| Suggests: | VIM, knitr, rmarkdown, testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | no |
| Config/testthat/edition: | 3 |
| Packaged: | 2025-11-01 05:05:13 UTC; jiashuaiz |
| Author: | Jiashuai Zhu |
| Maintainer: | Jiashuai Zhu <jiashuai.zhu@student.unimelb.edu.au> |
| Repository: | CRAN |
| Date/Publication: | 2025-11-04 19:20:07 UTC |
Distance metrics
Description
Distance metrics
Usage
.besmi_calculate_distance(a, b, method = "mae")
Determine bootstrap sample count for a given k
Description
Determine bootstrap sample count for a given k
Usage
.besmi_determine_sampling_sizes(k)
Initialize matrix by column means
Description
Initialize matrix by column means
Usage
.besmi_initialize_M(M)
Double-center a distance matrix
Description
Double-center a distance matrix
Usage
.double_center(D)
Procrustes alignment and mapping back to distances
Description
Procrustes alignment and mapping back to distances
Usage
apply_procrustes(X_base, Y_base, Y)
Arguments
X_base |
Base coordinates for target alignment |
Y_base |
Base coordinates for source alignment |
Y |
Full source coordinates to transform |
Value
Transformed coordinates matrix
Run BESMI imputation for a list of dataset paths
Description
Run BESMI imputation for a list of dataset paths
Usage
besmi_batch_impute(
dataset_paths,
the_method = "lasso.norm",
max_iter = 5,
imputation_convergence_threshold = 1e-06,
propagation_convergence_threshold = 1e-06,
distance_metric = "mae",
output_dir = file.path(tempdir(), "DataFusionGDM_imputation"),
k_filter = NULL,
full_dataset_path = NULL
)
Arguments
dataset_paths |
Character vector of RDS paths to masked matrices |
the_method |
Imputation method (e.g., 'lasso.norm' or 'KNN') |
max_iter |
Maximum iterations for iterative methods |
imputation_convergence_threshold |
Convergence threshold for imputation metric |
propagation_convergence_threshold |
Convergence threshold for propagation metric |
distance_metric |
Distance metric for evaluation ('mae','ssd','rmse','correlation') |
output_dir |
Output directory for imputed matrices (defaults to a temporary location) |
k_filter |
Optional numeric filter for k value |
full_dataset_path |
Optional path to a full matrix RDS used as ground truth |
Value
Data frame of metrics for all datasets
Create masked matrices for BESMI
Description
Create masked matrices for BESMI
Usage
besmi_create_masked_matrices(full_matrix, k, seed = NULL)
Arguments
full_matrix |
Full symmetric matrix |
k |
Number of populations to mask (as U) |
seed |
Optional seed for reproducibility |
Value
List with masked_matrix, mask_position, group_u, group_s, masked_percentage
Impute a single dataset from masked matrix path
Description
Impute a single dataset from masked matrix path
Usage
besmi_impute_single_dataset(
input_path,
method = "lasso.norm",
max_iterations = 5,
imputation_convergence_threshold = 0.001,
propagation_convergence_threshold = 0.001,
distance_metric = "mae",
output_dir = file.path(tempdir(), "DataFusionGDM_imputation"),
full_dataset_path = NULL
)
Arguments
input_path |
Path to masked matrix RDS |
method |
Imputation method ('lasso.norm' or 'KNN') |
max_iterations |
Maximum iterations for iterative methods |
imputation_convergence_threshold |
Convergence threshold for imputation metric |
propagation_convergence_threshold |
Convergence threshold for propagation metric |
distance_metric |
Distance metric name |
output_dir |
Output directory for results (defaults to a temporary location) |
full_dataset_path |
Optional path to a full matrix RDS used as ground truth |
Value
Data frame of per-iteration metrics
Iterative imputation with MICE (tails-chain)
Description
Iterative imputation with MICE (tails-chain)
Usage
besmi_iterative_imputation(
M_input,
M_mask,
M_real = NULL,
method = "lasso.norm",
max_iterations = 5,
imputation_convergence_threshold = 0.001,
propagation_convergence_threshold = 0.001,
distance_metric = "mae",
k = NA,
bs_i = NA
)
Arguments
M_input |
Matrix with NAs to impute |
M_mask |
Logical mask matrix (TRUE indicates masked positions) |
M_real |
Optional ground truth matrix |
method |
MICE method (e.g., 'lasso.norm') |
max_iterations |
Max outer iterations |
imputation_convergence_threshold |
Threshold for imputation distance |
propagation_convergence_threshold |
Threshold for propagation distance |
distance_metric |
Distance metric name |
k |
Dataset parameter k (for logging) |
bs_i |
Bootstrap index (for logging) |
Value
List with final_matrix, metrics, tails_chain
KNN imputation sweep (uses VIM::kNN)
Description
KNN imputation sweep (uses VIM::kNN)
Usage
besmi_knn_impute(
M_input,
M_mask,
M_real = NULL,
distance_metric = "mae",
k = NA,
bs_i = NA
)
Arguments
M_input |
Matrix with NAs |
M_mask |
Logical mask matrix |
M_real |
Optional ground truth |
distance_metric |
Distance metric name |
k |
Dataset parameter k |
bs_i |
Bootstrap index |
Value
List with final_matrix, metrics, tails_chain
Prepare full GDM dataset from CSV or RData
Description
Prepare full GDM dataset from CSV or RData
Usage
besmi_prepare_full_dataset(input_path)
Arguments
input_path |
Path to CSV or RData file containing the full distance matrix |
Value
Symmetric numeric matrix
Convert coordinate matrix to distance matrix
Description
Convert coordinate matrix to distance matrix
Usage
coords_to_distances(coords)
Arguments
coords |
Numeric coordinate matrix |
Value
Symmetric distance matrix
Create a heatmap of genetic distances (ggplot2)
Description
Returns a ggplot heatmap of the distance matrix using ggplot2 only (no Bioconductor dependencies).
Usage
create_distance_heatmap(dist_matrix, pop_info)
Arguments
dist_matrix |
Symmetric numeric distance matrix with row/column names |
pop_info |
Data frame with at least |
Value
A ggplot object
Create MDS plot of genetic distances
Description
Create MDS plot of genetic distances
Usage
create_mds_plot(dist_matrix, pop_info)
Arguments
dist_matrix |
Symmetric numeric distance matrix |
pop_info |
Data frame with metadata columns |
Value
A ggplot object
Export a simulated GDM to CSV
Description
Export a simulated GDM to CSV
Usage
export_simulated_gdm(
output_file = tempfile("gdm_", fileext = ".csv"),
scenario = "default",
n_pops = 30,
verbose = TRUE,
seed = NULL
)
Arguments
output_file |
Output CSV filename (defaults to a session-scoped temporary path) |
scenario |
Scenario name |
n_pops |
Number of populations |
verbose |
Verbose output |
seed |
Optional seed forwarded to run_genetic_scenario() |
Value
Invisibly, the normalized path to the written CSV
Examples
tmp <- export_simulated_gdm(verbose = FALSE)
if (file.exists(tmp)) unlink(tmp)
Perform MDS on a pair of distance matrices
Description
Perform MDS on a pair of distance matrices
Usage
perform_mds(A, B)
Arguments
A |
First distance matrix |
B |
Second distance matrix |
Value
A list with coordinates X, Y, optimal dimension d_opt, and variance info
Run simulation with predefined biological scenarios
Description
Run simulation with predefined biological scenarios
Usage
run_genetic_scenario(
scenario = "default",
n_pops = 30,
output_file = NULL,
seed = NULL,
verbose = TRUE
)
Arguments
scenario |
Scenario name: 'default', 'island', 'stepping_stone', 'admixture', 'ancient_divergence', 'simple' |
n_pops |
Number of populations |
output_file |
Optional CSV path to write the distance matrix |
seed |
Optional seed forwarded to run_genetic_simulation() |
verbose |
Print diagnostic information |
Value
Same structure as run_genetic_simulation()
Run a high-level genetic simulation with configurable model
Description
Run a high-level genetic simulation with configurable model
Usage
run_genetic_simulation(
n_pops = 30,
n_major_groups = 4,
n_subgroups = 8,
model = "mixed",
geo_dims = NULL,
isolation_factor = NULL,
genetic_dims = NULL,
group_separation = 15,
subgroup_separation = NULL,
pop_dispersion = 0.5,
admixture_prob = 0.15,
bottleneck_prob = 0.1,
use_subgroups = TRUE,
use_genetic_dims = NULL,
use_admixture = TRUE,
use_bottlenecks = TRUE,
use_isolation_by_distance = NULL,
use_nonlinear = TRUE,
use_noise = TRUE,
seed = NULL,
output_file = NULL,
verbose = TRUE
)
Arguments
n_pops |
Number of populations |
n_major_groups |
Number of major groups |
n_subgroups |
Number of subgroups |
model |
One of "mixed", "geography", "genetics", or "custom" |
geo_dims |
Geographic dimensions (overrides default based on model if set) |
isolation_factor |
Geography-genetics balance (overrides default based on model if set) |
genetic_dims |
Genetic dimensions (overrides default based on model if set) |
group_separation |
Separation between major groups |
subgroup_separation |
Separation between subgroups (default: group_separation/3 when NULL) |
pop_dispersion |
Within-subgroup dispersion |
admixture_prob |
Proportion of admixed populations |
bottleneck_prob |
Proportion of bottlenecked populations |
use_subgroups |
Whether to create subgroups |
use_genetic_dims |
Whether to include genetic dimensions |
use_admixture |
Whether to include admixture |
use_bottlenecks |
Whether to include bottlenecks |
use_isolation_by_distance |
Whether to weight geographic distance |
use_nonlinear |
Whether to apply nonlinear transformation |
use_noise |
Whether to add noise |
seed |
Optional seed forwarded to simulate_genetic_distances() |
output_file |
Optional CSV file path to write the distance matrix |
verbose |
Print diagnostics |
Value
List with results and plots (functions to print plots)
Simulate genetic distances using realistic population structure
Description
Generates a synthetic genetic distance matrix and metadata using hierarchical population structure, admixture and bottleneck options.
Usage
simulate_genetic_distances(
n_pops = 50,
n_major_groups = 5,
n_subgroups = 12,
geo_dims = 2,
genetic_dims = 2,
group_separation = 15,
subgroup_separation = 5,
pop_dispersion = 0.5,
isolation_factor = 0.8,
admixture_prob = 0.1,
bottleneck_prob = 0.05,
noise_level = 0.1,
nonlinear_factor = 0.7,
use_subgroups = TRUE,
use_genetic_dims = TRUE,
use_admixture = TRUE,
use_bottlenecks = TRUE,
use_isolation_by_distance = TRUE,
use_nonlinear = TRUE,
use_noise = TRUE,
seed = NULL,
verbose = TRUE
)
Arguments
n_pops |
Number of populations |
n_major_groups |
Number of major groups |
n_subgroups |
Number of subgroups |
geo_dims |
Geographic dimensions |
genetic_dims |
Additional genetic drift dimensions |
group_separation |
Separation between major groups |
subgroup_separation |
Separation between subgroups |
pop_dispersion |
Within-subgroup dispersion |
isolation_factor |
Weight for geography in isolation-by-distance model (0-1) |
admixture_prob |
Proportion of admixed populations |
bottleneck_prob |
Proportion of bottlenecked populations |
noise_level |
Noise level in transformation |
nonlinear_factor |
Nonlinearity factor in transformation |
use_subgroups |
Whether to create subgroups |
use_genetic_dims |
Whether to include genetic dimensions |
use_admixture |
Whether to include admixture |
use_bottlenecks |
Whether to include bottlenecks |
use_isolation_by_distance |
Whether to weight geographic distance |
use_nonlinear |
Whether to apply nonlinear transformation |
use_noise |
Whether to add noise |
seed |
Optional seed for reproducibility (NULL leaves the RNG state unchanged) |
verbose |
Print diagnostics |
Value
A list with distance_matrix, population_info, position_matrix, and parameters.
Create plotting handles for simulation results
Description
Create plotting handles for simulation results
Usage
visualize_results(sim_results)
Arguments
sim_results |
A list returned by simulate_genetic_distances() or run_genetic_simulation() |
Value
A list with heatmap and mds functions that print plots when called