NOTE THIS IS EXPERIMENTAL FOR NOW
library(multilevelcoda)
library(knitr)
library(data.table)
#> data.table 1.18.4 using 12 threads (see ?getDTthreads). Latest news: r-datatable.com
#>
#> Attaching package: 'data.table'
#>
#> The following object is masked from 'package:base':
#>
#> %notin%
options(digits = 3, pillar.sigfig = 3)Data simulation is useful for multiple purposes, such as:
- Power analysis and sample size planning.
- Method development and testing.
- Generating synthetic data to test models, write code before data collection finishes, etc.
Simulating data, especially if multilevel and/or compositional, can
be complex. The simulate_data() function is designed to
support these efforts.
Broadly, simulate_data() builds an indexed design, runs
named generators in order, and returns an mlsim_data object
with four main pieces:
-
data: the generateddata.table. -
metadata: design metadata such as group sizes, time indexing, and generator order. -
generator_specs: the generator settings -
generator_metadata: e.g., parameters, random effects, covariance matrices, helper columns.
Summary Tables
These tables give a high level summary of current features.
Design options
Design options cover the highest level structure.
| Option | Use |
|---|---|
n |
Total rows for a single-level design, or total rows to divide across groups. |
n_groups |
Number of groups in a grouped or multilevel design. |
n_per_group |
Group sizes: scalar, vector, function, or count-distribution list. |
group_id |
Name of the grouping column. |
obs_id |
Name of the within-design observation index. |
time_id |
Optional time column name. |
time_values |
Optional time vector or function, including Date/POSIXct values. |
time_truncate |
Whether shared time values are truncated for shorter groups. |
seed |
Reproducible simulation seed that restores the caller RNG state. |
Generator families
Generators are used to add simulated data columns. They are
independent of each other. Any generated column can be used as well to
later generate an “outcome” column with gen_outcome().
Outcomes differ from the rest of the generators because outcomes can
depend on any previously generated predictor.
Current predictor generators are listed below.
| Generator | Role |
|---|---|
gen_normal() |
Univariate Gaussian predictors or outcomes on the identity scale. |
gen_mvn() |
Correlated Gaussian blocks, including ILR compositional back-transforms. |
gen_categorical() |
Binary, unordered categorical, and ordered categorical variables. |
gen_binomial() |
Binomial counts with logit-scale multilevel support. |
gen_poisson() |
Poisson counts with log-scale multilevel support. |
gen_negbin() |
Negative-binomial counts with optional location-scale multilevel support. |
gen_gamma() |
Positive continuous variables with optional location-scale multilevel support. |
gen_beta() |
Bounded (0, 1) continuous variables with optional location-scale support. |
gen_custom() |
User-supplied generators that receive the active simulation context. |
Advanced simulation options
These are a sampling of more advanced simulation features and apply to predictor generators and outcome generation.
| Feature | Use |
|---|---|
level |
Choose row-level, group-level, or multilevel generation. |
fixed_intercept |
Set the fixed location or link-scale intercept in multilevel generators. |
random_cov |
Add group-specific random intercepts, and sometimes scale random effects. |
residual_var / residual_cov
|
Set row-level residual variation for Gaussian and MVN generators. |
scale_fixed_intercept |
Simulate group-varying log residual SD, size, shape, or precision. |
compositional, parts, sbp,
total, keep_ilr
|
Generate compositions from ILR coordinates and store SBP metadata. |
| Outcome formula | Generate Gaussian outcomes from fixed, random, lagged, and helper terms. |
lag1(), within(), between(),
ar1()
|
Create time-lag, within-group, between-group, and autoregressive terms. |
scale_formula |
Simulate location-scale outcomes through a log residual SD model. |
prepare_outcome_fit() |
Prepare simulated outcomes and helper columns for downstream fitting. |
Study Designs and Indexing
Here is a simple single level example. Here no outcome is generated, so each variable will be generated independently of other variables.
single <- simulate_data(
n = 20,
seed = 2026,
generators = list(
x = gen_normal("x", mean = 0, sd = 1),
edu = gen_categorical("edu", categories = c("< Bachelor", "Bachelor", "> Bachelor"),
prob = c(.3, .3, .4)),
tx = gen_categorical("tx", categories = c("control", "treatment"), prob = .5)
)
)
kable(single$data)| obs_id | x | edu | tx |
|---|---|---|---|
| 1 | 0.521 | Bachelor | control |
| 2 | -1.080 | > Bachelor | treatment |
| 3 | 0.139 | Bachelor | treatment |
| 4 | -0.085 | Bachelor | control |
| 5 | -0.667 | < Bachelor | control |
| 6 | -2.516 | < Bachelor | control |
| 7 | -0.735 | > Bachelor | treatment |
| 8 | -1.020 | < Bachelor | control |
| 9 | 0.114 | Bachelor | treatment |
| 10 | -0.474 | < Bachelor | treatment |
| 11 | -0.408 | > Bachelor | treatment |
| 12 | -0.730 | > Bachelor | control |
| 13 | -0.221 | > Bachelor | treatment |
| 14 | -0.226 | > Bachelor | treatment |
| 15 | -2.547 | Bachelor | control |
| 16 | 1.347 | > Bachelor | treatment |
| 17 | 0.616 | > Bachelor | treatment |
| 18 | 0.218 | < Bachelor | treatment |
| 19 | -0.805 | > Bachelor | control |
| 20 | 0.690 | < Bachelor | treatment |
single
#> <mlsim_data>
#> rows: 20
#> columns: 4 (generated: 3)
#> seed: 2026
#> grouping: none
#> generators: 3
#>
#> Generators:
#> generator distribution level vars
#> x normal single x
#> edu categorical single edu
#> tx categorical single txThe mlsim_data print method gives a compact console
overview. Use summary() when you want design and generator
metadata as data.table objects.
summary(single)$design
#> n n_cols n_generated_cols n_groups group_id group_size_min
#> <int> <int> <int> <int> <char> <int>
#> 1: 20 4 3 NA <NA> NA
#> group_size_median group_size_max obs_id time_id seed n_generators
#> <num> <int> <char> <char> <int> <int>
#> 1: NA NA obs_id <NA> 2026 3
summary(single)$generators
#> generator distribution level vars n_vars parameter_level parameter_count
#> <char> <char> <char> <char> <int> <char> <int>
#> 1: x normal single x 1 row 20
#> 2: edu categorical single edu 1 row 20
#> 3: tx categorical single tx 1 row 20
#> has_row_parameters has_group_parameters has_fixed_parameters has_random_cov
#> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE FALSE FALSE FALSE
#> 2: TRUE FALSE FALSE FALSE
#> 3: TRUE FALSE FALSE FALSE
#> has_random_effects has_residuals has_scale_model has_formula has_ar_terms
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: FALSE FALSE FALSE FALSE FALSE
#> 2: FALSE FALSE FALSE FALSE FALSE
#> 3: FALSE FALSE FALSE FALSE FALSE
#> has_composition has_custom_output
#> <lgcl> <lgcl>
#> 1: FALSE FALSE
#> 2: FALSE FALSE
#> 3: FALSE FALSEWe can generate multilevel structures as well. A single scalar
n_per_group creates a balanced grouped design with the
exact same number of observations per group.
balanced <- simulate_data(
n_groups = 3,
n_per_group = 4,
seed = 2026,
generators = list(
x = gen_normal("x", mean = 0, sd = 1)
)
)
kable(balanced$data)| group_id | obs_id | x |
|---|---|---|
| 1 | 1 | 0.521 |
| 1 | 2 | -1.080 |
| 1 | 3 | 0.139 |
| 1 | 4 | -0.085 |
| 2 | 1 | -0.667 |
| 2 | 2 | -2.516 |
| 2 | 3 | -0.735 |
| 2 | 4 | -1.020 |
| 3 | 1 | 0.114 |
| 3 | 2 | -0.474 |
| 3 | 3 | -0.408 |
| 3 | 4 | -0.730 |
However, this was still effectively single level, normal data. We can
improve that. Here we give essentially the parameters of a random
intercept only multilevel model, which is used to generate matching
data. The fixed_intercept is the overall mean,
random_cov is the group-level variance, and
residual_var is the residual variance. The same approach
broadly applies to other generator families when level = “multilevel”.
We can generator a level 2 variable as well, which is constant within
groups but varies between groups.
balanced2 <- simulate_data(
n_groups = 3,
n_per_group = 4,
seed = 2026,
generators = list(
x = gen_normal("x", level = "multilevel",
fixed_intercept = 0, random_cov = 1, residual_var = .01),
y = gen_normal("y", level = "level2", mean = 0, sd = 1)
)
)
kable(balanced2$data)| group_id | obs_id | x | y |
|---|---|---|---|
| 1 | 1 | 0.512 | 1.347 |
| 1 | 2 | 0.454 | 1.347 |
| 1 | 3 | 0.269 | 1.347 |
| 1 | 4 | 0.447 | 1.347 |
| 2 | 1 | -1.182 | 0.616 |
| 2 | 2 | -1.068 | 0.616 |
| 2 | 3 | -1.127 | 0.616 |
| 2 | 4 | -1.121 | 0.616 |
| 3 | 1 | 0.066 | 0.218 |
| 3 | 2 | 0.117 | 0.218 |
| 3 | 3 | 0.117 | 0.218 |
| 3 | 4 | -0.115 | 0.218 |
Vectors specs create unbalanced designs.
unbalanced <- simulate_data(
n_groups = 4,
n_per_group = c(3, 5, 4, 2),
seed = 2026,
generators = list(
x = gen_normal("x", level = "multilevel",
fixed_intercept = 0, random_cov = 1, residual_var = .01),
y = gen_normal("y", level = "level2", mean = 0, sd = 1)
)
)
kable(unbalanced$data)| group_id | obs_id | x | y |
|---|---|---|---|
| 1 | 1 | 0.454 | -0.805 |
| 1 | 2 | 0.269 | -0.805 |
| 1 | 3 | 0.447 | -0.805 |
| 2 | 1 | -1.182 | 0.690 |
| 2 | 2 | -1.068 | 0.690 |
| 2 | 3 | -1.127 | 0.690 |
| 2 | 4 | -1.121 | 0.690 |
| 2 | 5 | -1.153 | 0.690 |
| 3 | 1 | 0.117 | -0.329 |
| 3 | 2 | 0.117 | -0.329 |
| 3 | 3 | -0.115 | -0.329 |
| 3 | 4 | 0.274 | -0.329 |
| 4 | 1 | -0.023 | -0.165 |
| 4 | 2 | -0.063 | -0.165 |
drawn_sizes <- simulate_data(
n_groups = 5,
n_per_group = list(
distribution = "poisson",
lambda = 4,
minimum = 2,
maximum = 6
),
seed = 2026,
generators = list(
x = gen_normal("x", level = "multilevel",
fixed_intercept = 0, random_cov = 1, residual_var = .01),
y = gen_normal("y", level = "level2", mean = 0, sd = 1)
)
)
kable(drawn_sizes$data)| group_id | obs_id | x | y |
|---|---|---|---|
| 1 | 1 | -1.994 | 0.339 |
| 1 | 2 | -2.261 | 0.339 |
| 1 | 3 | -2.169 | 0.339 |
| 1 | 4 | -1.998 | 0.339 |
| 1 | 5 | -2.112 | 0.339 |
| 2 | 1 | 0.995 | 1.207 |
| 2 | 2 | 0.977 | 1.207 |
| 2 | 3 | 1.111 | 1.207 |
| 2 | 4 | 0.992 | 1.207 |
| 3 | 1 | -0.010 | 0.571 |
| 3 | 2 | 0.185 | 0.571 |
| 4 | 1 | 0.495 | -1.686 |
| 4 | 2 | 0.488 | -1.686 |
| 4 | 3 | 0.616 | -1.686 |
| 5 | 1 | 1.015 | -0.838 |
| 5 | 2 | 0.848 | -0.838 |
| 5 | 3 | 0.976 | -0.838 |
| 5 | 4 | 0.958 | -0.838 |
Longitudinal designs can add a time index. When groups have different
lengths, time_truncate = TRUE uses the first values for
shorter groups. Add a time index may not matter, but it can be used for
creating lags, etc.
dated <- simulate_data(
n_groups = 3,
n_per_group = c(5, 3, 4),
time_id = "date",
time_values = as.Date("2026-01-01") + 0:4,
seed = 2026,
generators = list(
x = gen_normal("x", level = "multilevel",
fixed_intercept = 0, random_cov = 1, residual_var = .01),
y = gen_normal("y", level = "level2", mean = 0, sd = 1)
)
)
kable(dated$data)| group_id | obs_id | date | x | y |
|---|---|---|---|---|
| 1 | 1 | 2026-01-01 | 0.512 | 1.347 |
| 1 | 2 | 2026-01-02 | 0.454 | 1.347 |
| 1 | 3 | 2026-01-03 | 0.269 | 1.347 |
| 1 | 4 | 2026-01-04 | 0.447 | 1.347 |
| 1 | 5 | 2026-01-05 | 0.419 | 1.347 |
| 2 | 1 | 2026-01-01 | -1.068 | 0.616 |
| 2 | 2 | 2026-01-02 | -1.127 | 0.616 |
| 2 | 3 | 2026-01-03 | -1.121 | 0.616 |
| 3 | 1 | 2026-01-01 | 0.066 | 0.218 |
| 3 | 2 | 2026-01-02 | 0.117 | 0.218 |
| 3 | 3 | 2026-01-03 | 0.117 | 0.218 |
| 3 | 4 | 2026-01-04 | -0.115 | 0.218 |
Single-Level, Group-Level, and Multilevel Predictors
The level argument controls whether values vary by row,
by group, or by row with group-specific random effects.
levels_demo <- simulate_data(
n_groups = 4,
n_per_group = 4,
seed = 2026,
generators = list(
row_x = gen_normal("row_x", mean = 0, sd = 1),
group_x = gen_normal("group_x", level = "level2", mean = 10, sd = 1),
mixed_x = gen_normal(
"mixed_x",
level = "multilevel",
fixed_intercept = 5,
random_cov = 0.50,
residual_var = 1
)
)
)
levels_demo
#> <mlsim_data>
#> rows: 16
#> columns: 5 (generated: 3)
#> seed: 2026
#> grouping: group_id (4 groups; size min/median/max: 4/4/4)
#> generators: 3
#>
#> Generators:
#> generator distribution level vars
#> row_x normal single row_x
#> group_x normal level2 group_x
#> mixed_x normal multilevel mixed_xDistribution Families
Single-level generators use direct distribution parameters. Their multilevel forms use link-scale intercepts and random covariance terms. Here is a quick demonstration of various distribution families with single level data.
families <- simulate_data(
n = 10,
seed = 2026,
generators = list(
normal = gen_normal("normal", mean = 5, sd = 1),
successes = gen_binomial("successes", size = 5, prob = 0.40),
visits = gen_poisson("visits", lambda = 2),
events = gen_negbin("events", size = 3, mu = 2),
cost = gen_gamma("cost", shape = 4, mean = 20),
adherence = gen_beta("adherence", mean = 0.70, precision = 12)
)
)
kable(families$data)| obs_id | normal | successes | visits | events | cost | adherence |
|---|---|---|---|---|---|---|
| 1 | 5.52 | 2 | 4 | 1 | 10.63 | 0.875 |
| 2 | 3.92 | 2 | 1 | 0 | 36.47 | 0.822 |
| 3 | 5.14 | 1 | 3 | 5 | 38.06 | 0.657 |
| 4 | 4.92 | 0 | 0 | 4 | 41.29 | 0.664 |
| 5 | 4.33 | 2 | 2 | 2 | 33.78 | 0.500 |
| 6 | 2.48 | 1 | 2 | 4 | 10.91 | 0.766 |
| 7 | 4.26 | 2 | 1 | 0 | 24.79 | 0.897 |
| 8 | 3.98 | 1 | 2 | 2 | 13.94 | 0.841 |
| 9 | 5.11 | 0 | 3 | 1 | 14.41 | 0.568 |
| 10 | 4.53 | 2 | 2 | 2 | 8.27 | 0.846 |
Here is an example of a multilevel count generator.
link_scale_counts <- simulate_data(
n_groups = 3,
n_per_group = 5,
seed = 2026,
generators = list(
count = gen_poisson(
"count",
level = "multilevel",
fixed_intercept = log(2.5),
random_cov = 0.20
)
)
)
link_scale_counts
#> <mlsim_data>
#> rows: 15
#> columns: 3 (generated: 1)
#> seed: 2026
#> grouping: group_id (3 groups; size min/median/max: 5/5/5)
#> generators: 1
#>
#> Generators:
#> generator distribution level vars
#> count poisson multilevel countCategorical Variables
gen_categorical() supports binary variables, labeled
categories, ordered factors, character output, and integer codes.
categorical <- simulate_data(
n = 12,
seed = 2026,
generators = list(
binary_factor = gen_categorical("binary_factor", prob = 0.35),
education = gen_categorical(
"education",
categories = c("< bachelor", "bachelor", "> bachelor"),
prob = c(0.45, 0.35, 0.20),
ordered = TRUE
),
education_code = gen_categorical(
"education_code",
categories = c("arts", "science", "psychology"),
prob = c(0.45, 0.35, 0.20),
output = "integer"
)
)
)
kable(categorical$data)| obs_id | binary_factor | education | education_code |
|---|---|---|---|
| 1 | 1 | < bachelor | 0 |
| 2 | 0 | > bachelor | 0 |
| 3 | 0 | < bachelor | 0 |
| 4 | 0 | < bachelor | 0 |
| 5 | 0 | bachelor | 0 |
| 6 | 0 | < bachelor | 1 |
| 7 | 0 | < bachelor | 2 |
| 8 | 1 | < bachelor | 0 |
| 9 | 0 | < bachelor | 1 |
| 10 | 0 | < bachelor | 0 |
| 11 | 0 | < bachelor | 1 |
| 12 | 1 | < bachelor | 0 |
Multilevel categorical generators use baseline-category logits and optional group random effects. For 2 categories, it is basically a multilevel bernoulli. For more than 2 categoaries, it is a multilevel multinomial with the first category as the baseline. The random intercepts can have their own variances and covariances. A full random effect covariance matrix must be specified. Use 0s on the off diagonal for uncorrelated random intercepts.
categorical_ml <- simulate_data(
n_groups = 4,
n_per_group = 4,
seed = 2026,
generators = list(
education = gen_categorical(
"education",
level = "multilevel",
categories = c("< bachelor", "bachelor", "> bachelor"),
fixed_intercept = c("bachelor" = 0.15, "> bachelor" = -0.55),
random_cov = matrix(c(0.20, 0.04, 0.04, 0.12), nrow = 2),
ordered = TRUE
)
)
)
kable(categorical_ml$data)| group_id | obs_id | education |
|---|---|---|
| 1 | 1 | bachelor |
| 1 | 2 | < bachelor |
| 1 | 3 | < bachelor |
| 1 | 4 | < bachelor |
| 2 | 1 | bachelor |
| 2 | 2 | bachelor |
| 2 | 3 | < bachelor |
| 2 | 4 | < bachelor |
| 3 | 1 | bachelor |
| 3 | 2 | < bachelor |
| 3 | 3 | bachelor |
| 3 | 4 | < bachelor |
| 4 | 1 | < bachelor |
| 4 | 2 | bachelor |
| 4 | 3 | > bachelor |
| 4 | 4 | < bachelor |
Correlated and Compositional Predictors
gen_mvn() simulates correlated blocks. With
compositional = TRUE, the MVN columns are interpreted as
ILR coordinates and back-transformed into parts. Here we are back to
single level data.
correlated_comp <- simulate_data(
n = 8,
seed = 2026,
generators = list(
affect_block = gen_mvn(
c("affect", "energy"),
mean = c(0, 1),
cov = matrix(c(1.0, 0.5, 0.5, 1.2), nrow = 2)
),
time_use = gen_mvn(
c("ilr_1", "ilr_2"),
mean = c(0, 0),
cov = diag(c(0.30, 0.20)),
compositional = TRUE,
parts = c("sleep", "active", "sedentary"),
total = 24,
keep_ilr = TRUE
)
)
)
kable(correlated_comp$data)| obs_id | affect | energy | ilr_1 | ilr_2 | sleep | active | sedentary |
|---|---|---|---|---|---|---|---|
| 1 | 0.351 | 1.566 | -0.338 | -0.022 | 5.96 | 8.88 | 9.16 |
| 2 | -0.587 | -0.290 | -0.119 | -0.853 | 6.40 | 4.05 | 13.54 |
| 3 | 0.355 | 0.938 | 0.441 | -0.774 | 10.24 | 3.45 | 10.31 |
| 4 | 0.366 | 0.561 | -0.378 | -0.026 | 5.75 | 8.96 | 9.29 |
| 5 | -0.405 | 0.238 | 0.180 | -0.289 | 9.10 | 5.95 | 8.95 |
| 6 | -1.890 | -1.579 | 0.090 | -0.772 | 7.83 | 4.06 | 12.10 |
| 7 | 0.922 | -0.962 | 0.762 | 0.237 | 13.36 | 6.21 | 4.44 |
| 8 | -1.621 | 0.655 | -0.803 | -0.074 | 3.78 | 9.58 | 10.64 |
correlated_comp$generator_metadata$time_use$sbp
#> sleep active sedentary
#> balance_1 1 -1 -1
#> balance_2 0 1 -1
correlated_comp$generator_metadata$time_use$ilr_coordinate_map
#> ilr sbp_row positive_parts negative_parts
#> <char> <int> <AsIs> <AsIs>
#> 1: ilr_1 1 sleep active,sedentary
#> 2: ilr_2 2 active sedentaryUse keep_ilr = FALSE when you only want the original
parts in the output.
parts_only <- simulate_data(
n = 5,
seed = 2026,
generators = list(
time_use = gen_mvn(
c("z1", "z2"),
mean = c(0, 0),
cov = diag(2),
compositional = TRUE,
parts = c("sleep", "active", "sedentary"),
total = 24,
keep_ilr = FALSE
)
)
)
kable(parts_only$data)| obs_id | sleep | active | sedentary |
|---|---|---|---|
| 1 | 21.86 | 1.45 | 0.694 |
| 2 | 11.64 | 2.21 | 10.153 |
| 3 | 15.23 | 4.82 | 3.956 |
| 4 | 7.27 | 7.87 | 8.867 |
| 5 | 10.69 | 3.73 | 9.582 |
Location-Scale Predictors
For multilevel predictors, scale_fixed_intercept adds
group-varying scale parameters. For a normal generator this means log
residual SD; for negative binomial, gamma, and beta generators it
controls size, shape, and precision.
Here you can see in the metadata the group specific parameters that were generated.
location_scale <- simulate_data(
n_groups = 4,
n_per_group = 5,
seed = 2026,
generators = list(
symptom = gen_normal(
"symptom",
level = "multilevel",
fixed_intercept = 10,
scale_fixed_intercept = log(1.2),
random_cov = matrix(
c(1.00, 0.15,
0.15, 0.08),
nrow = 2
)
),
event_count = gen_negbin(
"event_count",
level = "multilevel",
fixed_intercept = log(2),
scale_fixed_intercept = log(6),
random_cov = matrix(
c(0.20, 0.03,
0.03, 0.06),
nrow = 2
)
)
)
)
kable(data.table(
group_id = location_scale$data$group_id,
symptom_residual_sd = location_scale$generator_metadata$symptom$residual_sd,
event_count_size = location_scale$generator_metadata$event_count$size
))| group_id | symptom_residual_sd | event_count_size |
|---|---|---|
| 1 | 1.29 | 5.99 |
| 1 | 1.29 | 5.99 |
| 1 | 1.29 | 5.99 |
| 1 | 1.29 | 5.99 |
| 1 | 1.29 | 5.99 |
| 2 | 2.57 | 4.75 |
| 2 | 2.57 | 4.75 |
| 2 | 2.57 | 4.75 |
| 2 | 2.57 | 4.75 |
| 2 | 2.57 | 4.75 |
| 3 | 1.39 | 6.04 |
| 3 | 1.39 | 6.04 |
| 3 | 1.39 | 6.04 |
| 3 | 1.39 | 6.04 |
| 3 | 1.39 | 6.04 |
| 4 | 1.54 | 4.54 |
| 4 | 1.54 | 4.54 |
| 4 | 1.54 | 4.54 |
| 4 | 1.54 | 4.54 |
| 4 | 1.54 | 4.54 |
Custom Generators
Custom generators receive a simulation context and return generated
data, column names, and optional metadata. They can use the active
design columns, previously generated variables, and any extra arguments
supplied to gen_custom(). Their metadata would be captured
in the same was as built in generators.
pilot_sampler <- function(context, vars, level, pilot_values) {
values <- sample(pilot_values, context$n_rows, replace = TRUE)
list(
data = data.frame(values = values),
names = vars,
metadata = list(source = "pilot bootstrap", n_pilot = length(pilot_values))
)
}
clinic_load_sampler <- function(context, vars, level) {
load_values <- c("low", "medium", "high")
group_load <- load_values[context$data[[context$group_id]]]
list(
data = data.frame(load = group_load),
names = vars,
metadata = list(source = "derived from group index")
)
}
custom <- simulate_data(
n_groups = 3,
n_per_group = 4,
seed = 2026,
generators = list(
pilot_x = gen_custom(
"pilot_x",
generator = pilot_sampler,
pilot_values = c(1.2, 1.4, 1.8, 2.5, 3.1)
),
clinic_load = gen_custom(
"clinic_load",
generator = clinic_load_sampler
)
)
)
kable(custom$data)| group_id | obs_id | pilot_x | clinic_load |
|---|---|---|---|
| 1 | 1 | 3.1 | low |
| 1 | 2 | 1.2 | low |
| 1 | 3 | 1.2 | low |
| 1 | 4 | 3.1 | low |
| 2 | 1 | 1.8 | medium |
| 2 | 2 | 2.5 | medium |
| 2 | 3 | 2.5 | medium |
| 2 | 4 | 3.1 | medium |
| 3 | 1 | 2.5 | high |
| 3 | 2 | 1.4 | high |
| 3 | 3 | 1.4 | high |
| 3 | 4 | 1.8 | high |
Simple Outcomes
gen_outcome() appends Gaussian outcomes after earlier
generators have added predictors. This example uses a fixed intercept
and slope. The formula specifies how the outcome depends on predictors.
You can see from fitting a linear regression afterwards that the
simulated coefficients are well recovered.
simple_outcome <- simulate_data(
n = 200,
seed = 2026,
generators = list(
dose = gen_normal("dose", mean = 0, sd = 1),
score = gen_outcome(
score ~ dose,
coefficients = c("(Intercept)" = 10, dose = 2),
residual_cov = 4
)
)
)
summary(lm(score ~ dose, data = simple_outcome$data))
#>
#> Call:
#> lm(formula = score ~ dose, data = simple_outcome$data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.092 -1.299 -0.012 1.233 4.288
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10.025 0.138 72.6 <2e-16 ***
#> dose 2.015 0.141 14.3 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.95 on 198 degrees of freedom
#> Multiple R-squared: 0.509, Adjusted R-squared: 0.507
#> F-statistic: 205 on 1 and 198 DF, p-value: <2e-16Outcome Templates and Formula Helpers
outcome_template() runs through the same formula parsing
as gen_outcome() without adding outcome columns. It gives
named coefficient and covariance objects that can be edited safely.
This is important because as outcome models become more complicated, correctly specifying and naming all the necessary parameters becomes more difficult. The template approach allows you to first specify the model in a formula, then edit the parameters in a structured way, and finally generate the outcome with the edited parameters.
Here we will consider a much more complicated data simulation. We will generate two predictors, a level2 only treatment and a multilevel stress variable. We will then use an outcome template to specify a complicated outcome model with fixed effects, random effects, and lagged terms. The template will give us the correctly named coefficient and covariance objects that we can edit before generating the outcome.
A range of helper functions exist:
-
between(): group means of a predictor. -
within(): row values minus group means. -
lag1(): previous row value within groups.
These are used in formulae to specify further data management that
should occur on the predictors before they are used to generate the
outcome. For example, between(stress) creates a helper
column with group means of stress, and within(stress)
creates a helper column with row values minus group means. These helper
columns are then used in the outcome formula and have their own
coefficients.
f <- wellbeing ~
treatment +
between(stress) +
within(stress) +
lag1(stress) +
(1 + within(stress) | p | id)
helper_template_sim <- simulate_data(
n_groups = 4,
n_per_group = 5,
group_id = "id",
time_id = "day",
seed = 2026,
generators = list(
treatment = gen_categorical(
"treatment",
level = "level2",
prob = 0.5,
output = "integer"
),
stress = gen_normal(
"stress",
level = "multilevel",
fixed_intercept = 0,
random_cov = 0.20,
residual_var = 0.80
),
wellbeing_template = outcome_template(f)
)
)
helper_template <- helper_template_sim$generator_metadata$wellbeing_template
helper_template$coefficients
#> wellbeing
#> (Intercept) 0
#> treatment 0
#> between(stress) 0
#> within(stress) 0
#> lag1(stress) 0
rownames(helper_template$random_cov[["id::p"]])
#> [1] "wellbeing:(Intercept)" "wellbeing:within(stress)"
helper_template$fit_helpers
#> type source internal column
#> .mlsim_between_stress between stress .mlsim_between_stress stress_between
#> .mlsim_within_stress within stress .mlsim_within_stress stress_within
#> .mlsim_lag1_stress lag1 stress .mlsim_lag1_stress lag_stress
#> sim_term response_derived
#> .mlsim_between_stress between(stress) FALSE
#> .mlsim_within_stress within(stress) FALSE
#> .mlsim_lag1_stress lag1(stress) FALSENow we can use these objects and enter our desired parameter values with confidence that they are correctly named and structured.
helper_coef <- helper_template$coefficients
helper_coef["(Intercept)", "wellbeing"] <- 50
helper_coef["treatment", "wellbeing"] <- 2.0
helper_coef["between(stress)", "wellbeing"] <- -1.5
helper_coef["within(stress)", "wellbeing"] <- -0.8
helper_coef["lag1(stress)", "wellbeing"] <- -0.4
helper_residual_cov <- helper_template$residual_cov
helper_residual_cov[,] <- 1.5
helper_random_cov <- helper_template$random_cov
helper_block <- helper_random_cov[["id::p"]]
diag(helper_block) <- c(3.0, 0.30)
helper_random_cov[["id::p"]] <- helper_blockThe edited template objects can then be supplied to
gen_outcome(). Note how with a lag specified, there are
missing values for the outcome. This is because the lags are missing for
the first row of each group.
helper_sim <- simulate_data(
n_groups = 20,
n_per_group = 5,
group_id = "id",
time_id = "day",
seed = 2026,
generators = list(
treatment = gen_categorical(
"treatment",
level = "level2",
prob = 0.5,
output = "integer"
),
stress = gen_normal(
"stress",
level = "multilevel",
fixed_intercept = 0,
random_cov = 0.20,
residual_var = 0.80
),
wellbeing = gen_outcome(
f,
coefficients = helper_coef,
residual_cov = helper_residual_cov,
random_cov = helper_random_cov
)
)
)
kable(head(helper_sim$data, 20))| id | obs_id | day | treatment | stress | wellbeing | stress_between | stress_within | lag_stress |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | -0.656 | NA | -0.201 | -0.455 | NA |
| 1 | 2 | 2 | 1 | -0.034 | 51.7 | -0.201 | 0.167 | -0.656 |
| 1 | 3 | 3 | 1 | -0.410 | 53.6 | -0.201 | -0.210 | -0.034 |
| 1 | 4 | 4 | 1 | 0.115 | 51.7 | -0.201 | 0.316 | -0.410 |
| 1 | 5 | 5 | 1 | -0.019 | 52.3 | -0.201 | 0.181 | 0.115 |
| 2 | 1 | 1 | 1 | 0.715 | NA | -0.216 | 0.931 | NA |
| 2 | 2 | 2 | 1 | 0.204 | 44.8 | -0.216 | 0.420 | 0.715 |
| 2 | 3 | 3 | 1 | -1.124 | 49.0 | -0.216 | -0.908 | 0.204 |
| 2 | 4 | 4 | 1 | 0.190 | 49.6 | -0.216 | 0.406 | -1.124 |
| 2 | 5 | 5 | 1 | -1.064 | 48.5 | -0.216 | -0.848 | 0.190 |
| 3 | 1 | 1 | 0 | -1.134 | NA | -0.477 | -0.658 | NA |
| 3 | 2 | 2 | 0 | 0.597 | 50.3 | -0.477 | 1.073 | -1.134 |
| 3 | 3 | 3 | 0 | -1.176 | 50.7 | -0.477 | -0.700 | 0.597 |
| 3 | 4 | 4 | 0 | 0.175 | 50.5 | -0.477 | 0.652 | -1.176 |
| 3 | 5 | 5 | 0 | -0.845 | 50.4 | -0.477 | -0.368 | 0.175 |
| 4 | 1 | 1 | 0 | 1.167 | NA | 0.427 | 0.740 | NA |
| 4 | 2 | 2 | 0 | 0.536 | 50.9 | 0.427 | 0.108 | 1.167 |
| 4 | 3 | 3 | 0 | -0.461 | 51.3 | 0.427 | -0.888 | 0.536 |
| 4 | 4 | 4 | 0 | 0.614 | 50.2 | 0.427 | 0.187 | -0.461 |
| 4 | 5 | 5 | 0 | 0.280 | 48.8 | 0.427 | -0.147 | 0.614 |
Autoregressive Outcomes
In order to support time series and dynamics models, we support autoregressive effects in outcome generation.
Use ar1() inside an outcome formula to refer to the
previous generated value of the outcome, within each group. Using
autoregression, it is possible to generate only an outcome, with no
independent preditors generated. We will use an outcome template again
to help us specify the correct names and structure of parameters.
ar_template <- simulate_data(
n = 5,
time_id = "day",
seed = 2026,
generators = list(
mood_template = outcome_template(mood ~ ar1(), burnin = 5)
)
)$generator_metadata$mood_template
ar_coef <- ar_template$coefficients
ar_coef["(Intercept)", "mood"] <- 0
ar_coef["ar1(mood)", "mood"] <- 0.55
ar_residual_cov <- ar_template$residual_cov
ar_residual_cov[,] <- 0.40
ar_sim <- simulate_data(
n = 200,
time_id = "day",
seed = 2026,
generators = list(
mood = gen_outcome(
mood ~ ar1(),
coefficients = ar_coef,
residual_cov = ar_residual_cov,
burnin = 5
)
)
)
kable(head(ar_sim$data, 10))| obs_id | day | mood | lag_mood |
|---|---|---|---|
| 1 | 1 | -0.138 | NA |
| 2 | 2 | -0.759 | -0.138 |
| 3 | 3 | -0.329 | -0.759 |
| 4 | 4 | -0.235 | -0.329 |
| 5 | 5 | -0.551 | -0.235 |
| 6 | 6 | -1.894 | -0.551 |
| 7 | 7 | -1.507 | -1.894 |
| 8 | 8 | -1.474 | -1.507 |
| 9 | 9 | -0.739 | -1.474 |
| 10 | 10 | -0.706 | -0.739 |
Fitting a model to the simulated data shows that the autoregressive coefficient is well recovered.
summary(lm(mood ~ lag_mood, data = ar_sim$data))
#>
#> Call:
#> lm(formula = mood ~ lag_mood, data = ar_sim$data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.6081 -0.4578 -0.0056 0.3828 1.6566
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.00549 0.04433 0.12 0.9
#> lag_mood 0.56743 0.05880 9.65 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.625 on 197 degrees of freedom
#> (1 observation deleted due to missingness)
#> Multiple R-squared: 0.321, Adjusted R-squared: 0.318
#> F-statistic: 93.1 on 1 and 197 DF, p-value: <2e-16Compositional Outcomes
Outcomes can also be generated on ILR coordinates and returned as composition parts. Here the parts sum to 24 for each row.
comp_formula <- mvbind(comp_ilr_1, comp_ilr_2) ~ stress
comp_parts <- c("sleep", "active", "sedentary")
comp_template <- simulate_data(
n = 10,
seed = 2026,
generators = list(
stress = gen_normal(
"stress",
mean = 0,
sd = 1
),
time_use_template = outcome_template(
comp_formula,
compositional = TRUE,
parts = comp_parts,
total = 24
)
)
)$generator_metadata$time_use_template
comp_coef <- comp_template$coefficients
comp_coef["(Intercept)", ] <- c(-0.30, 0.20)
comp_coef["stress", ] <- c(-0.05, 0.04)
comp_residual_cov <- comp_template$residual_cov
comp_residual_cov[,] <- matrix(
c(0.12, 0.02,
0.02, 0.10),
nrow = 2
)
comp_sim <- simulate_data(
n = 10,
seed = 2026,
generators = list(
stress = gen_normal(
"stress",
mean = 0,
sd = 1
),
time_use = gen_outcome(
comp_formula,
coefficients = comp_coef,
residual_cov = comp_residual_cov,
compositional = TRUE,
parts = comp_parts,
total = 24
)
)
)
kable(comp_sim$data)| obs_id | stress | comp_ilr_1 | comp_ilr_2 | sleep | active | sedentary |
|---|---|---|---|---|---|---|
| 1 | 0.521 | -0.251 | 0.382 | 6.28 | 11.19 | 6.52 |
| 2 | -1.080 | -0.046 | 0.338 | 7.56 | 10.15 | 6.29 |
| 3 | 0.139 | -0.455 | 0.598 | 4.99 | 13.30 | 5.71 |
| 4 | -0.085 | 0.002 | -0.129 | 7.99 | 7.27 | 8.73 |
| 5 | -0.667 | 0.529 | 0.648 | 11.12 | 9.20 | 3.68 |
| 6 | -2.516 | -0.294 | -0.639 | 5.76 | 5.26 | 12.98 |
| 7 | -0.735 | -0.185 | -0.383 | 6.67 | 6.37 | 10.96 |
| 8 | -1.020 | -0.307 | 0.103 | 6.12 | 9.59 | 8.29 |
| 9 | 0.114 | 0.044 | 0.196 | 8.24 | 8.97 | 6.80 |
| 10 | -0.474 | -0.221 | -0.385 | 6.45 | 6.44 | 11.11 |
Location-Scale Outcomes
For outcomes, scale_formula models the log residual SD.
Residual correlations are supplied separately through
residual_cor.
scale_formula <- sigma ~ treatment + (1 | p | participant_id)
scale_outcome_formula <- wellbeing ~ treatment + stress + (1 | p | participant_id)
scale_template <- simulate_data(
n_groups = 4,
n_per_group = 5,
group_id = "participant_id",
seed = 2026,
generators = list(
treatment = gen_categorical(
"treatment",
level = "level2",
prob = 0.5,
output = "integer"
),
stress = gen_normal(
"stress",
level = "multilevel",
fixed_intercept = 0,
random_cov = 0.20,
residual_var = 0.80
),
wellbeing_template = outcome_template(
scale_outcome_formula,
scale_formula = scale_formula
)
)
)$generator_metadata$wellbeing_template
scale_coef <- scale_template$coefficients
scale_coef["(Intercept)", "wellbeing"] <- 50
scale_coef["treatment", "wellbeing"] <- 2
scale_coef["stress", "wellbeing"] <- -1
scale_sd_coef <- scale_template$scale_coefficients
scale_sd_coef["(Intercept)", "wellbeing"] <- log(1.0)
scale_sd_coef["treatment", "wellbeing"] <- log(1.4)
cov_coef <- scale_template$random_cov
cov_coef[["participant_id::p"]][, ] <- diag(c(3.0, 0.30))
scale_sim <- simulate_data(
n_groups = 4,
n_per_group = 5,
group_id = "participant_id",
seed = 2026,
generators = list(
treatment = gen_categorical(
"treatment",
level = "level2",
prob = 0.5,
output = "integer"
),
stress = gen_normal(
"stress",
level = "multilevel",
fixed_intercept = 0,
random_cov = 0.20,
residual_var = 0.80
),
wellbeing = gen_outcome(
scale_outcome_formula,
coefficients = scale_coef,
scale_formula = scale_formula,
scale_coefficients = scale_sd_coef,
random_cov = cov_coef,
residual_cor = matrix(
1,
nrow = 1,
dimnames = list("wellbeing", "wellbeing")
)
)
)
)
scale_sim$data
#> participant_id obs_id treatment stress wellbeing
#> <int> <int> <int> <num> <num>
#> 1: 1 1 1 -0.595 49.9
#> 2: 1 2 1 -0.850 52.0
#> 3: 1 3 1 0.164 49.9
#> 4: 1 4 1 -0.362 47.7
#> 5: 1 5 1 -0.303 50.4
#> 6: 2 1 1 -0.691 51.5
#> 7: 2 2 1 -0.236 50.7
#> 8: 2 3 1 -0.240 53.1
#> 9: 2 4 1 -2.316 52.7
#> 10: 2 5 1 1.167 51.1
#> 11: 3 1 0 0.253 47.7
#> 12: 3 2 0 -0.104 50.6
#> 13: 3 3 0 -1.018 50.7
#> 14: 3 4 0 0.319 48.1
#> 15: 3 5 0 -0.592 50.4
#> 16: 4 1 0 -1.273 48.6
#> 17: 4 2 0 -2.370 48.4
#> 18: 4 3 0 0.186 46.7
#> 19: 4 4 0 -1.082 47.3
#> 20: 4 5 0 0.581 46.8
#> participant_id obs_id treatment stress wellbeing
#> <int> <int> <int> <num> <num>Preparing Simulated Outcomes for Later Fitting
prepare_outcome_fit() does not fit a model. It prepares
the simulated data, formula, helper columns, and term maps for a later
fitting workflow.
fit_prep <- prepare_outcome_fit(helper_sim, outcome = "wellbeing")
str(fit_prep)
#> List of 12
#> $ target : chr "generic"
#> $ outcome : chr "wellbeing"
#> $ data :Classes 'data.table' and 'data.frame': 80 obs. of 9 variables:
#> ..$ id : int [1:80] 1 1 1 1 2 2 2 2 3 3 ...
#> ..$ obs_id : int [1:80] 2 3 4 5 2 3 4 5 2 3 ...
#> ..$ day : int [1:80] 2 3 4 5 2 3 4 5 2 3 ...
#> ..$ treatment : int [1:80] 1 1 1 1 1 1 1 1 0 0 ...
#> ..$ stress : num [1:80] -0.0337 -0.4104 0.1151 -0.0194 0.2042 ...
#> ..$ wellbeing : num [1:80] 51.7 53.6 51.7 52.3 44.8 ...
#> ..$ stress_between: num [1:80] -0.201 -0.201 -0.201 -0.201 -0.216 ...
#> ..$ stress_within : num [1:80] 0.167 -0.21 0.316 0.181 0.42 ...
#> ..$ lag_stress : num [1:80] -0.6557 -0.0337 -0.4104 0.1151 0.715 ...
#> ..- attr(*, ".internal.selfref")=<pointer: 0x55d4fa2c19d0>
#> $ formula :Class 'formula' language wellbeing ~ treatment + stress_between + stress_within + lag_stress + (1 + stress_within | p | id)
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ fixed_formula :Class 'formula' language wellbeing ~ treatment + stress_between + stress_within + lag_stress
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ term_map :'data.frame': 5 obs. of 3 variables:
#> ..$ component: chr [1:5] "fixed" "fixed" "fixed" "fixed" ...
#> ..$ sim_term : chr [1:5] "(Intercept)" "treatment" "between(stress)" "within(stress)" ...
#> ..$ fit_term : chr [1:5] "(Intercept)" "treatment" "stress_between" "stress_within" ...
#> $ helper_columns : chr [1:3] "stress_between" "stress_within" "lag_stress"
#> $ scale_formula : NULL
#> $ scale_term_map : NULL
#> $ scale_helper_columns: chr(0)
#> $ residual_cor : NULL
#> $ complete_rows : logi [1:100] FALSE TRUE TRUE TRUE TRUE FALSE ...
#> - attr(*, "class")= chr "mlsim_fit_prep"For compositional outcomes, prepare_outcome_fit()
automatically targets the multilevelcoda preparation
path.
Multilevel Dynamic Compositional Outcomes
Now we will look at a much more complicated example. We have a 3 part compositional outcome of physical activity: MVPA, light PA, and sedentary time. It is a dynamic, so a time series. Imagine multiple participants measured with accelerometry for multiple days. Participants also are part of an intervention or control group, and stress was measured daily. We want to simulate data that reflects the following model:
pa_formula <- mvbind(pa_ilr_1, pa_ilr_2) ~
intervention + stress + ar1() * intervention +
(1 + stress + ar1() | p | id)Build a Template
We can build a template with the formula to get the correctly named coefficient and covariance objects.
template_sim <- simulate_data(
n_groups = 4,
n_per_group = 4,
group_id = "id",
time_id = "day",
seed = 2026,
generators = list(
intervention = gen_categorical(
"intervention",
level = "level2",
prob = 0.5,
output = "integer"
),
stress = gen_normal(
"stress",
level = "multilevel",
fixed_intercept = 0,
random_cov = 0.20,
residual_var = 0.80
),
pa_template = outcome_template(
pa_formula,
compositional = TRUE,
parts = c("mvpa", "light_pa", "sedentary")
)
)
)
pa_template <- template_sim$generator_metadata$pa_template
kable(pa_template$coefficients)| pa_ilr_1 | pa_ilr_2 | |
|---|---|---|
| (Intercept) | 0 | 0 |
| intervention | 0 | 0 |
| stress | 0 | 0 |
| ar1(pa_ilr_1) | 0 | 0 |
| ar1(pa_ilr_2) | 0 | 0 |
| intervention:ar1(pa_ilr_1) | 0 | 0 |
| intervention:ar1(pa_ilr_2) | 0 | 0 |
There are multiple random effects. The random effects are all correlated, so we need a full covariance matrix for the participant random effects. The row names of the covariance matrix show the correct naming structure for the random effects.
rownames(pa_template$random_cov[["id::p"]])
#> [1] "pa_ilr_1:(Intercept)" "pa_ilr_1:stress" "pa_ilr_1:ar1(pa_ilr_1)"
#> [4] "pa_ilr_1:ar1(pa_ilr_2)" "pa_ilr_2:(Intercept)" "pa_ilr_2:stress"
#> [7] "pa_ilr_2:ar1(pa_ilr_1)" "pa_ilr_2:ar1(pa_ilr_2)"Fill in Plausible Values
The fixed coefficients below encode a modest intervention shift, a small stress effect, and positive autocorrelation. The interaction rows make autocorrelation weaker in the intervention group.
pa_coef <- pa_template$coefficients
pa_coef["(Intercept)", ] <- c(-0.30, 0.20)
pa_coef["intervention", ] <- c(0.15, -0.10)
pa_coef["stress", ] <- c(-0.08, 0.05)
pa_coef["ar1(pa_ilr_1)", ] <- c(0.35, 0.08)
pa_coef["ar1(pa_ilr_2)", ] <- c(0.04, 0.30)
pa_coef["intervention:ar1(pa_ilr_1)", ] <- c(-0.15, -0.02)
pa_coef["intervention:ar1(pa_ilr_2)", ] <- c(-0.01, -0.12)
kable(pa_coef)| pa_ilr_1 | pa_ilr_2 | |
|---|---|---|
| (Intercept) | -0.30 | 0.20 |
| intervention | 0.15 | -0.10 |
| stress | -0.08 | 0.05 |
| ar1(pa_ilr_1) | 0.35 | 0.08 |
| ar1(pa_ilr_2) | 0.04 | 0.30 |
| intervention:ar1(pa_ilr_1) | -0.15 | -0.02 |
| intervention:ar1(pa_ilr_2) | -0.01 | -0.12 |
There are several covariances. The residuals can also covary. We allow a small residual covariance.
pa_residual_cov <- pa_template$residual_cov
pa_residual_cov[,] <- matrix(
c(0.22, 0.04,
0.04, 0.18),
nrow = 2
)
kable(pa_residual_cov)| pa_ilr_1 | pa_ilr_2 | |
|---|---|---|
| pa_ilr_1 | 0.22 | 0.04 |
| pa_ilr_2 | 0.04 | 0.18 |
For simplicity, we start with all the random effects being
uncorrelated, which we accomplish using diag() to set the
covariance matrix. We then edit in a few selected non zero covariances
we want.
pa_random_cov <- pa_template$random_cov
participant_cov <- pa_random_cov[["id::p"]]
diag(participant_cov) <- c(
0.080, # pa_ilr_1 random intercept
0.015, # pa_ilr_1 stress slope
0.001, # pa_ilr_1 ar1(pa_ilr_1) slope
0.001, # pa_ilr_1 ar1(pa_ilr_2) slope
0.060, # pa_ilr_2 random intercept
0.012, # pa_ilr_2 stress slope
0.001, # pa_ilr_2 ar1(pa_ilr_1) slope
0.001 # pa_ilr_2 ar1(pa_ilr_2) slope
)
participant_cov["pa_ilr_1:(Intercept)", "pa_ilr_2:(Intercept)"] <- 0.025
participant_cov["pa_ilr_2:(Intercept)", "pa_ilr_1:(Intercept)"] <- 0.025
participant_cov["pa_ilr_1:stress", "pa_ilr_2:stress"] <- 0.004
participant_cov["pa_ilr_2:stress", "pa_ilr_1:stress"] <- 0.004
pa_random_cov[["id::p"]] <- participant_cov
kable(participant_cov)| pa_ilr_1:(Intercept) | pa_ilr_1:stress | pa_ilr_1:ar1(pa_ilr_1) | pa_ilr_1:ar1(pa_ilr_2) | pa_ilr_2:(Intercept) | pa_ilr_2:stress | pa_ilr_2:ar1(pa_ilr_1) | pa_ilr_2:ar1(pa_ilr_2) | |
|---|---|---|---|---|---|---|---|---|
| pa_ilr_1:(Intercept) | 0.080 | 0.000 | 0.000 | 0.000 | 0.025 | 0.000 | 0.000 | 0.000 |
| pa_ilr_1:stress | 0.000 | 0.015 | 0.000 | 0.000 | 0.000 | 0.004 | 0.000 | 0.000 |
| pa_ilr_1:ar1(pa_ilr_1) | 0.000 | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| pa_ilr_1:ar1(pa_ilr_2) | 0.000 | 0.000 | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 |
| pa_ilr_2:(Intercept) | 0.025 | 0.000 | 0.000 | 0.000 | 0.060 | 0.000 | 0.000 | 0.000 |
| pa_ilr_2:stress | 0.000 | 0.004 | 0.000 | 0.000 | 0.000 | 0.012 | 0.000 | 0.000 |
| pa_ilr_2:ar1(pa_ilr_1) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.001 | 0.000 |
| pa_ilr_2:ar1(pa_ilr_2) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.001 |
Simulate the Outcome
The outcome generator uses the same formula as the template and receives the edited coefficient and covariance objects. The ILR coordinates are kept, and the composition parts are back-transformed to sum to 16 hours of waking time. The burnin get’s used for autoregressive outcomes to avoid starting values driving the results.
pa_sim <- simulate_data(
n_groups = 100,
n_per_group = 14,
group_id = "id",
time_id = "day",
seed = 2027,
generators = list(
intervention = gen_categorical(
"intervention",
level = "level2",
prob = 0.5,
output = "integer"
),
stress = gen_normal(
"stress",
level = "multilevel",
fixed_intercept = 0,
random_cov = 0.20,
residual_var = 0.80
),
pa = gen_outcome(
pa_formula,
coefficients = pa_coef,
residual_cov = pa_residual_cov,
random_cov = pa_random_cov,
compositional = TRUE,
total = 16,
parts = c("mvpa", "light_pa", "sedentary"),
burnin = 50
)
)
)
kable(head(pa_sim$data, 10))| id | obs_id | day | intervention | stress | pa_ilr_1 | pa_ilr_2 | mvpa | light_pa | sedentary | lag_pa_ilr_1 | lag_pa_ilr_2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | -1.054 | -0.561 | 0.482 | 3.07 | 8.59 | 4.34 | NA | NA |
| 1 | 2 | 2 | 0 | -1.193 | -0.454 | 0.857 | 3.11 | 9.94 | 2.96 | -0.561 | 0.482 |
| 1 | 3 | 3 | 0 | 1.003 | -0.714 | -0.119 | 2.75 | 6.07 | 7.18 | -0.454 | 0.857 |
| 1 | 4 | 4 | 0 | -0.626 | 0.033 | -0.366 | 5.36 | 3.98 | 6.67 | -0.714 | -0.119 |
| 1 | 5 | 5 | 0 | 0.358 | -0.395 | -0.314 | 3.70 | 4.81 | 7.49 | 0.033 | -0.366 |
| 1 | 6 | 6 | 0 | -0.058 | -0.610 | 0.191 | 3.04 | 7.35 | 5.61 | -0.395 | -0.314 |
| 1 | 7 | 7 | 0 | -0.166 | -1.102 | 0.062 | 1.83 | 7.39 | 6.77 | -0.610 | 0.191 |
| 1 | 8 | 8 | 0 | -0.237 | -0.853 | 0.918 | 2.02 | 10.98 | 3.00 | -1.102 | 0.062 |
| 1 | 9 | 9 | 0 | -1.007 | -0.579 | -0.788 | 2.80 | 3.26 | 9.94 | -0.853 | 0.918 |
| 1 | 10 | 10 | 0 | -2.685 | -1.003 | 0.478 | 1.95 | 9.32 | 4.74 | -0.579 | -0.788 |
Prepare a multilevelcoda Object
The simulator already kept the ILR coordinates used to generate the
outcome, but brmcoda() fits the ILR variables created by
multilevelcoda::complr().
prepare_outcome_fit() creates that object, adds the
formula-required lagged ILR columns, and drops the first row per
participant where lagged outcomes are undefined. In this example, the
model ILR variables are named z1_1 and
z2_1.
pa_fit_prep <- prepare_outcome_fit(pa_sim, outcome = "pa")
pa_complr_model <- pa_fit_prep$complr
pa_fit_prep$formula
#> mvbind(z1_1, z2_1) ~ intervention + stress + (lag_z1_1 + lag_z2_1) *
#> intervention + (1 + stress + (lag_z1_1 + lag_z2_1) | p |
#> id)
get_variables(pa_complr_model)
#> $composition_1
#> $composition_1$X
#> [1] "tmvpa" "tlight_pa" "tsedentary"
#>
#> $composition_1$bX
#> [1] "bmvpa" "blight_pa" "bsedentary"
#>
#> $composition_1$wX
#> [1] "wmvpa" "wlight_pa" "wsedentary"
#>
#> $composition_1$Z
#> [1] "z1_1" "z2_1"
#>
#> $composition_1$bZ
#> [1] "bz1_1" "bz2_1"
#>
#> $composition_1$wZ
#> [1] "wz1_1" "wz2_1"
head(pa_fit_prep$data)
#> id obs_id day intervention stress pa_ilr_1 pa_ilr_2 mvpa light_pa
#> <int> <int> <int> <int> <num> <num> <num> <num> <num>
#> 1: 1 2 2 0 -1.1934 -0.4545 0.857 3.11 9.94
#> 2: 1 3 3 0 1.0030 -0.7141 -0.119 2.75 6.07
#> 3: 1 4 4 0 -0.6262 0.0327 -0.366 5.36 3.98
#> 4: 1 5 5 0 0.3581 -0.3951 -0.314 3.70 4.81
#> 5: 1 6 6 0 -0.0581 -0.6102 0.191 3.04 7.35
#> 6: 1 7 7 0 -0.1657 -1.1022 0.062 1.83 7.39
#> sedentary lag_pa_ilr_1 lag_pa_ilr_2 tmvpa tlight_pa tsedentary bmvpa
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 2.96 -0.5611 0.482 3.11 9.94 2.96 3.39
#> 2: 7.18 -0.4545 0.857 2.75 6.07 7.18 3.39
#> 3: 6.67 -0.7141 -0.119 5.36 3.98 6.67 3.39
#> 4: 7.49 0.0327 -0.366 3.70 4.81 7.49 3.39
#> 5: 5.61 -0.3951 -0.314 3.04 7.35 5.61 3.39
#> 6: 6.77 -0.6102 0.191 1.83 7.39 6.77 3.39
#> blight_pa bsedentary wmvpa wlight_pa wsedentary z1_1 z2_1 bz1_1 bz2_1
#> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 6.96 5.65 0.319 0.498 0.182 -0.4545 0.857 -0.502 0.147
#> 2: 6.96 5.65 0.275 0.295 0.430 -0.7141 -0.119 -0.502 0.147
#> 3: 6.96 5.65 0.474 0.172 0.354 0.0327 -0.366 -0.502 0.147
#> 4: 6.96 5.65 0.351 0.222 0.427 -0.3951 -0.314 -0.502 0.147
#> 5: 6.96 5.65 0.304 0.359 0.337 -0.6102 0.191 -0.502 0.147
#> 6: 6.96 5.65 0.193 0.379 0.428 -1.1022 0.062 -0.502 0.147
#> wz1_1 wz2_1 lag_z1_1 lag_z2_1
#> <num> <num> <num> <num>
#> 1: 0.0474 0.7104 -0.5611 0.482
#> 2: -0.2122 -0.2657 -0.4545 0.857
#> 3: 0.5346 -0.5125 -0.7141 -0.119
#> 4: 0.1068 -0.4604 0.0327 -0.366
#> 5: -0.1083 0.0445 -0.3951 -0.314
#> 6: -0.6003 -0.0848 -0.6102 0.191Fit With brms Through multilevelcoda
The prepared formula mirrors the simulation formula using the
concrete ILR names created by complr(). The shared
random-effect ID p allows the random intercepts, stress
slopes, and all AR slopes to correlate across both ILR outcomes,
matching the simulation structure.
This chunk compiles and samples a Stan model, so it is intentionally
the slow part of the vignette. The model uses the cmdstanr
backend. The meanfield algorithm is used to speed up
sampling, but it is less accurate than full HMC. Adjust the algorithm
and sampling parameters as needed for your use case. We only used
meanfield here to keep the vignette fast and illustrate
analysis.
pa_fit <- brmcoda(
complr = pa_complr_model,
formula = brms::bf(pa_fit_prep$formula) + brms::set_rescor(TRUE),
backend = "cmdstanr",
algorithm = "meanfield",
iter = 2000,
seed = 2028
)
#> Start sampling
#> ------------------------------------------------------------
#> EXPERIMENTAL ALGORITHM:
#> This procedure has not been thoroughly tested and may be unstable
#> or buggy. The interface is subject to change.
#> ------------------------------------------------------------
#> Gradient evaluation took 0.000779 seconds
#> 1000 transitions using 10 leapfrog steps per transition would take 7.79 seconds.
#> Adjust your expectations accordingly!
#> Begin eta adaptation.
#> Iteration: 1 / 250 [ 0%] (Adaptation)
#> Iteration: 50 / 250 [ 20%] (Adaptation)
#> Iteration: 100 / 250 [ 40%] (Adaptation)
#> Iteration: 150 / 250 [ 60%] (Adaptation)
#> Iteration: 200 / 250 [ 80%] (Adaptation)
#> Iteration: 250 / 250 [100%] (Adaptation)
#> Success! Found best value [eta = 0.1].
#> Begin stochastic gradient ascent.
#> iter ELBO delta_ELBO_mean delta_ELBO_med notes
#> 100 -1118856.826 1.000 1.000
#> 200 -562024.992 0.995 1.000
#> 300 -120516.538 2.327 3.663
#> 400 -71520.188 2.174 3.663
#> 500 -15366.487 2.170 3.654
#> 600 -8533.643 2.227 3.654
#> 700 -6585.525 0.548 0.801
#> 800 -5091.785 0.295 0.296
#> 900 -4596.913 0.201 0.293
#> 1000 -3872.725 0.147 0.187
#> 1100 -3508.609 0.145 0.187
#> 1200 -3178.075 0.104 0.104
#> 1300 -2936.816 0.093 0.104
#> 1400 -2731.047 0.079 0.082
#> 1500 -2578.067 0.067 0.075
#> 1600 -2458.585 0.054 0.059
#> 1700 -2362.959 0.045 0.049
#> 1800 -2282.281 0.038 0.040
#> 1900 -2215.249 0.033 0.035
#> 2000 -2166.098 0.026 0.030
#> Informational Message: The maximum number of iterations is reached! The algorithm may not have converged.
#> This variational approximation is not guaranteed to be meaningful.
#> Drawing a sample of size 1000 from the approximate posterior...
#> COMPLETED.
#> Finished in 1.6 seconds.
summary(pa_fit)
#> Family: MV(gaussian, gaussian)
#> Links: mu = identity
#> mu = identity
#> Formula: z1_1 ~ intervention + stress + (lag_z1_1 + lag_z2_1) * intervention + (1 + stress + (lag_z1_1 + lag_z2_1) | p | id)
#> z2_1 ~ intervention + stress + (lag_z1_1 + lag_z2_1) * intervention + (1 + stress + (lag_z1_1 + lag_z2_1) | p | id)
#> Data: complr$dataout (Number of observations: 1300)
#> Draws: 1 chains, each with iter = 1000; warmup = 0; thin = 1;
#> total post-warmup draws = 1000
#>
#> Multilevel Hyperparameters:
#> ~id (Number of levels: 100)
#> Estimate Est.Error l-95% CI u-95% CI Rhat
#> sd(z11_Intercept) 0.09 0.03 0.05 0.16 1.00
#> sd(z11_stress) 0.19 0.04 0.12 0.28 1.00
#> sd(z11_lag_z1_1) 0.11 0.03 0.07 0.18 1.00
#> sd(z11_lag_z2_1) 0.13 0.03 0.08 0.18 1.00
#> sd(z21_Intercept) 0.05 0.01 0.03 0.08 1.00
#> sd(z21_stress) 0.04 0.02 0.02 0.09 1.00
#> sd(z21_lag_z1_1) 0.05 0.02 0.02 0.10 1.00
#> sd(z21_lag_z2_1) 0.06 0.03 0.02 0.14 1.00
#> cor(z11_Intercept,z11_stress) 0.48 0.14 0.17 0.71 1.00
#> cor(z11_Intercept,z11_lag_z1_1) 0.44 0.24 -0.12 0.80 1.00
#> cor(z11_stress,z11_lag_z1_1) 0.36 0.27 -0.24 0.78 1.00
#> cor(z11_Intercept,z11_lag_z2_1) 0.17 0.28 -0.40 0.65 1.00
#> cor(z11_stress,z11_lag_z2_1) 0.68 0.18 0.25 0.92 1.00
#> cor(z11_lag_z1_1,z11_lag_z2_1) -0.03 0.32 -0.63 0.58 1.01
#> cor(z11_Intercept,z21_Intercept) 0.18 0.42 -0.66 0.83 1.00
#> cor(z11_stress,z21_Intercept) 0.11 0.41 -0.68 0.78 1.00
#> cor(z11_lag_z1_1,z21_Intercept) 0.27 0.32 -0.40 0.78 1.00
#> cor(z11_lag_z2_1,z21_Intercept) -0.10 0.37 -0.77 0.60 1.00
#> cor(z11_Intercept,z21_stress) 0.19 0.52 -0.81 0.94 1.00
#> cor(z11_stress,z21_stress) -0.06 0.42 -0.82 0.74 1.00
#> cor(z11_lag_z1_1,z21_stress) 0.07 0.38 -0.68 0.74 1.00
#> cor(z11_lag_z2_1,z21_stress) -0.10 0.37 -0.75 0.60 1.00
#> cor(z21_Intercept,z21_stress) 0.07 0.36 -0.64 0.70 1.00
#> cor(z11_Intercept,z21_lag_z1_1) -0.32 0.36 -0.85 0.48 1.00
#> cor(z11_stress,z21_lag_z1_1) -0.12 0.39 -0.81 0.64 1.00
#> cor(z11_lag_z1_1,z21_lag_z1_1) -0.29 0.31 -0.81 0.34 1.00
#> cor(z11_lag_z2_1,z21_lag_z1_1) 0.10 0.35 -0.58 0.75 1.00
#> cor(z21_Intercept,z21_lag_z1_1) -0.14 0.34 -0.76 0.53 1.00
#> cor(z21_stress,z21_lag_z1_1) -0.09 0.37 -0.75 0.66 1.00
#> cor(z11_Intercept,z21_lag_z2_1) 0.11 0.41 -0.65 0.79 1.00
#> cor(z11_stress,z21_lag_z2_1) -0.19 0.40 -0.84 0.61 1.00
#> cor(z11_lag_z1_1,z21_lag_z2_1) -0.23 0.35 -0.81 0.51 1.00
#> cor(z11_lag_z2_1,z21_lag_z2_1) -0.04 0.38 -0.69 0.70 1.00
#> cor(z21_Intercept,z21_lag_z2_1) -0.14 0.36 -0.77 0.60 1.00
#> cor(z21_stress,z21_lag_z2_1) 0.06 0.37 -0.64 0.74 1.01
#> cor(z21_lag_z1_1,z21_lag_z2_1) 0.03 0.36 -0.65 0.72 1.00
#> Bulk_ESS Tail_ESS
#> sd(z11_Intercept) 976 972
#> sd(z11_stress) 995 978
#> sd(z11_lag_z1_1) 834 904
#> sd(z11_lag_z2_1) 837 908
#> sd(z21_Intercept) 1042 1058
#> sd(z21_stress) 995 1023
#> sd(z21_lag_z1_1) 1060 682
#> sd(z21_lag_z2_1) 886 932
#> cor(z11_Intercept,z11_stress) 933 1061
#> cor(z11_Intercept,z11_lag_z1_1) 1009 939
#> cor(z11_stress,z11_lag_z1_1) 942 952
#> cor(z11_Intercept,z11_lag_z2_1) 1116 878
#> cor(z11_stress,z11_lag_z2_1) 1104 978
#> cor(z11_lag_z1_1,z11_lag_z2_1) 1098 955
#> cor(z11_Intercept,z21_Intercept) 903 949
#> cor(z11_stress,z21_Intercept) 973 906
#> cor(z11_lag_z1_1,z21_Intercept) 1127 1025
#> cor(z11_lag_z2_1,z21_Intercept) 1008 918
#> cor(z11_Intercept,z21_stress) 886 943
#> cor(z11_stress,z21_stress) 1024 1026
#> cor(z11_lag_z1_1,z21_stress) 744 975
#> cor(z11_lag_z2_1,z21_stress) 940 983
#> cor(z21_Intercept,z21_stress) 999 978
#> cor(z11_Intercept,z21_lag_z1_1) 939 905
#> cor(z11_stress,z21_lag_z1_1) 1073 1110
#> cor(z11_lag_z1_1,z21_lag_z1_1) 1117 983
#> cor(z11_lag_z2_1,z21_lag_z1_1) 1068 872
#> cor(z21_Intercept,z21_lag_z1_1) 766 937
#> cor(z21_stress,z21_lag_z1_1) 879 1033
#> cor(z11_Intercept,z21_lag_z2_1) 1072 988
#> cor(z11_stress,z21_lag_z2_1) 1023 963
#> cor(z11_lag_z1_1,z21_lag_z2_1) 973 728
#> cor(z11_lag_z2_1,z21_lag_z2_1) 1132 960
#> cor(z21_Intercept,z21_lag_z2_1) 1098 901
#> cor(z21_stress,z21_lag_z2_1) 965 979
#> cor(z21_lag_z1_1,z21_lag_z2_1) 893 934
#>
#> Regression Coefficients:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
#> z11_Intercept -0.21 0.07 -0.36 -0.06 1.00 884
#> z21_Intercept 0.21 0.11 -0.01 0.43 1.00 983
#> z11_intervention 0.11 0.07 -0.02 0.24 1.00 997
#> z11_stress -0.07 0.06 -0.18 0.04 1.00 895
#> z11_lag_z1_1 0.53 0.08 0.37 0.69 1.00 963
#> z11_lag_z2_1 0.11 0.09 -0.06 0.28 1.00 876
#> z11_intervention:lag_z1_1 -0.14 0.12 -0.38 0.09 1.00 1065
#> z11_intervention:lag_z2_1 -0.04 0.11 -0.26 0.17 1.00 871
#> z21_intervention -0.18 0.08 -0.34 -0.01 1.00 948
#> z21_stress 0.03 0.05 -0.07 0.13 1.00 1089
#> z21_lag_z1_1 0.17 0.10 -0.02 0.36 1.00 890
#> z21_lag_z2_1 0.52 0.09 0.35 0.69 1.00 1045
#> z21_intervention:lag_z1_1 -0.10 0.13 -0.35 0.16 1.00 1094
#> z21_intervention:lag_z2_1 -0.08 0.10 -0.28 0.12 1.00 1146
#> Tail_ESS
#> z11_Intercept 873
#> z21_Intercept 856
#> z11_intervention 1011
#> z11_stress 936
#> z11_lag_z1_1 944
#> z11_lag_z2_1 982
#> z11_intervention:lag_z1_1 921
#> z11_intervention:lag_z2_1 820
#> z21_intervention 883
#> z21_stress 907
#> z21_lag_z1_1 1039
#> z21_lag_z2_1 942
#> z21_intervention:lag_z1_1 962
#> z21_intervention:lag_z2_1 944
#>
#> Further Distributional Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma_z11 0.54 0.05 0.45 0.65 1.00 916 772
#> sigma_z21 0.53 0.04 0.46 0.60 1.00 987 967
#>
#> Residual Correlations:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> rescor(z11,z21) 0.19 0.08 0.04 0.33 1.00 1023 983
#>
#> Draws were sampled using variational(meanfield).