Skip to contents

NOTE THIS IS EXPERIMENTAL FOR NOW

library(multilevelcoda)
#> 
#> Attaching package: 'multilevelcoda'
#> The following object is masked from 'package:base':
#> 
#>     sub
library(knitr)
library(data.table)
#> data.table 1.18.4 using 12 threads (see ?getDTthreads).
#> Latest news: r-datatable.com
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:base':
#> 
#>     %notin%
library(cmdstanr)
#> This is cmdstanr version 0.8.0
#> - CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
#> - CmdStan path: /home/jwile/.cmdstan/cmdstan-2.38.0
#> - CmdStan version: 2.38.0
#> 
#> A newer version of CmdStan is available. See ?install_cmdstan() to install it.
#> To disable this check set option or environment variable cmdstanr_no_ver_check=TRUE.

options(digits = 3, pillar.sigfig = 3)

Data simulation is useful for multiple purposes, such as:

  • Power analysis and sample size planning.
  • Method development and testing.
  • Generating synthetic data to test models, write code before data collection finishes, etc.

Simulating data, especially if multilevel and/or compositional, can be complex. The simulate_data() function is designed to support these efforts.

Broadly, simulate_data() builds an indexed design, runs named generators in order, and returns an mlsim_data object with four main pieces:

  • data: the generated data.table.
  • metadata: design metadata such as group sizes, time indexing, and generator order.
  • generator_specs: the generator settings
  • generator_metadata: e.g., parameters, random effects, covariance matrices, helper columns.

Summary Tables

These tables give a high level summary of current features.

Design options

Design options cover the highest level structure.

Option Use
n Total rows for a single-level design, or total rows to divide across groups.
n_groups Number of groups in a grouped or multilevel design.
n_per_group Group sizes: scalar, vector, function, or count-distribution list.
group_id Name of the grouping column.
obs_id Name of the within-design observation index.
time_id Optional time column name.
time_values Optional time vector or function, including Date/POSIXct values.
time_truncate Whether shared time values are truncated for shorter groups.
seed Reproducible simulation seed that restores the caller RNG state.

Generator families

Generators are used to add simulated data columns. They are independent of each other.

Current predictor generators are listed below.

Generator Role
gen_mvn() Univariate and multivariate Gaussian predictors, including ILR compositional back-transforms.
gen_categorical() Binary, unordered categorical, and ordered categorical variables.
gen_binomial() Binomial counts with logit-scale multilevel support.
gen_poisson() Poisson counts with log-scale multilevel support.
gen_negbin() Negative-binomial counts with optional location-scale multilevel support.
gen_gamma() Positive continuous variables with optional location-scale multilevel support.
gen_beta() Bounded (0, 1) continuous variables with optional location-scale support.
gen_custom() User-supplied generators that receive the active simulation context.
gen_outcome() Gaussian, dynamic VAR(1), and ILR compositional outcomes generated from prior predictors.

Advanced simulation options

These are a sampling of more advanced simulation features and apply to predictor generators.

Feature Use
level Choose row-level, group-level, or multilevel generation.
fixed_intercept Set the fixed location or link-scale intercept in any predictor generator.
random_cov Add group-specific random intercepts, and sometimes scale random effects.
residual_cov Set row-level residual variation for Gaussian and MVN generators.
scale_fixed_intercept Simulate group-varying log residual SD, size, shape, or precision.
compositional, parts, sbp, total, keep_ilr Generate compositions from ILR coordinates and store SBP metadata.
between(), within(), ar1() Build dynamic outcome models from labelled predictor components and residual VAR(1) dynamics.

Study Designs and Indexing

Here is a simple single level example. Each variable in this first example is generated independently of the others.

single <- simulate_data(
  n = 20,
  seed = 2026,
  generators = list(
    x = gen_mvn("x", fixed_intercept = 0, residual_cov = 1),
    edu = gen_categorical("edu", categories = c("< Bachelor", "Bachelor", "> Bachelor"),
      fixed_intercept = c("Bachelor" = 0, "> Bachelor" = log(4 / 3))),
    tx = gen_categorical("tx", categories = c("control", "treatment"), fixed_intercept = stats::qlogis(.5))
  )
)

kable(single$data)
obs_id x edu tx
1 0.521 Bachelor control
2 -1.080 > Bachelor treatment
3 0.139 Bachelor treatment
4 -0.085 Bachelor control
5 -0.667 < Bachelor control
6 -2.516 < Bachelor control
7 -0.735 > Bachelor treatment
8 -1.020 < Bachelor control
9 0.114 Bachelor treatment
10 -0.474 < Bachelor treatment
11 -0.408 > Bachelor treatment
12 -0.730 > Bachelor control
13 -0.221 > Bachelor treatment
14 -0.226 > Bachelor treatment
15 -2.547 Bachelor control
16 1.347 > Bachelor treatment
17 0.616 > Bachelor treatment
18 0.218 < Bachelor treatment
19 -0.805 > Bachelor control
20 0.690 < Bachelor treatment
single
#> <mlsim_data>
#>   rows: 20
#>   columns: 4 (generated: 3)
#>   seed: 2026
#>   grouping: none
#>   generators: 3
#> 
#> Generators:
#>  generator distribution  level vars
#>          x          mvn single    x
#>        edu  categorical single  edu
#>         tx  categorical single   tx

The mlsim_data print method gives a compact console overview. Use summary() when you want design and generator metadata as data.table objects.

summary(single)$design
#>        n n_cols n_generated_cols n_groups group_id group_size_min
#>    <int>  <int>            <int>    <int>   <char>          <int>
#> 1:    20      4                3       NA     <NA>             NA
#>    group_size_median group_size_max obs_id time_id  seed n_generators
#>                <num>          <int> <char>  <char> <int>        <int>
#> 1:                NA             NA obs_id    <NA>  2026            3
summary(single)$generators
#>    generator distribution  level   vars n_vars parameter_level parameter_count
#>       <char>       <char> <char> <char>  <int>          <char>           <int>
#> 1:         x          mvn single      x      1             row              20
#> 2:       edu  categorical single    edu      1             row              20
#> 3:        tx  categorical single     tx      1             row              20
#>    has_row_parameters has_group_parameters has_fixed_parameters has_random_cov
#>                <lgcl>               <lgcl>               <lgcl>         <lgcl>
#> 1:               TRUE                FALSE                 TRUE          FALSE
#> 2:               TRUE                FALSE                 TRUE          FALSE
#> 3:               TRUE                FALSE                 TRUE          FALSE
#>    has_random_effects has_residuals has_scale_model has_composition
#>                <lgcl>        <lgcl>          <lgcl>          <lgcl>
#> 1:              FALSE          TRUE           FALSE           FALSE
#> 2:              FALSE         FALSE           FALSE           FALSE
#> 3:              FALSE         FALSE           FALSE           FALSE
#>    has_custom_output
#>               <lgcl>
#> 1:             FALSE
#> 2:             FALSE
#> 3:             FALSE

We can generate multilevel structures as well. A single scalar n_per_group creates a balanced grouped design with the exact same number of observations per group.

balanced <- simulate_data(
  n_groups = 3,
  n_per_group = 4,
  seed = 2026,
  generators = list(
    x = gen_mvn("x", fixed_intercept = 0, residual_cov = 1)
  )
)

kable(balanced$data)
group_id obs_id x
1 1 0.521
1 2 -1.080
1 3 0.139
1 4 -0.085
2 1 -0.667
2 2 -2.516
2 3 -0.735
2 4 -1.020
3 1 0.114
3 2 -0.474
3 3 -0.408
3 4 -0.730

However, this was still effectively single level, Gaussian data. We can improve that. Here we give essentially the parameters of a random intercept only multilevel model, which is used to generate matching data. The fixed_intercept is the overall mean, random_cov is the group-level variance, and residual_cov is the residual variance. The same approach broadly applies to other generator families when level = “multilevel”. We can generator a level 2 variable as well, which is constant within groups but varies between groups.

balanced2 <- simulate_data(
  n_groups = 3,
  n_per_group = 4,
  seed = 2026,
  generators = list(
    x = gen_mvn("x", level = "multilevel",
      fixed_intercept = 0, random_cov = 1, residual_cov = .01),
    y = gen_mvn("y", level = "level2", fixed_intercept = 0, residual_cov = 1)
  )
)

kable(balanced2$data)
group_id obs_id x x_between x_within .mlsim_x_random_intercept y
1 1 0.512 0.521 -0.008 0.521 1.347
1 2 0.454 0.521 -0.067 0.521 1.347
1 3 0.269 0.521 -0.252 0.521 1.347
1 4 0.447 0.521 -0.074 0.521 1.347
2 1 -1.182 -1.080 -0.102 -1.080 0.616
2 2 -1.068 -1.080 0.011 -1.080 0.616
2 3 -1.127 -1.080 -0.047 -1.080 0.616
2 4 -1.121 -1.080 -0.041 -1.080 0.616
3 1 0.066 0.139 -0.073 0.139 0.218
3 2 0.117 0.139 -0.022 0.139 0.218
3 3 0.117 0.139 -0.023 0.139 0.218
3 4 -0.115 0.139 -0.255 0.139 0.218

Vectors specs create unbalanced designs.

unbalanced <- simulate_data(
  n_groups = 4,
  n_per_group = c(3, 5, 4, 2),
  seed = 2026,
  generators = list(
    x = gen_mvn("x", level = "multilevel",
      fixed_intercept = 0, random_cov = 1, residual_cov = .01),
    y = gen_mvn("y", level = "level2", fixed_intercept = 0, residual_cov = 1)
  )
)

kable(unbalanced$data)
group_id obs_id x x_between x_within .mlsim_x_random_intercept y
1 1 0.454 0.521 -0.067 0.521 -0.805
1 2 0.269 0.521 -0.252 0.521 -0.805
1 3 0.447 0.521 -0.074 0.521 -0.805
2 1 -1.182 -1.080 -0.102 -1.080 0.690
2 2 -1.068 -1.080 0.011 -1.080 0.690
2 3 -1.127 -1.080 -0.047 -1.080 0.690
2 4 -1.121 -1.080 -0.041 -1.080 0.690
2 5 -1.153 -1.080 -0.073 -1.080 0.690
3 1 0.117 0.139 -0.022 0.139 -0.329
3 2 0.117 0.139 -0.023 0.139 -0.329
3 3 -0.115 0.139 -0.255 0.139 -0.329
3 4 0.274 0.139 0.135 0.139 -0.329
4 1 -0.023 -0.085 0.062 -0.085 -0.165
4 2 -0.063 -0.085 0.022 -0.085 -0.165
drawn_sizes <- simulate_data(
  n_groups = 5,
  n_per_group = list(
    distribution = "poisson",
    lambda = 4,
    minimum = 2,
    maximum = 6
  ),
  seed = 2026,
  generators = list(
    x = gen_mvn("x", level = "multilevel",
      fixed_intercept = 0, random_cov = 1, residual_cov = .01),
    y = gen_mvn("y", level = "level2", fixed_intercept = 0, residual_cov = 1)
  )
)

kable(drawn_sizes$data)
group_id obs_id x x_between x_within .mlsim_x_random_intercept y
1 1 -1.994 -1.958 -0.037 -1.958 0.339
1 2 -2.261 -1.958 -0.304 -1.958 0.339
1 3 -2.169 -1.958 -0.211 -1.958 0.339
1 4 -1.998 -1.958 -0.040 -1.958 0.339
1 5 -2.112 -1.958 -0.154 -1.958 0.339
2 1 0.995 1.085 -0.089 1.085 1.207
2 2 0.977 1.085 -0.108 1.085 1.207
2 3 1.111 1.085 0.026 1.085 1.207
2 4 0.992 1.085 -0.093 1.085 1.207
3 1 -0.010 0.204 -0.214 0.204 0.571
3 2 0.185 0.204 -0.019 0.204 0.571
4 1 0.495 0.501 -0.006 0.501 -1.686
4 2 0.488 0.501 -0.013 0.501 -1.686
4 3 0.616 0.501 0.115 0.501 -1.686
5 1 1.015 1.030 -0.015 1.030 -0.838
5 2 0.848 1.030 -0.182 1.030 -0.838
5 3 0.976 1.030 -0.054 1.030 -0.838
5 4 0.958 1.030 -0.072 1.030 -0.838

Longitudinal designs can add a time index. When groups have different lengths, time_truncate = TRUE uses the first values for shorter groups. Add a time index may not matter, but it can be used for creating lags, etc.

dated <- simulate_data(
  n_groups = 3,
  n_per_group = c(5, 3, 4),
  time_id = "date",
  time_values = as.Date("2026-01-01") + 0:4,
  seed = 2026,
  generators = list(
    x = gen_mvn("x", level = "multilevel",
      fixed_intercept = 0, random_cov = 1, residual_cov = .01),
    y = gen_mvn("y", level = "level2", fixed_intercept = 0, residual_cov = 1)
  )
)

kable(dated$data)
group_id obs_id date x x_between x_within .mlsim_x_random_intercept y
1 1 2026-01-01 0.512 0.521 -0.008 0.521 1.347
1 2 2026-01-02 0.454 0.521 -0.067 0.521 1.347
1 3 2026-01-03 0.269 0.521 -0.252 0.521 1.347
1 4 2026-01-04 0.447 0.521 -0.074 0.521 1.347
1 5 2026-01-05 0.419 0.521 -0.102 0.521 1.347
2 1 2026-01-01 -1.068 -1.080 0.011 -1.080 0.616
2 2 2026-01-02 -1.127 -1.080 -0.047 -1.080 0.616
2 3 2026-01-03 -1.121 -1.080 -0.041 -1.080 0.616
3 1 2026-01-01 0.066 0.139 -0.073 0.139 0.218
3 2 2026-01-02 0.117 0.139 -0.022 0.139 0.218
3 3 2026-01-03 0.117 0.139 -0.023 0.139 0.218
3 4 2026-01-04 -0.115 0.139 -0.255 0.139 0.218

Single-Level, Group-Level, and Multilevel Predictors

The level argument controls whether values vary by row, by group, or by row with group-specific random effects.

levels_demo <- simulate_data(
  n_groups = 4,
  n_per_group = 4,
  seed = 2026,
  generators = list(
    row_x = gen_mvn("row_x", fixed_intercept = 0, residual_cov = 1),
    group_x = gen_mvn("group_x", level = "level2", fixed_intercept = 10, residual_cov = 1),
    mixed_x = gen_mvn(
      "mixed_x",
      level = "multilevel",
      fixed_intercept = 5,
      random_cov = 0.50,
      residual_cov = 1
    )
  )
)

levels_demo
#> <mlsim_data>
#>   rows: 16
#>   columns: 8 (generated: 6)
#>   seed: 2026
#>   grouping: group_id (4 groups; size min/median/max: 4/4/4)
#>   generators: 3
#> 
#> Generators:
#>  generator distribution      level    vars
#>      row_x          mvn     single   row_x
#>    group_x          mvn     level2 group_x
#>    mixed_x          mvn multilevel mixed_x

Distribution Families

All predictor generator levels use fixed location or link-scale intercepts. Multilevel forms add random covariance terms. Here is a quick demonstration of various distribution families with single level data.

families <- simulate_data(
  n = 10,
  seed = 2026,
  generators = list(
    normal = gen_mvn("normal", fixed_intercept = 5, residual_cov = 1),
    successes = gen_binomial("successes", size = 5, fixed_intercept = stats::qlogis(0.40)),
    visits = gen_poisson("visits", fixed_intercept = log(2)),
    events = gen_negbin("events", fixed_intercept = log(2), scale_fixed_intercept = log(3)),
    cost = gen_gamma("cost", fixed_intercept = log(20), scale_fixed_intercept = log(4)),
    adherence = gen_beta("adherence", fixed_intercept = stats::qlogis(0.70), scale_fixed_intercept = log(12))
  )
)

kable(families$data)
obs_id normal successes visits events cost adherence
1 5.52 2 4 1 10.63 0.875
2 3.92 2 1 0 36.47 0.822
3 5.14 1 3 5 38.06 0.657
4 4.92 0 0 4 41.29 0.664
5 4.33 2 2 2 33.78 0.500
6 2.48 1 2 4 10.91 0.766
7 4.26 2 1 0 24.79 0.897
8 3.98 1 2 2 13.94 0.841
9 5.11 0 3 1 14.41 0.568
10 4.53 2 2 2 8.27 0.846

Here is an example of a multilevel count generator.

link_scale_counts <- simulate_data(
  n_groups = 3,
  n_per_group = 5,
  seed = 2026,
  generators = list(
    count = gen_poisson(
      "count",
      level = "multilevel",
      fixed_intercept = log(2.5),
      random_cov = 0.20
    )
  )
)

link_scale_counts
#> <mlsim_data>
#>   rows: 15
#>   columns: 4 (generated: 2)
#>   seed: 2026
#>   grouping: group_id (3 groups; size min/median/max: 5/5/5)
#>   generators: 1
#> 
#> Generators:
#>  generator distribution      level  vars
#>      count      poisson multilevel count

Categorical Variables

gen_categorical() supports binary variables, labeled categories, ordered factors, character output, and integer codes.

categorical <- simulate_data(
  n = 12,
  seed = 2026,
  generators = list(
    binary_factor = gen_categorical("binary_factor", fixed_intercept = stats::qlogis(0.35)),
    education = gen_categorical(
      "education",
      categories = c("< bachelor", "bachelor", "> bachelor"),
      fixed_intercept = c("bachelor" = log(0.35 / 0.45), "> bachelor" = log(0.20 / 0.45)),
      ordered = TRUE
    ),
    education_code = gen_categorical(
      "education_code",
      categories = c("arts", "science", "psychology"),
      fixed_intercept = c("science" = log(0.35 / 0.45), "psychology" = log(0.20 / 0.45)),
      output = "integer"
    )
  )
)

kable(categorical$data)
obs_id binary_factor education education_code
1 1 < bachelor 0
2 0 > bachelor 0
3 0 < bachelor 0
4 0 < bachelor 0
5 0 bachelor 0
6 0 < bachelor 1
7 0 < bachelor 2
8 1 < bachelor 0
9 0 < bachelor 1
10 0 < bachelor 0
11 0 < bachelor 1
12 1 < bachelor 0

Multilevel categorical generators use baseline-category logits and optional group random effects. For 2 categories, it is basically a multilevel bernoulli. For more than 2 categoaries, it is a multilevel multinomial with the first category as the baseline. The random intercepts can have their own variances and covariances. A full random effect covariance matrix must be specified. Use 0s on the off diagonal for uncorrelated random intercepts.

categorical_ml <- simulate_data(
  n_groups = 4,
  n_per_group = 4,
  seed = 2026,
  generators = list(
    education = gen_categorical(
      "education",
      level = "multilevel",
      categories = c("< bachelor", "bachelor", "> bachelor"),
      fixed_intercept = c("bachelor" = 0.15, "> bachelor" = -0.55),
      random_cov = matrix(c(0.20, 0.04, 0.04, 0.12), nrow = 2),
      ordered = TRUE
    )
  )
)

kable(categorical_ml$data)
group_id obs_id education .mlsim_education_random_intercept_bachelor .mlsim_education_random_intercept_X..bachelor
1 1 bachelor -0.306 0.105
1 2 < bachelor -0.306 0.105
1 3 < bachelor -0.306 0.105
1 4 < bachelor -0.306 0.105
2 1 bachelor 0.155 0.940
2 2 bachelor 0.155 0.940
2 3 < bachelor 0.155 0.940
2 4 < bachelor 0.155 0.940
3 1 bachelor -0.150 0.194
3 2 < bachelor -0.150 0.194
3 3 bachelor -0.150 0.194
3 4 < bachelor -0.150 0.194
4 1 < bachelor -0.089 0.318
4 2 bachelor -0.089 0.318
4 3 > bachelor -0.089 0.318
4 4 < bachelor -0.089 0.318

Correlated and Compositional Predictors

gen_mvn() simulates correlated blocks. With compositional = TRUE, the MVN columns are interpreted as ILR coordinates and back-transformed into parts. Here we are back to single level data.

correlated_comp <- simulate_data(
  n = 8,
  seed = 2026,
  generators = list(
    affect_block = gen_mvn(
      c("affect", "energy"),
      fixed_intercept = c(0, 1),
      residual_cov = matrix(c(1.0, 0.5, 0.5, 1.2), nrow = 2)
    ),
    time_use = gen_mvn(
      c("ilr_1", "ilr_2"),
      fixed_intercept = c(0, 0),
      residual_cov = diag(c(0.30, 0.20)),
      compositional = TRUE,
      parts = c("sleep", "active", "sedentary"),
      total = 24,
      keep_ilr = TRUE
    )
  )
)

kable(correlated_comp$data)
obs_id affect energy ilr_1 ilr_2 sleep active sedentary
1 0.351 1.566 -0.338 -0.022 5.96 8.88 9.16
2 -0.587 -0.290 -0.119 -0.853 6.40 4.05 13.54
3 0.355 0.938 0.441 -0.774 10.24 3.45 10.31
4 0.366 0.561 -0.378 -0.026 5.75 8.96 9.29
5 -0.405 0.238 0.180 -0.289 9.10 5.95 8.95
6 -1.890 -1.579 0.090 -0.772 7.83 4.06 12.10
7 0.922 -0.962 0.762 0.237 13.36 6.21 4.44
8 -1.621 0.655 -0.803 -0.074 3.78 9.58 10.64
correlated_comp$generator_metadata$time_use$sbp
#>           sleep active sedentary
#> balance_1     1     -1        -1
#> balance_2     0      1        -1
correlated_comp$generator_metadata$time_use$ilr_coordinate_map
#>       ilr sbp_row positive_parts   negative_parts
#>    <char>   <int>         <AsIs>           <AsIs>
#> 1:  ilr_1       1          sleep active,sedentary
#> 2:  ilr_2       2         active        sedentary

Use keep_ilr = FALSE when you only want the original parts in the output.

parts_only <- simulate_data(
  n = 5,
  seed = 2026,
  generators = list(
    time_use = gen_mvn(
      c("z1", "z2"),
      fixed_intercept = c(0, 0),
      residual_cov = diag(2),
      compositional = TRUE,
      parts = c("sleep", "active", "sedentary"),
      total = 24,
      keep_ilr = FALSE
    )
  )
)

kable(parts_only$data)
obs_id sleep active sedentary
1 21.86 1.45 0.694
2 11.64 2.21 10.153
3 15.23 4.82 3.956
4 7.27 7.87 8.867
5 10.69 3.73 9.582

Location-Scale Predictors

For multilevel predictors, scale_fixed_intercept adds group-varying scale parameters. For a univariate gen_mvn() generator this means log residual SD; for negative binomial, gamma, and beta generators it controls size, shape, and precision.

Here you can see in the metadata the group specific parameters that were generated.

location_scale <- simulate_data(
  n_groups = 4,
  n_per_group = 5,
  seed = 2026,
  generators = list(
    symptom = gen_mvn(
      "symptom",
      level = "multilevel",
      fixed_intercept = 10,
      scale_fixed_intercept = log(1.2),
      random_cov = matrix(
        c(1.00, 0.15,
          0.15, 0.08),
        nrow = 2
      )
    ),
    event_count = gen_negbin(
      "event_count",
      level = "multilevel",
      fixed_intercept = log(2),
      scale_fixed_intercept = log(6),
      random_cov = matrix(
        c(0.20, 0.03,
          0.03, 0.06),
        nrow = 2
      )
    )
  )
)

kable(data.table(
  group_id = location_scale$data$group_id,
  symptom_residual_sd = location_scale$generator_metadata$symptom$residual_sd,
  event_count_size = location_scale$generator_metadata$event_count$size
))
group_id symptom_residual_sd.symptom event_count_size
1 1.29 5.99
1 1.29 5.99
1 1.29 5.99
1 1.29 5.99
1 1.29 5.99
2 2.57 4.75
2 2.57 4.75
2 2.57 4.75
2 2.57 4.75
2 2.57 4.75
3 1.39 6.04
3 1.39 6.04
3 1.39 6.04
3 1.39 6.04
3 1.39 6.04
4 1.54 4.54
4 1.54 4.54
4 1.54 4.54
4 1.54 4.54
4 1.54 4.54

Custom Generators

Custom generators receive a simulation context and return generated data, column names, and optional metadata. They can use the active design columns, previously generated variables, and any extra arguments supplied to gen_custom(). Their metadata would be captured in the same was as built in generators.

pilot_sampler <- function(context, vars, level, pilot_values) {
  values <- sample(pilot_values, context$n_rows, replace = TRUE)

  list(
    data = data.frame(values = values),
    names = vars,
    metadata = list(source = "pilot bootstrap", n_pilot = length(pilot_values))
  )
}

clinic_load_sampler <- function(context, vars, level) {
  load_values <- c("low", "medium", "high")
  group_load <- load_values[context$data[[context$group_id]]]

  list(
    data = data.frame(load = group_load),
    names = vars,
    metadata = list(source = "derived from group index")
  )
}

custom <- simulate_data(
  n_groups = 3,
  n_per_group = 4,
  seed = 2026,
  generators = list(
    pilot_x = gen_custom(
      "pilot_x",
      generator = pilot_sampler,
      pilot_values = c(1.2, 1.4, 1.8, 2.5, 3.1)
    ),
    clinic_load = gen_custom(
      "clinic_load",
      generator = clinic_load_sampler
    )
  )
)

kable(custom$data)
group_id obs_id pilot_x clinic_load
1 1 3.1 low
1 2 1.2 low
1 3 1.2 low
1 4 3.1 low
2 1 1.8 medium
2 2 2.5 medium
2 3 2.5 medium
2 4 3.1 medium
3 1 2.5 high
3 2 1.4 high
3 3 1.4 high
3 4 1.8 high

Dynamic Outcome Simulation

gen_outcome() adds a model-based outcome generator to the same simulate_data() workflow. It is designed for simulation studies where outcomes are generated from a known location, scale, and optional residual autoregressive process.

The scale model is required. Use scale = sigma ~ 1 for constant conditional standard deviations. When ar1() is present, the scale model controls innovation variability. When ar1() is absent, it controls ordinary residual variability. The AR process is applied to residual ILR states, not to lagged observed outcomes.

gen_template() can be used first to ask the simulator for the exact parameter matrices and arrays required by a gen_outcome() specification. The template should be run with the same design, formulas, and composition settings that will be used in the final simulation.

dynamic_outcome_formula <- mvbind(ilr1, ilr2) ~ ar1() + (1 + ar1() | ID)
dynamic_scale_formula <- sigma ~ 1 + (1 | ID)
dynamic_composition <- list(parts = c("sleep", "sedentary", "activity"), total = 24)

dynamic_template <- simulate_data(
  n_groups = 4,
  n_per_group = 5,
  group_id = "ID",
  time_id = "day",
  seed = 2026,
  generators = list(
    outcome_template = gen_template(
      dynamic_outcome_formula,
      scale = dynamic_scale_formula,
      composition = dynamic_composition,
      burnin = 20
    )
  )
)

dynamic_params <- dynamic_template$generator_metadata$outcome_template$params
dynamic_params$location$beta["(Intercept)", ] <- c(0, 0)
dynamic_params$scale$beta["(Intercept)", ] <- log(c(0.4, 0.35))
dynamic_params$ar$beta["ar1()", , ] <- matrix(c(0.25, 0.02, -0.01, 0.20), 2, 2, byrow = TRUE)

dynamic_random_names <- rownames(dynamic_params$random$ID$covariance)
dynamic_random_sd_values <- c(
  "location|outcome=ilr1|term=(Intercept)" = 0.20,
  "location|outcome=ilr2|term=(Intercept)" = 0.18,
  "ar|term=ar1()|to=ilr1|from=ilr1" = 0.12,
  "ar|term=ar1()|to=ilr1|from=ilr2" = 0.06,
  "ar|term=ar1()|to=ilr2|from=ilr1" = 0.06,
  "ar|term=ar1()|to=ilr2|from=ilr2" = 0.12,
  "scale|outcome=ilr1|term=(Intercept)" = 0.16,
  "scale|outcome=ilr2|term=(Intercept)" = 0.14
)
dynamic_random_sd <- dynamic_random_sd_values[dynamic_random_names]

dynamic_random_cor <- diag(length(dynamic_random_names))
dimnames(dynamic_random_cor) <- list(dynamic_random_names, dynamic_random_names)
set_dynamic_random_cor <- function(x, y, value) {
  dynamic_random_cor[x, y] <<- value
  dynamic_random_cor[y, x] <<- value
}
set_dynamic_random_cor(
  "location|outcome=ilr1|term=(Intercept)",
  "ar|term=ar1()|to=ilr1|from=ilr1",
  0.20
)
set_dynamic_random_cor(
  "ar|term=ar1()|to=ilr1|from=ilr1",
  "ar|term=ar1()|to=ilr1|from=ilr2",
  0.15
)
set_dynamic_random_cor(
  "location|outcome=ilr2|term=(Intercept)",
  "ar|term=ar1()|to=ilr2|from=ilr1",
  0.20
)
set_dynamic_random_cor(
  "ar|term=ar1()|to=ilr2|from=ilr1",
  "ar|term=ar1()|to=ilr2|from=ilr2",
  0.15
)

if (anyNA(dynamic_random_sd) ||
    min(eigen(dynamic_random_cor, symmetric = TRUE, only.values = TRUE)$values) <= 0) {
  stop("The dynamic outcome random-effect covariance is not valid.")
}

dynamic_params$random$ID$covariance <- diag(dynamic_random_sd) %*%
  dynamic_random_cor %*%
  diag(dynamic_random_sd)
dimnames(dynamic_params$random$ID$covariance) <- list(dynamic_random_names, dynamic_random_names)
dynamic_comp <- simulate_data(
  n_groups = 400,
  n_per_group = 100,
  group_id = "ID",
  time_id = "day",
  seed = 2026,
  generators = list(
    outcome = gen_outcome(
      dynamic_outcome_formula,
      scale = dynamic_scale_formula,
      params = dynamic_params,
      composition = dynamic_composition,
      burnin = 50
    )
  )
)

kable(head(dynamic_comp$data[, c("ID", "day", "ilr1", "ilr2", "sleep", "sedentary", "activity"), with = FALSE]))
ID day ilr1 ilr2 sleep sedentary activity
1 1 -0.507 0.031 5.08 9.67 9.25
1 2 -0.028 0.454 7.55 10.77 5.67
1 3 0.031 0.090 8.20 8.40 7.40
1 4 -0.383 0.247 5.65 10.76 7.59
1 5 -0.458 0.648 4.92 13.63 5.45
1 6 0.221 0.420 9.26 9.50 5.25
dynamic_template$generator_metadata$outcome_template$expected_parameter_names
#> $location
#> [1] "(Intercept)"
#> 
#> $scale
#> [1] "(Intercept)"
#> 
#> $ar
#> [1] "ar1()"
#> 
#> $random
#> [1] "location|outcome=ilr1|term=(Intercept)"
#> [2] "location|outcome=ilr2|term=(Intercept)"
#> [3] "ar|term=ar1()|to=ilr1|from=ilr1"       
#> [4] "ar|term=ar1()|to=ilr1|from=ilr2"       
#> [5] "ar|term=ar1()|to=ilr2|from=ilr1"       
#> [6] "ar|term=ar1()|to=ilr2|from=ilr2"       
#> [7] "scale|outcome=ilr1|term=(Intercept)"   
#> [8] "scale|outcome=ilr2|term=(Intercept)"

dynamic_comp$generator_metadata$outcome$expected_parameter_names
#> $location
#> [1] "(Intercept)"
#> 
#> $scale
#> [1] "(Intercept)"
#> 
#> $ar
#> [1] "ar1()"
#> 
#> $random
#> [1] "location|outcome=ilr1|term=(Intercept)"
#> [2] "location|outcome=ilr2|term=(Intercept)"
#> [3] "ar|term=ar1()|to=ilr1|from=ilr1"       
#> [4] "ar|term=ar1()|to=ilr1|from=ilr2"       
#> [5] "ar|term=ar1()|to=ilr2|from=ilr1"       
#> [6] "ar|term=ar1()|to=ilr2|from=ilr2"       
#> [7] "scale|outcome=ilr1|term=(Intercept)"   
#> [8] "scale|outcome=ilr2|term=(Intercept)"

dynamic_comp$generator_metadata$outcome$ar$stability$max_spectral_radius_overall
#> [1] 0.58

table(rowSums(as.matrix(dynamic_comp$data[, c("sleep", "sedentary", "activity"), with = FALSE])))
#> 
#>    24 
#> 40000

The generated data can be managed into an analysis data set and passed to brmcoda(). prep_sim_analysis() rebuilds a complr object using the same SBP basis as the simulator and translates ar1() into within-person centered lagged ILR predictors. The fitted model includes random level terms, random inertia terms through lagged ILR slopes, and random conditional variability through response-specific sigma models.

dynamic_analysis <- prep_sim_analysis(dynamic_comp)
dynamic_analysis$metadata$lag_columns
#> [1] "lag_z1_1_within" "lag_z2_1_within"
dynamic_analysis$formula
#> z1_1 ~ lag_z1_1_within + lag_z2_1_within + (1 + lag_z1_1_within + lag_z2_1_within | ID) 
#> sigma ~ 1 + (1 | ID)
#> z2_1 ~ lag_z1_1_within + lag_z2_1_within + (1 + lag_z1_1_within + lag_z2_1_within | ID) 
#> sigma ~ 1 + (1 | ID)
dynamic_fit <- brmcoda(
  complr = dynamic_analysis$complr,
  formula = dynamic_analysis$formula,
  backend = "cmdstanr",
  seed = 2026,
  algorithm = "meanfield", 
  iter = 2000
)
summary(dynamic_fit)
#>  Family: MV(gaussian, gaussian) 
#>   Links: mu = identity; sigma = log
#>          mu = identity; sigma = log 
#> Formula: z1_1 ~ lag_z1_1_within + lag_z2_1_within + (1 + lag_z1_1_within + lag_z2_1_within | ID) 
#>          sigma ~ 1 + (1 | ID)
#>          z2_1 ~ lag_z1_1_within + lag_z2_1_within + (1 + lag_z1_1_within + lag_z2_1_within | ID) 
#>          sigma ~ 1 + (1 | ID)
#>    Data: complr$dataout (Number of observations: 39600) 
#>   Draws: 1 chains, each with iter = 1000; warmup = 0; thin = 1;
#>          total post-warmup draws = 1000
#> 
#> Multilevel Hyperparameters:
#> ~ID (Number of levels: 400) 
#>                                              Estimate Est.Error l-95% CI
#> sd(z11_Intercept)                                0.18      0.00     0.18
#> sd(z11_lag_z1_1_within)                          0.11      0.00     0.10
#> sd(z11_lag_z2_1_within)                          0.02      0.00     0.01
#> sd(sigma_z11_Intercept)                          0.15      0.00     0.14
#> sd(z21_Intercept)                                0.18      0.00     0.17
#> sd(z21_lag_z1_1_within)                          0.04      0.00     0.03
#> sd(z21_lag_z2_1_within)                          0.10      0.01     0.09
#> sd(sigma_z21_Intercept)                          0.14      0.00     0.13
#> cor(z11_Intercept,z11_lag_z1_1_within)           0.31      0.05     0.21
#> cor(z11_Intercept,z11_lag_z2_1_within)           0.15      0.27    -0.40
#> cor(z11_lag_z1_1_within,z11_lag_z2_1_within)     0.69      0.23     0.02
#> cor(z21_Intercept,z21_lag_z1_1_within)           0.22      0.12    -0.01
#> cor(z21_Intercept,z21_lag_z2_1_within)          -0.08      0.06    -0.18
#> cor(z21_lag_z1_1_within,z21_lag_z2_1_within)     0.09      0.06    -0.02
#>                                              u-95% CI Rhat Bulk_ESS Tail_ESS
#> sd(z11_Intercept)                                0.18 1.00      899      969
#> sd(z11_lag_z1_1_within)                          0.12 1.00      993      972
#> sd(z11_lag_z2_1_within)                          0.03 1.00     1062      951
#> sd(sigma_z11_Intercept)                          0.16 1.00     1033      936
#> sd(z21_Intercept)                                0.18 1.00      856      908
#> sd(z21_lag_z1_1_within)                          0.05 1.00      821      922
#> sd(z21_lag_z2_1_within)                          0.11 1.00     1077      793
#> sd(sigma_z21_Intercept)                          0.15 1.00     1053      818
#> cor(z11_Intercept,z11_lag_z1_1_within)           0.41 1.00     1102      880
#> cor(z11_Intercept,z11_lag_z2_1_within)           0.63 1.00      682      715
#> cor(z11_lag_z1_1_within,z11_lag_z2_1_within)     0.96 1.00      954      981
#> cor(z21_Intercept,z21_lag_z1_1_within)           0.43 1.00      875     1016
#> cor(z21_Intercept,z21_lag_z2_1_within)           0.03 1.00     1029      815
#> cor(z21_lag_z1_1_within,z21_lag_z2_1_within)     0.20 1.00      836      872
#> 
#> Regression Coefficients:
#>                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> z11_Intercept           0.01      0.00     0.01     0.01 1.00     1029      707
#> sigma_z11_Intercept    -0.91      0.00    -0.91    -0.90 1.00      930      943
#> z21_Intercept           0.01      0.00     0.00     0.01 1.00     1029     1068
#> sigma_z21_Intercept    -1.03      0.00    -1.04    -1.02 1.00      909      972
#> z11_lag_z1_1_within     0.21      0.01     0.19     0.22 1.00     1143      790
#> z11_lag_z2_1_within     0.00      0.01    -0.01     0.01 1.00      917      900
#> z21_lag_z1_1_within    -0.02      0.01    -0.03    -0.01 1.00      833      859
#> z21_lag_z2_1_within     0.20      0.01     0.19     0.21 1.00      910      908
#> 
#> Residual Correlations: 
#>                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> rescor(z11,z21)    -0.02      0.00    -0.02    -0.01 1.00      943      941
#> 
#> Draws were sampled using variational(meanfield).
dynamic_response_map <- dynamic_analysis$metadata$response_map
dynamic_response_prefix <- setNames(
  gsub("_", "", unname(dynamic_response_map), fixed = TRUE),
  names(dynamic_response_map)
)
dynamic_lag_columns <- setNames(
  dynamic_analysis$metadata$lag_columns,
  names(dynamic_response_map)
)

dynamic_model_term <- function(term) {
  switch(
    term,
    "(Intercept)" = "Intercept",
    term
  )
}

dynamic_fixed_truth <- numeric()
for (outcome in names(dynamic_response_map)) {
  response_prefix <- dynamic_response_prefix[[outcome]]

  for (term in rownames(dynamic_params$location$beta)) {
    dynamic_fixed_truth[paste(response_prefix, dynamic_model_term(term), sep = "_")] <-
      dynamic_params$location$beta[term, outcome]
  }
  for (term in rownames(dynamic_params$scale$beta)) {
    dynamic_fixed_truth[paste("sigma", response_prefix, dynamic_model_term(term), sep = "_")] <-
      dynamic_params$scale$beta[term, outcome]
  }
}
for (ar_term in dimnames(dynamic_params$ar$beta)[[1]]) {
  ar_interaction <- base::sub(":?ar1\\(\\)", "", ar_term)
  for (to in names(dynamic_response_map)) {
    for (from in names(dynamic_response_map)) {
      lag_term <- dynamic_lag_columns[[from]]
      model_term <- if (nzchar(ar_interaction)) {
        paste(ar_interaction, lag_term, sep = ":")
      } else {
        lag_term
      }
      dynamic_fixed_truth[paste(dynamic_response_prefix[[to]], model_term, sep = "_")] <-
        dynamic_params$ar$beta[ar_term, to, from]
    }
  }
}

dynamic_fixed_summary <- as.data.table(fixef(dynamic_fit), keep.rownames = "parameter")
dynamic_fixed_recovery <- dynamic_fixed_summary[, .(
  type = "fixed effect",
  parameter = parameter,
  estimate = Estimate,
  lower_95 = Q2.5,
  upper_95 = Q97.5,
  true_value = dynamic_fixed_truth[parameter]
)]

dynamic_random_model_name <- function(parameter) {
  if (startsWith(parameter, "location|")) {
    outcome <- base::sub("^location\\|outcome=([^|]+)\\|term=.*$", "\\1", parameter)
    term <- base::sub("^location\\|outcome=[^|]+\\|term=(.*)$", "\\1", parameter)
    return(paste(dynamic_response_prefix[[outcome]], dynamic_model_term(term), sep = "_"))
  }
  if (startsWith(parameter, "scale|")) {
    outcome <- base::sub("^scale\\|outcome=([^|]+)\\|term=.*$", "\\1", parameter)
    term <- base::sub("^scale\\|outcome=[^|]+\\|term=(.*)$", "\\1", parameter)
    return(paste("sigma", dynamic_response_prefix[[outcome]], dynamic_model_term(term), sep = "_"))
  }
  if (startsWith(parameter, "ar|")) {
    to <- base::sub("^ar\\|term=[^|]+\\|to=([^|]+)\\|from=([^|]+)$", "\\1", parameter)
    from <- base::sub("^ar\\|term=[^|]+\\|to=([^|]+)\\|from=([^|]+)$", "\\2", parameter)
    return(paste(dynamic_response_prefix[[to]], dynamic_lag_columns[[from]], sep = "_"))
  }
  stop(sprintf("Unknown random-effect parameter: %s", parameter))
}

dynamic_random_cov_truth <- dynamic_params$random$ID$covariance
dynamic_random_model_names <- vapply(
  rownames(dynamic_random_cov_truth),
  dynamic_random_model_name,
  character(1)
)
dimnames(dynamic_random_cov_truth) <- list(dynamic_random_model_names, dynamic_random_model_names)
dynamic_random_cor_truth <- cov2cor(dynamic_random_cov_truth)

dynamic_varcorr_id <- VarCorr(dynamic_fit)$ID
dynamic_random_sd_summary <- as.data.table(dynamic_varcorr_id$sd, keep.rownames = "parameter")
dynamic_random_sd_summary[, random_effect := parameter]
dynamic_random_variance_recovery <- dynamic_random_sd_summary[, .(
  type = "random variance",
  parameter = paste0("var(", random_effect, ")"),
  estimate = Estimate^2,
  lower_95 = Q2.5^2,
  upper_95 = Q97.5^2,
  true_value = diag(dynamic_random_cov_truth)[random_effect]
)]

dynamic_random_correlation_recovery <- data.table()
if (!is.null(dynamic_varcorr_id$cor)) {
  dynamic_cor_summary <- dynamic_varcorr_id$cor
  dynamic_cor_names <- dimnames(dynamic_cor_summary)[[1]]
  dynamic_cor_pairs <- which(
    lower.tri(matrix(FALSE, length(dynamic_cor_names), length(dynamic_cor_names))),
    arr.ind = TRUE
  )
  dynamic_random_correlation_recovery <- rbindlist(lapply(seq_len(nrow(dynamic_cor_pairs)), function(i) {
    row <- dynamic_cor_pairs[i, "row"]
    col <- dynamic_cor_pairs[i, "col"]
    summary <- dynamic_cor_summary[row, , col]
    data.table(
      type = "random correlation",
      parameter = sprintf("cor(%s, %s)", dynamic_cor_names[row], dynamic_cor_names[col]),
      estimate = summary[["Estimate"]],
      lower_95 = summary[["Q2.5"]],
      upper_95 = summary[["Q97.5"]],
      true_value = dynamic_random_cor_truth[dynamic_cor_names[row], dynamic_cor_names[col]]
    )
  }))
}

dynamic_recovery_table <- rbindlist(list(
  dynamic_fixed_recovery,
  dynamic_random_variance_recovery,
  dynamic_random_correlation_recovery
), use.names = TRUE)
dynamic_recovery_table[, covered_by_95_ci := true_value >= lower_95 & true_value <= upper_95]

if (anyNA(dynamic_recovery_table$true_value)) {
  stop("Some fitted parameters could not be matched to simulation truth.")
}

kable(dynamic_recovery_table, digits = 3)
type parameter estimate lower_95 upper_95 true_value covered_by_95_ci
fixed effect z11_Intercept 0.009 0.005 0.013 0.000 FALSE
fixed effect sigma_z11_Intercept -0.907 -0.913 -0.902 -0.916 FALSE
fixed effect z21_Intercept 0.006 0.001 0.012 0.000 FALSE
fixed effect sigma_z21_Intercept -1.028 -1.036 -1.019 -1.050 FALSE
fixed effect z11_lag_z1_1_within 0.205 0.194 0.217 0.250 FALSE
fixed effect z11_lag_z2_1_within 0.001 -0.011 0.014 0.020 FALSE
fixed effect z21_lag_z1_1_within -0.017 -0.027 -0.007 -0.010 TRUE
fixed effect z21_lag_z2_1_within 0.201 0.190 0.212 0.200 TRUE
random variance var(z11_Intercept) 0.032 0.031 0.034 0.040 FALSE
random variance var(z11_lag_z1_1_within) 0.013 0.011 0.015 0.014 TRUE
random variance var(z11_lag_z2_1_within) 0.000 0.000 0.001 0.004 FALSE
random variance var(sigma_z11_Intercept) 0.023 0.021 0.025 0.026 FALSE
random variance var(z21_Intercept) 0.032 0.031 0.033 0.032 TRUE
random variance var(z21_lag_z1_1_within) 0.002 0.001 0.002 0.004 FALSE
random variance var(z21_lag_z2_1_within) 0.011 0.009 0.013 0.014 FALSE
random variance var(sigma_z21_Intercept) 0.020 0.018 0.022 0.020 TRUE
random correlation cor(z11_lag_z1_1_within, z11_Intercept) 0.314 0.209 0.410 0.200 FALSE
random correlation cor(z11_lag_z2_1_within, z11_Intercept) 0.151 -0.402 0.632 0.000 TRUE
random correlation cor(sigma_z11_Intercept, z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_Intercept, z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z1_1_within, z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z2_1_within, z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(sigma_z21_Intercept, z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z11_lag_z2_1_within, z11_lag_z1_1_within) 0.690 0.016 0.955 0.150 TRUE
random correlation cor(sigma_z11_Intercept, z11_lag_z1_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_Intercept, z11_lag_z1_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z1_1_within, z11_lag_z1_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z2_1_within, z11_lag_z1_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(sigma_z21_Intercept, z11_lag_z1_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(sigma_z11_Intercept, z11_lag_z2_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_Intercept, z11_lag_z2_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z1_1_within, z11_lag_z2_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z2_1_within, z11_lag_z2_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(sigma_z21_Intercept, z11_lag_z2_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_Intercept, sigma_z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z1_1_within, sigma_z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z2_1_within, sigma_z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(sigma_z21_Intercept, sigma_z11_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z1_1_within, z21_Intercept) 0.218 -0.014 0.433 0.200 TRUE
random correlation cor(z21_lag_z2_1_within, z21_Intercept) -0.078 -0.184 0.031 0.000 TRUE
random correlation cor(sigma_z21_Intercept, z21_Intercept) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(z21_lag_z2_1_within, z21_lag_z1_1_within) 0.086 -0.022 0.197 0.150 TRUE
random correlation cor(sigma_z21_Intercept, z21_lag_z1_1_within) 0.000 0.000 0.000 0.000 TRUE
random correlation cor(sigma_z21_Intercept, z21_lag_z2_1_within) 0.000 0.000 0.000 0.000 TRUE