simglm
Introducing a new framework for simulation and power analysis with
the simglm
package, tidy simulation. The package now has
helper functions in which the first argument is always the data and the
second argument is always the simulation arguments. A second vignette is
written that contains a more exhaustive list of simulation arguments
that are allowed. This vignette will give the basics needed to specify
simulation arguments to generate simulated data.
There are four primary functions to be used for basic simulation functionality. These include:
simulate_fixed
: simulate fixed effectssimulate_error
: simulate random errorsimulate_randomeffects
: simulate random effects or
cluster residualsgenerate_response
: based on fixed, error, and random
effects, generate outcome variable \(Y_{j}\) given the following \(Y_{j} = g(X_{j} \beta + Z_{j} b_{j} +
e_{j})\).We will dive into what each of these component represent in a second, but first let’s start with an example that simulates a linear regression model.
Let us assume for this first example that our outcome of interest is continuous and that the data is not clustered. In this example, our model would look like: \(Y_{j} = X_{j} \beta + e_{j}\). In this equation, \(Y_{j}\) represents the continuous outcome, \(X_{j} \beta\) represents the fixed portion of the model comprised of regression coefficients (\(\beta\)) and a design matrix (\(X_{j}\)), and \(e_{j}\) represents random error.
The fixed portion of the model represents variables that are treated
as fixed. This means that the values observed in the data are the only
values we are interested in, will generalize to, or that we consider
values of interest in our population. Let us consider the following
formula: y ~ 1 + x1 + x2
. In this formula there are a total
of two variables that are assumed to be fixed, x1
and
x2
. These variables together with an intercept would make
up the design matrix \(X_{j}\). Let’s
generate some example data using the simglm
package and the
simulate_fixed
function.
library(simglm)
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + x1 + x2,
fixed = list(x1 = list(var_type = 'continuous',
mean = 180, sd = 30),
x2 = list(var_type = 'continuous',
mean = 40, sd = 5)),
sample_size = 10
)
simulate_fixed(data = NULL, sim_arguments)
## X.Intercept. x1 x2 level1_id
## 1 1 231.1471 41.73851 1
## 2 1 158.6388 47.42296 2
## 3 1 171.6605 40.94163 3
## 4 1 176.4105 52.21630 4
## 5 1 176.2812 34.23280 5
## 6 1 188.0455 35.97664 6
## 7 1 201.8052 42.28035 7
## 8 1 186.9941 42.10166 8
## 9 1 190.1734 42.88792 9
## 10 1 163.4426 42.23178 10
The first three columns of the resulting data frame is the design
matrix described above, \(X_{j}\). You
may have noticed that the simulate_fixed
function needs
three elements defined in the simulation arguments (called
sim_arguments
) above. These elements are:
formula
: this argument represents a R formula that is
used to represent the model that is wished to be simulated. For the
simulate_fixed
argument, only the right hand side is
used.fixed
: these represent the specific details for
generating the variables (other than the intercept) in the right hand
side of the formula
simulation argument. Each variable is
specified as its own list by name in the formula
argument
and the var_type
specifies the type of variable to
generate. The vary_type
argument is required for each fixed
variable to simulate. Optional arguments, for example
mean =
and sd =
above, will be discussed in
more detail later.sample_size
: this argument tells the function how many
responses to generate.The columns x1
and x2
would represent
variables that we would gather if these data were real. To reflect a
real life scenario, consider the following fixed simuation.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'continuous', mean = 40, sd = 5)),
sample_size = 10
)
simulate_fixed(data = NULL, sim_arguments)
## X.Intercept. weight age level1_id
## 1 1 231.1471 41.73851 1
## 2 1 158.6388 47.42296 2
## 3 1 171.6605 40.94163 3
## 4 1 176.4105 52.21630 4
## 5 1 176.2812 34.23280 5
## 6 1 188.0455 35.97664 6
## 7 1 201.8052 42.28035 7
## 8 1 186.9941 42.10166 8
## 9 1 190.1734 42.88792 9
## 10 1 163.4426 42.23178 10
Now instead of the variables being called x1
and
x2
, they now reflect variables weight (measured in lbs) and
age (measured continuously, not rounded to whole digits). If we wished
to change the 'age'
variable to be rounded to a whole
integer, we could change the variable type to 'ordinal'
as
such.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60)),
sample_size = 10
)
simulate_fixed(data = NULL, sim_arguments)
## X.Intercept. weight age level1_id
## 1 1 231.1471 44 1
## 2 1 158.6388 38 2
## 3 1 171.6605 31 3
## 4 1 176.4105 40 4
## 5 1 176.2812 60 5
## 6 1 188.0455 47 6
## 7 1 201.8052 33 7
## 8 1 186.9941 43 8
## 9 1 190.1734 52 9
## 10 1 163.4426 31 10
When specifying a var_type
as 'ordinal'
, an
additional levels
argument is needed to determine which
values are possible for the simulation. The last type of variable that
is useful to discuss now would be factor or categorical variables. These
variables can be generated by setting the
var_type = 'factor'
.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
sample_size = 10
)
simulate_fixed(data = NULL, sim_arguments)
## X.Intercept. weight age sex_1 sex level1_id
## 1 1 231.1471 44 1 female 1
## 2 1 158.6388 38 1 female 2
## 3 1 171.6605 31 1 female 3
## 4 1 176.4105 40 1 female 4
## 5 1 176.2812 60 0 male 5
## 6 1 188.0455 47 1 female 6
## 7 1 201.8052 33 1 female 7
## 8 1 186.9941 43 0 male 8
## 9 1 190.1734 52 1 female 9
## 10 1 163.4426 31 1 female 10
The required arguments when a factor variable are identical to an
ordinal variable, however for factor variables the levels
argument can either be an integer or can be a character vector where the
character labels are specified directly. As you can see from the output,
when a character vector is used, two variables are returned, one that
contains the variable represented numerically and another that is the
variable represented as a character vector. More details on this
behavior to follow.
The simulation of random error (\(e_{j}\) from the equation above) is a bit
simpler than generating the fixed effects. Suppose for example, we want
to simply simulate random errors that are normally distributed with a
mean of 0 and a variance of 1. This can be done with the
simulate_error
function.
set.seed(321)
sim_arguments <- list(
sample_size = 10
)
simulate_error(data = NULL, sim_arguments)
## error level1_id
## 1 1.7049032 1
## 2 -0.7120386 2
## 3 -0.2779849 3
## 4 -0.1196490 4
## 5 -0.1239606 5
## 6 0.2681838 6
## 7 0.7268415 7
## 8 0.2331354 8
## 9 0.3391139 9
## 10 -0.5519147 10
The simulate_error
function only needs to know how many
data values to simulate. By default, the rnorm
function is
used to generate random error and this function assumes a mean of 0 and
standard deviation of 1 by default. I personally prefer the slightly
more verbose code however.
set.seed(321)
sim_arguments <- list(
error = list(variance = 1),
sample_size = 10
)
simulate_error(data = NULL, sim_arguments)
## error level1_id
## 1 1.7049032 1
## 2 -0.7120386 2
## 3 -0.2779849 3
## 4 -0.1196490 4
## 5 -0.1239606 5
## 6 0.2681838 6
## 7 0.7268415 7
## 8 0.2331354 8
## 9 0.3391139 9
## 10 -0.5519147 10
This code makes it clearer when the variance of the errors is wished to be specified as some other value. For example:
set.seed(321)
sim_arguments <- list(
error = list(variance = 25),
sample_size = 10
)
simulate_error(data = NULL, sim_arguments)
## error level1_id
## 1 8.5245161 1
## 2 -3.5601928 2
## 3 -1.3899246 3
## 4 -0.5982451 4
## 5 -0.6198031 5
## 6 1.3409189 6
## 7 3.6342074 7
## 8 1.1656771 8
## 9 1.6955694 9
## 10 -2.7595733 10
Now that we have seen the basics of simulating fixed variables
(covariates) and random error, we can now generate the response by
combining the previous two steps and then using the
generate_response
function. What makes this a tidy
simulation is that the pipe from magrittr, |>
can be
used to combine steps into a simulation pipeline.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
error = list(variance = 25),
sample_size = 10,
reg_weights = c(2, 0.3, -0.1, 0.5)
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments)
## X.Intercept. weight age sex_1 sex level1_id error fixed_outcome
## 1 1 231.1471 44 1 female 1 4.5862776 67.44413
## 2 1 158.6388 38 1 female 2 -0.5353077 46.29165
## 3 1 171.6605 31 1 female 3 4.9416770 50.89814
## 4 1 176.4105 40 1 female 4 -5.3611940 51.42316
## 5 1 176.2812 60 0 male 5 -3.7900764 48.88435
## 6 1 188.0455 47 1 female 6 0.4750036 54.21365
## 7 1 201.8052 33 1 female 7 -11.6546559 59.74157
## 8 1 186.9941 43 0 male 8 2.0875799 53.79822
## 9 1 190.1734 52 1 female 9 -5.6016371 54.35202
## 10 1 163.4426 31 1 female 10 -2.3734235 48.43277
## random_effects y
## 1 0 72.03041
## 2 0 45.75635
## 3 0 55.83981
## 4 0 46.06196
## 5 0 45.09428
## 6 0 54.68866
## 7 0 48.08692
## 8 0 55.88580
## 9 0 48.75039
## 10 0 46.05934
The only additional argument that is needed for the
generate_response
function is the reg_weights
argument. This argument represents the regression coefficients
associated with \(\beta\) in the
equation \(Y_{j} = X_{j} \beta +
e_{j}\). The regression coefficients are multiplied by the design
matrix to generate the column labeled “fixed_outcome” in the output. The
output also contains the column, “random_effects” which are all 0 here
indicating there are no random effects and the response variable,
“y”.
Earlier, a factor or categorical grouping factor with 2 levels was shown. It may be of interest to generate a factor or categorical attribute with more than 2 levels, a feature that is built into simglm.
The creation a categorical attribute with more than two level is similar to that with 2 levels. The following code creates a categorical attribute with 4 levels based on the grade, freshman, sophomore, junior, or senior. To add more levels, the user would just add more categories to the levels argument. Below, the data are generated and a table showing the different categories are shown.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex + grade,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor',
levels = c('male', 'female')),
grade = list(var_type = 'factor',
levels = c('freshman', 'sophomore',
'junior', 'senior'))),
error = list(variance = 25),
sample_size = 100
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
count(grade)
## grade n
## 1 freshman 27
## 2 sophomore 20
## 3 junior 32
## 4 senior 21
The levels of the new grade attribute could also be modified to include unequal probabilities for each class to be sampled. To do this, the prob argument could be passed as follows.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex + grade,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor',
levels = c('male', 'female')),
grade = list(var_type = 'factor',
levels = c('freshman', 'sophomore',
'junior', 'senior'),
prob = c(.05, .3, .45, .2))),
error = list(variance = 25),
sample_size = 100
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
count(grade)
## grade n
## 1 freshman 6
## 2 sophomore 26
## 3 junior 43
## 4 senior 25
When generating more than 2 levels for a categorical attribute, the default behavior within R and with simglm is to create dummy or indicator variables to be used within a linear model. Therefore, in order to generate the outcome for an attribute with more than 2 levels, multiple indicator or dummy attributes need to be created. The number of indicator or dummy attributes is always 1 less than the number of categories, so for the grade attribute example, there needs to be 3 indicator attributes to represent the 4 categories within a linear model. An example of what these indicators attributes would look like is presented below.
data.frame(
grade = sample(c('freshman', 'sophomore', 'junior', 'senior'),
size = 10, replace = TRUE)
) |>
mutate(sophomore = ifelse(grade == 'sophomore', 1, 0),
junior = ifelse(grade == 'junior', 1, 0),
senior = ifelse(grade == 'senior', 1, 0))
## grade sophomore junior senior
## 1 junior 0 1 0
## 2 freshman 0 0 0
## 3 junior 0 1 0
## 4 sophomore 1 0 0
## 5 senior 0 0 1
## 6 freshman 0 0 0
## 7 sophomore 1 0 0
## 8 sophomore 1 0 0
## 9 sophomore 1 0 0
## 10 freshman 0 0 0
Each indicator takes a value of 1 or 0 for each new column created. These indicate which category is represented by each attribute, for example, the 1 for the sophomore column indicates that this term would represent the effect of moving from the reference group to the sophomore category. The reference group is selected by the group or category that does not have an indicator attribute for it, in this example it is the freshman group. The default behavior in R is to use the category that is closest to “A” in alphabetical order.
It is important to understand how the indicator attributes are
generated as this will impact how the response outcome is created. In
this case, since there will be 3 indicator variables to represent the 4
categories of the grade attribute, 3 regression weights will need to be
added to the reg_weights
argument. These terms would
represent the effect (mean level change) from moving from the reference
group to each of the indicator variables. Using the example above, this
would reflect the mean difference between freshman and sophomore,
freshman and junior, and freshman and senior.
The other element that needs to be kept track of, is the order that
these indicator attributes are created by R. The default behavior is for
these to be created in alphabetical order. The simglm package modifies
this default behavior to convert those specified with a character vector
for the levels
argument to be in the same order as
specified. This is important as it has implications for the
interpretation of the terms specified within reg_weights
.
To be more concrete, what this means is that the last three terms
specified within the reg_weights
argument (ie., .75, 1.8,
and 2.5) would respectively represent the difference between the
sophomore, junior, and senior groups compared to the freshman group.
Therefore, the average mean difference between the freshman and
sophomore groups would be .75, the mean difference between the freshman
and junior groups would be 1.8, and the mean difference between the
freshman and senior groups would be 2.5. You could confirm this by
saving the simulated data below and fitting a linear regression using
the lm
function as follows:
lm(y ~ 1 + weight + age + sex + grade, data = &&)
where the saved data object is placed instead of the
&&
.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex + grade,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor',
levels = c('male', 'female')),
grade = list(var_type = 'factor',
levels = c('freshman', 'sophomore',
'junior', 'senior'))),
error = list(variance = 25),
sample_size = 10000,
reg_weights = c(2, .1, .55, 1.5, .75, 1.8, 2.5)
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments) |>
head()
## X.Intercept. weight age sex_1 grade_1 grade_2 grade_3 sex grade
## 1 1 231.1471 49 1 1 0 0 female sophomore
## 2 1 158.6388 58 1 0 1 0 female junior
## 3 1 171.6605 43 1 0 0 0 female freshman
## 4 1 176.4105 37 0 0 0 0 male freshman
## 5 1 176.2812 45 1 0 0 1 female senior
## 6 1 188.0455 58 0 0 0 0 male freshman
## level1_id error fixed_outcome random_effects y
## 1 1 0.0718251 54.31471 0 54.38653
## 2 2 -0.7071167 53.06388 0 52.35677
## 3 3 2.2538313 44.31605 0 46.56988
## 4 4 4.1403091 39.99105 0 44.13136
## 5 5 -0.4405120 48.37812 0 47.93761
## 6 6 2.9550896 52.70455 0 55.65964
Non-normal outcomes are possible with simglm. Two non-normal outcomes
are currently supported with more support coming in the future. Binary
and count outcomes are supported and can be specified with the
outcome_type
simulation argument. If
outcome_type = 'logistic'
or
outcome_type = 'binary'
then a binary outcome is generated
(ie. 0/1 variable) using the rbinom
function. If
outcome_type = 'count'
or
outcome_type = 'poisson'
then the outcome is transformed to
be a count variable (ie. discrete variable; 0, 1, 2, etc.).
Here is an example of generating a binary outcome.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
error = list(variance = 25),
sample_size = 10,
reg_weights = c(2, 0.3, -0.1, 0.5),
outcome_type = 'binary'
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments)
## X.Intercept. weight age sex_1 sex level1_id error fixed_outcome
## 1 1 231.1471 44 1 female 1 4.5862776 67.44413
## 2 1 158.6388 38 1 female 2 -0.5353077 46.29165
## 3 1 171.6605 31 1 female 3 4.9416770 50.89814
## 4 1 176.4105 40 1 female 4 -5.3611940 51.42316
## 5 1 176.2812 60 0 male 5 -3.7900764 48.88435
## 6 1 188.0455 47 1 female 6 0.4750036 54.21365
## 7 1 201.8052 33 1 female 7 -11.6546559 59.74157
## 8 1 186.9941 43 0 male 8 2.0875799 53.79822
## 9 1 190.1734 52 1 female 9 -5.6016371 54.35202
## 10 1 163.4426 31 1 female 10 -2.3734235 48.43277
## random_effects untransformed_outcome y
## 1 0 72.03041 1
## 2 0 45.75635 1
## 3 0 55.83981 1
## 4 0 46.06196 1
## 5 0 45.09428 1
## 6 0 54.68866 1
## 7 0 48.08692 1
## 8 0 55.88580 1
## 9 0 48.75039 1
## 10 0 46.05934 1
And finally, an example of generating a count outcome. Note, the weight variable here has been grand mean centered in the generation (ie. mean = 0). This helps to ensure that the counts are not too large.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 0, sd = 30),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
error = list(variance = 25),
sample_size = 10,
reg_weights = c(2, 0.01, 0.5),
outcome_type = 'count'
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments)
## X.Intercept. weight sex_1 sex level1_id error fixed_outcome
## 1 1 51.147097 0 male 1 -4.0233584 2.511471
## 2 1 -21.361157 0 male 2 2.2803457 1.786388
## 3 1 -8.339547 1 female 3 2.1016629 2.416605
## 4 1 -3.589471 0 male 4 2.8879225 1.964105
## 5 1 -3.718819 0 male 5 2.2317803 1.962812
## 6 1 8.045513 1 female 6 4.5862776 2.580455
## 7 1 21.805245 1 female 7 -0.5353077 2.718052
## 8 1 6.994062 1 female 8 4.9416770 2.569941
## 9 1 10.173416 0 male 9 -5.3611940 2.101734
## 10 1 -16.557440 1 female 10 -3.7900764 2.334426
## random_effects untransformed_outcome y
## 1 0 -1.511887 0
## 2 0 4.066734 53
## 3 0 4.518267 95
## 4 0 4.852028 115
## 5 0 4.194592 70
## 6 0 7.166733 1240
## 7 0 2.182745 11
## 8 0 7.511618 1767
## 9 0 -3.259460 0
## 10 0 -1.455651 0
Now that the basics of tidy simulation have been shown in the context
of a linear regression model, let’s explore an example in which the
power for this model is to be evaluated. In particular, suppose we are
interested in estimating empirical power for the three fixed effects
based on the following formula: y ~ 1 + weight + age + sex
.
More specifically, we are interested in estimating power for “weight”,
“age”, and “sex” variables. A few additional functions are needed for
this step including:
model_fit
: this function will fit a model to the
data.extract_coefficients
: this function will extract the
fixed coefficients based on the model fitted.To fit a model and extract coefficients, we could do the following building off the example from the previous section:
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
error = list(variance = 25),
sample_size = 10,
reg_weights = c(2, 0.3, -0.1, 0.5)
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments) |>
model_fit(sim_arguments) |>
extract_coefficients()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.13 20.3 0.0553 0.958
## 2 weight 0.328 0.103 3.19 0.0189
## 3 age -0.198 0.271 -0.733 0.491
## 4 sexfemale -1.87 5.93 -0.315 0.763
This output contains the model output for a single data simulation,
more specifically we can see the parameter name, parameter estimate, the
standard error for the parameter estimate, the test statistics, and the
p-value. These were estimated using the lm
function based
on the same formula defined in sim_arguments
. It is
possible to specify your own formula, model fitting function, and
regression weights to the model_fit
function. For example,
suppose we knew that weight was an important predictor, but are unable
to collect it in real life. We could then specify an alternative model
when evaluating power, but include the variable in the data generation
step.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
error = list(variance = 25),
sample_size = 10,
reg_weights = c(2, 0.3, -0.1, 0.5),
model_fit = list(formula = y ~ 1 + age + sex,
model_function = 'lm'),
reg_weights_model = c(2, -0.1, 0.5)
)
simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments) |>
model_fit(sim_arguments) |>
extract_coefficients()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 48.0 21.3 2.25 0.0594
## 2 age 0.0488 0.394 0.124 0.905
## 3 sexfemale 2.25 8.79 0.256 0.805
Notice that we now get very different parameter estimates for this single data generation process which reflects the contribution of the variable “weight” that is not taken into account in the model fitting.
When evaluating empirical power, it is essential to replicate the analysis just like a Monte Carlo study to avoid extreme simulation conditions. The simglm package offers functions to aid in the replication given the simulation conditions and the desired simulation framework. To do this, two additional functions are used:
replicate_simulation
: this function replicates the
simulationcompute_statistics
: this function compute desired
statistics across the replications.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
error = list(variance = 25),
sample_size = 10,
reg_weights = c(2, 0.3, -0.1, 0.5),
model_fit = list(formula = y ~ 1 + age + sex,
model_function = 'lm',
reg_weights_model = c(2, -0.1, 0.5)),
replications = 10,
extract_coefficients = TRUE
)
replicate_simulation(sim_arguments) |>
compute_statistics(sim_arguments, alternative_power = FALSE,
type_1_error = FALSE)
## # A tibble: 3 × 7
## term avg_estimate power param_estimate_sd avg_standard_error precision_ratio
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Inte… 46.1 0.3 26.1 19.2 1.36
## 2 age 0.0852 0.1 0.583 0.396 1.47
## 3 sexfe… 1.51 0 6.66 7.50 0.888
## # ℹ 1 more variable: replications <dbl>
As can be seen from the output, the default behavior is to return statistics for power, average test statistic, adjusted average test statistic, standard deviation of parameter estimate, average standard error, precision ration (standard deviation of parameter estimate divided by average standard error), and the number of replications for each term.
To generate this output, only the number of replications is needed. Here only 10 replications were done, in practice many more replications would be done to ensure there are accurate results.
The default power parameter values used are: Normal distribution, two-tailed alternative hypotheses, and alpha of 0.05. Additional power parameters can be passed directly to override default values by including a power argument within the simulation arguments. For example, if a t-distribution with one degree of freedom an alpha of 0.02 is desired, these can be added as follows:
Note: A user would likely want to change
plan(sequential)
to something like
plan(mutisession)
or plan(multicore)
to run
these in parallel. plan(sequential)
is used here for
vignette processing.
set.seed(321)
library(future)
plan(sequential)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex,
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
error = list(variance = 25),
sample_size = 50,
reg_weights = c(2, 0.3, -0.1, 0.5),
model_fit = list(formula = y ~ 1 + age + sex,
model_function = 'lm',
reg_weights_model = c(2, -0.1, 0.5)),
replications = 1000,
power = list(
dist = 'qt',
alpha = .02,
opts = list(df = 1)
),
extract_coefficients = TRUE
)
replicate_simulation(sim_arguments) |>
compute_statistics(sim_arguments, alternative_power = FALSE,
type_1_error = FALSE)
## # A tibble: 3 × 7
## term avg_estimate power param_estimate_sd avg_standard_error precision_ratio
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Inte… 56.0 1 7.83 7.83 1.00
## 2 age -0.0994 0.014 0.166 0.168 0.991
## 3 sexfe… 0.439 0.01 3.01 2.98 1.01
## # ℹ 1 more variable: replications <dbl>
Nested designs are ones in which data belong to multiple levels. An example could be individuals nested within neighborhoods. In this example, a specific individual is tied directly to one neighborhood. These types of data often include correlations between individuals within a neighborhood that need to be taken into account when modeling the data. These types of data can be generated with simglm.
To do this, the formula syntax introduced above is modified slightly
to include information on the nesting structure. In the example below,
the nesting structure in the formula is specified in the portion within
the parentheses. How the part within parentheses could be read is, add a
random intercept effect for each neighborhood. This random intercept
effect is similar to random error found in regression models, except the
error is associated with neighborhoods and is the same values for all
individuals within that neighborhood. The simulation arguments,
randomeffect
provides information about this term. One
element that is often of interest to modify is the variance of the
random effect term(s). For example, if the variance of the random effect
should be 8, then the argument, variance = 8
, could be
added to the specific term as shown below.
Finally, the sample_size
argument needs to be modified
to include information about the two levels of nested. You could read
the sample_size = list(level1 = 10, level2 = 20)
argument
below as: create 20 neighborhoods (ie. level2 = 20
) and
within each neighborhood have 10 individuals (ie.
level1 = 10
). Therefore the total sample size (ie. rows in
the data) would be 200.
set.seed(321)
sim_arguments <- list(
formula = y ~ 1 + weight + age + sex + (1 | neighborhood),
reg_weights = c(4, -0.03, 0.2, 0.33),
fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
age = list(var_type = 'ordinal', levels = 30:60),
sex = list(var_type = 'factor', levels = c('male', 'female'))),
randomeffect = list(int_neighborhood = list(variance = 8, var_level = 2)),
sample_size = list(level1 = 10, level2 = 20)
)
nested_data <- simulate_fixed(data = NULL, sim_arguments) |>
simulate_randomeffect(sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments)
head(nested_data, n = 10)
## X.Intercept. weight age sex_1 sex level1_id neighborhood
## 1 1 231.1471 39 0 male 1 1
## 2 1 158.6388 30 0 male 2 1
## 3 1 171.6605 35 0 male 3 1
## 4 1 176.4105 42 0 male 4 1
## 5 1 176.2812 46 0 male 5 1
## 6 1 188.0455 34 0 male 6 1
## 7 1 201.8052 43 1 female 7 1
## 8 1 186.9941 38 1 female 8 1
## 9 1 190.1734 53 0 male 9 1
## 10 1 163.4426 41 1 female 10 1
## int_neighborhood error fixed_outcome random_effects y
## 1 1.398851 -0.4700086 4.865587 1.398851 5.794429
## 2 1.398851 1.0499201 5.240835 1.398851 7.689606
## 3 1.398851 1.3339871 5.850186 1.398851 8.583024
## 4 1.398851 0.7412679 7.107684 1.398851 9.247803
## 5 1.398851 -1.0045606 7.911565 1.398851 8.305855
## 6 1.398851 1.9360082 5.158635 1.398851 8.493493
## 7 1.398851 0.3692732 6.875843 1.398851 8.643967
## 8 1.398851 1.3708928 6.320178 1.398851 9.089922
## 9 1.398851 -0.1077046 8.894798 1.398851 10.185944
## 10 1.398851 1.1553718 7.626723 1.398851 10.180946
nrow(nested_data)
## [1] 200
The vignette simulation_arguments contains more example of specifying random effects and additional nesting designs including three levels of nested and multiple membership models.