Report

Inhale, Exhale, Analyze: BMI’s Imprint on Impulse Oscillometry Outcomes

Author

Joshua J. Cook, M.S., ACRP-PM, CCRC, Syed Ahzaz H. Shah, B.S., Jacob Hernandez, B.S., Sara Basili, M.S.

Published

April 15, 2024

1 Introduction

Linear mixed-effects models (LMMs) are advanced statistical tools designed to analyze data that exhibit complex structures, such as hierarchical organization, repeated measures, and random effects. These models are particularly useful when data violate the assumptions of traditional ANOVA or regression methods, such as the independence of observations, homoscedasticity, and normality of residuals. LMMs accommodate intra-subject differences, allowing for both fixed effects, which are consistent across individuals, and random effects, which vary among subjects or groups.

The implementation of LMMs has been facilitated by various software packages and programming languages. Brown (Brown 2021) provides a comprehensive guide to implementing LMMs in R, a widely used statistical programming language, offering a step-by-step walkthrough of model syntax without delving deeply into complex mathematical foundations. Additionally, the lme4 package, as detailed by Bates et al. (Bates et al. 2015), represents a significant evolution in computational methods for fitting mixed models, offering efficient tools and simplified modeling processes for R users, especially for models with crossed random effects. Pymer4, developed by Jolly (Jolly 2018), bridges R and Python, offering Python users a flexible and integrated tool for linear mixed modeling by leveraging the capabilities of R’s lme4 package. This tool enhances the analytical capabilities within the Python ecosystem, making advanced statistical methods more accessible to a broader audience.

LMMs find applications across various scientific domains, each with its unique data structures and analytical challenges. The paper by Lee and Shang (Lee and Shang n.d.) explores the impact of missing data on the estimation and selection in LMMs, highlighting the challenges and proposing a method to record missingness using an indicator-based matrix. This approach is critical for ensuring model accuracy in the presence of missing data, a common issue in real-world datasets. Wang et al. (Wang et al. 2022) illustrate the application of LMMs in cardiothoracic surgery outcomes research, using a case study of homograft pulmonary valve replacement data to demonstrate the model’s ability to handle repeated measurements and provide more nuanced understandings of clinical outcomes. Aarts et al (Aarts et al. 2015) demonstrates multilevel design experiments in neuroscience and how using linear models on multilevel data can result in increase in false positives. Magezi (Magezi 2015) highlights the use of LMMs in within-participant psychology experiments, addressing the complexities of repeated measures and nested data structures common in psychological research. Harrison et al. (Harrison et al. 2018) and Bolker et al. (Bolker et al. 2009) discuss the application of LMMs and generalized linear mixed models (GLMMs) in ecology, emphasizing their utility in analyzing ecological data that involve complex relationships and hierarchical data structures with GRU. Grueber et al (GRUEBER et al. 2011) another ecology research, paper focuses on the model averaging and information theorist with LMMS as an alternative to traditional null hypothesis testing. In the medical field, LMMs are employed to model pandemic-induced mortality changes, as demonstrated by Verbeeck et al. (Verbeeck et al. 2023), and to analyze longitudinal health-related quality of life data in cancer clinical trials, as discussed by Touraine et al. (Touraine et al. 2023).

The paper “To transform or not to transform: using generalized linear mixed models to analyse reaction time data” by Lo and Andrews (Lo and Andrews 2015) challenges the common practice of transforming reaction time data in cognitive psychology, advocating for GLMMs as a more robust alternative. The “LEVEL” guidelines proposed by Monsalves et al. (Monsalves et al. 2020) aim to standardize the reporting of multilevel data and analyses, enhancing comparability across studies. Piepho’s study (Piepho 1999) on analyzing disease incidence data with GLMMs underscores the inadequacy of traditional methods like ANOVA for such data, highlighting GLMMs’ flexibility. The simulation study by Pusponegoro et al. (Pusponegoro et al. 2017) on children’s growth differences emphasizes the importance of choosing the appropriate covariance structure in LMMs for longitudinal data. Lastly, the framework introduced by Steibel et al. (Steibel et al. 2009) for analyzing RT-PCR data with LMMs showcases the method’s statistical power and flexibility, offering a significant advancement over traditional analysis methods. LMMs are used in a wide array of disciplines, but also in varying study designs, as shown in Table 1.

Table 1. Systematic Review of LMM Use-cases (Casals et al. 2014)

The strengths of LMMs lie in their flexibility to model complex data structures and their ability to handle missing data, making them a powerful tool for a wide range of scientific inquiries. However, their application is not without challenges. Peng and Lu (Peng and Lu 2012) address the difficulty of variable selection and parameter estimation in LMMs, proposing an iterative procedure to improve model accuracy. Barr (Barr 2013) critiques existing guidelines for testing interactions within LMMs, proposing new guidelines to ensure more reliable results. The work by Tu (Tu 2015) on GLMMs for network meta-analyses showcases how mixed models have evolved to tackle complex data, enhancing the accuracy of combining different studies. On the other hand, Fokkema et al. (Marjolein Fokkema and Wolpert 2021) introduce GLMM trees, merging machine learning with mixed models to improve predictions and analysis, particularly useful in mental health research. Despite their robustness, as noted by Schielzeth et al. (Schielzeth et al. 2020), LMMs require careful evaluation of model assumptions and may present computational challenges, especially with high-dimensional datasets.

The literature reviewed here collectively emphasizes the versatility, robustness, and broad applicability of LMMs and GLMMs across various fields of research. Despite their advantages, the importance of careful model selection, acknowledgment of limitations, and the potential need for more complex models such as joint models in certain scenarios are also highlighted. As the use of LMMs continues to grow, the development of standardized processes, such as the LEVEL framework (Monsalves et al. 2020) and the 10 protocol put forth by (Zuur and Ieno 2016), and user-friendly tools will be crucial in ensuring the accurate and effective application of these models in research.

2 Methods

As mentioned by (Galecki 2014), a LMM is:

a parametric linear model for clustered, longitudinal, or repeated-measures data that quantifies the relationships between a continuous dependent variable and various predictor variables. An LMM may include by fixed-effect parameters associated with one or more continuous or categorical covariates and random effects associated with one or more random factors.

Fixed-effect parameters describe the relationships of the covariates to the dependent variable for the entire population. These effects are typically distinct and clearly defined categorical values and are used for classification, such as Gender or Co-morbidities. These effects are commonly utilized in the setting of analysis like ANOVA. Random effects are specific to clusters or subjects within a population. It is typically not possible to include all the distinct levels random effects, but the researcher should always attempt to account for as many random effects as possible to improve the reliability of the LMM.

The selected dataset for this report specifically represent longitudinal data, which is data where the dependent variable is measured at several points in time for each unit of analysis. Participant dropout is often a concern in the analysis of longitudinal data, with early time points often having a higher compliance rate than later time points. Along with clustered and repeated-measures data, longitudinal data is hierarchical because the observations can be placed into hierarchies or levels.

2.1 Mathematical Foundations

LMMs have a mathematical foundation stemming from linear algebra. We will be using notation for a 2-level longitudinal model since that is the structure of the dataset in this report. The index i is used to denote participants and t is used to denote the different time points of the observations. Given this notation t is the first level and i is the second level.

Simple LMMs can be defined as in Equation 1.

\[ y=X\beta + Zu+ \epsilon \tag{1}\]

where:

Y is the response vector.
X is the design matrix for fixed effects.
β is the vector of fixed effects (parameters associated with the entire population or certain repeatable levels of experimental factors).
Z is the design matrix for random effects.
u is the vector of random effects (represent random deviations from the population parameters (β ) for different subjects or experimental units; i.e., the variability not explained by the fixed effects).
ϵ is the vector of residual errors.

Matrix and Vector Dimensions (Random Intercepts)

Y is N x 1 matrix where N is the number of the number of repeated measures
X is a N x p matrix where p is the number of fixed effects covariates
β is p x 1 column vector
Z is a N x J matrix where J number of subjects
u is J x 1 vector
ϵ is n x 1 vector

For a model with a random intercept, the first column of the X matrix will be all 1s and the first element in the β vector will pertain to that random intercept. The Z matrix in a random intercepts model is a block diagonal matrix, with the block defined by Z_i matrices.

Adding random effects to the model also changes the size of the dimensions of the Z. If one random effect is added to the matrix then the dimensions change to N x 2q which essentially doubles the columns of the Z matrix to account for the random intercept. u will also double in length to be 2q x 1.

2.1.1 Example

Now let’s go over an example with a 2-level longitudinal structure where we have 100 students with 10 test scores per student and the associated study time for those tests. In this case, the dependent variable is the variable concerning test scores, the fixed effect is the study time and the random effect is the student. For the sake of simplicity, we will only consider a random intercepts model.

Variable Breakdown:

N=1000: the number of observations which is the number of students multiplied by the number of test scores
J = 100: the number of students
p = 2: the random intercept and the fixed effect

Matrix Notations and Dimension laid out:

\(Y_{1000\times1} = X_{1000\times 2} \; \beta_{2\times1} + Z_{1000\times100}\;u_{100\times1} + \epsilon_{1000\times1}\)

Example Matrices:

\(y = \begin{bmatrix} Score\\ 75 \\ 80\\ ... \\ 90 \end{bmatrix} X = \begin{bmatrix} Intercept & Study Time \\1 & 2 \\1 & 3\\... & ... \\1 & 5\end{bmatrix}\)

\(\beta = \begin{bmatrix} 1.2\\2.3\end{bmatrix}\)

The matrix multiplication can also be broken down into individual equations. In the case of our example we get the following equations:

Level 1 (Time):

\(Y_{ti} = \beta_{0j} + \beta_{1j} \cdot \text{StudyTime}_{ti} + e_{ti}\)

Level 2 (Student):

\(\beta_{0j} = \gamma_{00} + u_{0j}\)

Since this is a random intercepts model, only the intercept equation is needed. γ₀₀ is the grand intercept mean an u_0j is the deviation of the j_th group, which in our case is the student. (Galecki 2014).

2.1.1 Parameter Estimation

LMMs typically use a Maximum Likelihood estimation or variation called Restricted Maximum likelihood Estimation (REML). Both of these methods obtain parameters of β and θ by optimizing the likelihood function. β are the fixed effects parameters and θ are the covariance matrix parameters where θ depends on the number of random effects and the covariance matrix structure. Our model uses the REML method because it is less biased to the covariance parameters and better at modeling random effects (Galecki 2014).

2.2 Assumptions

Although more flexible than other methods such as ANOVA, there are several assumptions for LMMs:

The relationship between the predictors and response variable is assumed to be linear, within each level of random effects.
Random effects (u) are assumed to follow a normal distribution with mean zero and variance-covariance matrix G.

\(\gamma \sim N(0,G)\)
Residual errors (ϵ ) are assumed to follow a normal distribution with mean zero and variance-covariance matrix R.

\(\epsilon \sim N(0,R)\)
Random effects (u) and residual errors (ϵ ) are assumed to be independent.
Homoscedasticity is assumed for the residuals across all levels of the independent variables.

There are several techniques that can be utilized to overcome violations in the LMM assumptions, including variable transformation (to achieve linearity or normality), using robust variance estimates, modifying the structure of random and fixed effects, and employing non-parametric methods or generalized linear mixed models (GLMMs) (Galecki 2014).

2.3 Sample Data Structure

LMMs require a tidy data set where each variables are columns and observations are rows. Smaller datasets are usually saved as CSV files and are often loaded from a database. The dataset can contain nulls but they still need to be handled whether they are omitted or imputed and depends on the amount and which columns are affected. The lmer() function in the lme4 library will automatically drop any null values so it is important that data is inspected and visualized before constructing any models. Below is an example of tidy data.

Figure 1. Tidy data, as defined by Wickham Et. Al.

LMMs also require that the structure of both random and fixed effects be defined before the model is created. The variables that have random variation across groups and those that are fixed must be identified. There are different hierarchies in LMMs. First of all we have to distinguish between clustered and longitudinal data. With cluttered data, as the name suggests, groups the subjects or the unit of analysis into different groups. For a two level data set we can have students be the unit of analysis and the next level up is the classrooms. For a three level data set we can add schools as the third level. Regardless of the amount of levels, the first level is always the subject of the unit of analysis, In the example mentioned above, its students (Galecki 2014).

Then there is also longitudinal data where repeated measures are at the first level and the unit of analysis is the second level. A dataset with different patient cholesterol data over time has the measures at different timepoints as the first level and the patients at a second level (Galecki 2014).

2.4 Implementation in R

The implementation begins with importing the dataset into R from a file containing longitudinal retrospective data on the impact of BMI on IOS estimates of airway resistance and reactance in children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma). This dataset spans from 2015 to 2020. Data import is executed using the appropriate function, with consideration for specifying file paths and handling header information. Following data importation, preprocessing steps, such as handling missing values and ensuring data integrity, are performed (Galecki 2014).

2.4.1 Analysis Using lme() Function

After preprocessing the data, we proceed with fitting linear mixed-effects models (LMMs) using the lme() function from the nlme package.

The analysis employs the lme() function from the nlme package to fit linear mixed-effects models (LMMs). Model formulation involves specifying a model formula that includes both fixed effects (e.g., BMI, diagnosis of asthma, relevant covariates) and random effects (e.g., random intercepts for subjects). The random argument specifies the random effects structure, while the data argument indicates the dataset to be used. The estimation method (method = “REML”) is specified to use restricted maximum likelihood estimation. It is advantageous to use nlme because it offers a user interface for fitting models with structure in the residuals (including forms of heteroscedasticity and autocorrelation) and in the random-effects covariance matrices).

2.4.2 Hypothesis Testing

Hypotheses are tested to guide model selection and refinement. For instance, Hypothesis 3.1 [1] assesses whether the variance of random effects is greater than zero, while Hypothesis 3.2 [2] investigates the presence of heterogeneous residual variances across treatment groups. These hypotheses are evaluated using likelihood ratio tests or F-tests, depending on the context.

2.4.4. Analysis Using lmer() Function

An alternative approach involves utilizing the lmer() function from the lme4 package to fit LMMs. This function follows a similar syntax to lme() but differs in how it handles random effects specification. lme4 offers several benefits compared to nlme, including: more efficient linear algebra tools (with associated performance enhancements), simpler syntax and more efficient implementation for fitting models with crossed random effects, implementation of profile likelihood confidence intervals on random-effects parameters, and the ability to fit GLMMs (Bates et al. 2015). Likelihood ratio tests and model diagnostics are employed to assess model fit and inform model selection (Bates et al. 2015).

2.4.5 Final Model Selection

The final model is selected based on a synthesis of statistical criteria, including model fit indices, significance of fixed effects, and the adequacy of the model’s assumptions. This selected model is then employed for interpretation and inference concerning the relationships between the predictor variables (e.g., BMI) and the response variable (e.g., IOS measures).

3 Analysis and Results

3.1 Packages

Code

if (!requireNamespace(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"), quietly = TRUE)) {
    install.packages(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"))
}

library(tidyverse)
library(lme4)
library(nlme)
library(gt)
library(gtsummary)
library(RefManageR)
library(DataExplorer)
library(Matrix)
library(car)
library(reshape2)

#references <- ReadBib("references.bib")
#summary(references)

tidyverse: used for data wrangling and visualization.
lme4 and nlme: used for LMM within R.
Matrix: used for matrix manipulation.
gt: used for table generation.
gtsummary: used for summary table generation of descriptive statistics.
RefManageR: used for BibTex reference management.
DataExplorer: used for EDA.
Matrix: used for sparse and dense matrix classes and methods.
car: for qq plots.
reshape2: to reshape date.

An error was encountered with the Matrix and lme4 packages during model creation. If this error is encountered, please:

remove.packages(“Matrix”)

remove.packages(“lme4”)

install.packages(“lme4”, type = “source”)

library(lme4)

3.2 Data Ingestion

Code

# Load the dataset
BMI <- read.csv("data/BMI_IOS_SCD_Asthma.csv")

colnames(BMI) <- c("Group", "Subject_ID", "Observation_number", "Hydroxyurea", "Asthma", "ICS", "LABA", "Gender", "Age_months", "Height_cm", "Weight_Kg", "BMI", "R5Hz_PP", "R20Hz_PP", "X5Hz_PP", "Fres_PP")

BMI$Group <- as.factor(BMI$Group)
BMI$Subject_ID <- as.factor(BMI$Subject_ID)
BMI$Observation_number <- as.factor(BMI$Observation_number)

`BMI` from Kaggle (Impact of BMI on IOS measures on children (kaggle.com))

Description: This dataset is from a retrospective study to assess the impact of BMI on impulse oscillometry (IOS) estimates of airway resistance and reactance in children with sickle cell disease (C-SCD).
Detailed Description: The dataset comprises various attributes and measurements across its columns. Categorical variables, such as Group, Subject ID, Observation_number, Hydroxyurea, Asthma, ICS, LABA, and Gender, denote different groupings, individual subjects, and attributes like medication usage and gender. Numerical variables like Age (months), Height (cm), Weight (Kg), BMI, R5Hz_PP, R20Hz_PP, X5Hz_PP, and Fres_PP provide quantitative data on subjects’ characteristics and test results. Notably, the summary also identifies missing values, such as the 14 instances in the Fres_PP variable, which warrant consideration in subsequent analysis. These columns provide measurements and estimates related to airway resistance and reactance obtained using impulse oscillometry (IOS), which is a non-invasive method for assessing respiratory function. These parameters are valuable in understanding the impact of BMI on respiratory measures in children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma) participating in the study.
Why suitable for LMMs: The dataset has multiple observations, over time, for the same set of participants.

3.2 Exploratory Data Analysis (EDA)

The structure of the dataframe and variable descriptions are shown in Table 2 and Figure 2. Figures 3-10 systematically explore the features of the data and are described below.

Code

x <- BMI

str(x)

'data.frame':   219 obs. of  16 variables:
 $ Group             : Factor w/ 2 levels "C-Asthma","C-SCD": 2 2 2 2 2 2 2 2 2 2 ...
 $ Subject_ID        : Factor w/ 90 levels "1","2","3","4",..: 1 1 1 1 2 3 3 4 4 5 ...
 $ Observation_number: Factor w/ 6 levels "1","2","3","4",..: 1 2 3 4 1 1 2 1 2 1 ...
 $ Hydroxyurea       : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Asthma            : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ ICS               : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ LABA              : chr  "No" "No" "Yes" "Yes" ...
 $ Gender            : chr  "Male" "Male" "Male" "Male" ...
 $ Age_months        : int  239 193 212 224 204 178 186 222 210 196 ...
 $ Height_cm         : num  164 163 164 164 154 ...
 $ Weight_Kg         : num  61.5 62.3 63.1 63.7 66.4 51.9 56.7 66.7 66.9 52.9 ...
 $ BMI               : num  22.8 23.5 23.6 23.7 27.8 ...
 $ R5Hz_PP           : int  145 103 107 87 124 109 117 101 179 136 ...
 $ R20Hz_PP          : int  133 98 98 87 121 86 105 132 153 97 ...
 $ X5Hz_PP           : num  -456 111 174 -303 98 115 107 -216 195 140 ...
 $ Fres_PP           : int  NA 169 159 NA 135 148 159 NA 175 199 ...

Code

head(x)

  Group Subject_ID Observation_number Hydroxyurea Asthma ICS LABA Gender
1 C-SCD          1                  1         Yes    Yes Yes   No   Male
2 C-SCD          1                  2         Yes    Yes Yes   No   Male
3 C-SCD          1                  3         Yes    Yes Yes  Yes   Male
4 C-SCD          1                  4         Yes    Yes Yes  Yes   Male
5 C-SCD          2                  1          No     No  No   No Female
6 C-SCD          3                  1         Yes    Yes  No   No   Male
  Age_months Height_cm Weight_Kg   BMI R5Hz_PP R20Hz_PP X5Hz_PP Fres_PP
1        239     164.1      61.5 22.84     145      133    -456      NA
2        193     162.7      62.3 23.53     103       98     111     169
3        212     163.5      63.1 23.60     107       98     174     159
4        224     163.8      63.7 23.74      87       87    -303      NA
5        204     154.5      66.4 27.82     124      121      98     135
6        178     158.0      51.9 20.79     109       86     115     148

Code

variables <- colnames(x)

variables_table <- data.frame(
  Variable = variables,
  Description = c(
    "This column indicates the group to which the subject belongs. There are two groups in the study: children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).",
    "Each subject in the study is assigned a unique identifier or ID, which is listed in this column. The ID is used to differentiate between individual participants.",
    "This column represents the number assigned to each observation or measurement taken for a particular subject. Since this is a longitudinal study, multiple observations may be recorded for each subject over time.",
    "This column indicates whether the subject with sickle cell disease (C-SCD) received hydroxyurea treatment. Hydroxyurea is a medication commonly used for the treatment of sickle cell disease.",
    "This column indicates whether the subject has a diagnosis of asthma. It distinguishes between children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).",
    "This column indicates whether the subject is using inhaled corticosteroids (ICS). ICS is a type of medication commonly used for the treatment of asthma and certain other respiratory conditions.",
    "This column indicates whether the subject is using a long-acting beta-agonist (LABA). LABA is a type of medication often used in combination with inhaled corticosteroids for the treatment of asthma.",
    "This column represents the gender of the subject, indicating whether they are male or female",
    "This column specifies the age of the subject at the time of the observation or measurement. Age is typically measured in months.",
    "This column represents the height of the subject, typically measured in a standard unit of length, such as centimeters or inches. Height is an important variable to consider in assessing the impact of BMI on respiratory measures.",
    "This column indicates the weight of the subject at the time of the observation or measurement. Weight is typically measured in kilograms (Kg) and is an important variable for calculating the body mass index (BMI).",
    "Body Mass Index (BMI) is a measure that assesses body weight relative to height. It is calculated by dividing the weight of an individual (in kilograms) by the square of their height (in meters). The BMI column provides the calculated BMI value for each subject based on their weight and height measurements. BMI is commonly used as an indicator of overall body fatness and is often used to classify individuals into different weight categories (e.g., underweight, normal weight, overweight, obese).",
    "This column represents the estimate of airway resistance at 5 Hz using impulse oscillometry (IOS). Airway resistance is a measure of the impedance encountered by airflow during respiration. The R5Hz_PP value indicates the airway resistance at the frequency of 5 Hz and is obtained through the IOS testing.",
    "This column represents the estimate of airway resistance at 20 Hz using impulse oscillometry (IOS). Similar to R5Hz_PP, R20Hz_PP provides the measure of airway resistance at the frequency of 20 Hz based on the IOS testing.",
    "This column represents the estimate of airway reactance at 5 Hz using impulse oscillometry (IOS). Airway reactance is a measure of the elasticity and stiffness of the airway walls. The X5Hz_PP value indicates the airway reactance at the frequency of 5 Hz and is obtained through the IOS testing.",
    "This column represents the estimate of resonant frequency using impulse oscillometry (IOS). Resonant frequency is a measure of the point at which the reactance of the airways transitions from positive to negative during respiration. The Fres_PP value indicates the resonant frequency and is obtained through the IOS testing.:"
    )
)

variables_table %>%
  gt %>%
  tab_header(
    title = "Table 2. Variable Description"
  ) %>%
  tab_footnote(
    footnote = "Each variable in the dataset, accompanied by a qualitative description from the study team."
  )

Table 2. Variable Description
Variable	Description
Group	This column indicates the group to which the subject belongs. There are two groups in the study: children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).
Subject_ID	Each subject in the study is assigned a unique identifier or ID, which is listed in this column. The ID is used to differentiate between individual participants.
Observation_number	This column represents the number assigned to each observation or measurement taken for a particular subject. Since this is a longitudinal study, multiple observations may be recorded for each subject over time.
Hydroxyurea	This column indicates whether the subject with sickle cell disease (C-SCD) received hydroxyurea treatment. Hydroxyurea is a medication commonly used for the treatment of sickle cell disease.
Asthma	This column indicates whether the subject has a diagnosis of asthma. It distinguishes between children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).
ICS	This column indicates whether the subject is using inhaled corticosteroids (ICS). ICS is a type of medication commonly used for the treatment of asthma and certain other respiratory conditions.
LABA	This column indicates whether the subject is using a long-acting beta-agonist (LABA). LABA is a type of medication often used in combination with inhaled corticosteroids for the treatment of asthma.
Gender	This column represents the gender of the subject, indicating whether they are male or female
Age_months	This column specifies the age of the subject at the time of the observation or measurement. Age is typically measured in months.
Height_cm	This column represents the height of the subject, typically measured in a standard unit of length, such as centimeters or inches. Height is an important variable to consider in assessing the impact of BMI on respiratory measures.
Weight_Kg	This column indicates the weight of the subject at the time of the observation or measurement. Weight is typically measured in kilograms (Kg) and is an important variable for calculating the body mass index (BMI).
BMI	Body Mass Index (BMI) is a measure that assesses body weight relative to height. It is calculated by dividing the weight of an individual (in kilograms) by the square of their height (in meters). The BMI column provides the calculated BMI value for each subject based on their weight and height measurements. BMI is commonly used as an indicator of overall body fatness and is often used to classify individuals into different weight categories (e.g., underweight, normal weight, overweight, obese).
R5Hz_PP	This column represents the estimate of airway resistance at 5 Hz using impulse oscillometry (IOS). Airway resistance is a measure of the impedance encountered by airflow during respiration. The R5Hz_PP value indicates the airway resistance at the frequency of 5 Hz and is obtained through the IOS testing.
R20Hz_PP	This column represents the estimate of airway resistance at 20 Hz using impulse oscillometry (IOS). Similar to R5Hz_PP, R20Hz_PP provides the measure of airway resistance at the frequency of 20 Hz based on the IOS testing.
X5Hz_PP	This column represents the estimate of airway reactance at 5 Hz using impulse oscillometry (IOS). Airway reactance is a measure of the elasticity and stiffness of the airway walls. The X5Hz_PP value indicates the airway reactance at the frequency of 5 Hz and is obtained through the IOS testing.
Fres_PP	This column represents the estimate of resonant frequency using impulse oscillometry (IOS). Resonant frequency is a measure of the point at which the reactance of the airways transitions from positive to negative during respiration. The Fres_PP value indicates the resonant frequency and is obtained through the IOS testing.:
Each variable in the dataset, accompanied by a qualitative description from the study team.

Code

plot_str(x)
introduce(x)

  rows columns discrete_columns continuous_columns all_missing_columns
1  219      16                8                  8                   0
  total_missing_values complete_rows total_observations memory_usage
1                   14           205               3504        34032

Code

plot_intro(x, title="Figure 2. Structure of variables and missing observations.")

3.2.1 Missing Values

Code

plot_missing(x, title="Figure 3. Breakdown of missing observations.")

Based on the missing values count in Figure 3, it appears that there are no missing values in most of the columns, except for Fres_PP, where there are 14 missing values (6.39%). In this case, omitting missing values for Fres_PP is reasonable, considering the small proportion of missing data compared to the total number of observations.

3.2.2 Cleaning Data

Code

dim(x)

[1] 219  16

Code

x_clean <- na.omit(x) # drops NAs, further analysis is without NA values
x_clean$Gender <- tolower(x_clean$Gender)
dim(x_clean)

[1] 205  16

Code

str(x_clean)

'data.frame':   205 obs. of  16 variables:
 $ Group             : Factor w/ 2 levels "C-Asthma","C-SCD": 2 2 2 2 2 2 2 2 2 2 ...
 $ Subject_ID        : Factor w/ 90 levels "1","2","3","4",..: 1 1 2 3 3 4 5 6 6 7 ...
 $ Observation_number: Factor w/ 6 levels "1","2","3","4",..: 2 3 1 1 2 2 1 1 2 1 ...
 $ Hydroxyurea       : chr  "Yes" "Yes" "No" "Yes" ...
 $ Asthma            : chr  "Yes" "Yes" "No" "Yes" ...
 $ ICS               : chr  "Yes" "Yes" "No" "No" ...
 $ LABA              : chr  "No" "Yes" "No" "No" ...
 $ Gender            : chr  "male" "male" "female" "male" ...
 $ Age_months        : int  193 212 204 178 186 210 196 175 200 173 ...
 $ Height_cm         : num  163 164 154 158 164 ...
 $ Weight_Kg         : num  62.3 63.1 66.4 51.9 56.7 66.9 52.9 39 50.6 56.3 ...
 $ BMI               : num  23.5 23.6 27.8 20.8 20.9 ...
 $ R5Hz_PP           : int  103 107 124 109 117 179 136 96 123 117 ...
 $ R20Hz_PP          : int  98 98 121 86 105 153 97 109 115 117 ...
 $ X5Hz_PP           : num  111 174 98 115 107 195 140 111 164 92 ...
 $ Fres_PP           : int  169 159 135 148 159 175 199 104 109 152 ...
 - attr(*, "na.action")= 'omit' Named int [1:14] 1 4 8 13 18 21 44 46 50 52 ...
  ..- attr(*, "names")= chr [1:14] "1" "4" "8" "13" ...

Count Plots for Categorical Variable

The bar plots in Figure 4 show frequency distributions for categories within a distribution. This can aid in data cleaning, checking sparseness or checking for class imbalances. It appears that:

Most cases are C-SCD compared to C-Asthma (class imbalance).
The number of observations decreases at subsequent measurements.
Most cases have Asthma (class imbalance).
Most cases have LABA (class imbalance).
Hydroxyurea, ICS, and Gender are relatively evenly distributed.

Code

plot_bar(x_clean, title = "Figure 4. Frequency plots of categorical variables.")

Histograms

Histograms in Figure 5 show the frequency and distributions of numerical variables This helps identify distribution types among the different variables. Most of the variables below exhibit a normal distribution with some (BMI) showing a slight right skew.

Code

plot_histogram(x_clean, title = "Figure 5. Histogram plots of numerical variables.")

Q-Q Plots

The QQ plots in Figure 6 serve as a visual aid to asses normality in the covariates. The closer the points are to the straight diagonal line, the more “normal” the data is distributed. Most of the variables show a normal distribution. BMI has a substantial amount of points on the right corner of the plot that go off of the diagonal line potentially representing a non-normal distribution. Weight_Kg shows a similar skew.

Code

plot_qq(na.omit(x), title = "Figure 6. QQ plots to assess normality of numerical variables.")

Principal Component Analysis (PCA)

The PCA plots in Figure 7 show the numerical variables in our data set split into principal components. More than half (54.8%) of the variance can be explained with just 4 principal components. This can be useful if we want to simplify our model by only keeping the principal components that explain most of the variance.

Code

plot_prcomp(na.omit(x), title = "Figure 7. PCA to assess key principle components that explain the variance.")

Box Plots

Based on the boxplots in Figure 8, it’s evident that all variables except “Age (months)” and “Height (cm)” contain outliers. Now, let’s pinpoint these outliers and calculate summary statistics (Table 3).

Figure 8. Boxplots of numerical variables.

Code

numeric_vars <- x_clean %>% 
  select_if(is.numeric)

# Boxplot for each numeric variable
par(mfrow=c(2, 2))
for (col in colnames(numeric_vars)) {
  boxplot(numeric_vars[[col]], main=col)
}

Code

# Adding a general title for the entire set of boxplots
#mtext("Figure 8. Box plots of numerical variables.", side=3, line=1, outer=TRUE, cex=1.5)

Code

# Define a function to detect outliers in each column
detect_outliers <- function(column) {
  Q1 <- quantile(column, 0.25)
  Q3 <- quantile(column, 0.75)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  outliers <- column[column < lower_bound | column > upper_bound]
  return(outliers)
}

# Iterate over each column and print outliers; not removed
for (col in names(numeric_vars)) {
  outliers <- detect_outliers(numeric_vars[[col]])
  if (length(outliers) > 0) {
    cat("Outliers in", col, ":\n")
    print(outliers)
    cat("\n")
  }
}

Outliers in Weight_Kg :
 [1]  91.4 100.2 104.4 111.2 120.4 108.6 105.1 100.0  95.5 130.1 108.5 124.6
[13]  95.2

Outliers in BMI :
 [1] 27.82 37.85 41.49 42.09 26.65 28.66 41.85 41.18 39.89 40.05 38.58 38.74
[13] 29.92 47.21 41.34 48.07 27.82 26.95

Outliers in R5Hz_PP :
[1] 187 183 196  10 199 198

Outliers in R20Hz_PP :
[1] 153 150  12 162 153

Outliers in X5Hz_PP :
 [1]  439 1117  260  270  297  465  338  285  306  -68  543  381

Code

x_clean %>% 
  select(-2) %>%
  tbl_summary( #gtSummary Table
    by=Group,
    type = list(
      c('Age_months', 'Height_cm', 'Weight_Kg', 'BMI', 'R5Hz_PP', 'R20Hz_PP', 'X5Hz_PP', 'Fres_PP') ~ 'continuous2'),
    statistic = all_continuous2() ~ c(
                       "{mean} ± {sd}",
                       "{median} ({p25}, {p75})",
                       "{min}, {max}"
                       ),
    digits = all_continuous2() ~ 2,
    missing="ifany",
  ) %>%
  bold_labels %>%
  italicize_levels() %>%
  as_gt() %>%
  tab_header(
    title = "Table. 3 Summary Statistics"
  ) %>%
  tab_footnote(
    footnote = "Summary statistics for all variables."
  )

Table. 3 Summary Statistics
Characteristic	C-Asthma, N = 56¹	C-SCD, N = 149¹
Observation_number
1	34 (61%)	51 (34%)
2	13 (23%)	44 (30%)
3	7 (13%)	31 (21%)
4	2 (3.6%)	17 (11%)
5	0 (0%)	5 (3.4%)
6	0 (0%)	1 (0.7%)
Hydroxyurea	0 (0%)	112 (75%)
Asthma	56 (100%)	106 (71%)
ICS	40 (71%)	73 (49%)
LABA	24 (43%)	14 (9.4%)
Gender
female	22 (39%)	59 (40%)
male	34 (61%)	90 (60%)
Age_months
Mean ± SD	132.98 ± 35.59	142.16 ± 44.70
Median (IQR)	134.00 (100.00, 147.25)	144.00 (105.00, 179.00)
Range	87.00, 213.00	50.00, 215.00
Height_cm
Mean ± SD	146.57 ± 17.33	145.00 ± 18.83
Median (IQR)	144.50 (131.00, 162.00)	146.20 (132.00, 160.30)
Range	120.00, 185.00	101.80, 178.90
Weight_Kg
Mean ± SD	52.98 ± 30.88	40.11 ± 16.06
Median (IQR)	38.30 (30.60, 66.83)	36.70 (28.30, 51.90)
Range	19.60, 130.10	14.90, 104.40
BMI
Mean ± SD	23.02 ± 9.17	18.32 ± 4.10
Median (IQR)	18.38 (17.17, 25.66)	17.59 (15.87, 19.48)
Range	14.10, 48.07	13.70, 42.09
R5Hz_PP
Mean ± SD	89.57 ± 25.09	107.67 ± 32.83
Median (IQR)	85.50 (72.75, 104.50)	105.00 (85.00, 125.00)
Range	43.00, 168.00	10.00, 199.00
R20Hz_PP
Mean ± SD	77.16 ± 21.70	88.77 ± 24.61
Median (IQR)	72.00 (62.00, 88.25)	88.00 (72.00, 102.00)
Range	38.00, 135.00	12.00, 162.00
X5Hz_PP
Mean ± SD	101.87 ± 51.52	138.04 ± 115.20
Median (IQR)	96.00 (74.75, 117.75)	116.00 (83.00, 157.00)
Range	1.69, 381.00	-68.00, 1,117.00
Fres_PP
Mean ± SD	120.55 ± 31.25	136.75 ± 33.35
Median (IQR)	111.50 (99.00, 142.50)	132.00 (110.00, 163.00)
Range	61.00, 236.00	66.00, 225.00
Summary statistics for all variables.
¹ n (%)

Participant Dropout

Figure 9 and Table 4 show how many subjects had data at each subsequent timepoint, which suggests that this study experienced significant participant dropout over time. This dropout may or may not be attributed to the study itself and should be investigated further. A strength of LMM is that it can handle unbalanced groups (i.e., patients), so we will continue with modeling regardless.

Code

x_clean_timepoints <- x_clean %>%
  group_by(Observation_number) %>%
  summarise(Unique_Subjects = n_distinct(Subject_ID))

x_clean_timepoints$Unique_Subjects <- as.numeric(x_clean_timepoints$Unique_Subjects)

ggplot(x_clean_timepoints, aes(x = Observation_number, y = Unique_Subjects)) +
  geom_point(size = 3, color = "blue") + # Add points for each observation
  geom_line(aes(group = 1), color = "blue") + # Connect the points with a line
  theme_minimal() +
  labs(title = "Figure 9. Participant dropout over time.",
       x = "Timepoint",
       y = "Number of Unique Subjects")

Code

x_clean_timepoints %>% 
  gt() %>%
  tab_header(
    title = "Table 4. Number of participants at each timepoint."
  ) %>%
  tab_footnote(
    footnote = "Counts of unique subjects reveal an increasing amount of missing data at subsequent observation visits."
  )

Table 4. Number of participants at each timepoint.
Observation_number	Unique_Subjects
1	85
2	57
3	38
4	19
5	5
6	1
Counts of unique subjects reveal an increasing amount of missing data at subsequent observation visits.

3.2.3 Correlations

Figure 10 highlights correlations between variables that should be assessed before any modeling. Age (months) and Height (cm): There is a strong positive correlation (0.914) between Age (months) and Height (cm). This implies that as age increases, height tends to increase as well. This correlation is expected, as children tend to grow taller as they get older.

Weight (Kg) and BMI: There is a strong positive correlation (0.927) between Weight (Kg) and BMI. This suggests that as weight increases, BMI (Body Mass Index) tends to increase as well. This correlation is expected because BMI is calculated using weight and height measurements.

R5Hz_PP and Fres_PP: There is a strong positive correlation (0.754) between R5Hz_PP and Fres_PP.

Code

plot_correlation(na.omit(x), maxcat=5L, title = "Figure 10. Correlation matrix of all variables.")

Code

correlation_matrix <- cor(numeric_vars)
print(correlation_matrix)

           Age_months Height_cm Weight_Kg           BMI      R5Hz_PP   R20Hz_PP
Age_months  1.0000000 0.9136787 0.6972936  4.441633e-01 4.038596e-01 0.33200111
Height_cm   0.9136787 1.0000000 0.7478733  4.581481e-01 3.100369e-01 0.24291582
Weight_Kg   0.6972936 0.7478733 1.0000000  9.265856e-01 1.261621e-01 0.10391395
BMI         0.4441633 0.4581481 0.9265856  1.000000e+00 6.430664e-06 0.01337894
R5Hz_PP     0.4038596 0.3100369 0.1261621  6.430664e-06 1.000000e+00 0.70961067
R20Hz_PP    0.3320011 0.2429158 0.1039139  1.337894e-02 7.096107e-01 1.00000000
X5Hz_PP     0.3885715 0.3745554 0.1346387 -2.990486e-02 4.464806e-01 0.20055148
Fres_PP     0.5876287 0.5214880 0.1975967 -1.523383e-02 7.538067e-01 0.54826299
               X5Hz_PP     Fres_PP
Age_months  0.38857148  0.58762865
Height_cm   0.37455537  0.52148796
Weight_Kg   0.13463870  0.19759669
BMI        -0.02990486 -0.01523383
R5Hz_PP     0.44648059  0.75380665
R20Hz_PP    0.20055148  0.54826299
X5Hz_PP     1.00000000  0.50290470
Fres_PP     0.50290470  1.00000000

3.3 Linear Mixed Modeling

In this dataset, the variable of interest is body mass index (BMI). Additionally, controlled variables are present such as group, age, weight, height, and other co-morbidities. These are the fixed effects. On the other hand, random variability may exist between individual observations which are nested in each subject. These represent the random effects, as shown in Table 5. In the initial model, Subject_ID was treated as the sole random effect. In the final model, both random effects were incorporated (Subject_ID, Observation_Number).

Code

variables_table2 <- variables_table %>%
  select(1) %>%
  mutate(Type = c(
    "Fixed",
    "Random",
    "Random",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed"
  )
  )

variables_table2 %>%
  gt %>%
  tab_header(
    title = "Table 5. Variable Categorization"
  ) %>%
  tab_footnote(
    footnote = "A break down of random and fixed effects based on the purpose of the study. Variable categorization is a crucial step in the LMM process."
  )

Table 5. Variable Categorization
Variable	Type
Group	Fixed
Subject_ID	Random
Observation_number	Random
Hydroxyurea	Fixed
Asthma	Fixed
ICS	Fixed
LABA	Fixed
Gender	Fixed
Age_months	Fixed
Height_cm	Fixed
Weight_Kg	Fixed
BMI	Fixed
R5Hz_PP	Fixed
R20Hz_PP	Fixed
X5Hz_PP	Fixed
Fres_PP	Fixed
A break down of random and fixed effects based on the purpose of the study. Variable categorization is a crucial step in the LMM process.

Code

#lme()

# Fit models using a tidy and clear approach
model_lme <- lme(
  fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg,
  random = list(Subject_ID = pdIdent(~1)),
  data = x_clean,
  method = "REML"
)

#lmer() 

model_lmer <- lmer(
  formula = R5Hz_PP + R20Hz_PP + X5Hz_PP + Fres_PP ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + (1 | Subject_ID),
  data = x_clean
)

3.3.1 Initial Model

Equation 1. The initial linear mixed model. {.lightbox}

Code

# Compare models based on AIC
aic_lme <- AIC(model_lme)
aic_lmer <- AIC(model_lmer)

cat(sprintf("AIC for lme model: %f\n", aic_lme))

AIC for lme model: 1898.947925

Code

cat(sprintf("AIC for lmer model: %f\n", aic_lmer))

AIC for lmer model: 2517.366883

Code

# Correctly assign final_model based on AIC comparison
if (aic_lme < aic_lmer) {
  final_model <- model_lme
  model_type <- "lme"
} else {
  final_model <- model_lmer
  model_type <- "lmer"
}
cat(sprintf("Final model selected: %s\n", model_type))

Final model selected: lme

Code

# Since final_model is now correctly assigned, we can call summary on it
summary(final_model)

Linear mixed-effects model fit by REML
  Data: x_clean 
       AIC      BIC   logLik
  1898.948 1935.007 -938.474

Random effects:
 Formula: ~1 | Subject_ID
        (Intercept) Residual
StdDev:    20.45439 19.76738

Fixed effects:  cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS +      LABA + Gender + Age_months + Height_cm + Weight_Kg 
                Value Std.Error  DF   t-value p-value
(Intercept) 219.89260  86.92601 111  2.529653  0.0128
BMI          -5.89676   2.67942 111 -2.200761  0.0298
AsthmaYes    16.70033   7.68459  85  2.173222  0.0325
ICSYes       -4.93414   4.87163 111 -1.012830  0.3133
LABAYes      -4.35652   6.08228 111 -0.716265  0.4753
Gendermale   -5.68901   5.71803  85 -0.994924  0.3226
Age_months    0.49796   0.13627 111  3.654088  0.0004
Height_cm    -1.04120   0.60384 111 -1.724307  0.0874
Weight_Kg     1.65587   1.01041 111  1.638804  0.1041
 Correlation: 
           (Intr) BMI    AsthmY ICSYes LABAYs Gndrml Ag_mnt Hght_c
BMI        -0.927                                                 
AsthmaYes  -0.135  0.054                                          
ICSYes      0.115 -0.074 -0.378                                   
LABAYes    -0.011  0.026 -0.055 -0.298                            
Gendermale -0.146  0.146 -0.009 -0.070  0.037                     
Age_months  0.329 -0.101  0.049 -0.002  0.029  0.146              
Height_cm  -0.965  0.830  0.074 -0.110  0.002  0.050 -0.535       
Weight_Kg   0.938 -0.986 -0.082  0.087 -0.030 -0.122  0.087 -0.852

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-2.8899152 -0.4840087 -0.1223361  0.4272441  2.5859876 

Number of Observations: 205
Number of Groups: 88

Akaike Information Criterion (AIC)

The AIC for both models was calculated. The AIC is a measure of the relative quality of statistical models for a given set of data. Lower AIC values indicate a model that better fits the data without unnecessary complexity.

Here, the AIC for lme was 1898.95 while lmer was 2517.37.

The model with the lower AIC was selected as the final model (lme) despite performance improvements offered by the lme4 package. All additional models were lme.

Residuals

Residual plots (Residuals vs. Fitted Values) were created for the lme model to assess the goodness of fit in Figure 11. A horizontal line at y=0 was added as a reference. These plots help in identifying non-linearity, unequal variances, and outliers.

Based on the residual plot, the model has an ideal random pattern of scattered values with a few possible outliers.

Code

# Residuals
residuals_final <- resid(final_model)

# Calculate fitted values and residuals from the final model
fitted_values <- fitted(final_model)
residual_values <- residuals(final_model)

# Create a data frame explicitly for plotting
plot_data <- data.frame(Fitted = fitted_values, Residuals = residual_values)

# Plotting using ggplot2 for a more flexible and powerful approach
# Residuals vs Fitted Values
ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted Values", y = "Residuals", title = "Figure 11. Residuals vs. Fitted Values")

Histogram of Residuals and QQ Plots

A histogram and a Q-Q (Quantile-Quantile) plot of the residuals were used to check the normality assumption of the residuals (Figure 12). Finally, a QQ plot with a QQ line was produced for a graphical normality check (Figure 13).

Based on the histogram, the model visually had an ideal bell-shaped curve that resembles the normal distribution. Based on the QQ plot, the model graphically may have had some residuals that were not normally distributed toward the ends.

Code

# Histogram of Residuals
ggplot(plot_data, aes(x = Residuals)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Figure 12. Histogram of Residuals")

Code

# Q-Q Plot
qqPlot(residuals_final, main = "Figure 13. Q-Q Plot of Residuals")

 35  40 
 95 113

Code

# Shapiro-Wilk Normality Test
shapiro_test_results <- shapiro.test(residuals_final)
print(shapiro_test_results)


    Shapiro-Wilk normality test

data:  residuals_final
W = 0.95883, p-value = 1.163e-05

The Shapiro-Wilk test was conducted on the residuals to formally test for normality.

\(H_o\): the residuals are normally distributed.

\(H_a\): the residuals are not normally distributed.

\(\alpha\) = 0.05

In this case, P = 0.00001163. P < 0.05, so the null hypothesis was rejected, suggesting that the residuals were not normally distributed. This model does not satisfy the assumptions of LMMs.

3.3.2 Imputed Model

Outliers (as mentioned above) were present in most variables, and the residuals of the initial model were not normally distributed. To improve model performance, outliers were imputed using the the threshold values. The model was then regenerated and assessed using the same metrics as above (Figures 14-17).

Figure 14. Box plots of numerical variables.

Code

# Copy the original dataset
x_clean_imputed <- x_clean

# Define a function for Winsorization
winsorize <- function(x, lower_percentile = 0.10, upper_percentile = 0.90) {
  lower_threshold <- quantile(x, lower_percentile)
  upper_threshold <- quantile(x, upper_percentile)
  x[x < lower_threshold] <- lower_threshold
  x[x > upper_threshold] <- upper_threshold
  
  return(x)
}

# Apply imputation across numeric variables in the copied dataset
numeric_vars <- names(x_clean_imputed %>% select_if(is.numeric))
for (col in numeric_vars) {
  x_clean_imputed[[col]] <- winsorize(x_clean_imputed[[col]])
}

# Visualization with ggplot2
# Plot boxplots for each numeric variable after imputation
for (col in numeric_vars) {
  p <- ggplot(data = x_clean_imputed, aes(x = "", y = !!sym(col))) +
    geom_boxplot(fill = "skyblue", color = "blue") +
    labs(title = paste("Boxplot of", col), x = "", y = col)
  print(p)
}

Code

# Adding a general title for the entire set of boxplots
#mtext("Figure 14. Box plots of numerical variables.", side=3, line=1, outer=TRUE, cex=1.5)

# Modeling with Imputed Data
# Refit the model using the lme function with the cleaned data
model_lme_imputed <- lme(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg,
                       random = list(Subject_ID = pdIdent(~1)),
                       data = x_clean_imputed,
                       method = "REML")

aic_lme_imputed <- AIC(model_lme_imputed)

cat(sprintf("AIC for lme model: %f\n", aic_lme_imputed))

AIC for lme model: 1790.908121

Code

# Extract residuals
residuals_imputed <- resid(model_lme_imputed)

# Residuals vs Fitted Values Plot
ggplot(data = data.frame(Fitted = fitted(model_lme_imputed), Residuals = residuals_imputed), aes(x = Fitted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted Values", y = "Residuals", title = "Figure 15. Residuals vs. Fitted Values")

Code

# Histogram of Residuals
ggplot(data = data.frame(Residuals = residuals_imputed), aes(x = Residuals)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Figure 16. Histogram of Residuals")

Code

qqPlot(residuals_imputed, main = "Figure 17. Q-Q Plot of Residuals")

17 29 
39 72

Code

# Q-Q Plot and Shapiro-Wilk Test
shapiro_test_results <- shapiro.test(residuals_imputed)
print(shapiro_test_results)


    Shapiro-Wilk normality test

data:  residuals_imputed
W = 0.98664, p-value = 0.05066

The AIC was calculated as 1790.91, an improvement on the initial model.

The Shapiro-Wilk test was conducted on the residuals to formally test for normality.

\(H_o\): the residuals are normally distributed.

\(H_a\): the residuals are not normally distributed.

\(\alpha\) = 0.05

In this case, P = 0.05066. P >0.05, so we failed to reject the null hypothesis, suggesting that the residuals were normally distributed after threshold imputation. This model now satisfies the assumptions of LMMs.

3.3.3 Final Model

This was a longitudinal study involving multiple observations for each subject over time, and subjects are grouped into two categories (children with sickle cell disease and African-American children with asthma). Thus, in this final model, we modeled Group as a fixed effect since we were interested in the effect of the group itself on the outcome. Subject_ID should be a random effect to account for the repeated measures within subjects, and Observation_number was included as a random slope within Subject_ID (i.e., nested within Subject_ID). The same visualizations and tests were completed to assess the LMM assumptions (Figures 18-20). The residuals show a random pattern (Figure 18), the histogram is approximately normal (Figure 19), and the qq plot follows a straight line (Figure 20), indicating normality.

Code

model_lme_imputed_final <- lme(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + Group,
                         data = x_clean_imputed,
                         random = list(Subject_ID = pdIdent(~1 + Observation_number)),
                         method = "REML")
str(model_lme_imputed_final)

List of 18
 $ modelStruct :List of 1
  ..$ reStruct:List of 1
  .. ..$ Subject_ID:'pdIdent' with matrix num [1:6, 1:6] 1.99 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:6] "(Intercept)" "Observation_number2" "Observation_number3" "Observation_number4" ...
  .. .. .. ..$ : chr [1:6] "(Intercept)" "Observation_number2" "Observation_number3" "Observation_number4" ...
  .. ..- attr(*, "settings")= int [1:4] 1 1 0 2
  .. ..- attr(*, "class")= chr "reStruct"
  .. ..- attr(*, "plen")= Named int 1
  .. .. ..- attr(*, "names")= chr "Subject_ID"
  ..- attr(*, "settings")= num [1:4] 1 0 1 2
  ..- attr(*, "class")= chr [1:3] "lmeStructInt" "lmeStruct" "modelStruct"
  ..- attr(*, "pmap")= logi [1, 1] TRUE
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "reStruct"
  ..- attr(*, "fixedSigma")= logi FALSE
 $ dims        :List of 5
  ..$ N    : int 205
  ..$ Q    : int 1
  ..$ qvec : Named num [1:3] 6 0 0
  .. ..- attr(*, "names")= chr [1:3] "Subject_ID" "" ""
  ..$ ngrps: Named int [1:3] 88 1 1
  .. ..- attr(*, "names")= chr [1:3] "Subject_ID" "X" "y"
  ..$ ncol : Named num [1:3] 6 10 1
  .. ..- attr(*, "names")= chr [1:3] "Subject_ID" "" ""
 $ contrasts   :List of 2
  ..$ Observation_number: num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:6] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:5] "2" "3" "4" "5" ...
  ..$ Group             : num [1:2, 1] 0 1
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:2] "C-Asthma" "C-SCD"
  .. .. ..$ : chr "C-SCD"
 $ coefficients:List of 2
  ..$ fixed : Named num [1:10] 154.99 -3.77 15.54 -3.26 -3.16 ...
  .. ..- attr(*, "names")= chr [1:10] "(Intercept)" "BMI" "AsthmaYes" "ICSYes" ...
  ..$ random:List of 1
  .. ..$ Subject_ID: num [1:88, 1:6] -3.56 8.76 -1.48 13.86 8.56 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:88] "1" "2" "3" "4" ...
  .. .. .. ..$ : chr [1:6] "(Intercept)" "Observation_number2" "Observation_number3" "Observation_number4" ...
 $ varFix      : num [1:10, 1:10] 4621.9 -125.4 -56.5 29.6 -33.7 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10] "(Intercept)" "BMI" "AsthmaYes" "ICSYes" ...
  .. ..$ : chr [1:10] "(Intercept)" "BMI" "AsthmaYes" "ICSYes" ...
 $ sigma       : num 11
 $ apVar       : num [1:2, 1:2] 0.0107 -0.0118 -0.0118 0.0277
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "reStruct.Subject_ID" "lSigma"
  .. ..$ : chr [1:2] "reStruct.Subject_ID" "lSigma"
  ..- attr(*, "Pars")= Named num [1:2] 2.74 2.4
  .. ..- attr(*, "names")= chr [1:2] "reStruct.Subject_ID" "lSigma"
  ..- attr(*, "natural")= logi TRUE
 $ logLik      : num -889
 $ numIter     : NULL
 $ groups      :'data.frame':   205 obs. of  1 variable:
  ..$ Subject_ID: Factor w/ 88 levels "1","2","3","4",..: 1 1 2 3 3 4 5 6 6 7 ...
 $ call        : language lme.formula(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI +      Asthma + ICS + LABA + Gender + Age_mo| __truncated__ ...
 $ terms       :Classes 'terms', 'formula'  language cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA +      Gender + Age_months + Height_cm + W| __truncated__
  .. ..- attr(*, "variables")= language list(cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP), BMI, Asthma, ICS, LABA,      Gender, Age_months, Height_cm, Weight_Kg, Group)
  .. ..- attr(*, "factors")= int [1:10, 1:9] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:10] "cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP)" "BMI" "Asthma" "ICS" ...
  .. .. .. ..$ : chr [1:9] "BMI" "Asthma" "ICS" "LABA" ...
  .. ..- attr(*, "term.labels")= chr [1:9] "BMI" "Asthma" "ICS" "LABA" ...
  .. ..- attr(*, "order")= int [1:9] 1 1 1 1 1 1 1 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP), BMI, Asthma, ICS, LABA,      Gender, Age_months, Height_cm, Weight_Kg, Group)
  .. ..- attr(*, "dataClasses")= Named chr [1:10] "nmatrix.4" "numeric" "character" "character" ...
  .. .. ..- attr(*, "names")= chr [1:10] "cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP)" "BMI" "Asthma" "ICS" ...
 $ method      : chr "REML"
 $ fitted      : num [1:205, 1:2] 112 110 111 113 113 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:205] "2" "3" "5" "6" ...
  .. ..$ : chr [1:2] "fixed" "Subject_ID"
 $ residuals   : num [1:205, 1:2] -9.09 -3.37 13.18 -4.1 4.14 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:205] "2" "3" "5" "6" ...
  .. ..$ : chr [1:2] "fixed" "Subject_ID"
  ..- attr(*, "std")= num [1:205] 11 11 11 11 11 ...
 $ fixDF       :List of 2
  ..$ X    : Named num [1:10] 111 111 84 111 111 84 111 111 111 84
  .. ..- attr(*, "names")= chr [1:10] "(Intercept)" "BMI" "AsthmaYes" "ICSYes" ...
  ..$ terms: Named num [1:10] 111 111 84 111 111 84 111 111 111 84
  .. ..- attr(*, "names")= chr [1:10] "(Intercept)" "BMI" "Asthma" "ICS" ...
  ..- attr(*, "assign")=List of 10
  .. ..$ (Intercept): int 1
  .. ..$ BMI        : int 2
  .. ..$ Asthma     : int 3
  .. ..$ ICS        : int 4
  .. ..$ LABA       : int 5
  .. ..$ Gender     : int 6
  .. ..$ Age_months : int 7
  .. ..$ Height_cm  : int 8
  .. ..$ Weight_Kg  : int 9
  .. ..$ Group      : int 10
  ..- attr(*, "varFixFact")= num [1:10, 1:10] -1.95 -10.45 -3.1 -1.14 -0.55 ...
 $ na.action   : NULL
 $ data        :'data.frame':   205 obs. of  16 variables:
  ..$ Group             : Factor w/ 2 levels "C-Asthma","C-SCD": 2 2 2 2 2 2 2 2 2 2 ...
  ..$ Subject_ID        : Factor w/ 90 levels "1","2","3","4",..: 1 1 2 3 3 4 5 6 6 7 ...
  ..$ Observation_number: Factor w/ 6 levels "1","2","3","4",..: 2 3 1 1 2 2 1 1 2 1 ...
  ..$ Hydroxyurea       : chr [1:205] "Yes" "Yes" "No" "Yes" ...
  ..$ Asthma            : chr [1:205] "Yes" "Yes" "No" "Yes" ...
  ..$ ICS               : chr [1:205] "Yes" "Yes" "No" "No" ...
  ..$ LABA              : chr [1:205] "No" "Yes" "No" "No" ...
  ..$ Gender            : chr [1:205] "male" "male" "female" "male" ...
  ..$ Age_months        : num [1:205] 193 198 198 178 186 ...
  ..$ Height_cm         : num [1:205] 163 164 154 158 164 ...
  ..$ Weight_Kg         : num [1:205] 62.3 63.1 65.7 51.9 56.7 ...
  ..$ BMI               : num [1:205] 23.5 23.6 24.6 20.8 20.9 ...
  ..$ R5Hz_PP           : num [1:205] 103 107 124 109 117 146 136 96 123 117 ...
  ..$ R20Hz_PP          : num [1:205] 98 98 117 86 105 117 97 109 115 117 ...
  ..$ X5Hz_PP           : num [1:205] 111 174 98 115 107 195 140 111 164 92 ...
  ..$ Fres_PP           : num [1:205] 169 159 135 148 159 175 176 104 109 152 ...
  ..- attr(*, "na.action")= 'omit' Named int [1:14] 1 4 8 13 18 21 44 46 50 52 ...
  .. ..- attr(*, "names")= chr [1:14] "1" "4" "8" "13" ...
 - attr(*, "class")= chr "lme"

Code

aic_lme_imputed_final <- AIC(model_lme_imputed_final)

cat(sprintf("AIC for lme model: %f\n", aic_lme_imputed_final))

AIC for lme model: 1801.597804

Code

# Extract residuals
residuals_imputed <- resid(model_lme_imputed_final)

# Residuals vs Fitted Values Plot
ggplot(data = data.frame(Fitted = fitted(model_lme_imputed_final), Residuals = residuals_imputed), aes(x = Fitted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted Values", y = "Residuals", title = "Figure 18. Residuals vs. Fitted Values")

Code

# Histogram of Residuals
ggplot(data = data.frame(Residuals = residuals_imputed), aes(x = Residuals)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Figure 19. Histogram of Residuals")

Code

# Q-Q Plot of Residuals
qqPlot(residuals_imputed, main = "Figure 20. Q-Q Plot of Residuals")

17 29 
39 72

Code

# Shapiro-Wilk Test for Normality of Residuals
shapiro_test_results <- shapiro.test(residuals_imputed)
print(shapiro_test_results)


    Shapiro-Wilk normality test

data:  residuals_imputed
W = 0.98676, p-value = 0.0529

Equation 2. The final linear mixed model. {.lightbox}

The AIC was calculated as 1801.60, an improvement on the initial model but not on the less complex imputed model, as shown in Figure 21. The AIC penalizes model complexity to avoid overfitting, suggesting that the added effects of Group and Observation_number may not be sufficiently increasing model accuracy compared to complexity. However, these effects may still be relevant given the research goal of the project despite the slight increase in AIC, and thus will be left in the final model.

Code

# Model names
model_names <- c("1. LME Model", "2. LME Imputed Model", "3. LME Imputed Final Model")

# Combining into a dataframe
aic_review <- data.frame(
  Model = model_names,
  AIC = c(aic_lme, aic_lme_imputed, aic_lme_imputed_final)
)

aic_review$Model <- as.factor(aic_review$Model)
aic_review$AIC <- round(aic_review$AIC, 2)

# Check the structure
str(aic_review)

'data.frame':   3 obs. of  2 variables:
 $ Model: Factor w/ 3 levels "1. LME Model",..: 1 2 3
 $ AIC  : num  1899 1791 1802

Code

ggplot(aic_review, aes(x = Model, y = AIC, fill = Model)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Figure 21. AIC Values for Different Models",
       x = "Model",
       y = "AIC Value") +
  geom_text(aes(label = AIC), vjust = -0.3, size = 3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

The Shapiro-Wilk test was conducted on the residuals to formally test for normality.

\(H_o\): the residuals are normally distributed.

\(H_a\): the residuals are not normally distributed.

\(\alpha\) = 0.05

In this case, P = 0.0529. P >0.05, so we failed to reject the null hypothesis, suggesting that the residuals were normally distributed after threshold imputation. This final model also satisfies the assumptions of LMMs.

3.3.4 Predictions

Code

# Sctter plot of predicted and actuals on y axis and most important category on x axis 
# and split into groups 

set.seed(43)


lme_resids = residuals(model_lme)
lme_imputed_resids = residuals(model_lme_imputed)
lme_imputed_final_resids = residuals(model_lme_imputed_final)




lme_mse = mean(lme_resids^2)
lme_mae = mean(abs(lme_resids))

lme_imputed_mse = mean(lme_imputed_resids^2)
lme_imputed_mae = mean(abs(lme_imputed_resids))

lme_imputed_final_mse = mean(lme_imputed_final_resids^2)
lme_imputed_final_mae = mean(abs(lme_imputed_final_resids))


mse_review <- data.frame(
  Model = model_names,
  MSE = c(lme_mse, lme_imputed_mse, lme_imputed_final_mse)
)


mse_review$MSE <- round(mse_review$MSE, digits = 2)

mae_review <- data.frame(
  Model = model_names,
  MAE = c(lme_mae, lme_imputed_mae, lme_imputed_final_mae)
)

mae_review$MAE <- round(mae_review$MAE, digits = 2)

MSE and MAE

Mean Squared Error (MSE) and Mean Absolute Error (MAE) are metrics used to asses the performance of a model. MSE is the mean of the individual residuals squared and MAE is the mean of the individual absolute value of the residuals. As shown in figures 22 and 23, the imputed final model out performs the other two models by a significant margin. It is important to note that MSE is impacted more by larger errors or outliers because it squares the residuals

Code

ggplot(mse_review, aes(x = Model, y = MSE, fill = Model)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Figure 22. MSE Values for Different Models",
       x = "Model",
       y = "MSE Value") +
  geom_text(aes(label = MSE), vjust = -0.3, size = 3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

Code

ggplot(mae_review, aes(x = Model, y = MAE, fill = Model)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Figure 23. MAE Values for Different Models",
       x = "Model",
       y = "MAE Value") +
  geom_text(aes(label = MAE), vjust = -0.3, size = 3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

Sample Predictions vs Actual

The bar graph below compares the actual R5Hz_PP to predicted R5Hz_PP (as a measure of airway resistance and reactance) for 10 random subjects. The difference in the bars for each subject is the residual error. The small residual error present for each subject suggests that the model is accurate at predicting R5Hz_PP as a measure of airway resistance and reactance.

Code

lme_imputed_final_predictions = predict(model_lme_imputed_final)
lme_imputed_fina_preds_actuals = data.frame(cbind(lme_imputed_final_predictions, x_clean_imputed$R5Hz_PP))

colnames(lme_imputed_fina_preds_actuals) <- c("Predicted_R5Hz_PP", "Actual_R5Hz_PP" )

set.seed(42)

sample_indices <- sample(nrow(x_clean_imputed), 10)
sample_pred_actuals = lme_imputed_fina_preds_actuals[sample_indices, ]
sample_pred_actuals$row <-1:10


sample_pred_actuals_melt <- melt(sample_pred_actuals, id.vars = "row")


ggplot(sample_pred_actuals_melt, aes(x = factor(row), y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Observation", y = "R5Hz_PP", fill = "") +
  theme_minimal() +
  theme(legend.position = "top") +
  ggtitle("Figure 24. Sample Comparison of Predicted and Actual Values")

4 Conclusion

Linear mixed models are versatile tools for modeling complex relations with multiple effects (fixed and random), as well as missing and non-independent data. For the given capstone dataset, the final linear mixed model can reliably predict measures of airway resistance and reactance given demographic and co-morbidity data. This model can be reliably used for both children with Sickle Cell Disease and those with asthma to provide insights into their respiratory function.

5 References

Aarts, E., Dolan, C. V., Verhage, M., and Sluis, S. van der (2015), “Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives,” BMC Neuroscience, 16, 94. https://doi.org/10.1186/s12868-015-0228-5.

Barr, D. J. (2013), “Random effects structure for testing interactions in linear mixed-effects models,” Frontiers in Psychology, 4, 328. https://doi.org/10.3389/fpsyg.2013.00328.

Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015), “Fitting Linear Mixed-Effects Models Using lme4,” Journal of Statistical Software, 67, 1–48. https://doi.org/10.18637/jss.v067.i01.

Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., and White, J.-S. S. (2009), “Generalized linear mixed models: A practical guide for ecology and evolution,” Trends in Ecology & Evolution, 24, 127–135. https://doi.org/10.1016/j.tree.2008.10.008.

Brown, V. A. (2021), “An Introduction to Linear Mixed-Effects Modeling in R,” Advances in Methods and Practices in Psychological Science, 4, 2515245920960351. https://doi.org/10.1177/2515245920960351.

Casals, M., Girabent-Farrés, M., and Carrasco, J. L. (2014), “Methodological quality and reporting of generalized linear mixed models in clinical medicine (2000-2012): A systematic review,” PloS One, 9, e112653. https://doi.org/10.1371/journal.pone.0112653.

Galecki, A. T., Kathleen B. Welch (2014), Linear Mixed Models: A Practical Guide Using Statistical Software, Second Edition, New York: Chapman; Hall/CRC. https://doi.org/10.1201/b17198.

GRUEBER, C. E., NAKAGAWA, S., LAWS, R. J., and JAMIESON, I. G. (2011), “Multimodel inference in ecology and evolution: Challenges and solutions,” Journal of Evolutionary Biology, 24, 699–711. https://doi.org/https://doi.org/10.1111/j.1420-9101.2010.02210.x.

Harrison, X. A., Donaldson, L., Correa-Cano, M. E., Evans, J., Fisher, D. N., Goodwin, C. E. D., Robinson, B. S., Hodgson, D. J., and Inger, R. (2018), “A brief introduction to mixed effects modelling and multi-model inference in ecology,” PeerJ, 6, e4794. https://doi.org/10.7717/peerj.4794.

Jolly, E. (2018), “Pymer4: Connecting R and Python for Linear Mixed Modeling,” Journal of Open Source Software, 3, 862. https://doi.org/10.21105/joss.00862.

Lee, Y.-C., and Shang, J. (n.d.), “Estimation and selection in linear mixed models with missing data under compound symmetric structure,” Journal of Applied Statistics, 49, 4003–4027. https://doi.org/10.1080/02664763.2021.1969342.

Lo, S., and Andrews, S. (2015), “To transform or not to transform: Using generalized linear mixed models to analyse reaction time data,” Frontiers in Psychology, 6.

Magezi, D. A. (2015), “Linear mixed-effects models for within-participant psychology experiments: An introductory tutorial and free, graphical user interface (LMMgui),” Frontiers in Psychology, 6, 2. https://doi.org/10.3389/fpsyg.2015.00002.

Marjolein Fokkema, J. E.-C., and Wolpert, M. (2021), “Generalized linear mixed-model (GLMM) trees: A flexible decision-tree method for multilevel and longitudinal data,” Psychotherapy Research, Routledge, 31, 329–341. https://doi.org/10.1080/10503307.2020.1785037.

Monsalves, M. J., Bangdiwala, A. S., Thabane, A., and Bangdiwala, S. I. (2020), “LEVEL (Logical Explanations & Visualizations of Estimates in Linear mixed models): Recommendations for reporting multilevel data and analyses,” BMC Medical Research Methodology, 20, 3. https://doi.org/10.1186/s12874-019-0876-8.

Peng, H., and Lu, Y. (2012), “Model selection in linear mixed effect models,” Journal of Multivariate Analysis, 109, 109–129. https://doi.org/10.1016/j.jmva.2012.02.005.

Piepho (1999), “Analysing disease incidence data from designed experiments by generalized linear mixed models,” Plant Pathology, 48, 668–674. https://doi.org/https://doi.org/10.1046/j.1365-3059.1999.00383.x.

Pusponegoro, N. H., Rachmawati, R. N., Notodiputro, K. A., and Sartono, B. (2017), “Linear Mixed Model for Analyzing Longitudinal Data: A Simulation Study of Children Growth Differences,” Procedia Computer Science, Discovery and innovation of computer science technology in artificial intelligence era: The 2nd International Conference on Computer Science and Computational Intelligence (ICCSCI 2017), 116, 284–291. https://doi.org/10.1016/j.procs.2017.10.071.

Schielzeth, H., Dingemanse, N. J., Nakagawa, S., Westneat, D. F., Allegue, H., Teplitsky, C., Réale, D., Dochtermann, N. A., Garamszegi, L. Z., and Araya-Ajoy, Y. G. (2020), “Robustness of linear mixed-effects models to violations of distributional assumptions,” Methods in Ecology and Evolution, 11, 1141–1152. https://doi.org/https://doi.org/10.1111/2041-210X.13434.

Steibel, J. P., Poletto, R., Coussens, P. M., and Rosa, G. J. M. (2009), “A powerful and flexible linear mixed model framework for the analysis of relative quantification RT-PCR data,” Genomics, 94, 146–152. https://doi.org/10.1016/j.ygeno.2009.04.008.

Touraine, C., Cuer, B., Conroy, T., Juzyna, B., Gourgou, S., and Mollevi, C. (2023), “When a joint model should be preferred over a linear mixed model for analysis of longitudinal health-related quality of life data in cancer clinical trials,” BMC medical research methodology, 23, 36. https://doi.org/10.1186/s12874-023-01846-3.

Tu, Y.-K. (2015), “Using generalized linear mixed models to evaluate inconsistency within a network meta-analysis,” Value in Health, 18, 1120–1125. https://doi.org/https://doi.org/10.1016/j.jval.2015.10.002.

Verbeeck, J., Faes, C., Neyens, T., Hens, N., Verbeke, G., Deboosere, P., and Molenberghs, G. (2023), “A linear mixed model to estimate COVID-19-induced excess mortality,” Biometrics, 79, 417–425. https://doi.org/10.1111/biom.13578.

Wang, X., Andrinopoulou, E.-R., Veen, K. M., Bogers, A. J. J. C., and Takkenberg, J. J. M. (2022), “Statistical primer: An introduction to the application of linear mixed-effects models in cardiothoracic surgery outcomes research-a case study using homograft pulmonary valve replacement data,” European Journal of Cardio-Thoracic Surgery: Official Journal of the European Association for Cardio-Thoracic Surgery, 62, ezac429. https://doi.org/10.1093/ejcts/ezac429.

Zuur, A. F., and Ieno, E. N. (2016), “A protocol for conducting and presenting results of regression-type analyses,” Methods in Ecology and Evolution, 7, 636–645. https://doi.org/https://doi.org/10.1111/2041-210X.12577.

1 Introduction

2 Methods

2.1 Mathematical Foundations

2.1.1 Example

2.1.1 Parameter Estimation

2.2 Assumptions

2.3 Sample Data Structure

2.4 Implementation in R

2.4.1 Analysis Using lme() Function

2.4.2 Hypothesis Testing

2.4.3 Model Refinement

2.4.4. Analysis Using lmer() Function

2.4.5 Final Model Selection

3 Analysis and Results

3.1 Packages

remove.packages(“Matrix”)

remove.packages(“lme4”)

install.packages(“lme4”, type = “source”)

library(lme4)

3.2 Data Ingestion

BMI from Kaggle (Impact of BMI on IOS measures on children (kaggle.com))

3.2 Exploratory Data Analysis (EDA)

3.2.1 Missing Values

3.2.2 Cleaning Data

3.2.3 Correlations

3.3 Linear Mixed Modeling

3.3.1 Initial Model

3.3.2 Imputed Model

3.3.3 Final Model

3.3.4 Predictions

4 Conclusion

5 References

`BMI` from Kaggle (Impact of BMI on IOS measures on children (kaggle.com))