Module 8 Exercise - Fitting Basic Statistical Models: Part 3

Load Cleaned Data and Libraries

# Data Handling and Model Building
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.3     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.4     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library(gtsummary)


Attaching package: 'gtsummary'

The following objects are masked from 'package:recipes':

    all_double, all_factor, all_integer, all_logical, all_numeric

# Pathing
library(here)

here() starts at C:/Users/Kai/Documents/School/Colleges/UGA/MPH Year/Spring 2023/Modern Data Analysis/kaichen-MADA-portfolio

path <- here("fluanalysis", "data", "cleaned_data.rds")

# Load Data
clean_data <- readRDS(path)

As mentioned in previous files, BodyTemp acts as the main continuous outcome of interest, and Nausea acts as the main categorical outcome of interest. For model fitting, all variables other than the outcome in question will be used as predictors. This means that in a logistic regression model for nausea, body temperature will be one of the predictors. Additionally, RunnyNose will be treated as the main predictor in the simple regression models.

Linear Regression Model

Set Up Linear Regression Engine

lm_model <- linear_reg() %>% set_engine("lm")

Runny Nose vs Body Temperature

# Tidyverse
tidy(lm_model %>% fit(BodyTemp ~ RunnyNose, data = clean_data))

# A tibble: 2 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    99.1      0.0819   1210.   0      
2 RunnyNoseYes   -0.293    0.0971     -3.01 0.00268

# Base R
summary(lm(BodyTemp ~ RunnyNose, data = clean_data))


Call:
lm(formula = BodyTemp ~ RunnyNose, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9431 -0.7505 -0.3505  0.3495  4.1495 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  99.14313    0.08191 1210.426  < 2e-16 ***
RunnyNoseYes -0.29265    0.09714   -3.013  0.00268 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.19 on 728 degrees of freedom
Multiple R-squared:  0.01231,   Adjusted R-squared:  0.01096 
F-statistic: 9.076 on 1 and 728 DF,  p-value: 0.00268

All Relevant Predictors vs Body Temperature

# Tidymodels
rmarkdown::paged_table(tidy(lm_model %>% fit(BodyTemp ~ SwollenLymphNodes + ChestCongestion + ChillsSweats + NasalCongestion + CoughYN + Sneeze + Fatigue + SubjectiveFever + Headache + Weakness + WeaknessYN + CoughIntensity + CoughYN2 + Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + Nausea + EarPn + Hearing + Pharyngitis + Breathless + ToothPn + Vision + Vomit + Wheeze, data = clean_data)))

# Base R
summary(lm(BodyTemp ~ SwollenLymphNodes + ChestCongestion + ChillsSweats + NasalCongestion + CoughYN + Sneeze + Fatigue + SubjectiveFever + Headache + Weakness + WeaknessYN + CoughIntensity + CoughYN2 + Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + Nausea + EarPn + Hearing + Pharyngitis + Breathless + ToothPn + Vision + Vomit + Wheeze, data = clean_data))


Call:
lm(formula = BodyTemp ~ SwollenLymphNodes + ChestCongestion + 
    ChillsSweats + NasalCongestion + CoughYN + Sneeze + Fatigue + 
    SubjectiveFever + Headache + Weakness + WeaknessYN + CoughIntensity + 
    CoughYN2 + Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + 
    Diarrhea + EyePn + Insomnia + ItchyEye + Nausea + EarPn + 
    Hearing + Pharyngitis + Breathless + ToothPn + Vision + Vomit + 
    Wheeze, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2110 -0.7219 -0.2853  0.4342  4.2095 

Coefficients: (3 not defined because of singularities)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            97.925243   0.303804 322.330  < 2e-16 ***
SwollenLymphNodesYes   -0.165302   0.091959  -1.798 0.072682 .  
ChestCongestionYes      0.087326   0.097546   0.895 0.370973    
ChillsSweatsYes         0.201266   0.127302   1.581 0.114330    
NasalCongestionYes     -0.215771   0.113798  -1.896 0.058362 .  
CoughYNYes              0.313893   0.240738   1.304 0.192707    
SneezeYes              -0.361924   0.098299  -3.682 0.000249 ***
FatigueYes              0.264762   0.160558   1.649 0.099596 .  
SubjectiveFeverYes      0.436837   0.103398   4.225 2.71e-05 ***
HeadacheYes             0.011453   0.125405   0.091 0.927256    
WeaknessMild            0.018229   0.189169   0.096 0.923258    
WeaknessModerate        0.098944   0.197864   0.500 0.617189    
WeaknessSevere          0.373435   0.230766   1.618 0.106065    
WeaknessYNYes                 NA         NA      NA       NA    
CoughIntensityMild      0.084881   0.279878   0.303 0.761768    
CoughIntensityModerate -0.061384   0.301997  -0.203 0.838992    
CoughIntensitySevere   -0.037272   0.314013  -0.119 0.905551    
CoughYN2Yes                   NA         NA      NA       NA    
MyalgiaMild             0.164242   0.160498   1.023 0.306510    
MyalgiaModerate        -0.024064   0.167834  -0.143 0.886031    
MyalgiaSevere          -0.129263   0.207854  -0.622 0.534216    
MyalgiaYNYes                  NA         NA      NA       NA    
RunnyNoseYes           -0.080485   0.108526  -0.742 0.458569    
AbPainYes               0.031574   0.140236   0.225 0.821927    
ChestPainYes            0.105071   0.106980   0.982 0.326365    
DiarrheaYes            -0.156806   0.129545  -1.210 0.226522    
EyePnYes                0.131544   0.129757   1.014 0.311047    
InsomniaYes            -0.006824   0.090797  -0.075 0.940114    
ItchyEyeYes            -0.008016   0.110191  -0.073 0.942028    
NauseaYes              -0.034066   0.102049  -0.334 0.738620    
EarPnYes                0.093790   0.113875   0.824 0.410436    
HearingYes              0.232203   0.222043   1.046 0.296037    
PharyngitisYes          0.317581   0.121342   2.617 0.009057 ** 
BreathlessYes           0.090526   0.099837   0.907 0.364863    
ToothPnYes             -0.022876   0.113750  -0.201 0.840673    
VisionYes              -0.274625   0.277681  -0.989 0.323010    
VomitYes                0.165272   0.151432   1.091 0.275478    
WheezeYes              -0.046665   0.107036  -0.436 0.662990    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.144 on 695 degrees of freedom
Multiple R-squared:  0.1287,    Adjusted R-squared:  0.08605 
F-statistic: 3.019 on 34 and 695 DF,  p-value: 4.197e-08

Conclusions

The output of the models is ordinary in that they relay the same primary information of summary(lm(…)) in table form. However, I was unpleasantly surprised by the lack of stars next to p-values. For this reason, I prefer the longer-winded summary(lm(…)) to tidymodels. I will not deny, though, that there may be some instances where the ability to convert such information into a table would be greatly beneficial. The only difference between the two linear regression models built is that one accounts for the multiple predictors that have been inputted, rather than just one.

Logistic Model

Set Up Logistic Regression Engine

logistic_model <- logistic_reg() %>% set_engine("glm")

Runny Nose vs Nausea

# Tidymodels
tidy(logistic_model %>% fit(Nausea ~ RunnyNose, data = clean_data))

# A tibble: 2 × 5
  term         estimate std.error statistic    p.value
  <chr>           <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   -0.658      0.145    -4.53  0.00000589
2 RunnyNoseYes   0.0502     0.172     0.292 0.770

# Base R
summary(glm(Nausea ~ RunnyNose, family = binomial, data = clean_data))


Call:
glm(formula = Nausea ~ RunnyNose, family = binomial, data = clean_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9325  -0.9325  -0.9137   1.4439   1.4664  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.65781    0.14520  -4.530 5.89e-06 ***
RunnyNoseYes  0.05018    0.17182   0.292     0.77    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 944.65  on 729  degrees of freedom
Residual deviance: 944.57  on 728  degrees of freedom
AIC: 948.57

Number of Fisher Scoring iterations: 4

All Relevant Predictors vs Nausea

# Tidymodels
rmarkdown::paged_table(tidy(logistic_model %>% fit(Nausea ~ SwollenLymphNodes + ChestCongestion + ChillsSweats + NasalCongestion + CoughYN + Sneeze + Fatigue + SubjectiveFever + Headache + Weakness + WeaknessYN + CoughIntensity + CoughYN2 + Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + EarPn + Hearing + Pharyngitis + Breathless + ToothPn + Vision + Vomit + Wheeze + BodyTemp, data = clean_data)))

# Base R
summary(glm(Nausea ~ SwollenLymphNodes + ChestCongestion + ChillsSweats + NasalCongestion + CoughYN + Sneeze + Fatigue + SubjectiveFever + Headache + Weakness + WeaknessYN + CoughIntensity + CoughYN2 + Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + Diarrhea + EyePn + Insomnia + ItchyEye + EarPn + Hearing + Pharyngitis + Breathless + ToothPn + Vision + Vomit + Wheeze + BodyTemp, family = binomial, data = clean_data))


Call:
glm(formula = Nausea ~ SwollenLymphNodes + ChestCongestion + 
    ChillsSweats + NasalCongestion + CoughYN + Sneeze + Fatigue + 
    SubjectiveFever + Headache + Weakness + WeaknessYN + CoughIntensity + 
    CoughYN2 + Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + 
    Diarrhea + EyePn + Insomnia + ItchyEye + EarPn + Hearing + 
    Pharyngitis + Breathless + ToothPn + Vision + Vomit + Wheeze + 
    BodyTemp, family = binomial, data = clean_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9065  -0.8138  -0.5301   0.8581   2.4268  

Coefficients: (3 not defined because of singularities)
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             0.222870   7.827409   0.028 0.977285    
SwollenLymphNodesYes   -0.251083   0.196029  -1.281 0.200248    
ChestCongestionYes      0.275554   0.212662   1.296 0.195066    
ChillsSweatsYes         0.274097   0.287828   0.952 0.340948    
NasalCongestionYes      0.425817   0.254561   1.673 0.094376 .  
CoughYNYes             -0.140423   0.518798  -0.271 0.786644    
SneezeYes               0.176724   0.210349   0.840 0.400828    
FatigueYes              0.229062   0.371882   0.616 0.537925    
SubjectiveFeverYes      0.277741   0.225363   1.232 0.217793    
HeadacheYes             0.331259   0.284896   1.163 0.244937    
WeaknessMild           -0.121606   0.446886  -0.272 0.785531    
WeaknessModerate        0.310849   0.454483   0.684 0.493999    
WeaknessSevere          0.823187   0.510424   1.613 0.106799    
WeaknessYNYes                 NA         NA      NA       NA    
CoughIntensityMild     -0.220794   0.584367  -0.378 0.705554    
CoughIntensityModerate -0.362678   0.631370  -0.574 0.565676    
CoughIntensitySevere   -0.950544   0.658142  -1.444 0.148659    
CoughYN2Yes                   NA         NA      NA       NA    
MyalgiaMild            -0.004146   0.368094  -0.011 0.991013    
MyalgiaModerate         0.204743   0.373231   0.549 0.583301    
MyalgiaSevere           0.120758   0.444927   0.271 0.786075    
MyalgiaYNYes                  NA         NA      NA       NA    
RunnyNoseYes            0.045324   0.232645   0.195 0.845535    
AbPainYes               0.939304   0.281463   3.337 0.000846 ***
ChestPainYes            0.070777   0.227858   0.311 0.756090    
DiarrheaYes             1.063934   0.258705   4.113 3.91e-05 ***
EyePnYes               -0.341991   0.277720  -1.231 0.218164    
InsomniaYes             0.084175   0.192985   0.436 0.662710    
ItchyEyeYes            -0.063364   0.232501  -0.273 0.785212    
EarPnYes               -0.181719   0.239207  -0.760 0.447451    
HearingYes              0.323052   0.452402   0.714 0.475177    
PharyngitisYes          0.275364   0.266059   1.035 0.300680    
BreathlessYes           0.526801   0.208579   2.526 0.011548 *  
ToothPnYes              0.480649   0.229474   2.095 0.036209 *  
VisionYes               0.125498   0.541114   0.232 0.816596    
VomitYes                2.458466   0.348608   7.052 1.76e-12 ***
WheezeYes              -0.304435   0.234084  -1.301 0.193417    
BodyTemp               -0.031246   0.079838  -0.391 0.695526    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 944.65  on 729  degrees of freedom
Residual deviance: 751.47  on 695  degrees of freedom
AIC: 821.47

Number of Fisher Scoring iterations: 4

Conclusions

I reiterate my thoughts from the previous Conclusions section under linear regression; I would still prefer manually recoding my variables for a summary(glm(…)) than use tidymodels, even if model performance between the options offered is the same. However, I have also noticed that summary() is not needed for the statistics to be displayed. Instead, that function is already built into fit(). Just like the linear regression models, the only difference between the two logistic models built is that one accounts for the multiple predictors that have been inputted, rather than just one.

All in all, while tidymodels would not be my go-to for model fitting, the package has a few functions I would like to keep in mind going forward.