# Pathing
<- here("fluanalysis", "data", "cleaned_data.rds")
data_path
# Data Importation
<- readRDS(data_path) clean_data
Module 8 Exercise - Fitting Basic Statistical Models: Part 2
Setup
Tidyverse
will be used for graphing in addition to handling data processing/manipulation, and the here
package assists in pathing.
Load Required R Packages
Load Cleaned Data
Exploratory Data Analysis
To get a greater sense of the dataset’s structure, I’ll run the str()
command and determine which variables of interest are numeric and/or continuous.
Data Structure
str(clean_data)
'data.frame': 730 obs. of 32 variables:
$ SwollenLymphNodes: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 2 1 ...
..- attr(*, "label")= chr "Swollen Lymph Nodes"
$ ChestCongestion : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
..- attr(*, "label")= chr "Chest Congestion"
$ ChillsSweats : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 2 2 1 ...
..- attr(*, "label")= chr "Chills/Sweats"
$ NasalCongestion : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 2 2 ...
..- attr(*, "label")= chr "Nasal Congestion"
$ CoughYN : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 2 2 2 ...
..- attr(*, "label")= chr "Cough"
$ Sneeze : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 1 2 1 1 ...
..- attr(*, "label")= chr "Sneeze"
$ Fatigue : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "label")= chr "Fatigue"
$ SubjectiveFever : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
..- attr(*, "label")= chr "Subjective Fever"
$ Headache : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
..- attr(*, "label")= chr "Headache"
$ Weakness : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 3 3 2 4 3 3 ...
$ WeaknessYN : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "label")= chr "Weakness"
$ CoughIntensity : Factor w/ 4 levels "None","Mild",..: 4 4 2 3 1 3 4 3 3 3 ...
..- attr(*, "label")= chr "Cough Severity"
$ CoughYN2 : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 2 ...
$ Myalgia : Factor w/ 4 levels "None","Mild",..: 2 4 4 4 2 3 2 4 3 2 ...
$ MyalgiaYN : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "label")= chr "Myalgia"
$ RunnyNose : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 2 2 2 2 ...
..- attr(*, "label")= chr "Runny Nose"
$ AbPain : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Abdominal Pain"
$ ChestPain : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 2 1 1 1 ...
..- attr(*, "label")= chr "Chest Pain"
$ Diarrhea : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
$ EyePn : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
..- attr(*, "label")= chr "Eye Pain"
$ Insomnia : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 1 2 2 2 ...
..- attr(*, "label")= chr "Sleeplessness"
$ ItchyEye : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Itchy Eyes"
$ Nausea : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 1 1 2 2 ...
$ EarPn : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Ear Pain"
$ Hearing : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Loss of Hearing"
$ Pharyngitis : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 1 1 ...
..- attr(*, "label")= chr "Sore Throat"
$ Breathless : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 2 ...
..- attr(*, "label")= chr "Breathlessness"
$ ToothPn : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 2 1 ...
..- attr(*, "label")= chr "Tooth Pain"
$ Vision : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Blurred Vision"
$ Vomit : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
..- attr(*, "label")= chr "Vomiting"
$ Wheeze : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 1 ...
..- attr(*, "label")= chr "Wheezing"
$ BodyTemp : num 98.3 100.4 100.8 98.8 100.5 ...
It appears that there is only one numeric/continuous variable: BodyTemp. BodyTemp will also serve as the main continuous outcome of interest, and Nausea will serve as the main categorical outcome.
Categorical Variables
Categorical variables include: SwollenLymphNodes, ChestCongestion, ChillsSweats, NasalCongestion, CoughYN, Sneeze, Fatigue, Subjective Fever, Headache, Weakness, WeaknessYN, CoughIntensity, CoughYN2, Myalgia, MyalgiaYN, RunnyNose, AbPain, ChestPain, Diarrhea, EyePn, Insomnia, ItchyEye, Nausea, EarPn, Hearing, Pharyngitis, Breathless, ToothPn, Vision, Vomit, and Wheeze.
Since these categorical variables are just Y/N, the number of yes/no’s will be presented in a table for each.
Main Categorical Outcome - Nausea
# Table
::paged_table(clean_data %>% group_by(Nausea) %>% count()) rmarkdown
Other Categorical Variables (Predictors)
# Swollen Lymph Nodes
::paged_table(clean_data %>% group_by(SwollenLymphNodes) %>% count()) rmarkdown
# Chest Congestion and Chest Pain
## Congestion
::paged_table(clean_data %>% group_by(ChestCongestion) %>% count()) rmarkdown
## Pain
::paged_table(clean_data %>% group_by(ChestPain) %>% count()) rmarkdown
# Chills/Sweats
::paged_table(clean_data %>% group_by(ChillsSweats) %>% count()) rmarkdown
# Nasal Congestion
::paged_table(clean_data %>% group_by(NasalCongestion) %>% count()) rmarkdown
# Coughing
## Y/N
::paged_table(clean_data %>% group_by(CoughYN) %>% count()) rmarkdown
## Intensity
::paged_table(clean_data %>% group_by(CoughIntensity) %>% count()) rmarkdown
## 2nd Y/N
::paged_table(clean_data %>% group_by(CoughYN2) %>% count()) rmarkdown
# Sneezing
::paged_table(clean_data %>% group_by(Sneeze) %>% count()) rmarkdown
# Fatigue
::paged_table(clean_data %>% group_by(Fatigue) %>% count()) rmarkdown
# Subjective Fever
::paged_table(clean_data %>% group_by(SubjectiveFever) %>% count()) rmarkdown
# Headache
::paged_table(clean_data %>% group_by(Headache) %>% count()) rmarkdown
# Weakness
## Severity
::paged_table(clean_data %>% group_by(Weakness) %>% count()) rmarkdown
## Y/N
::paged_table(clean_data %>% group_by(WeaknessYN) %>% count()) rmarkdown
# Myalgia
## Y/N
::paged_table(clean_data %>% group_by(MyalgiaYN) %>% count()) rmarkdown
## Severity
::paged_table(clean_data %>% group_by(Myalgia) %>% count()) rmarkdown
# Runny Nose
::paged_table(clean_data %>% group_by(RunnyNose) %>% count()) rmarkdown
# Abdominal Pain
::paged_table(clean_data %>% group_by(AbPain) %>% count()) rmarkdown
# Diarrhea
::paged_table(clean_data %>% group_by(Diarrhea) %>% count()) rmarkdown
# Eye/Vision
## Pain
::paged_table(clean_data %>% group_by(EyePn) %>% count()) rmarkdown
## Itchyness
::paged_table(clean_data %>% group_by(ItchyEye) %>% count()) rmarkdown
## Vision
::paged_table(clean_data %>% group_by(Vision) %>% count()) rmarkdown
# Insomnia
::paged_table(clean_data %>% group_by(Insomnia) %>% count()) rmarkdown
# Ear/Hearing
## Pain
::paged_table(clean_data %>% group_by(EarPn) %>% count()) rmarkdown
## Hearing
::paged_table(clean_data %>% group_by(Hearing) %>% count()) rmarkdown
# Pharyngitis
::paged_table(clean_data %>% group_by(Pharyngitis) %>% count()) rmarkdown
# Breathless/Wheezing
## Breathless
::paged_table(clean_data %>% group_by(Breathless) %>% count()) rmarkdown
## Wheezing
::paged_table(clean_data %>% group_by(Wheeze) %>% count()) rmarkdown
# Tooth Pain
::paged_table(clean_data %>% group_by(ToothPn) %>% count()) rmarkdown
# Vomiting
::paged_table(clean_data %>% group_by(Vomit) %>% count()) rmarkdown
Continuous Variables
The only continuous variable is body temperature, which also acts as the main continuous outcome of interest.
Body Temperature - Main Continuous Outcome
## Summary Table
<- do.call(cbind, lapply(
Body_Temp_Summary %>% select(BodyTemp), summary))
clean_data <- data.frame(Body_Temp_Summary)
Body_Temp_Summary ::paged_table(Body_Temp_Summary) rmarkdown
## Histogram
ggplot(clean_data, aes(x = BodyTemp)) +
geom_histogram(binwidth = 0.1) +
labs(x = "Body Temperature", y = "Frequency",
title = "Frequency of Body Temperatures")
Data Visualization: Predictors and Outcomes
Selected Predictors of Interest for Outcome of Nausea: Headache, EarPn, Vomit
Selected Predictors of Interest for Outcome of Body Temperature: SwollenLymphNodes, SubjectiveFever, MyalgiaYN, Myalgia, ChillsSweats
Since it is difficult to represent categorical variables as themselves instead of numeric, I have decided to plot the Nausea variable as 0 for “No” and 1 for “Yes” on the y-axis.
Ear Pain vs Nausea
ggplot(clean_data %>% mutate(Nausea = if_else(Nausea == "Yes", 1, 0)), aes(x = EarPn, y = Nausea)) +
geom_point() +
geom_jitter() +
geom_violin() +
labs(x = "Ear Pain")
Most do not have ear pain. This violin plot also depicts greater lack of nausea for those who don’t have ear pain. For those that do, the frequency distributions seem relatively similar, though the frequency of “Yes” to ear pain still results in slightly fewer cases of nausea.
Vomit vs Nausea
ggplot(clean_data %>% mutate(Nausea = if_else(Nausea == "Yes", 1, 0)), aes(x = Vomit, y = Nausea)) +
geom_point() +
geom_jitter() +
geom_violin() +
labs(x = "Vomiting")
Most do not have vomiting. However, the violin plot above shows greater frequency of nausea with vomiting than vice-versa (which makes sense).
Headache vs Nausea
ggplot(clean_data %>% mutate(Nausea = if_else(Nausea == "Yes", 1, 0)), aes(x = Headache, y = Nausea)) +
geom_point() +
geom_jitter() +
geom_violin()
Most have a headache. From the violin graph, it appears that headache may influence nausea, though a statistical test or model should be done to verify this.
Subjective Fever vs Body Temperature
ggplot(clean_data, aes(x = SubjectiveFever, y = BodyTemp)) +
geom_point() + geom_boxplot() +
labs(x = "Subjective Fever", y = "Body Temperature")
Those who say “Yes” to subjective fever tend to have higher body temperatures than those who say “No”.
Swollen Lymph Nodes vs Body Temperature
ggplot(clean_data, aes(x = SwollenLymphNodes, y = BodyTemp)) +
geom_point() + geom_boxplot()
There appears to be little difference between the body temperatures of those who do have swollen lymph nodes and those who do not.
Myalgia vs Body Temperature
# Y/N
ggplot(clean_data, aes(x = MyalgiaYN, y = BodyTemp)) +
geom_point() + geom_boxplot()
# Myalgia Severity
ggplot(clean_data, aes(x = Myalgia, y = BodyTemp)) +
geom_point() + geom_boxplot()
It seems that there is an association between having myalgia and a higher body temperature.
Chills/Sweats vs Body Temperature
ggplot(clean_data, aes(x = ChillsSweats, y = BodyTemp)) +
geom_point() +
geom_boxplot() +
labs(x = "Chills/Sweats")
From the boxplot, it appears that those with chills/sweats tend to have higher body temperatures than those without.