Introduction

Smoking and obesity have been some of the most highlighted health concerns of the American people over the past several decades. In 1997, an article in the Journal of the American Medical Association identified smoking and sedentary lifestyle as the first and second most important causes of death in the American population (Ferucci et al. 1999). Smoking and obesity are dangerous health risks that are prominent in America. Research has shown that both smoking and obesity have an adverse effect on health and life expectancy (Stewart et al. 2009). Multiple studies have observed the effects of both obesity (Bigaard et al. 2012; Ferucci et al. 1999; Van Baal et al. 2006) and smoking (Streppel et al. 2007; Sakata et al. 2012; Van Baal et al. 2006) on life expectancy. These studies show negative correlations between life expectancy and smoking/obesity.

While Bigaard et al. (2012) included waist circumference as an added measure of obesity, BMI is generally used as a standard measure of obesity. However, some research has shown a negative correlation between BMI and smoking (Albanes et al. 1987). This correlation between BMI and smoking may have an impact when attempting to find a correlation between BMI and life expectancy when failing to segregate smokers from non-smokers. It also remains in question whether these two variables together can provide any prediction of life expectancy.

Hypothesis

There is no correlation between smoking status, BMI, and life expectancy. BMI and smoking status cannot be used to predict life expectancy.

Methods

Statistical analyses were done utilizing the Sanford Data Collaborative Teaching Dataset (SDC), provided by Sanford Research. The SDC provided healthcare data for 155,143 Sanford patients. Of these patients, 151,480 were alive and 3,663 were deceased. While providing a variety of health factors, the SDC gave limited demographic information about patients to protect their privacy.

Analyses were made using four variables from the dataset (Age, Smoking Status, BMI, and Life Status). An assumption was made that the age of deceased individuals was their age at death, allowing for predictions about life expectancy. Smoking Status was a spectrum ranking from 1 to 10, with higher scores indicating higher smoking levels. Using these four given variables, two additional variables were created: Year of Birth and Average Smoking Status (per YOB). Year of Birth was generated by subtracting the age of living patients from the year 2018. As it is unclear when the dataset was submitted to Augustana and the birth dates of patients are unknown, this metric could be off by as much as a year. The mean smoking status of all patients born in each year was found to create the Average Smoking Status of a given birth year. Correlations between variables were found using Pearson’s product-moment correlation.

RMD Code

Chunk 1: Initialized Database

library(tidyverse)
## -- Attaching packages ---------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts ------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
setwd("C:/Users/trevo/OneDrive/Documents/COSC 219/sanford")
Sanford_data=read.csv("Sanford_Data_Collaborative_Teaching_DataSet.csv")

Chunk 2: Modifying data.Set Age and BMI as numeric factors. SmokingStatus was already set as numeric. Created a data set with all deceased people labelled Sanford_dead

str(Sanford_data)
## 'data.frame':    155143 obs. of  15 variables:
##  $ ID                   : int  2 8 9 11 12 16 20 21 27 39 ...
##  $ Sex                  : Factor w/ 3 levels "Female","Male",..: 2 1 1 2 2 2 1 1 2 2 ...
##  $ Age                  : Factor w/ 73 levels "18","19","20",..: 65 33 43 42 48 35 56 42 22 21 ...
##  $ Status               : Factor w/ 2 levels "Alive","Deceased": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Hypertension         : int  0 0 1 1 1 1 1 1 1 0 ...
##  $ VascularDisease      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Payor                : Factor w/ 3 levels "Medicaid","Medicare",..: 2 3 3 3 3 3 2 3 3 3 ...
##  $ Diabetes             : int  0 0 1 1 0 1 0 0 0 0 ...
##  $ A1C                  : Factor w/ 129 levels "10","10.1","10.2",..: 129 129 89 105 129 110 129 129 129 129 ...
##  $ BMI                  : Factor w/ 4812 levels "1.24","10.12",..: 1120 1707 1658 1810 2029 1885 2150 2056 4515 3232 ...
##  $ ScheduledClinicVisits: Factor w/ 96 levels "0","1","10","102",..: 43 32 85 54 76 21 2 76 43 43 ...
##  $ MissedClinicVisits   : Factor w/ 26 levels "0","1","10","11",..: 1 1 1 1 2 1 1 1 1 2 ...
##  $ DiastolicBP          : int  72 91 78 79 82 69 64 82 86 73 ...
##  $ SystolicBP           : int  139 150 122 116 117 149 131 145 141 134 ...
##  $ SmokingStatus        : int  4 5 5 4 4 1 5 5 5 5 ...
Sanford_data$Age=as.numeric(as.character(Sanford_data$Age))
## Warning: NAs introduced by coercion
Sanford_data$BMI=as.numeric(as.character(Sanford_data$BMI))
## Warning: NAs introduced by coercion
Sanford_dead <- Sanford_data[which(Sanford_data$Status == "Deceased"), names(Sanford_data) %in% c("ID", "Sex","Age","BMI","Diabetes","SmokingStatus")]

Chunk 3: Running correlations between BMI, Age, and Smoking Status in Sanford_dead

cor.test (x = Sanford_dead$Age, y = Sanford_dead$SmokingStatus, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  Sanford_dead$Age and Sanford_dead$SmokingStatus
## t = 11.734, df = 2668, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1851462 0.2572961
## sample estimates:
##       cor 
## 0.2215243
cor.test (x = Sanford_dead$Age, y = Sanford_dead$BMI, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  Sanford_dead$Age and Sanford_dead$BMI
## t = -0.8219, df = 2144, p-value = 0.4112
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06001594  0.02458419
## sample estimates:
##         cor 
## -0.01774764
cor.test (x = Sanford_dead$BMI, y = Sanford_dead$SmokingStatus, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  Sanford_dead$BMI and Sanford_dead$SmokingStatus
## t = -0.14203, df = 2734, p-value = 0.8871
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04018583  0.03476072
## sample estimates:
##          cor 
## -0.002716367

Chunk 4: Testing for a correlation between SmokingStatus and BMI among the entire population.

cor.test (x = Sanford_data$BMI, y = Sanford_data$SmokingStatus, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  Sanford_data$BMI and Sanford_data$SmokingStatus
## t = 2.52, df = 144900, p-value = 0.01174
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.001471188 0.011768433
## sample estimates:
##         cor 
## 0.006619986

Chunk 5: Jitter Plot created to show SmokingStatus versus Age among dead individuals in the SDC.

ggplot(Sanford_dead, aes(Sanford_dead$SmokingStatus, Sanford_dead$Age)) + geom_jitter(color = "midnightblue", size = 0.75) + theme_bw() + labs(title = "Figure 2.0: Age at Death and Smoking Status of Dead Individuals in the SDC", x = "Smoking Status", y = "Age at Death")
## Warning: Removed 993 rows containing missing values (geom_point).

Chunk 6: Created a dataset of all living individuals in the Sanford Data Collaborative. Checked if a correlation existed between Age and SmokingStatus among living individuals.

Sanford_live <- Sanford_data[which(Sanford_data$Status == "Alive"), names(Sanford_data) %in% c("ID", "Sex","Age","BMI","Diabetes","SmokingStatus")]
cor.test (x = Sanford_live$Age, y = Sanford_live$SmokingStatus, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  Sanford_live$Age and Sanford_live$SmokingStatus
## t = 45.696, df = 145770, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1137748 0.1238968
## sample estimates:
##       cor 
## 0.1188389

Chunk 7: Created histograms displaying the age distribution among living, deceased, and all members of the SDC.

ggplot(Sanford_data, aes(Sanford_data$Age)) + geom_histogram(binwidth = 1, fill = "seashell4") + theme_bw() + labs(title = "Figure 3.0: Frequency of Age of Individuals in SDC", x = "Age", y = "Frequency")
## Warning: Removed 6702 rows containing non-finite values (stat_bin).

ggplot(Sanford_dead, aes(Sanford_dead$Age)) + geom_histogram(binwidth = 1, fill = "seashell4") + theme_bw() + labs(title = "Figure 3.1: Frequency of Age of Deceased Individuals in SDC", x = "Age", y = "Frequency")
## Warning: Removed 993 rows containing non-finite values (stat_bin).

ggplot(Sanford_live, aes(Sanford_live$Age)) + geom_histogram(binwidth = 1, fill = "seashell4") + theme_bw() + labs(title = "Figure 3.2: Frequency of Age of Living Individuals in SDC", x = "Age", y = "Frequency")
## Warning: Removed 5709 rows containing non-finite values (stat_bin).

Chunk 8: A Histogram of YearBorn is just a mirror image of Age in the living population.

Sanford_live$YearBorn = 2018 - Sanford_live$Age
ggplot(Sanford_live, aes(Sanford_live$YearBorn)) + geom_histogram(binwidth = 1, fill = "seashell4") + theme_bw() + labs(title = "Figure 3.3: Frequency of Year of Birth of Living Individuals in SDC", x = "Age", y = "Frequency")
## Warning: Removed 5709 rows containing non-finite values (stat_bin).

Chunk 9: Created dataset SmokingHistory showing year born and average smoking status. Line graph showing year born versus average smoking status among the living population.

SmokingHistory = aggregate(Sanford_live$SmokingStatus, by = list(Sanford_live$YearBorn), FUN = mean)
names(SmokingHistory) <- c("YearBorn", "AverageSmokingStatus")
ggplot(SmokingHistory, aes(SmokingHistory$YearBorn, SmokingHistory$AverageSmokingStatus)) + geom_line(color = "goldenrod4", size = 1) +theme_bw() + labs(title = "Figure 4.0: Average Smoking Status of Individuals Born from 1929 to 2000", x = "Year of Birth", y = "Average Smoking Status")

cor.test (x = SmokingHistory$YearBorn, y = SmokingHistory$AverageSmokingStatus, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  SmokingHistory$YearBorn and SmokingHistory$AverageSmokingStatus
## t = -3.3678, df = 70, p-value = 0.001235
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5569019 -0.1551636
## sample estimates:
##        cor 
## -0.3734094

Results

There was no correlation found between BMI and either age of death or smoking status among deceased individuals. A miniscule correlation was found between BMI and smoking status among the general population, but this correlation was so small across so many data points that no further research was done into the matter, and BMI was not considered to be at risk of being a confounding variable. A correlation of 0.222 was found between smoking status and age of death (see chunk 3). This positive correlation was concerning as a majority of research suggested a negative correlation between these variables (Streppel et al. 2007; Sakata et al. 2012; Van Baal et al. 2006).

To check what was causing these surprising results, a correlation test was performed between age and smoking status among living members of the population. A correlation of 0.112 among living patients revealed a correlation not only between age of death and smoking status but also smoking status and age (see chunk 6). Distribution frequencies of the age variable among the population revealed a strong negative skew in both alive and deceased members of the population (see chunk 7). This skew was especially present among dead individuals in the population, possibly explaining the larger correlation. To compare historical smoking rates, an average smoking status was computed for each year of birth among living patients. Average smoking status saw a steady decline in individuals born from 1929 to about 1985 (see chunk 9), before seeing increases which could be caused by a lack of data. Average smoking status and year born had a negative correlation of -0.373 among living individuals of the SDC.

Discussion

The lack of correlation of BMI with other factors may suggest that this metric is inaccurate, as suggested by Bigaard et al. in 2012. BMI does not accurately portray a person’s obesity or health, and other measurements might form a more complete picture of a person’s obesity status. These other measurements would be essential in further research into a correlation between obesity and life expectancy.

The generational decline in smoking observed in Figure 4.0 aligns with reports of declining smoking rates (Ny et al., 2014). This decline might explain the odd correlation between smoking status and life expectancy. Older individuals grew up with less awareness of the negative effects of smoking, and their views on the subject might differ from younger individuals. Thus, older patients might be more likely to have smoked for much of their lives due to a lack of educational awareness about the dangers of smoking.

Further research must account for generational differences in smoking rates as well as the negative skew in the SDC. To account for these effects, a dataset must be used with more than 3,663 deceased individuals, preferably one with a more even distribution of age of death. A perfect distribution is likely unattainable, as humans can die 60 years before the median age of death, but will not live 60 years past. However, the information of more individuals who died earlier than expected would be helpful.

References

Albanes, D., Jones, D.Y., Micozzi, M.S., Mattson, M.E. (1987). Associations between smoking and body weight in the US population: analysis of NHANES II. American Journal of Public Health, 77(4), 439-444.

Bigaard, J., Tjonneland, A., Thomsen, B.L., Overvad, K., Heitmann, B.L., & Sorensen, T.I.A. (2012). Waist Circumference, BMI, Smoking, and Mortality in Middle-Aged Men and Women. Wiley Obesity Research, 11(7).

Ferrucci, L., Izmirlian, G., Leveille, S., Phillips, C.L., Corti, M., Brock, D.B., Guralnik, J.M. (1999). Smoking, Physical Activity, and Active Life Expectancy. American Journal of Epidemiology, 149(7), 645-653.

Ng, M., Freeman, M.K., Fleming, T.D., et al. (2014). Smoking Prevalence and Cigarette Consumption in 187 countries, 1980-2012. Journal of the American Medical Association, 311(2), 183-192.

Sakata, R., McGale, P., Grant, E.J., Ozasa, K., Peto, R., & Darby, S.C. (2012). Impact of smoking on morality and life expectancy in Japanese smokers: a prospective cohort study. BMJ.

Stewart, S.T., Cutler, D.M., Allison, B.R. (2009). Forecasting the Effects of Obesity and Smoking on U.S. Life Expectancy. The New England Journal of Medicine, 361, 2252-2260.

Streppel, M.T., Boshuizen, H.C., Ocke, M.C., Kok, F.J., Kromhout, D. (2007). Mortality and life expectancy in relation to long-term cigarette, cigar and pipe smoking: The Zutphen Study. Tobacco Control, 16(2).

Van Baal, P.H.M., Hoogenveen, R.T., De Wit, A.G., Boshuizen, H.C. (2006). Estimating health-adjusted life expectancy conditional on risk factors: results for smoking and obesity. Population Health Metrics, 4(14).