“A project is something to do that you want to share”
Connections between obesity, lack of exercise & diabetes by United States county
The U.S. Centers for Disease Control & Prevention (CDC) have data on the incidence of:
- adult obesity
- lack of exercise, and
for the years 2004 through 2013.
The focus of this project is to carry out an exploratory analysis of the connections between obesity, lack of exercise, and diabetes from the CDC data.
- What might the data be telling us?
- What sensible conclusions can we draw from the data?
- What hypotheses does the data suggest?
- What hypotheses can we test?
- What models of the connections between the variables can we build?
- How reliable are these models?
- How predictive are these models?
This exploration could be within a particular year or across years, or both.
The data is stored at CDC in downloadable Excel format: one Excel sheet, covering 2004-2013, for each of the variables obesity, lack of exercise and diabetes.
The data is listed by State & then County, and gives, among other things, the total numbers of adults who are obese, lack any exercise, or are diabetic, for each county, for each of the years 2004-2013.
Before you do anything
Who is the target audience for your analysis?
- Hospitals & clinics?
- Health care workers?
- School administrators?
- Community activists?
- Food and drink companies?
- Life insurance companies?
- Health insurance companies?
Try to construct an avatar of your ideal client: for example, a 35 year old female physician, working in an urban clinic in Alabama with 5 other doctors, with a high incidence of diabetes in her patients.
Some things to watch out for
- There is missing data. Sometimes that shows in a spreadsheet just as an empty cell, other times there are the words “NO DATA”.
- Sometimes a particular county will vanish: present one year, gone the next. Counties are also created in some years. So the list of counties is not entirely static year to year, although changes are relatively minimal. This is something to bear in mind when making comparisons across years.
- The Excel formatting has spaces in some text: for example Los Angeles County, New Mexico. Depending on your choice of software you will probably want to fill these space with a character such as an underscore: Los_Angeles_County, New_Mexico. This is especially important in R.
Some things you might do
- Make and compare histograms of all the variables.
- Create summary descriptive statistics for all variables (mean, standard deviation, skewness, kurtosis).
- Do qq-plots for normality tests for all the variables.
- For each year, do scatter plots, 2D & 3D, to show the variation in diabetes with the other two variables.
- Check for outliers in scatterplots and identify these outliers by State & County.
- Look at initial segments of your data – for example counties with obesity numbers below 100,000 and do individual scatterplots on these subsets of data. Observe and analyze any significant differences you see.
- If it seems reasonable from the scatterplots carry out linear regressions on your data, with with one predictor variable (obesity or lack of exercise) or two predictor variables (obesity and lack of exercise). Check for heteroscedasticity in your data, or subsets of your data, and address how this might affect parameter estimation.
- Compare an analyze the changing distributions of each of the variables obesity, lack of exercise, and diabetes over the years 2004 – 2013, and draw appropriate conclusions.
- Analyze the changing geographic distribution of the variables obesity, lack of exercise, and diabetes over the years 2004-2013.
As a good data analyst you should write a report for your selected client(s).
You report should include:
- Title, author(s) and date.
- A short abstract of what you did.
- Statement of the problems and issues you addressed.
- Data sources and nature of the data.
- Method: how you went about doing what you did.
- Results: what you found (keep any long tables, or long lists of graphics, to appendices).
- Discussion: what it all means, in your view.
- Recommendations, if any.
- Bibliography: any sources cited.
Your report should be readable by your proposed audience, and while technically accurate, should not overwhelm a reader with technicalities.
If you are using R you can create your report as you go using R Markdown in RStudio.
Or you may want to use LaTeX via a free ShareLatex account, for example.
With R you can create an interactive Web version of your analyses using Shiny.
Broadcasting your work
This is important so that other people in data analysis and data science get to know you and your work.
Check out Jorge Fernande’s work on LinkedIn.
Important ways of broadcasting are:
- LinkedIn – join a data analysis group and discuss your project. This is super important.
- Your own web page
We want to know about your finished project on what the US county data tells you about connections between obesity, lack of exercise, and diabetes.
When you’ve broadcast you work on a Web page send us a link and we will add it to this page.
A panel of expert, but very kind, data analysts will give merit awards to quality projects submitted by July 31, 2017, and all such projects will be legible for publication in Quality Projects in Data Science.