“A project is something to do that you want to share”
New York Harbor water quality
New York City Environmental Protection has several years data on water quality in New York Harbor.
The data is in Excel format, and, as of June 2017, there are separate Excel sheets for each of the years 2008-2016 and partial results for 2017.
The Excel sheet for 2016 contains 2374 rows and 9 columns.
The rows contain identifiers for 86 different survey stations
The columns list: the identifier of the survey station at which a sample was take; the date that sample was taken; the dissolved oxygen in the sample, both top and bottom of the harbor; fecal coliform levels, both top and bottom of the harbor, enterococcus levels, both top and bottom of the harbor; and the Secchi measure of the transparency of the water at that survey station.
There is a link to the definitions of dissolved oxygen, fecal coliform bacteria, enterococci bacteria, and Secchi transparency.
There is also a link to a map of locations of the survey stations.
Some things to watch out for
The data contains numerous “NS” (for “not sampled”) entries, and some numerical entries are suspiciously large (perhaps corresponding to a measurement or data entry error).
Before you do anything
Who is the target audience for your analysis?
- Seafood suppliers?
- Fishing charter operators?
- Fishing (individuals or groups)?
- Clam Diggers?
- Diving schools?
- Tourist boat operators?
- Swimming instructors?
Try to construct an avatar of your ideal client: for example, a chartered fishing boat operator, captain for 25 years, male, aged 57, operates 7 days a week in season, and guarantees fish will be caught on all trips.
Some things you might do
- Make and compare histograms of all the variables.
- Create summary descriptive statistics for all variables (mean, standard deviation, skewness, kurtosis).
- Do qq-plots for normality tests for all the variables.
- Analyze and visually display the changing geographic distribution of the variables dissolved oxygen, fecal coliform levels, enterococcus levels, and Secchi measure of transparency, over a period of years.
- Examine differences between dissolved oxygen, fecal coliform levels, and enterococcus levels, at the top and bottom of the harbor, and how those difference vary with time.
- Are there significant relationships between the variables dissolved oxygen, fecal coliform levels, and enterococcus levels for any given year, or over the years?
As a good data analyst you should write a report for your selected client(s).
You report should include:
- Title, author(s) and date.
- A short abstract of what you did.
- Statement of the problems and issues you addressed.
- Data sources and nature of the data.
- Method: how you went about doing what you did.
- Results: what you found (keep any long tables, or long lists of graphics, to appendices).
- Discussion: what it all means, in your view.
- Recommendations, if any.
- Bibliography: any sources cited.
Your report should be readable by your proposed audience, and while technically accurate, should not overwhelm a reader with technicalities.
If you are using R you can create your report as you go using R Markdown in RStudio.
Or you may want to use LaTeX via a free ShareLatex account, for example.
With R you can create an interactive Web version of your analyses using Shiny.
Broadcasting your work
This is important so that other people in data analysis and data science get to know you and your work.
Check out Jorge Fernande’s work on LinkedIn.
Important ways of broadcasting are:
- LinkedIn – join a data analysis group and discuss your project. This is super important.
- Your own web page
We want to know about your finished project on what the US county data tells you about connections between obesity, lack of exercise, and diabetes.
When you’ve broadcast you work on a Web page send us a link and we will add it to this page.
A panel of expert, but very kind, data analysts will give merit awards to quality projects submitted by July 31, 2017, and all such projects will be legible for publication in Quality Projects in Data Science.