Our visual sense, suitably trained, is capable of picking up both broad and fine features of numerical data from a visual representation.
One common way to visualize the distribution of a numerical data set is via a histogram.
To create a simple histogram, we divide the entire range of values of the data into small (but not too small) non-overlapping intervals— called “bins”. We then count how many data values fall into each bin, and draw a rectangle over each bin with height proportional to the data count for that bin.
The process of creating and drawing a histogram, particularly for any decent-sized data set, is carried out via software, in which the number of bins is usually set to a default value determined by the software. In our examples here we will use the data analysis program R.
To this end, let’s import some data from a website into R. The data we choose is from the STAT LABS Data site. On that site we choose the Birth weight I link, under “Maternal Smoking and Infant Health I”. On that site there are two columns of numerical data that begin as follows:
The left hand column, labelled “bwt” gives a baby’s birth weight in ounces. The right hand column, labelled “smoke” contains a “0” if the mother of the baby, whose weight is recorded in the left hand column, did not smoke during pregnancy, and a “1” if she did smoke. Some of the right hand column entries are “9” and that indicates the data entry team could not determine if the mother did or did not smoke.
To get this data into R we use the read.table function:
> data<-read.table(“http://www.stat.berkeley.edu/users/ statlabs/data/babiesI.data”,header = TRUE,sep=””)
The first argument of the read.table function is the URL of the data, the second argument tells R the data has headers (“bwt” & “smoke”), and the third argument tells R to use a blank as separator for the data. We arbitrarily named the data set “data”. The data lives in R as a data frame, which we can see by applying the structure function str to the data:
with the result:
‘data.frame’:1236 obs. of 2 variables:
$ bwt : int 120 113 128 123 108 136 138 132 120 143 …
$ smoke: int 0 0 1 0 1 0 0 0 0 1 …
This tells us the data frame ‘data’ consists of 1236 observations of two variables: ‘bwt’ and ‘smoke’, each of which is an integer value (= whole number), and then lists the first 10 of the numeric values for each of these variables, in order. To get a histogram of the birth weights of the non-smoking mothers we need to extract those birth weights from the data frame as a vector. First we extract the part of the data frame that contains only the birth weights of babies of the non-smoking mothers:
> nosmoke< −subset(data, smoke == 0)
and then we extract from that new data frame, nosmoke, just the vector of birth weights:
> nosmokedata< −nosmoke[[‘bwt’]]
Now we can use the histogram function, hist, to plot a histogram of the birth weights of the babies of the non-smoking mothers:
with the result:
We can see from this graphic image that the data is relatively symmetric about the mean of the data, and there appears to be only one mode (local maximum). Of course the latter may be an artifact of the number of bins chosen for us by R. What we notice about the vertical axis of the histogram is that it records the total number of data points in each bin. This means that the histogram, as drawn, is not a representation of the empirical probability density function, because the total area under the bars of the histogram is not 1. We can rectify this by using the following option in the hist function:
> hist(nosmokedata, probability=TRUE)
which gives us the following graphic image:
Notice the change in the vertical axis: we now have the relative frequency of occurrence, or density, of data points in the bins, so we can interpret the vertical axis as the probability that a data point will lie in that bin.
Histograms were introduced by the statistician Karl Pearson in 1895, in a paper on the mathematical theory of evolution:
Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London. A, 343-414.
Eric’s main interests in statistics are machine learning, statistical computing, applied statistics, and mathematical statistics.