What is statistics? – Graph figures – Histogram

Histogram: Etymology
Histogram was first introduced by Karl Pearson in 1891. Pearson coined the word “histogram” by using the following two words: “historical diagram” which is the function of a histogram. It is a graph figure which is used to display past data. Another -not very possible- explanation is that the word “Histogram” is the product of two Greek words: “Histos” which means “web”. ‘Literally “anything set upright,” from histasthai “to stand” and the “+gram” which means “something written” (Online Etymology Lexicon).

Histogram: Definition
A Histogram is a chart which is consisted by bars that are named bins. An histogram tells the story of “how many” values represent a specified range in a dataset, in a graphical way. For example, the the range 1.3 to 2.3 can include 20 values, and the range 2.4 to 3.4 can include 40 values while its width remain stable, in a given dataset from 1 to 10. Each value is represented by only one given range.

histogram_width_bins_width_ranges_distribution

Histogram: What its bins can represent
There is no space between the bins which shows that a continuous (quantitative) variable is depicted. Each bar/bin represents a range. The raw values of the dataset can appear in the X axis. The height of these bins show the Frequency of the number of values that a given range represents. The Frequency or the Relative Frequency is depicted on the Y axis.

Histogram: Usages
The Histogram is one of the most used graphical tool. It can be used to check the type of a distribution: how many modes a dataset has, one or multiple ones? Or about the dispersion of a dataset: what Skewness or Kurtosis has. Statistically, It is a common practice to display the curvature of a dataset on a histogram. In that way, quick information on the shape and dispersion of a dataset / distribution is provided.

histogram_width_bins_width_ranges_distribution_normal_curve

Histogram: How to calculate the width of ranges / bins and How many Ranges to divide your dataset
The most common approach about the width of the ranges in a histogram is to be of equal width. That is, the dataset is divided in a number of ranges that can vary BUT of equal width e.g. let’s say 5 points. Note that as the width of ranges increases for a given dataset, the number of ranges that represent a dataset will decrease, and as the width of ranges is narrowed, the number of ranges that represent this dataset will increase.

A variable that is represented by too few bins (ranges) in a Histogram and thus, the width of the range is very wide, relative to the size of the given dataset, it can alter the real shape of the distribution of the dataset. Note also that some ranges can represent zero (0) values, and thus, graphically, empty spaces exist in these positions instead of a bin.

histogram_width_bins_width_ranges_distribution_normal_curve_different_width

Histogram: How to divide your dataset in ranges
Statistically, multiple rules of thumbs exist that can relate the size of a Continuous Variable to the ‘optimal’ number of divisions by ranges and the ‘optimal’ width of these ranges (bins). Such formulas are given below:

i) k_{r}=\left \lceil  \right \frac{max(x)-min(x)}{W} \rceil

ii) Square root option (Tukey & Mosteller, 1977): k_{r}=\sqrt{n}

iii) Surges’ formula (1926): k_{r}=\left \lceil log_{2}{n}+1   \right \rceil

iv) Rice Rule (Terrell & Scott, 1985): k_{r}=\left \lceil (2n)^{1/3}  \right \rceil
*Note that most websites present the formula without the parentheses, which produce results that are not in agreement with most of the other formulas.

v) Wichard’s rule (2008): k_{r}=1+Ln(n)+Ln(1+Kurt \sqrt{\frac{n}{6}})

vi) Scott’s Rule (1979): W=\frac{3.49\hat{\sigma}}{n^{1/3}}

vii) Freedman–Diaconis rule (1981): W=\frac{2IQR_{x}}{n^{1/3}}

viii) Bendat & Piersol Rule (1966): k_{r}=1.87(n-1)^{0.4}

ix) Doane’s Rule (1976): k_{r}=1+Log_{2}(n)+Log_{2}(1+\frac{\left |\gamma  \right |}{\sigma_{\gamma}}) where: \sigma_{\gamma}}=\sqrt{\frac{6(n-2)}{(n+1)(n+3)}} and \left |\gamma  \right | is the absolute value (signs are omitted) of the skewness.

x) Cochran (1954): k_{r}=\sqrt{\frac{n}{5}}

xi) Rule of Twelve: Any random continuous variable can be represented by 12 ranges.

Comments
Note that formulas denoted as k_{r} are giving the “k” number of ranges, and thus the number of bins, while formulas denoted as W are giving the “W” width of the ranges.
n is the size of a variable.
IQR is the value of the interquartile range.
The \hat{\sigma} denotes the standard deviation of the sample (sd).
The \left \lceil  \right \rceil shows the transformation of a decimal digit to the nearest upper whole number such as e.g. 2.10 –> 3.

Histogram: Example
For the given example, we use the Galton dataset which contains the Height of 928 children in inches. Height is a continuous variable. Here, the measurement unit of Height has been converted from inches to cm.

Real limits of a Range
The real limits of e.g. the first range is not the values of 156 and 160 but the values of 155.5 and 160.5. These are the real upper and lower limits of this Range. Therefore, this method ensures that when data are grouped in ranges, each range represents a unique set of numbers. In the following table, the height of these children has been grouped into seven (7) ranges. The real limits of these ranges are presented too.

Height Ranges (in cm)Real RangeSimple Frequency (F)Cumulative Frequency Relative Frequency (F%)Relative cumulative Frequency
156-160
155.5-160.544
44
4.7%
4.7%
161-165
160.5-165.559
103
6.4%
11.1%
166-170
165.5-170.5165
268
17.8%
28.9%

171-175
170.5-175.5258
526
27.8%
56.7%
176-180
175.5-180.5266
792
28.7%
85.4%
181-185
180.5-185.5105
897
11.3%
96.7%
186-190
185.5-190.531
928
3.3%
100%

Histogram and types of Frequency
The table presents the (i) Simple Frequency (F) and the (ii) Cumulative Frequency as well the Relative Frequency (F%) of the (iii) Simple and (iv) Cumulative Frequencies.

Frequencies
i) The Simple Frequency can tell “how many values” represent a range e.g. the range of 156-160 represents 44 height values.
ii) The Relative Frequency is the ratio of the Simple Frequency of a given range over the Total Simple Frequency of the heights.
For example, the Simple Frequency for the third Height Range (166-170) is 165, its Relative Frequency is then: 165/923=0.178=17.8\%.

iii) The Cumulative Frequency is produced by adding (i) the Simple Frequency of a given Range PLUS (ii) the Simple Frequency of all the previous Ranges from the given one.
For example, the Cumulative Frequency of the third Height Range (166-170) is: 44+59+165=268.
iv) Finally, the Relative Cumulative Frequency is produced by adding (i) the Relative Frequency of a given Range PLUS (ii) the Relative Frequency of all the previous Ranges from the given one.

The table presents the Number of Ranges that a dataset can be divided based on the given formulas (see latin numbers to find the corresponding formula) based on various sample sizes (928, 400, 100, 20), assuming that these sample sizes have the same properties as the original sample (n=928). Note that minimum and maximum Heights were needed in formula (i).

n
-----------------
max=187.2 - min=156.7
9284001005020
ii
30
2010
74
iii1110876
iv1310654
v ---- Kurt=2.661110876
vi ---- σ=6.5w=2.33 / 14
w=3.08 / 10w=4.89 / 7
w=6.16 / 5
w=8.34 / 4
vii ---- IQR=10.32w=2.12 / 15w=2.80 / 11
w=4.45 / 7
w=5.60 / 6w=7.60 / 5
viii
29
211296
ix --- γ=-0.088σγ=0.08 / 12σγ=0.12 / 10σγ=0.24 / 8σγ=0.34 / 7σγ=0.47 / 6
x149432

In case of vi and vii, the formulas produced the width of the range e.g. (2.33) and then using formula (i), the Number of Ranges was calculated e.g. (14), and thus in the table is presented as: w= 2.33 / 14. All final results have been rounded up or down to whole numbers. We can see that the formulas suggested to divide a dataset from:

i) 30 to 11 ranges of equal width when sample size was n=928
ii) 6 to 2 ranges of equal width when sample size was n=20.

The following picture shows all the calculations for all the previous formulas when the given sample is equal to 928.
histogram_width_bins_width_ranges_distribution_normal_curve_different_width_formulas_ranges

Histogram versus Bar chart
When we have a qualitative variable, then, a bar chart can be used instead of histogram. Note that usually, the “bins” of a bar chart do not touch each other, which shows that a non-continuous variable is represented.

Sources
Karl Pearson
Francis Galton: Dataset with the Heights of Children and Parents
Table with relevant Range formulas
Doane’s rule and others
Histogram: wiki
Histogram etymology