What is statistics? – Graph figures – Histogram

Histogram: Etymology
Histogram was first introduced by Karl Pearson in 1891. Pearson coined the word “histogram” by using the following two words: “historical diagram” which is the function of a histogram. It is a graph figure which is used to display past data. Another -not very possible- explanation is that the word “Histogram” is the product of two Greek words: “Histos” which means “web”. ‘Literally “anything set upright,” from histasthai “to stand” and the “+gram” which means “something written” (Online Etymology Lexicon).

Histogram: Definition
A Histogram is a chart which is consisted by bars that are named bins. An histogram tells the story of “how many” values represent a specified range in a dataset, in a graphical way. For example, the the range 1.3 to 2.3 can include 20 values, and the range 2.4 to 3.4 can include 40 values while its width remain stable, in a given dataset from 1 to 10. Each value is represented by only one given range.


Histogram: What its bins can represent
There is no space between the bins which shows that a continuous (quantitative) variable is depicted. Each bar/bin represents a range. The raw values of the dataset can appear in the X axis. The height of these bins show the Frequency of the number of values that a given range represents. The Frequency or the Relative Frequency is depicted on the Y axis.

Histogram: Usages
The Histogram is one of the most used graphical tool. It can be used to check the type of a distribution: how many modes a dataset has, one or multiple ones? Or about the dispersion of a dataset: what Skewness or Kurtosis has. Statistically, It is a common practice to display the curvature of a dataset on a histogram. In that way, quick information on the shape and dispersion of a dataset / distribution is provided.


Histogram: How to calculate the width of ranges / bins and How many Ranges to divide your dataset
The most common approach about the width of the ranges in a histogram is to be of equal width. That is, the dataset is divided in a number of ranges that can vary BUT of equal width e.g. let’s say 5 points. Note that as the width of ranges increases for a given dataset, the number of ranges that represent a dataset will decrease, and as the width of ranges is narrowed, the number of ranges that represent this dataset will increase.

A variable that is represented by too few bins (ranges) in a Histogram and thus, the width of the range is very wide, relative to the size of the given dataset, it can alter the real shape of the distribution of the dataset. Note also that some ranges can represent zero (0) values, and thus, graphically, empty spaces exist in these positions instead of a bin.


Histogram: How to divide your dataset in ranges
Statistically, multiple rules of thumbs exist that can relate the size of a Continuous Variable to the ‘optimal’ number of divisions by ranges and the ‘optimal’ width of these ranges (bins). Such formulas are given below:

i) k_{r}=\left \lceil  \right \frac{max(x)-min(x)}{W} \rceil

ii) Square root option (Tukey & Mosteller, 1977): k_{r}=\sqrt{n}

iii) Surges’ formula (1926): k_{r}=\left \lceil log_{2}{n}+1   \right \rceil

iv) Rice Rule (Terrell & Scott, 1985): k_{r}=\left \lceil (2n)^{1/3}  \right \rceil
*Note that most websites present the formula without the parentheses, which produce results that are not in agreement with most of the other formulas.

v) Wichard’s rule (2008): k_{r}=1+Ln(n)+Ln(1+Kurt \sqrt{\frac{n}{6}})

vi) Scott’s Rule (1979): W=\frac{3.49\hat{\sigma}}{n^{1/3}}

vii) Freedman–Diaconis rule (1981): W=\frac{2IQR_{x}}{n^{1/3}}

viii) Bendat & Piersol Rule (1966): k_{r}=1.87(n-1)^{0.4}

ix) Doane’s Rule (1976): k_{r}=1+Log_{2}(n)+Log_{2}(1+\frac{\left |\gamma  \right |}{\sigma_{\gamma}}) where: \sigma_{\gamma}}=\sqrt{\frac{6(n-2)}{(n+1)(n+3)}} and \left |\gamma  \right | is the absolute value (signs are omitted) of the skewness.

x) Cochran (1954): k_{r}=\sqrt{\frac{n}{5}}

xi) Rule of Twelve: Any random continuous variable can be represented by 12 ranges.

Note that formulas denoted as k_{r} are giving the “k” number of ranges, and thus the number of bins, while formulas denoted as W are giving the “W” width of the ranges.
n is the size of a variable.
IQR is the value of the interquartile range.
The \hat{\sigma} denotes the standard deviation of the sample (sd).
The \left \lceil  \right \rceil shows the transformation of a decimal digit to the nearest upper whole number such as e.g. 2.10 –> 3.

Histogram: Example
For the given example, we use the Galton dataset which contains the Height of 928 children in inches. Height is a continuous variable. Here, the measurement unit of Height has been converted from inches to cm.

Real limits of a Range
The real limits of e.g. the first range is not the values of 156 and 160 but the values of 155.5 and 160.5. These are the real upper and lower limits of this Range. Therefore, this method ensures that when data are grouped in ranges, each range represents a unique set of numbers. In the following table, the height of these children has been grouped into seven (7) ranges. The real limits of these ranges are presented too.

Height Ranges (in cm)Real RangeSimple Frequency (F)Cumulative Frequency Relative Frequency (F%)Relative cumulative Frequency


Histogram and types of Frequency
The table presents the (i) Simple Frequency (F) and the (ii) Cumulative Frequency as well the Relative Frequency (F%) of the (iii) Simple and (iv) Cumulative Frequencies.

i) The Simple Frequency can tell “how many values” represent a range e.g. the range of 156-160 represents 44 height values.
ii) The Relative Frequency is the ratio of the Simple Frequency of a given range over the Total Simple Frequency of the heights.
For example, the Simple Frequency for the third Height Range (166-170) is 165, its Relative Frequency is then: 165/923=0.178=17.8\%.

iii) The Cumulative Frequency is produced by adding (i) the Simple Frequency of a given Range PLUS (ii) the Simple Frequency of all the previous Ranges from the given one.
For example, the Cumulative Frequency of the third Height Range (166-170) is: 44+59+165=268.
iv) Finally, the Relative Cumulative Frequency is produced by adding (i) the Relative Frequency of a given Range PLUS (ii) the Relative Frequency of all the previous Ranges from the given one.

The table presents the Number of Ranges that a dataset can be divided based on the given formulas (see latin numbers to find the corresponding formula) based on various sample sizes (928, 400, 100, 20), assuming that these sample sizes have the same properties as the original sample (n=928). Note that minimum and maximum Heights were needed in formula (i).

max=187.2 - min=156.7
v ---- Kurt=2.661110876
vi ---- σ=6.5w=2.33 / 14
w=3.08 / 10w=4.89 / 7
w=6.16 / 5
w=8.34 / 4
vii ---- IQR=10.32w=2.12 / 15w=2.80 / 11
w=4.45 / 7
w=5.60 / 6w=7.60 / 5
ix --- γ=-0.088σγ=0.08 / 12σγ=0.12 / 10σγ=0.24 / 8σγ=0.34 / 7σγ=0.47 / 6

In case of vi and vii, the formulas produced the width of the range e.g. (2.33) and then using formula (i), the Number of Ranges was calculated e.g. (14), and thus in the table is presented as: w= 2.33 / 14. All final results have been rounded up or down to whole numbers. We can see that the formulas suggested to divide a dataset from:

i) 30 to 11 ranges of equal width when sample size was n=928
ii) 6 to 2 ranges of equal width when sample size was n=20.

The following picture shows all the calculations for all the previous formulas when the given sample is equal to 928.

Histogram versus Bar chart
When we have a qualitative variable, then, a bar chart can be used instead of histogram. Note that usually, the “bins” of a bar chart do not touch each other, which shows that a non-continuous variable is represented.

Karl Pearson
Francis Galton: Dataset with the Heights of Children and Parents
Table with relevant Range formulas
Doane’s rule and others
Histogram: wiki
Histogram etymology