Boxplot: A Little History
The “box and whisker plot” or simply the Boxplot used for first time by John W. Tukey in 1970, a very important mathematician. He wanted to produce a graph that will summarize the properties of a Continuous Distribution.
Boxplot: What is it?
A classic Boxplot includes:
i) a rectangular Box
ii) the Median which is drawn as a line inside the box. If a Normal distribution is represented, then this line is drawn in the center of the “box”.
iii) A whisker like a “T letter” is drawn in the up and down side of this rectangular Box.
The Boxplot can be drawn Horizontally or Vertically. Tukey’s Boxplot has been modified in various ways in order to accommodate visually additional statistics such as the Arithmetic Mean.
Boxplot: What Information is shown?
A Classic Boxplot, as it is seen in the following figure, includes statistics about the dispersion of a dataset, and thus, about its shape. Specifically, it shows information about:
i) Median value (2nd Quartile): is represented by a line inside the Box. Note that the Median value is the 2nd quartile. Therefore, this line separates the data of a variable exactly in halve: 50% of the values are before this line and 50% of the values are after this line. The position of this line shows if the most values are near to the 1st or 3rd Quartile.
ii) Quartiles: 50% centered values: The rectangular Box of the Boxplot represents visually the values that exist between the 1st Quartile (Q1) and the 3rd Quartile (Q3). That is, the Bottom Side of this Box represents the 1st Quartile (Q1) and the Upper Side of this Box represents the 3rd Quartile (Q3). In other words, this Box represents the Interquartile Range (IQR=Q3-Q1). In that way, the 50% of the values of a Continuous variable is visually represented by this Box. Note that the length of the other two sides of the Box, those that do not include whiskers, is arbitrarily drawn.
iii) Quartiles: 50% of the Non-centered values: Note that the 25% of the values of a variable is positioned before the Bottom Side of the Box (1st Quartile (Q1)), and another 25% of the values of this variable is positioned after the Upper Side of this Box (3rd Quartile (Q3)).
iv) Whiskers: The Bottom Whisker represents the value that is 1.5*IQR below the 1st Quartile (Q1-1.5*IQR) and the Upper Whisker represents the value that is 1.5*IQR upper the 3rd Quartile (Q3+1.5*IQR). Some modified Boxplots replaces these “Whisker” values with the values of the 2nd and 98th percentile.
v) Outliers: The values that are further below the Bottom Whisker and further above the Upper Whisker, – (“Q1-” or “Q3+” 1.5*IQR) – can be described as outliers, that is, they are extreme values. These outliers are visually represented by circles. In some modified Boxplots, the values that are even further from these points – (“Q1-” or “Q3+” 3*IQR) – are represented as a Star. These points are named also “Fences”.
vi) The maximum and minimum limits or values of a variable can be suggested from a Boxplot too. Some modified Boxplots replace the values of the Whiskers by the minimum and maximum values of the variable.
Boxplot: Example figure
The below Graph figure presents the “five” statistical points of the Boxplot as well its relation to the Curve of a Standard Normal Distribution. The “gray star” could show a possible outlier / extreme value, IF a such value existed. Note that the Standard Normal Distribution do not present Outliers / extreme values. It is symmetrical and therefore, its boxplot also is symmetrical: both whiskers has the same distance from the box and the median is exactly in the center of the box.
Boxplot: Modifications
Tuckey’s Boxplox has been modified in a number of ways in order to present visually additional properties of the dataset:
i) The Traditional Boxplot as it was described by Tuckey
ii) Variable Width Boxplot: The size of the sample defines the size of the “side” / width of the Boxplot
iii) Notched Box Plot: Notches emphasizes the size of the Median
iv) Violin plot: The perimeter (yellow one) shows the Probability Density (pdf) of the dataset / group
v) Vase plot: The perimeter (yellow one) shows the Modality of the dataset: Unimodal or Bimodal?
vi) Bean Plot: The black lines is each individual observation of the dataset and its thickness or their width can present duplicate values. The big line shows the mean.
Boxplot: Example Ι
The following Table presents the Descriptive statistics of the finish time (in minutes) that did Males and Females in the (“Cherry Blossom”) run of 10 miles in 2009. Also, it presents the “five” statistics that are needed for the construction of the relevant boxplots, one for each gender.
The comparison of these two Boxplots can reveal that:
i) The whole Boxplot of Female runners is higher positioned than the Boxplot of Male runners. That is, Both Whiskers, as well the Box itself of Female Runners represent higher Finish times than those of Male Runners. Indeed, if we read the Table, we can see that the Median Finish time for Females was 98.03 and the Median Finish time for Males was 87.47.
ii) Moreover, it can be seen that the Finish time of one Female Runner was about 170 minutes. This value is shown as a circle that is far away from the Box as well from the rest circles in the Boxplot of Female Runners. Note that all other circles that are shown in both Boxplots can be characterized as Outliers based on the definition of Boxplot.
Finish Time statistics (in minutes) | Females | Males |
---|---|---|
n | 9732 | 7192 |
Mean | 99.02 | 88.43 |
SD | 14.68 | 15.52 |
Min | 54.03 | 45.25 |
Max | 170.97 | 150.98 |
Summary statistics for | the construction of Boxplot | |
Bottom Whisker: Q1-1.5*IQR | 61.12 | 47.55 |
Box starts at (Q1) | 89.08 | 77.55 |
Median (Q2) | 98.03 | 87.47 |
Box ends at (Q3) | 107.90 | 97.78 |
Upper Whisker: Q3+1.5*IQR | 136.13 | 127.87 |