What is statistics? – Standard Deviation and Variance

Defining some Statistical terms
Population
When we use the word “Population”, we refer e.g. to all the people that exist in a city or e.g. to the total number of birds that exist in a region.

Sample
Sample is a part of the whole population e.g 100 people from a city which “occupies” 10000 people

population vs Sample

What is Standard Deviation and Variance – A Historical view
Karl Pearson was one of the most influential, famous and great statistician that existed. He founded the world first Statistical University department in London. He was honored multiple times from UK government. He also invented the Standard Deviation among other things (1895).

What is Standard Deviation and Variance – Theoretical Definition
A Simple explanation: Standard Deviation (SD) shows the average distance that each number has from the mean in an arithmetic series. So, you take a walk each day and then you write down for five (5) days how much distance you have walked each of these days. In five (5) days, you would like to calculate the average of these distances that you walked in these 5 days. The Standard Deviation can tell you how far is this average distance from each individual number you have written down, in average! As your dataset is expanded, then more accurate results you will take. Variance has a similar definition and it differs slightly from Standard Deviation on the grounds of telling how much distance numbers have from the mean.

Standard Deviation and Variance – A little Imagination!
It is easier to imagine each number as a small circle. All these circles are inside a big circle (your dataset). Now, you can depict your Mean as another circle, ideally, near the center of the big circle. The next step is to join each small circle with a straight line with the circle of the Mean. Then, If you were in position to calculate these distances and take the average of them, then you are taking an idea what is Standard Deviation and Variance!

Standard_deviation_Variance

Statistical definition of the Standard Deviation and Variance
It must be said that the statistical definition of the Standard Deviation and Variance can slightly change, depending if we have a Population or a Sample. That is a dataset that can include all the possible data or a part of them only. The change is a slight one in the division part of the formula: Instead of using as denumerator the N which is used when we have a population-based dataset, it is used the n-1 as denumerator when we have a sample-based dataset . Note that we get the Variance as a result if we do not apply the square root in the last step of Standard Deviation calculations. Variance is separated by Standard deviation by only one step in calculations!

Statistical formulas of Standard Deviation

A population-based formula: \sigma =\sqrt{\frac{1}{N}*\sum_{i=1}^{N}(x_{i}-\mu)^{2}}

A sample-based formula: s=\sqrt{\frac{1}{n-1}*\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}

Statistical formulas of Variance
Note: Variance can be denoted as s^2, \sigma^2, or as Var(X)^2

A population-based formula: \sigma^2 =\frac{1}{N}*\sum_{i=1}^{N}(x_{i}-\mu)^{2}

A sample-based formula: s^2=\frac{1}{n-1}*\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}

Explaining the Statistical Symbols

Standard Deviation for Population (\sigma) and Sample sets (s)
The total number of either Population (N) or Sample sets (n)
Mean symbol for Population (\mu) or Sample sets (\bar{x})
\sum = This symbol indicates that a summation must be done on the results that will be acquired after the declared arithmetic operations finish.
x_{i} = The “i” symbol can be replaced by the numbered position of a arithmetic member of a data series as many times as total arithmetic members exists in this dataset. Therefore, it can be generated as many times as total arithmetic members exists in this dataset. Therefore, x_{5} shows the fifth (5th) number in a dataset and the x_{7} the seventh one (7th).

Statistical Example I – Steps
Let’s Say that you have written down the Kilometers that you have walked for six (6) days: X_{1}=4, X_{2}=3, X_{3}=5, X_{4}=3, X_{5}=4, X_{6}=5
Here, we make the assumption that these data can represent a Population or they are a sample derived / extracted from a population. The steps are:


i) N\hspace{2}\&\hspace{2}n=6
ii) You calculate the mean of your data: \frac{(4+3+5+3+4+5)}{6}=\frac{24}{6}=4
iii) Therefore, \mu \hspace{2}\&\hspace{2}\bar{x}=4
iv) Each x_{i} is known – your numbers!

Calculations that take part inside \sum
i) You subtract eachX, that is, each number is subtracted from your mean, individually:

(4-4)=0
(3-4)=-1
(5-4)=1
(3-4)=-1
(4-4)=0
(5-4)=1

ii) Now, you must square the results (2), that is, you multiply it by itself:

0*0=0
((-1)*(-1))=1
(1*1)=1
((-1)*(-1))=1
(0*0)=0
(1*1)=1

iii) and then you add these results:
0+1+1+1+0+1=4

Results for Variance
You must divide your results either using N=6 or n-1=6-1=5
By executing this step, the Variance for Population and Sample is resulted:
Var(X)=\frac{4}{6}=0.67 and Var(X)=\frac{4}{5}=0.80
The results show that the Variance of population is 0.67 and the Variance of sample is 0.80

Result for Standard Deviation
It is simple now! You must calculate the Square root of the Variance results!
The Standard Deviation for Population is: \sqrt{0.67}=0.82
The Standard Deviation for Sample is: \sqrt{0.80}=0.89

The following table presents some steps of getting these results

Position of the Number memberValue of the NumberValue number is substracted from the Mean (4)Χ^2
144-4=00*0=0
233-4=-1(-1)*(-1)=1
355-4=11*1=1
433-4=-1(-1)*(-1)=1
544-4=00*0=0
655-4=11*1=1

Interpretation of Standard Deviation
a) By subtracting each number from the mean, we got the distance of each number from the mean.
b) Then, we squared the results in order to have absolute values e.g. (-1)*(-1) which equals to +1.
c) and then we added the results and then divided them by the size of the dataset, that is, by 6 for the population SD formula.

Interpretation of Results
A high percentage of distance that you have walked can be indicated by the mean, plus or minus the Standard Deviation. So: 4\pm0.82 or 4\pm0.89. That is, you walked about 3 to 5 kilometers all these 6 days! Look to your notes! Therefore, by knowing only these two numbers, the mean and the Standard Deviation, you can suggest which are the most members of an arithmetic series or dataset! As the number of Standard Deviations are increased in relation to the Mean, the precision about the number of values that can included in a dataset is also increased. Is it not a magical thing? Statistics creates Magic!

Variance and Standard Deviation: Similarities and Differences
Variance is the prior step to Standard Deviation. Both Variance and Standard Deviations describe the dispersion of a dataset: that is, the distance between the numbers of a dataset and its mean. Therefore:

i) when Variance has a value equals to zero (0) then it means that all numbers have the same value with the mean. That is, if the mean of a dataset is 30, then all numbers of this dataset have values “30”. Its values are identical. The same is true about the Standard Deviation.

ii) A Low value in Variance shows that the distance between the mean and the numbers of a dataset is also low. Similarly, a high value in Variance shows that the distance between the mean and the numbers of a dataset is also high. The same is true about the Standard Deviation.

iii) Note that Variance takes only positive values while Standard Deviation only absolute values.

iv) The below graph shows that the Standard Deviation have higher values than the Variance ONLY between the region of 0 and 1. The Variance increases linearly while Standard Deviation is not.

Variancd_Vs_SD

Variance and Standard Deviation: Similarities and Differences: Example
The following Table and Graph shows the Linearity attribute of the Variance in relation to the Standard Deviation which has a “squared” attribute. Also, it shows how the values of Variance and Standard Deviation change as the mean (or the numbers around the mean) change. All datasets includes five (5) values:

i) The 1st dataset has identical numbers therefore: Var=0 and SD=0. This happens because, each individual value is subtracted from the mean (M=30) when Variance or Standard Deviation formula is used.

ii) In the 2nd dataset, the value of a number changed slightly in relation to the rest values that remained identical. The mean also changed slightly (M=30.2). Therefore, the values of Variance and Standard Deviation also changed, in relation with the prior case. Note that the value of the Standard Deviation is higher than the value of the Variance.

iii) In the 3rd dataset, the value of a number changed also slightly in relation to the rest values. The Mean remained the same as before (M=30.2). However, the value of Variance had a ninefold increase in relation with its previous value while the value of Standard Deviation only increased 3 times in relation with its previous value. Note that the value of the Standard Deviation is NOT Now higher than the value of the Variance. This fact will remain the same for the rest cases.

Graph: Example
The below infographic presents in a fun way the information that the Table shows. It is a good person which has four (4) hands and one (10) leg. It has also two (2) antennas. It is repeared six (6) times, one for each Table Example:

i) Its four hands are symbols for the same numbers in the relative dataset.
ii) Its Leg is a symbol for the number that is not the same with the rest ones.
iii) Its T-shirt is a symbol for the Mean value.
iv) Each Antenna is a symbol for Variance and Standard Deviation.
v) As the difference of a number from the Mean value increases, the Mean is getting “Meaner” over this number and expands against this number.

Variance_Standard_Deviation_infographic

iv) In the 4th and 5th dataset, we can see that both datasets include four identical values while the 5th value is a different number. Note that the distance (difference) of Mean in relation to the four identical values is “4 points” while the distance of Mean in relation to the other number is “16 points”, in both cases. Note that the Mean value is 46 in the 4th dataset while it is 34 in the 5th dataset. The distance between the numbers and their Mean is the same in both cases (in absolute values). Therefore, the values of Variance and Standard Deviation remained EXACTLY the same, in both cases.

v) Finally, we can see that in the 6th dataset, the value of Variance increased incredibly in relation to the value of Standard Deviation which did not skyrocket.

N=5MeanVariance (var)Standard Deviation (SD)
30,30,30,30,30
30
00
30,30,31,30,3030.20.160.40
30,30,33,30,30
30.21.441.20
50,50,30,50,50
46648
30,30,50,30,30
34648
30, 30, 10000, 30, 30
202415904144
3988

Sources
Karl Pearson and Standard Deviation