- Kalpesh Agrawal

# Statistical Concepts- 1

Before we delve into the topic of statistics, we ought to know what importance does statistics hold in our daily lives. Without statistics, you wouldn't be able to justify whether Ronaldo is Better than Messi or vice-versa. You also won't be able to analyze as to how many classes you can bunk to maintain admission attendance. Statistics are useful in every part of life, from small shopkeepers to Billionaire businessman all are dependant on statistics. Also without statistics, you won't be able to differentiate what product among the choices available is more suitable for you.

That being said, statistics can be termed as the manipulation of the data available that best fits the objective of deriving meaningful information. In other words, data will not highlight the information that is objectively required by itself, but by means of different permutation of approaches, meaningful information could be extracted.

Since statistics is a vast subject and could be classified in various ways, it is however been boiled down to two major categories, i.e.;** **

** • Descriptive Statistics **

• **Inferential Statistics**

**Descriptive Statistics** gives us the power to extract and highlight the characteristics of the data and helps bring the most complex forms of data into a manageable form. It does so by means of,

• Appropriate classification of the data as per its depiction

• Organizing the data for mobile extraction

• Summarising the characteristics of data for a detailed overview

**Inferential Statistics** on the other hand helps us acknowledging whether or not the findings and the information supplied by a study group could be generalized for the population of interest. It is of utmost importance as it may not be feasible to study the characteristics of the population as a whole mainly due to monetary and timely constraints.

In this, we make a certain assumption about the population and try and inspect our assumption on the basis of information supplied by a sample extracted from the population.

**Descriptive Statistics **

**Measures of Central Tendency**

**Mean**

It is the average value supplied by dividing the sum of the observation in a collection divided by the number of observed entities in that very collection.

So, a question arises, is **MEAN** a great measure for determining the central of a distribution. Sadly, the answer stands **NO** because the mean can be highly impacted by even a single outlier.

**Let us understand**

Let’s suppose I am observing the salaries of 100 individuals and each of them earn close 1 lakh per month, so the mean is expected to the around that value. If a billionaire, say, Warren Buffet a favorite of all investors, is part of the study group and is the 101st individual, then the mean will shoot up by a substantial amount. This in turn will not give the right idea about the mean of the distribution of salaries.

This situation could be avoided however by excluding value at the extreme percentiles which have the tendency to the above-mentioned adverse effect.

**Median**

Median is a value that represents the midpoint of the distribution of values while they are arranged in ascending order. Unlike Mean, this measure of central tendency is not influenced by the outlier values.

So, in this approach, it wouldn’t matter even Bill Gates, Jeff Bezos and a couple of Russian Billionaires join the tally. The median value of the salary amounts remains close to 1 lakh only.

**Mode**

Mode assumed that the value with the maximum occurrence is the one that best describes the centre of the distribution. This may hold true for certain criteria which tends to display a normal distribution like the positioning of values, wherein values are symmetrically scattered around the highest occurring value. Classic examples of the same are our physical measurement (height and weight) when measured in integer values.

Generalizing the above assumption may be incorrect as more than one value may be having the highest frequency. If there are two such values, then the distribution of the values is considered to be BIMODAL.

So, the above may not act as a measure of central tendency as we may wish it to be.

**Measures of Spread**

**Standard Deviation**

It is a measure of spread that measures and quantifies the amount of dispersion around the data values around its mean or the expected value.

The standard deviation lower in magnitude indicates the fact that all data values are concentrated near to the expected value and the probability of observing extreme values is low. Whereas, standard deviation lower in magnitude is indicative of the fact that the data values are spread all across and the probability of observing extreme values is high.

If we were to ascertain that a certain value lies in a particular range, we would be able to do it with a high degree of accuracy if the standard deviation was low.

**But what happens if we have two distribution and there arises a need to compare the standard deviation of the two?**

For that purpose, the Coefficient of Variation will be of great aid, as it compares the standard deviation of the two distributions keeping into consideration the mean of the two distributions. It is of great aid when the magnitude of standard deviation and mean differ greatly and may lead to misleading interpretations.

**Skewness**

It is the go-to measure of spread if the objective is to measure the degree of distortion of a probability distribution from its symmetric counterpart, i.e., **normally distributed** data.

**Coefficient of skewness**, this numerical value is helpful in determining whether the distribution is skewed to the right or towards the left (discussed later in the section).

The most common way to determine skewness is to use the ratio of the difference between mean and mode to the standard deviation of the data values from its mean position.

Although Pearson’s approach is highly regarded for its simplicity, there are numerous other approaches for obtaining the coefficient of skewness, namely:

Where 𝐷9 𝑎𝑛𝑑 𝐷1 are the ninth and first decile respectively.

And last but not the least the one obtained by the method of moments,

The value of the coefficient of skewness so obtained may either be **positive, negative** or **zero**.

A positive value denotes that the Distribution is positively skewed. Likewise, negative value denotes that the distribution is negatively skewed. And zero is representative of the perfectly symmetrical distribution of data values. But a question arises? What is Positively (or Negatively) skewed distribution?

Let us suppose that distribution has two tails which it generally does have. Of these two tails the right (or left) one is much more elongated than the left (or right) one. This is because the majority of the data values are concentrated towards the left (or right) of the distribution. Due to this fact, the probability of the occurrence of the values concentrated towards the left (or right) is higher than the infrequent ones at the right (or left) end of the distribution. Thus, the probability distribution is asymmetric as given in the figure.

**Let us understand **

**Positively Skewed**

The data points pertaining to this probability distribution as discussed are concentrated towards the LEFT of the distribution, and the mode (value with maximum occurrence) is observed at a fairly lower value. Hence, **Mode < Mean**.

**Example **

I visit Bangalore often, short spells of rain experienced more often, which sometimes extend for longer periods, maintaining pleasant weather.

Wonder if there could be any skewness in the distribution of the rainfall levels on a weekly basis if I were to model one?

**Yes**. The distribution of the levels of rainfall if monitored on a weekly basis will be positively skewed as the long periods of rainfall are infrequent keeping the weekly rainfall levels at a lower level majority of the time. Hence, with a higher probability of low levels of rainfall within a week, the probability is asymmetrical.

**Negatively Skewed**

The data points pertaining to this type of probability distribution as discussed are concentrated towards the RIGHT of the distribution, and the mode (value with maximum occurrence) is observed at a fairly higher value. Hence, **Mode > Mean**.

Example

Let us say that a company keeps a preliminary test prior to its interview process. The test turns out to be an easy one and as a result, the majority of the applicants score well. But there are few cases where the students could not perform well for reasons unknown.

So, if I were to model the marks obtained by the applicants using a probability distribution, the distribution so obtained would be negatively skewed. As the probability of securing higher ranges marks overpowers the minute chances of obtaining lesser marks leading to the asymmetrical distribution of values.

Written by: **Lakshay Guglani**

**Special thanks to Lakshay Guglani for being a guest writer on our website.**

Thanks for reading, I hope you have been learned a lot from this article.

To be continued in part 2...

To get regular updates from us, subscribe to the form given below.