Statistics can be divided into branches of which one is Descriptive Statistics. Descriptive Statistics is about studying the data we have in hand. It is not used to draw any implied inferences. In Descriptive Statistics we perform basic graphical and quantitative analytics. Such statistics is used so that we can denote or highlight the behavior/characteristics of data in a manageable form.
Descriptive Statistics when applied on a single variable at a time, we call it as univariate analysis. Univariate Analysis generally includes studying or summarizing data on following 3 categories.
- Distribution
- Central Tendency
- Dispersion
Distribution
Most common of the Distribution is the Frequency Distribution. It is nothing but classification of data. It exposes the pattern of variation in the data and helps in determining the probability law governing the variation.
Data can be divided into two classes- Discrete and Continuous Data. In case of discrete data, frequency distribution table/plot can be obtained easily. However, in case of continuous data, classes are determined by Class Boundary Method or by A to under B Method.
- Class Boundaries are the difference between smallest & the largest data points which lie in a class.
- The smallest and the largest data points in the class are called as upper and lower class limits.
- Class Midpoint is the average of the class boundaries.
- ith Class Frequency is number of data points in the ith class.
- Relative Frequency of the ith class is class frequency divided by total frequency.
- Percent Relative Frequency is Relative Frequency multiplied by 100.
- Cumulative Relative Frequency is addition of Relative Frequencies.
- Percent Relative Cumulative Frequency is Relative Cumulative Frequency multiplied by 100
Frequency distribution allows one to visualize and compare two samples as well as facilitates making crude inferences about the population.
Graphs can be used to better observe and understand the Distributions. Frequency Polygon is a graph of frequency against class midpoint. It is used to compare frequency distributions of 2 different data sets if the distributions have same number of classes. Comparison is usually based on relative frequency.
Histograms and Ogive are also used to represent data. Histogram is nothing but bar graph of frequency distribution. It is not useful for comparing 2 distributions. Ogive is the graph of relative cumulative frequency plotted against upper class boundaries. They are also called as Relative Frequency Polygons.
Probability Distributions are theoretical counter part of relative frequency distribution. They are the model behavior of experimental outcome representing relative frequency distribution when large number of trials is performed.
Central Tendency
Mean, Median & Mode
Airthmetic Mean is simple average. If Xi denotes observations then
Formula: Xbar = (∑ Xi)/n
Weighted Mean is used to approximate arithmetic mean from a continuous type frequency distribution.
Formula: Xbar = (∑wiXi) where i can take values from 1 to n
Properties:
- ∑(Xi – Xbar) = 0
- ∑(Xi – A)^2 = minimum when A=Xbar
* Xbar is not a good measure of middle for skewed datasets. In such cases we use Median.
Population Mean is given by expected value of X.
E(X) =∑( XiPi) where Pi is probability of observing xi
Mode, a particular value that occurs most number of times in a set of observations, is one other measure that can denote central tendency.
Dispersion
In order to define the data in hand in a simple way, apart from mean we can talk about the spread of the data points from that mean value. This is nothing but dispersion of data and it can be measured using Variance or Standard Deviation. Standard Deviation is nothing but the square root of Variance where Variance is summed squared difference of every observation from the mean.
Var = ∑over all i observations(Xi – Xbar)^2
These concepts are pretty simple however lie as the basis for the talking about your data. Knowing such descriptive statistics of the data can let you compare two univariate datasets or infer about the population or extend these concepts to multivariate data sets. Some of these form a part of Inferential Statistics.
We shall discuss about it in the next blog..