Have you ever wondered about how statisticians are able to make predictions about the future, be it presidential elections or economic forecasting?
In a past post titled When Science Gets Involved in Politics, we discussed the importance of adhering to scientific sampling techniques as a first step.
So now that we have our well defined sample, the next question is how do we find answers about the population?
Welcome to inferential statistics. Estimation in statistics refers to the process by which statisticians are able to make relatively accurate inferences about a population based on information obtained from a sample.
In order to understand how we make that move, it is important to differentiate among three different distributions.
What is important to understand here is that the sampling distribution is theoretical, meaning that the researcher never obtains it in reality, but it is critical for estimation.
Because of the laws of probability, a great deal is known about this distribution such as its shape, central tendency, and dispersion. We know that its shape is a normal curve. You most likely have heard of the normal or “Bell Curve”, which is a theoretical distribution of scores that is symmetrical and bell shaped.
The standard normal curve always has a mean of 0 and a standard deviation of 1. Furthermore, there are known probabilities that can be calculated based on the mean and standard deviation.
Because one can assume that the shape of the sampling distribution is normal, we can calculate probabilities of various outcomes. We can also assume things like the mean of the sampling distribution is the same value as the mean of the population.
Building on this is the Central Limit Theorem, a probability theory that declares if a random sample of size N is drawn from any population with a mean and standard deviation, as N grows, the sampling distribution of the sample means will approach normality.
With a larger sample size, the mean of the sampling distribution becomes equal to the population mean; the standard error of the mean decreases in size, and the variability in the sample estimates from sample to sample decreases. So you can start to see how researchers can have more and more confidence in their results, such as election polling!
But with estimation, there is always a chance of error. The width of Confidence Intervals is a function of the risk we are willing to take of being wrong and the sample size. The larger the sample, the lower the chance of error.
In other words, it refers to the probability that a specified interval will contain the population parameter. A 95% confidence level means that there is 0.95 probability that a specified interval does contain the population mean; accordingly, there are 5 chances out of 100 that the interval does not contain the population mean.
When the purpose of the statistical inference is to draw a conclusion about a population, the significance level measures how frequently the conclusion will be wrong. For example, a 5% significance level means that our conclusion will be wrong 5% of the time. It is always the case that Confidence Level + Significance Level = 1.
It is possible to make inferences about a population from a sample that is carefully selected. The sampling distribution, a theoretical one, links the known sample to a larger population through an estimation. Because of the properties of the sampling distribution we are able to identify the probability of any statistic with a certain level of confidence.
Whether you realize it or not this stuff is under our noses every day in the news!
Keep your eye out and next time someone talks about who is ahead in the polls at your next cocktail party you’ll be armed with a heavy dose of skepticism.