Statistics Intro

This particular post will briefly introduce statistics definition, distribution, hypothesis testing and categorical hypothesis testing.

Introduction

In Quality Assurance, statistics and data are always solid information to prove or validate the effectiveness or implementation. That is why understand statistics’ definition and how to interpret the statistic result is extremely critical to make the right calls. 

Statistics have the following definition based on the collected resource, which is either describe observations from objects (descriptive statistics) or use collected sample data to estimate the population status (inferential statistics).

Based on the collection from sample, variables can also be listed in the following lists for the common terms.

Variables which are discrete such as  gender, religion or others which can be categorized

Variables which are continuous where as the value can be lay in between a finite range.

After the data are gathered for data description, measures of central tendency and measures of variability will be observed to evaluate or illustrate the data

In measure of central tendency, mean, median and mode are the most common phrases. The following table and list below will explain the differences between means. 

Mean Definition Breakdown

When arrange the numbers from small to large, the median number is the middle number.

For instance, when you have 5 numbers laid out from small to large, then the 3rd number will be your median.

Mode is the value which appears the most in the data. So if the data consists of 1, 1, 3, 6, 9, 9 , 9. Then the mode for this data is 9.

By looking at measures of variability, indicators such as variance, standard deviation, ranges and mean absolute deviation are the primary focus for the data distribution and variance evaluation.

Variance and Standard Deviation Equation and Excel Commands
Range, IQR and Mean Absolute Deviation Calculation

Statistics Distribution

Statistical distribution indicates the probably of when certain events will occur. But there are many statistical distribution which can be referred to calculate based on the nature of events and conditions. This section will breakdown the types of distribution and classification of when to use these distributions.

For statistical distribution, two major separation is that an event’s probability can be either discrete or continuous. And the critical components to comply with distribution are the averages and variance from distribution. The following table will illustrate the differences between discrete and continuous distributions.

Discrete vs Continuous Distribution Comparison

Please refer to the following tables for detailed discrete and continuous distribution types/applications.

Discrete Distribution Justification Table
Continuous Distribution Justification Table

Statistics Sampling Methods

The main purpose is to use the sampled data to represent the desired population. Sampled data will refer to statistics characteristics such as sample mean, sample variances to estimate the population’s parameter. This section will explain sampling methods and tactics such as reducing sampling error.

Before sampling, there are 4 primary reason why population sampling was avoided by people:

But how does the sample effectively represent the population? It will be validated by minimizing the sampling error and non-sampling error.  Detailed definition of sampling method and errors are listed in the table below:

Sampling Method and Error Types

And the following table summarize respective sampling method and application usage:

Sampling Method Description Summary

Hypothesis Testing Introduction

Hypothesis testing is required to examine the appropriateness of the hypothesis before validating the data. 

In hypothesis testing, null hypothesis (H0) and alternative hypothesis (H1) are the two primary components to examine the hypothesis. And the following table summarizes the scenarios when you accept or reject null hypothesis.

Hypothesis Testing Description

In type I error, α (alpha) is also known as significance level for statistics. α is a measure of the strength of the evidence (default statement would be 95% confidence interval) that must be present in your sample before rejecting null hypothesis and conclude that the effect is statistically significant.

In general, Type II error is more of the permittable error compare to type I error. And also in the common statistics software such as JMP or Minitab, the null hypothesis (H0) is assumed to be true for the population. And also based on the hypothesis testing, there are two types of tests which are listed below:

Tailed Test Comparison Table

And based on the hypothesis testing of the population parameters, there will be different scenarios listed below.

Hypothesis Testing Types

In general, the t-test, Z-test and F test for the continuous data by comparing the population average, variances are the most often ones. The latter section will illustrate the calculation condition for the following setups when doing hypothesis testing for single population, multiple population and variances.

One population mean's hypothesis testing breakdown
Two population mean's hypothesis testing breakdown
Variance Comparison Hypothesis Testing

The calculation is to use the actual sample’s data to compare with the α level’s significance in probability. The rule of thumb is given below.

When calculated score (Z, t, F or Chi Square) does NOT exceed the values for significance level’s indicated statistical value.

When calculated score (Z, t, F or Chi Square) does exceed the values for significance level’s indicated statistical value.

Hypothesis Testing Calculation Examples

One Population Test Example
Two Large Population Test Example
Two Small Populations Test Example (where variance is same)
Two Small Populations Test Example (where variance is different)

Categorical Hypothesis Testing

Categorical data analytics’ testing method would be illustrated with Chi-Square testing based on contingency table.

In categorical data analysis, there are 4 different types of test performed listed below:

Categorical Test Item Setup

For categorical test, the chi-square estimation is calculated below as the generic formula. This will also be applied for the majority of the categorical data hypothesis testing.

The following examples will demonstrate each categorical test calculation:

Goodness of Fit Calculation Example
Test of Independence Example
Test of Homogeneity Example
Test of Change Illustration & Example

In general, the greater the Chi-Square value (X2) then the null hypothesis would be more likely to be rejected. And the greater the selected sample size is, the highly likely the null hypothesis would be rejected as well. But this should only be applied when the null hypothesis is correct.

Share your thoughts