# Statistics Intro

This particular post will briefly introduce statistics definition, distribution, hypothesis testing and categorical hypothesis testing.

**Introduction**

In Quality Assurance, statistics and data are always solid information to prove or validate the effectiveness or implementation. That is why understand statistics’ definition and how to interpret the statistic result is extremely critical to make the right calls.

Statistics have the following definition based on the collected resource, which is either describe observations from objects (**descriptive statistics**) or use collected **sample** data to estimate the **population** status (**inferential statistics**).

Based on the collection from sample, variables can also be listed in the following lists for the common terms.

Variables which are **discrete** such as gender, religion or others which can be **categorized**

Variables which are **continuous **where as the value can be lay in between a finite range.

After the data are gathered for data description, **measures of central tendency** and measures of variability will be observed to evaluate or illustrate the data

In measure of central tendency, mean, median and mode are the most common phrases. The following table and list below will explain the differences between means.

When arrange the numbers from small to large, the **median number is the middle number**.

For instance, when you have 5 numbers laid out from small to large, then the 3rd number will be your median.

Mode is the value which **appears the most in the data**. So if the data consists of 1, 1, 3, 6, 9, 9 , 9. Then the mode for this data is 9.

By looking at **measures of variability**, indicators such as **variance, standard deviation, ranges and mean absolute deviation** are the primary focus for the data distribution and variance evaluation.

**Statistics Distribution**

Statistical distribution indicates the probably of when certain events will occur. But there are many statistical distribution which can be referred to calculate based on the nature of events and conditions. This section will breakdown the types of distribution and classification of when to use these distributions.

For statistical distribution, two major separation is that an event’s probability can be either **discrete** or **continuous**. And the critical components to comply with distribution are the averages and variance from distribution. The following table will illustrate the differences between discrete and continuous distributions.

Please refer to the following tables for detailed discrete and continuous distribution types/applications.

**Statistics Sampling Methods**

The main purpose is to use the sampled data to represent the desired population. Sampled data will refer to **statistics characteristics such as sample mean, sample variances** to **estimate the population’s parameter**. This section will explain sampling methods and tactics such as reducing sampling error.

Before sampling, there are 4 primary reason why population sampling was avoided by people:

- Significant Population Size
- Uncertain Population Scope
- Destructive Test for Sample
- Good Estimation of Population based on Sample Data

But how does the sample effectively represent the population? It will be validated by minimizing the** sampling error** and** non-sampling error**. Detailed definition of sampling method and errors are listed in the table below:

And the following table summarize respective sampling method and application usage:

**Hypothesis Testing Introduction**

Hypothesis testing is required to examine the appropriateness of the hypothesis before validating the data.

In hypothesis testing, null hypothesis (**H0**) and alternative hypothesis (**H1**) are the two primary components to examine the hypothesis. And the following table summarizes the scenarios when you accept or reject null hypothesis.

In type I error, **α (alpha)** is also known as **significance level for statistics**. α is a measure of the strength of the evidence (default statement would be **95%** confidence interval) that must be present in your sample before rejecting null hypothesis and conclude that the effect is statistically significant.

In general, Type II error is more of the permittable error compare to type I error. And also in the common statistics software such as JMP or Minitab, the null hypothesis (H0) is assumed to be true for the population. And also based on the hypothesis testing, there are two types of tests which are listed below:

And based on the hypothesis testing of the population parameters, there will be different scenarios listed below.

In general, the** t-test, Z-test and F test** for the continuous data by comparing the population average, variances are the most often ones. The latter section will illustrate the calculation condition for the following setups when doing hypothesis testing for single population, multiple population and variances.

The calculation is to use the actual sample’s data to compare with the α level’s significance in probability. The rule of thumb is given below.

**Hypothesis Testing Calculation Examples**

**Categorical Hypothesis Testing**

Categorical data analytics’ testing method would be illustrated with Chi-Square testing based on contingency table.

In categorical data analysis, there are 4 different types of test performed listed below:

For categorical test, the chi-square estimation is calculated below as the generic formula. This will also be applied for the majority of the categorical data hypothesis testing.

The following examples will demonstrate each categorical test calculation:

In general, the greater the Chi-Square value (**X2**) then the null hypothesis would be more likely to be rejected. And the **greater the selected sample size is**, the highly likely the null hypothesis would be rejected as well. But this** should only be applied when the null hypothesis is correct**.