Please enable JavaScript.
Coggle requires JavaScript to display documents.
Quantitative Data Analysis Techniques - Coggle Diagram
Quantitative Data Analysis Techniques
Cluster Analysis - Grouping similar data points together based on specific characteristics to provide a means to describe a situation.
techniques
Hierarchical clustering – This approach represents a clustering algorithm that organises data points into a hierarchy of clusters, e.g. by creating a dendrogram (Figure 9) to illustrate relationships between different product categories. It aims is to find similarities between groups. This builds on K-Means clustering by using the DISTANCE between each CHARACTISTIC VALUE (x, y).
2 approaches
Top-down – working through the data, dividing into clusters, and continuing to sub-divide. The approach is referred to as Divisive Hierarchical Clustering
Bottoms-up – the approach considers each data point and then merges clusters at different levels until one final cluster is established. This approach is referred to as Agglomerative Hierarchical Clustering.
further example
As a simple example, working from the bottoms-up (Figure 9), the data analyst would look to establish clusters for the animals based on their similarities in multiple steps. The key activity is identifying the characteristics of interest and determining how ‘close’ the data is to each other, e.g., while birds and mammals are different, creating clusters based on ‘vertebrates’ represents the next cluster level.
K-means clustering - A clustering approach that groups similar data points into clusters, e.g. segmenting firms into groups based on time in business and their annual purchasing behaviour
Distance within clustering – the ‘distance’ between the values and the calculated Centroid can be established by calculating the Centroid. This is the critical step in clustering. The concept of distance is the means to provide a ‘measure of similarity’ or ‘dissimilarity’, i.e., a small value implies similarity between items. In contrast, larger values are deemed to be dissimilar. It is from here that grouping into appropriate clusters can be formed.
In the example provided, we see that the average distance for the 6 data points from the Centroid (14.67 years, €47.6m) is 1.67 years and €1.31m, respectively. The more similar the items in the cluster, the ‘smaller’ these values are. This represents the goal for the analyst to define clusters by identifying smaller distances.
The calculation application allows characterising the data set, e.g. Cluster 1 in Figure 7 can be described as the Industry Leaders, with Cluster 2 as Mid-Growth and Cluster 3 as New Businesses.
steps
Step 1 – Determine, based on the data, how you want to cluster, e.g. Years in Business and Annual Sales – this represents the ‘K’ and is selected in advance of analysis.
Step 2 – create a scatter graph from which you can visually assign clusters (e.g. Figure 8)
Inferential Statistics - Drawing conclusions and making predictions about a population based on a sample
techniques
Confidence intervals (CI) – refer to the probability (or chance) that the data your analysis will be within a certain interval where and the confidence that what you calculate is right, e.g. having a 95% Confidence Interval means that you have a 5% of being wrong with your estimate and a 95% chance of being right.
steps calculate
Step 2 – find the standard deviation (σ) of the data.
Step 3 – find what is defined as the Alpha value (α) and called the ‘level of significance level’ by calculating
Step 1 – Calculate the data set's mean (μ) or average.
Step 4 – based on the question being investigated and the probability, we identify the Z-value based on whether it is a one-tail or two-tail test:
Step 5 – Find the corresponding Z-value from the Z-Tables for the probability calculated in Step 4.
Step 6 – calculate the confidence interval based on the following formula:
CI=Confidence Interval
μ=mean value of data
z=Corresponding Z value for Alpha Value (α)
σ=Standard Deviation
n=number of data points
2 qs
Question 2 – when looking for a value that is either greater or less than a specific value – called a one-tailed test e.g. calculating a 95% confidence interval for the average response time of a call centre.
Question 1 - where the question focuses on the values that fall between 2 values – called a two-tailed test, e.g. Calculating the % range with a 90% confidence interval of who would vote for the current government.
example
Hypothesis testing - A statistical method to assess a statement (or theory) whose truth has yet to be proven, e.g.
3 steps
Step 2 - A ‘test’ is carried out of the sample data to assess the evidence against the NULL HYPOTHESIS, i.e. analysing a collected sample data set and determining if the probability of the Null Hypothesis is true
The confidence level, once set, provides the Alpha value. It is also called the level of significance because it defines the threshold at which an organisation considers significant enough to reject the Null Hypothesis
Step 3 - The firm doing the investigation would want the new drug to be more effective, so they would
reject
accept
Step 1 - to define 2 hypotheses that are effectively opposite, i.e.,
H0 – the Null Hypothesis – the drug is NOT EFFECTIVE
HA – the Alternative Hypothesis – the drug IS EFFECTIVE
example
T-tests (and p-values) – the t-test is used to determine whether there are differences in the mean values between groups and how significant the difference is. It can be seen as a generalisation of Z-scores for smaller sample sizes (typically n < 30) or when the population standard deviation is unknown.
wo-Sample T-Test – A two-Sample T-Test is used to determine if there is a significant difference between the means of two independent groups and how they are related or different, e.g. understanding the impact of two different drugs. The formula is given as follows:
Paired Sample t-test – This approach allows you to compare the values of paired data, e.g., looking at the impact of a diet on the same group of people by weighing each person before and after. The measured values are available in pairs as you compare the differences for each person to determine if there is a significant difference overall.
t-table
p-value
One-Sample T-Test – similar to calculating Z-Scores, especially where n < 30. This test is usually carried out when comparing the average of a sample of data to a known reference mean value. For example, if a manufacturer of chocolate bars wants to check that the weight of their chocolate bars meets specifications, They would use a T-test to compare the mean of a sample to the known mean value. This is the same as the Z-Score in the last section, where the difference comes from the smaller sample size in t-tests
Linear regression – or Line of Best Fit - A statistical method that models the relationship between a dependent variable and one or more independent variables as a linear equation, e.g. modelling the relationship between advertising expenditure and sales revenue
Step 1 – Calculate the average values of X and Y.
Step 2 - Calculate the slope of the line (m), which measures the line's steepness.
Step 3 – determine where the line cuts the Y-Axis:
example
correlation
positive
negative
none
Z-Score - a statistical measurement describing a value's relationship to the mean of a group of values. It is expressed in terms of standard deviations from the mean of a set of data. This allows us to estimate based on the z-score. The formula is as follows:
z = z-score
x = value
μ = mean
σ = Standard deviation
example
Exploratory Data Analysis (EDA) - Uncovering patterns, relationships, and trends in the data.
techniques
Scatter plots - A graphical representation of the relationship between two continuous variables, e.g. plotting the relationship between hours of study and exam scores for a group of students (Figure 7).
Box plots - A graphical representation of the distribution of a dataset, showing the median, quartiles, and potential outliers, e.g. creating a box plot to visualise the distribution of employee salaries within different departments.
This allows
for another view of the data spread as the ‘box’ represents 50% of the data values. It also facilitates a comparison between data with a relative comparison between the ‘box’ and the tails for the lowest and highest values in the data set (Figure 5).
Histograms are graphical representations of the distribution of a continuous dataset, divided into bins, e.g., constructing a histogram to display the distribution of ages in a survey.
The raw data is plotted into ‘bins’ where the count of the values within each bin is represented on the Y-Axis, e.g., Bin #1 has 5 values between 44 to 59 (Figure 6).
As another approach to describe the spread of data, it involves ordering the data into 4 equal parts. The identification of the Interquartile Range allows for the construction of a box plot (Figure 4).
Time Series Analysis - Analysing data points (techniques:)
Time Series – sequence of data points that occur in successive order over some time period. A simple example (Figure 25) plots the reported Operating Margin % (Profit as % of Revenue) at the end of each quarter for Hewlett Packard and the divisions within the company. This helps to identify who is seen as performing well and who is not. It is also important to understand the selected measurement to be trended
patterns
Seasonality – repeating data patterns at regular intervals, e.g. higher sales at peak seasons
Cycling - repeating pattern but not seasonality-driven. Usually occur over a longer time span than seasonality, e.g. boom and busts in an economy
Trending – showing the change through movement upward or downward for part or all of the time series
Variation – unpredictable ups and downs (or irregularity or noise), e.g. share price fluctuations driven by news, market sentiment or unexpected events
moving averages (MA)
mean absolute deviation - Quantifies accuracy of predicted values
.
Smoothing out fluctuations by calculating the average of neighboring data points
Helps visualize underlying patterns and make predictions
Statistical Process Control (SPC) Charts- Identify and understand variations in a process
X-Bar/R Charts
Control limits calculation
Chart 'rules' for deviation detection
Monitor average and range of data subgroup
Other charts
Np-chart
U-chart
P-chart
C-chart
Process Capability-Analysis of process performance in relation to target specifications
Cpk (Process Capability)
Includes Mean or Average
Sigma Level, Yield %, and Defects per Million (DPMO)
Interpretation of capability levels
Illustration of sigma levels and spread of data
Cp (Process Capability Index)
Higher Cp indicates higher process capability
Descriptive Statistics – Summarising and describing key features of a dataset.
Illustration example - The following data set has been provided in a supporting Excel file to help illustrate the calculations.
techniques
Median - The middle value of a dataset when arranged in ascending or descending order, e.g. finding the median income in a survey of households
Range - The difference between the maximum and minimum values in a dataset, e.g. determining the range of temperatures recorded over a week
Mean - The average of a set of values, calculated by summing all values and dividing by the number of observations, e.g. calculating the mean of exam scores for a class of students
Standard deviation - A measure and approach to quantify the dispersion or spread of a set of values, e.g. calculating the standard deviation of monthly sales to assess variability, the variation in the no. of days that customers pay, the variation in the weight of the food dispensed into packaging, etc
The calculation of the variable SIGMA (σ) provides the means to describe the spread in the data – ‘one standard deviation’ - the value calculates represents how much deviation from the mean of the data
Once calculated, the Sigma (σ) value can show the spread (Figure 2). The following formula calculates it: :
Error Bars - are graphical representations of the variability or uncertainty associated with a data point or group of data points (Figure 3). They typically extend vertically or horizontally from the mean or median of the data. They can indicate various measures of variability, such as standard deviation, confidence intervals, or range.
Example – based on data provided in the Excel file, we use the standard deviation as the variation or ‘error’ amount to reflect on each data point in the graph. Visually, you can quickly see where there are larger and smaller variation points in the data. In other words, shorter bars indicate these values are more likely, and longer bars are less likely. In this example, the organisation should investigate the longer bars as more products may be outside the specifications for this process. Shorter bars would indicate times when a process runs better with less variability. This would also be useful to understand how to replicate this.
Errors in Data Analytics
Discrepancies between observed data and true underlying phenomena or expected and actual results.
types errors
type II
Significant effect not detected when a real effect is present.
Example: Deeming a drug ineffective due to insufficient statistical data.
Also known as 'false negative'.
type III
Leads to incorrect conclusions.
Example: Selecting wrong variables in an experiment
Mistakes in specifying wrong hypothesis or model structure.
type I
Significant effect detected when no real effect is present.
Example: Declaring a drug effective when it has no therapeutic benefit
Also known as 'false positive'.
calculating and quantifying errors
Mean Square Error (MSE)-Average of squares of errors between predicted and actual values. Root Mean Square Error (RMSE) derived
Mean Absolute Percent Error (MAPE)-Average percentage difference between predicted and actual values. Expressed as a percentage of actual values.Provides percentage-based measure of accuracy.
Mean Absolute Deviation (MAD)-Average absolute difference between each data point and mean
reasons
Modelling assumptions
Algorithmic biases
Measurement inaccuracies
Human mistakes
Data collection issues