Please enable JavaScript.

Coggle requires JavaScript to display documents.

Quantitative Data Analysis Techniques - Coggle Diagram

- - - - Hierarchical clustering – This approach represents a clustering algorithm that organises data points into a hierarchy of clusters, e.g. by creating a dendrogram (Figure 9) to illustrate relationships between different product categories. It aims is to find similarities between groups. This builds on K-Means clustering by using the DISTANCE between each CHARACTISTIC VALUE (x, y).
        
        2 approaches
        
        Top-down – working through the data, dividing into clusters, and continuing to sub-divide. The approach is referred to as Divisive Hierarchical Clustering
        
        Bottoms-up – the approach considers each data point and then merges clusters at different levels until one final cluster is established. This approach is referred to as Agglomerative Hierarchical Clustering.
        
        further example
        
        As a simple example, working from the bottoms-up (Figure 9), the data analyst would look to establish clusters for the animals based on their similarities in multiple steps. The key activity is identifying the characteristics of interest and determining how ‘close’ the data is to each other, e.g., while birds and mammals are different, creating clusters based on ‘vertebrates’ represents the next cluster level.
      - K-means clustering - A clustering approach that groups similar data points into clusters, e.g. segmenting firms into groups based on time in business and their annual purchasing behaviour
        
        Distance within clustering – the ‘distance’ between the values and the calculated Centroid can be established by calculating the Centroid. This is the critical step in clustering. The concept of distance is the means to provide a ‘measure of similarity’ or ‘dissimilarity’, i.e., a small value implies similarity between items. In contrast, larger values are deemed to be dissimilar. It is from here that grouping into appropriate clusters can be formed.
        In the example provided, we see that the average distance for the 6 data points from the Centroid (14.67 years, €47.6m) is 1.67 years and €1.31m, respectively. The more similar the items in the cluster, the ‘smaller’ these values are. This represents the goal for the analyst to define clusters by identifying smaller distances.
        
        The calculation application allows characterising the data set, e.g. Cluster 1 in Figure 7 can be described as the Industry Leaders, with Cluster 2 as Mid-Growth and Cluster 3 as New Businesses.
        
        steps
        
        Step 1 – Determine, based on the data, how you want to cluster, e.g. Years in Business and Annual Sales – this represents the ‘K’ and is selected in advance of analysis.
        
        Step 2 – create a scatter graph from which you can visually assign clusters (e.g. Figure 8)
  - - - Confidence intervals (CI) – refer to the probability (or chance) that the data your analysis will be within a certain interval where and the confidence that what you calculate is right, e.g. having a 95% Confidence Interval means that you have a 5% of being wrong with your estimate and a 95% chance of being right.
        
        steps calculate
        
        Step 2 – find the standard deviation (σ) of the data.
        
        Step 3 – find what is defined as the Alpha value (α) and called the ‘level of significance level’ by calculating
        
        Step 1 – Calculate the data set's mean (μ) or average.
        
        Step 4 – based on the question being investigated and the probability, we identify the Z-value based on whether it is a one-tail or two-tail test:
        
        Step 5 – Find the corresponding Z-value from the Z-Tables for the probability calculated in Step 4.
        
        Step 6 – calculate the confidence interval based on the following formula:
        
        CI=Confidence Interval
        
        μ=mean value of data
        
        z=Corresponding Z value for Alpha Value (α)
        
        σ=Standard Deviation
        
        n=number of data points
        
        2 qs
        
        Question 2 – when looking for a value that is either greater or less than a specific value – called a one-tailed test e.g. calculating a 95% confidence interval for the average response time of a call centre.
        
        Question 1 - where the question focuses on the values that fall between 2 values – called a two-tailed test, e.g. Calculating the % range with a 90% confidence interval of who would vote for the current government.
        
        example
      - Hypothesis testing - A statistical method to assess a statement (or theory) whose truth has yet to be proven, e.g.
        
        3 steps
        
        Step 2 - A ‘test’ is carried out of the sample data to assess the evidence against the NULL HYPOTHESIS, i.e. analysing a collected sample data set and determining if the probability of the Null Hypothesis is true
        
        The confidence level, once set, provides the Alpha value. It is also called the level of significance because it defines the threshold at which an organisation considers significant enough to reject the Null Hypothesis
        
        Step 3 - The firm doing the investigation would want the new drug to be more effective, so they would
        
        reject
        
        accept
        
        Step 1 - to define 2 hypotheses that are effectively opposite, i.e.,
        
        H0 – the Null Hypothesis – the drug is NOT EFFECTIVE
        
        HA – the Alternative Hypothesis – the drug IS EFFECTIVE
        
        example
      - T-tests (and p-values) – the t-test is used to determine whether there are differences in the mean values between groups and how significant the difference is. It can be seen as a generalisation of Z-scores for smaller sample sizes (typically n < 30) or when the population standard deviation is unknown.
        
        wo-Sample T-Test – A two-Sample T-Test is used to determine if there is a significant difference between the means of two independent groups and how they are related or different, e.g. understanding the impact of two different drugs. The formula is given as follows:
        
        Paired Sample t-test – This approach allows you to compare the values of paired data, e.g., looking at the impact of a diet on the same group of people by weighing each person before and after. The measured values are available in pairs as you compare the differences for each person to determine if there is a significant difference overall.
        
        t-table
        
        p-value
        
        One-Sample T-Test – similar to calculating Z-Scores, especially where n < 30. This test is usually carried out when comparing the average of a sample of data to a known reference mean value. For example, if a manufacturer of chocolate bars wants to check that the weight of their chocolate bars meets specifications, They would use a T-test to compare the mean of a sample to the known mean value. This is the same as the Z-Score in the last section, where the difference comes from the smaller sample size in t-tests
      - Linear regression – or Line of Best Fit - A statistical method that models the relationship between a dependent variable and one or more independent variables as a linear equation, e.g. modelling the relationship between advertising expenditure and sales revenue
        
        Step 1 – Calculate the average values of X and Y.
        
        Step 2 - Calculate the slope of the line (m), which measures the line's steepness.
        
        Step 3 – determine where the line cuts the Y-Axis:
        
        example
        
        correlation
        
        positive
        
        negative
        
        none
      - Z-Score - a statistical measurement describing a value's relationship to the mean of a group of values. It is expressed in terms of standard deviations from the mean of a set of data. This allows us to estimate based on the z-score. The formula is as follows:
        
        z = z-score
        
        x = value
        
        μ = mean
        
        σ = Standard deviation
        
        example
  - - - Scatter plots - A graphical representation of the relationship between two continuous variables, e.g. plotting the relationship between hours of study and exam scores for a group of students (Figure 7).
      - Box plots - A graphical representation of the distribution of a dataset, showing the median, quartiles, and potential outliers, e.g. creating a box plot to visualise the distribution of employee salaries within different departments.
        
        This allows for another view of the data spread as the ‘box’ represents 50% of the data values. It also facilitates a comparison between data with a relative comparison between the ‘box’ and the tails for the lowest and highest values in the data set (Figure 5).
        
        Histograms are graphical representations of the distribution of a continuous dataset, divided into bins, e.g., constructing a histogram to display the distribution of ages in a survey.
        
        The raw data is plotted into ‘bins’ where the count of the values within each bin is represented on the Y-Axis, e.g., Bin #1 has 5 values between 44 to 59 (Figure 6).
        
        As another approach to describe the spread of data, it involves ordering the data into 4 equal parts. The identification of the Interquartile Range allows for the construction of a box plot (Figure 4).
  - - - patterns
        
        Seasonality – repeating data patterns at regular intervals, e.g. higher sales at peak seasons
        
        Cycling - repeating pattern but not seasonality-driven. Usually occur over a longer time span than seasonality, e.g. boom and busts in an economy
        
        Trending – showing the change through movement upward or downward for part or all of the time series
        
        Variation – unpredictable ups and downs (or irregularity or noise), e.g. share price fluctuations driven by news, market sentiment or unexpected events
    - - mean absolute deviation - Quantifies accuracy of predicted values
      - .
        
        Smoothing out fluctuations by calculating the average of neighboring data points
        
        Helps visualize underlying patterns and make predictions
    - - X-Bar/R Charts
        
        Control limits calculation
        
        Chart 'rules' for deviation detection
        
        Monitor average and range of data subgroup
      - Other charts
        
        Np-chart
        
        U-chart
        
        P-chart
        
        C-chart
    - - Cpk (Process Capability)
        
        Includes Mean or Average
      - Sigma Level, Yield %, and Defects per Million (DPMO)
        
        Interpretation of capability levels
        
        Illustration of sigma levels and spread of data
      - Cp (Process Capability Index)
        
        Higher Cp indicates higher process capability
  - - - Median - The middle value of a dataset when arranged in ascending or descending order, e.g. finding the median income in a survey of households
      - Range - The difference between the maximum and minimum values in a dataset, e.g. determining the range of temperatures recorded over a week
      - Mean - The average of a set of values, calculated by summing all values and dividing by the number of observations, e.g. calculating the mean of exam scores for a class of students
      - Standard deviation - A measure and approach to quantify the dispersion or spread of a set of values, e.g. calculating the standard deviation of monthly sales to assess variability, the variation in the no. of days that customers pay, the variation in the weight of the food dispensed into packaging, etc
        
        The calculation of the variable SIGMA (σ) provides the means to describe the spread in the data – ‘one standard deviation’ - the value calculates represents how much deviation from the mean of the data
        
        Once calculated, the Sigma (σ) value can show the spread (Figure 2). The following formula calculates it: :
      - Error Bars - are graphical representations of the variability or uncertainty associated with a data point or group of data points (Figure 3). They typically extend vertically or horizontally from the mean or median of the data. They can indicate various measures of variability, such as standard deviation, confidence intervals, or range.
        
        Example – based on data provided in the Excel file, we use the standard deviation as the variation or ‘error’ amount to reflect on each data point in the graph. Visually, you can quickly see where there are larger and smaller variation points in the data. In other words, shorter bars indicate these values are more likely, and longer bars are less likely. In this example, the organisation should investigate the longer bars as more products may be outside the specifications for this process. Shorter bars would indicate times when a process runs better with less variability. This would also be useful to understand how to replicate this.
- - - - Significant effect not detected when a real effect is present.
      - Example: Deeming a drug ineffective due to insufficient statistical data.
      - Also known as 'false negative'.
    - - Leads to incorrect conclusions.
      - Example: Selecting wrong variables in an experiment
      - Mistakes in specifying wrong hypothesis or model structure.
    - - Significant effect detected when no real effect is present.
      - Example: Declaring a drug effective when it has no therapeutic benefit
      - Also known as 'false positive'.