Statistics for Business
Descriptive Statistics
(1.1 p8-10)
Inferential Statistics
(1.1 p15-20)
Provide methods of describing a set of data in a convenient and informative way
Provide you with methods on how to draw conclusions about a larger group based on data from a small sample group
Sources of data
(1.1 p8-10)
Primary data
Direct methods of data collection involve
collecting new data for a specific study
Secondary data
Indirect methods of data collection involve sourcing and accessing existing data that were not originally collected for the purpose of the study
Examples
Surveys
Interviews
Experiments
Examples
Customer records
Online transactions
2 forms
Estimation
Hypothesis testing
Data types
(1.1 p29-31)
Categorical
Numerical
Nominal
Ordinal
Interval
Ratio
No true zero
With a true zero
Strict ordering
No specific ordering
Examples
Distance
Weights
Examples
Date
Temperature
Examples
Agree, neutral, disagree
Good, better, best
Examples
True, false
chinese, malay, tamil
Measure of
Central Tendency
Variability or Spread
Shape
Mean
(1.2 p8-10)
Median
(1.2 p11-13)
Mode
(1.2 p14-15)
Average
Middle value
Most frequently occurring
Range
(1.2 p21)
Variance
(1.2 p22)
Standard Deviation
(1.2 p23)
Coefficient of Variance
(1.2 p25-26)
Total spread of values
How much the data deviates from the mean value
square root of the variance
ratio of the standard deviation to the mean
Skewness
Kurtosis
(1.2 p33-35)
Degree of distortion from the symmetrical bell curve
Types
(1.2 p30)
Positive skew
Negative skew
No skew
Interpretation
(1.2 p31)
Highly Skewed
+1 < skewness < -1
Moderately skewed
-0.5 < skewness < -1
+0.5 < skewness < +1
Approximately symmetric
-0.5 < skewness < +0.5
Measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution
High Kurtosis
Data has heavy tails
or outliers
Low Kurtosis
Data has light tails or
lack of outliers
Normal distribution
(1.2 p24)
Given the mean and SD, we can find out the probability of x.
Data Preparation
(1.3 p4-10)
Data cleaning
(1.3 p7)
Data transformation
(1.3 p8)
Data construction
Data integration
Data reduction
Data standardization
(z-score)
Data normalization
(min-max scaling)
Data Visualization
(2.1 p32)
Summary reports
Histograms
(2.1 p14-16)
Bar Charts
(2.1 p17-20)
Pie Charts
(2.1 p21-23)
Line Plots
(2.1 p24-26)
Scatter Plots
(2.1 p27-30)
Combo Charts
(2.1 p31)
Analysis Methods
Simple Regression Analysis
(2.3 p3-12)
Goodness of Fit Measure
(R-squared)
Generally, as a rule of thumb, R-squared greater than 0.7 is a good model fit.
Estimating the relationships between a dependent variable and one or more independent variables
The closer R-squared is to 1, the better the model fit
Correlation Analysis
(2.2 p5-17)
Measures the association between two sets of interval scaled or ratio scaled variables
Correlation does not imply causation
Correlation coefficient
Between -1 and 1
Zero - no correlation
Negative sign - inverse or negative correlation
Usually for the correlation to be considered significant, the correlation must be 0 5 or above in either direction
Positive sign - direct or positive correlation
Decision Trees
(3.1 p6-31)
Decision support tool that uses a tree-like model of decisions and their possible consequences
(3.1 p10)
Involve a model-building process
Splitting data, best if node purity is highest.
(3.1 p16-29)
Tree pruning, to cut back on the tree
(3.1 p30)
Cluster Analysis
(3.2 p4-28)
Multivariate data exploratory technique by uncovering natural patterns in data
Grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups
Types
Hierarchical
Clustering
Partitional
Clustering
Agglomerative
Divisive
Hard Clustering
K-means
(3.2 p22-26)
Soft Clustering
Fuzzy-C
Can be used to identify critical factors
Advantages
(3.1 p31)
Classification
Clustering
Grouping observations into known categories.
Grouping observations into unknown categories.
Supervised learning
Unsupervised learning
Interpreting Output
(3.2 p15-20)
Explain in practical terms
Look at distinguishing characteristics
Look at cluster quality
(3.2 p16-20)
Silhouette Score
Look for clusters of anomalies or outliers
Calculate the goodness of a clustering technique
ranges from -1 to 1
More than 0.4 is considered acceptable
best used when distribution is symmetric
best used when distribution is skewed
best used for categorical data