Statistics for Business

Descriptive Statistics
(1.1 p8-10)

Inferential Statistics
(1.1 p15-20)

Provide methods of describing a set of data in a convenient and informative way

Provide you with methods on how to draw conclusions about a larger group based on data from a small sample group

Sources of data
(1.1 p8-10)

Primary data

Direct methods of data collection involve
collecting new data for a specific study

Secondary data

Indirect methods of data collection involve sourcing and accessing existing data that were not originally collected for the purpose of the study

Examples

Surveys

Interviews

Experiments

Examples

Customer records

Online transactions

2 forms

Estimation

Hypothesis testing

Data types
(1.1 p29-31)

Categorical

Numerical

Nominal

Ordinal

Interval

Ratio

No true zero

With a true zero

Strict ordering

No specific ordering

Examples

Distance

Weights

Examples

Date

Temperature

Examples

Agree, neutral, disagree

Good, better, best

Examples

True, false

chinese, malay, tamil

Measure of

Central Tendency

Variability or Spread

Shape

Mean
(1.2 p8-10)

Median
(1.2 p11-13)

Mode
(1.2 p14-15)

Average

Middle value

Most frequently occurring

Range
(1.2 p21)

Variance
(1.2 p22)

Standard Deviation
(1.2 p23)

Coefficient of Variance
(1.2 p25-26)

Total spread of values

How much the data deviates from the mean value

square root of the variance

ratio of the standard deviation to the mean

Skewness

Kurtosis
(1.2 p33-35)

Degree of distortion from the symmetrical bell curve

Types
(1.2 p30)

Positive skew

Negative skew

No skew

Interpretation
(1.2 p31)

Highly Skewed

+1 < skewness < -1

Moderately skewed

-0.5 < skewness < -1
+0.5 < skewness < +1

Approximately symmetric

-0.5 < skewness < +0.5

Measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution

High Kurtosis

Data has heavy tails
or outliers

Low Kurtosis

Data has light tails or
lack of outliers

Normal distribution
(1.2 p24)

Given the mean and SD, we can find out the probability of x.

Data Preparation
(1.3 p4-10)

Data cleaning
(1.3 p7)

Data transformation
(1.3 p8)

Data construction

Data integration

Data reduction

Data standardization
(z-score)

Data normalization
(min-max scaling)

Data Visualization
(2.1 p32)

Summary reports

Histograms
(2.1 p14-16)

Bar Charts
(2.1 p17-20)

Pie Charts
(2.1 p21-23)

Line Plots
(2.1 p24-26)

Scatter Plots
(2.1 p27-30)

Combo Charts
(2.1 p31)

Analysis Methods

Simple Regression Analysis
(2.3 p3-12)

Goodness of Fit Measure
(R-squared)

Generally, as a rule of thumb, R-squared greater than 0.7 is a good model fit.

Estimating the relationships between a dependent variable and one or more independent variables

The closer R-squared is to 1, the better the model fit

Correlation Analysis
(2.2 p5-17)

Measures the association between two sets of interval scaled or ratio scaled variables

Correlation does not imply causation

Correlation coefficient

Between -1 and 1

Zero - no correlation

Negative sign - inverse or negative correlation

Usually for the correlation to be considered significant, the correlation must be 0 5 or above in either direction

Positive sign - direct or positive correlation

Decision Trees
(3.1 p6-31)

Decision support tool that uses a tree-like model of decisions and their possible consequences

(3.1 p10)

Involve a model-building process

Splitting data, best if node purity is highest.
(3.1 p16-29)

Tree pruning, to cut back on the tree
(3.1 p30)

Cluster Analysis
(3.2 p4-28)

Multivariate data exploratory technique by uncovering natural patterns in data

Grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups

Types

Hierarchical
Clustering

Partitional
Clustering

Agglomerative

Divisive

Hard Clustering
K-means
(3.2 p22-26)

Soft Clustering
Fuzzy-C

Can be used to identify critical factors

Advantages
(3.1 p31)

Classification

Clustering

Grouping observations into known categories.

Grouping observations into unknown categories.

Supervised learning

Unsupervised learning

Interpreting Output
(3.2 p15-20)

Explain in practical terms

Look at distinguishing characteristics

Look at cluster quality
(3.2 p16-20)

Silhouette Score

Look for clusters of anomalies or outliers

Calculate the goodness of a clustering technique

ranges from -1 to 1

More than 0.4 is considered acceptable

best used when distribution is symmetric

best used when distribution is skewed

best used for categorical data