Please enable JavaScript.
Coggle requires JavaScript to display documents.
Probability and Statistics - Coggle Diagram
Probability and Statistics
HC: estimation
Estimation strategies
Estimation by bounding:
setting a range of plausible quantities and using the average value in calculations.
Calculate
geometric mean
when lower and upper bound has different order of magnitude
Calculate
arithmetic mean
when lower and upper bound have same order of magnitude
To find approximate geometric mean (AGM): average of coefficients times 10^ (average of exponents).
If average of exponents is float, round down an exponent and multiply the entire number by 3.
Estimation by analogy:
estimating based on similar experiences
Bottom-up estimation:
deconstructing estimation in smaller and easier estimations and adding them together.
Getting outside opinion:
a single person tends to over/underestimate. To avoid, ask for estimations of uninterested person
Expert opinion:
if the topic is new, ask for expert/panel of expert opinion.
Fermi estimate: making justified guesses on lower and upper bounds on unknown figures
Estimation can go wrong if:
Over/under estimation: systematic error
Nonlinear problem: simple addition of estimation is not accurate
Incorrect model: incorrect assumptions on what estimation steps are needed
HC: variables
Types of variables
Quantitative:
Handle quantities, mathematical operations can be performed
Discrete: Two consecutive values can be identified (e.g. number of apples)
Continuous: assumes any real number within the bound (e.g., mass, temperature, distance). It is impossible to define a pair of consecutive values.
Qualitative:
Handle categories
Ordinal: labels as values. Used to group data
Nominal: Variables that can be ranked (morning, noon, afternoon, evening)
Roles of variables
Independent: variable hypothesized to cause a change in dependent variable. Something that is modified to observe its effect.
Dependent: variable that is changed due to an influence of independent variable.
Controlled: purposely fixed variables
Extraneous: variables (external factors) that affect dependent variable. It cannot be controlled, but mitigated. Usually hidden.
Confounding: subset of extraneous variable. Factors that affect both IV and DV
HC: descriptivestats
Measure of location:
Mean:
The average of dataset.
Mean=Sum of all data/N of data
Median:
The middle number of data set when they are arranged in increasing order.
If N of data is odd the median is the middle number of a dataset. If even, the median is mean of middle two numbers.
Mode:
The most frequently occurred number in the dataset
Measure of spread:
Range:
The difference between the maximum and minimum data points
Population Standard deviation:
The distance between each data point and the mean.
Sample standard deviation:
The same as population standard deviation, but to make it unbiased, instead of dividing the sum by the number of data, it is divided by number of data-1
dataviz: Histograms
Left-skewed: left foot
Right-skewed: right foot
HC: correlation
Possible explanations of correlation
A causes B
B causes A
A and B are both caused by third factor
Combination of 1,2,3
Correlation is a coincidence
Types of data:
Univariate: one variable of an individual is measured (represented by histogram)
Bivariate: two variables of individual (scatter plot)
Multivariate: more than two variables of individual (multidimensional graph)
Indicators of association:
When slices of scatterplot is taken, scatter (SD of a slice) decreases
If knowing one gives us an idea about the second.
Homoscedasticity:
if the scatterplot is sliced vertically, the number of points in a slice will not highly depend on where the slice is taken from
Heteroscedasticity:
if the scatterplot is sliced vertically, the number of points will depend on where the slice is from.
Correlation: linear association.
Positive correlation: if individuals with smaller than average values on one variable tend to have smaller than average value in another variable and vice versa have positive association.
Negative correlation: The opposite
Features of scatter plor
Point of Averages:
is a measure of the "center" of scatter plot. To find it, calculate the mean of both X and Y datasets
Outlier:
the point which does not follow the overall pattern and is many SD away from mean
-
Measure of spread in scatterplots:
SDx (standard deviation of variable plotted on x-axis), SDy (SD of variable plotted on y-axis)
HC: probability
Probability distribution:
list of all disjoints and their probabilities.
Rules:
(i) all outcomes should be disjoint
(ii) probability of all outcomes should be between 0 and 1
(iii) sum of all probabilities should be 1
Rules
Disjoint ("mutually exclusive"): P(A or B)=P(A) + P(B) A and B cannot happen at the same time.
Probability addition rules: P(A or B)= P(A) + P(B) -P(A and B) A and B can happen at the same time, they are not mutually exclusive.
Multiplication rule (for independent events): P(A and B)=P(A) x P(B)
Complement: P(A)=1-P(A')
Interpretations
Bayesian interpretation: subjective probability. Is based on our current belief of the chances of occurrence of the event. Can be updates based on evidences. “Given all the data that I have observed, what should I believe the world is like? #induction
Frequentist interpretation:
relative frequency of a particular event happening of the infinite number of trials. “If the world was like X, what type of data would we expect to collect from repeated observations?” #deduction
Note:
Independence:
events are independent when occurrence of one event does not say anything about the other.
Law of Large Numbers:
increasing the number of trials brings the ration of particular outcome closer to true probability.
Marginal:
probability of one event without regarding another event A/total (e.g., probability of having no pet 336/613)
Joint:
probability of outcome of two or more events A and B/total (e.g., probability of having pet and being healthy 243/613)
Conditional:
probability of A given B (A and B)/B) (e.g., probability of being healthy given that you have a pet P(A|B)=(243/530)
Bayes theorem
P(A|B)=(P(B|A)*P(A))/P(B)=P(A AND B)/P(B)
P(A and B)=P(A|B) and P(B)= P(B|A) and P(A)
If A and B are independent then P(A|B)=P(A)
Fallacies
Confusion of the inverse:
P(A|B)!=P(B|A)
Gambler's fallacy:
P of an independent event does not change based on the frequency of historical occurrence
Base-rate fallacy:
Instead of integrating probabilities of two events, only one is used (false positive, false negative problems)
HC: distribition
Binomial
Definition:
Probability of having k success in n Bernoulli trials with the probability of success p.
Bernoulli varible
A random variable with only two possible outcomes. Outcomes are typically labeled as success (1) or failure (0).
P(1)=#success/#trials
Mean
=P(1)
SD
=(P(1-mean)^0.5)
Formulas
Probability:
P(X=k)=nCk
p^k
(1-p)^(n-k) Where n is number of trials, k number of successful trials and p probability of successful trial
Mean:
μ = np
Variance:
σ ^2 = np(1-p)
Expected values:
E(X)=sum of x*P(X=x)
Normal
Discrete vs Normal
Discrete distributions show the probability of a particular outcome while the normal distribution can show the probability of range of outcomes
Definition:
A distribution of continuous variable where frequencies are symmetric to mean. Expressed by bell shaped curve called
probability density
P(x1-x2)=n(x1-x2)/N=Area under the curve in that region
Standard normal distribution:
A standard normal distribution is a normal distribution with mean equal to zero and std 1
Z-score shows how many standard deviation far x is from mean
Z=(x-m)/sigma
Binomial to Normal
When sample size is large enough that np and n(1-p) are both at least 10 then binomial distribution can be treated as normal.
In this case, parameters of normal distribution will remain the same mean=np, std= (np(1-p))^0.5
Sampling
Definition:
Sampling distribution is a frequency of a sample parameter
Formula
Sampling distribution
mean
is equal to the probability of an independent outcome
Sampling distribution standard deviation aka
standard error
is equal to (p(1-p)/n)^0.5
Central limit theorem
when sample size is sufficiently large and observations are independent, the sampling distribution will tend to follow a normal distribution
Rules
Success-failure condition:
Distribution is normal when np and n(1-p) are more than 10. And sample size is greater than 30. Otherwise it can left skewed or right skewed.
10% rule:
If sample size is less than 10% of population size, then each individual can be treated as independent.