Please enable JavaScript.
Coggle requires JavaScript to display documents.
Statistics - Intro to Data Mining - Coggle Diagram
Statistics - Intro to Data Mining
empirical means
Are the means equal?
t-distribution
kann Wahrscheinlichkeit daraus berechnet werden, wie wahrscheinlich es ist, dass man mit der Annahme gleichender Mittelwerte richtig/falsch liegt
dementsprechend kann die Hypothese entweder verworfen oder erst einmal bestehen bleiben
Student-t-distribution
Standardfehler der Mittelwertdifferenzen
gibt an, wie sehr Varianz zwischen den beiden Stichproben fehlerhaft sein kann und alles, was man daraus interpretiert, ebenfalls
given
two 'unpaired' samples {xiA}, {xiB} of size NA and NB, drawn from unknown probability densities PA and PB
true mean
Null hypothesis: Mittelwerte von A und B in der Grundgesamtheit gleich
Tests to discriminate distributions
Unterscheidung zwischen continuous und discrete values / stetigen und diskreten Werten
discrete distributions: X²-test
given
discrete distribution with i = 1,...,M
Ai = ^ number of events in bin i
ai: expected number (continuous, coming from model/hypothesis)
Case 1: Comparison of the a dataset with a given distribution
tools
variance: like Bernoulli experiments np(1-p)
bin i will be realized with probability p = (ai)/sum(ai)
analyzing results
for large Chi2 -> large deviations between the distributions
Pearson: for large M, the denomianator will be normal distributed
errors ai - Ai are normal distributed
probability that an observed value X2^ is larger (X2^ > X2c) just by accident: Q(X2c, v) with strichobenundobenrechts(x)
Case 2: Comparison of two datasets
tools
X² = M mal ((Ai-Bi)²/Ai+Bi)
Variance: o²(Ai-Bi) = o²(Ai) + o²(Bi) ungefähr Ai + Bi
continuous distributions: Kolmogorov-Smirnov test
given:
two samples {xiA} i = 1,..., NA {xiB} i = 1,..., NB
Optional: comparison of {xiA] with a hypothetical distribution P(x)
goal: calculation of the empiric cumulative distribution FA(x) = 1/NA sum( x -xiA ). 1 if x>=0 0 else
alternative statistics: P(D>=Dcrit)
different formulas and their meaning
density function
P(t,v)
cumulative density function (integral)
A(t,v)
which kind of tests and their H0 hypothesis
depends on wished significance level, e. g. 5 %, 1 % or 0.1 %
one-sided test
used when: we assume that mü A > mü B (so positive value of t)
the area below the student-t density function from tc to infinite is the probability that a value t > tc occurs randomly
two-sided test
used when: | mü A - mü B | ungleich 0
the area below the student-t density function from tc to infinite is the probability that a value t > tc occurs randomly
Detection of Dependencies between variables
question: if the null hypothesis can be rejected, how strong is the dependency? - reparameterizazion, so that the value is independet of I and J
tools
strength is essentially given by the value of X², conventional normalizations are those two
Cramer's V
needed: X², N, J, I
results
V = 1 <-- perfect Association (i. e. one variable determines the other)
V = 0 <-- no association at all
If I = J = 2, V is called 'durchstrich o - statistic''
Contingency coefficient
results
value C = 1 will never be reached
is only useful to compare the strengths of association of tables with equal (I, J)
both with V and C there is no direct statistical interpretation of values in between
notation
Ni,j = # of cases with A = i-th value and B = j-th value
Ni,* = j mal Ni,j row marginal, sums over columns
N*,j = i mal Ni,j column marginal, sums over rows
N** = ij mal Nij = N total number of data points