Please enable JavaScript.
Coggle requires JavaScript to display documents.
Statistics - Intro to Data Mining - Coggle Diagram
Statistics - Intro to Data Mining
empirical means
Are the means equal?
kann Wahrscheinlichkeit daraus berechnet werden, wie wahrscheinlich es ist, dass man mit der Annahme gleichender Mittelwerte richtig/falsch liegt
dementsprechend kann die Hypothese entweder verworfen oder erst einmal bestehen bleiben
Standardfehler der Mittelwertdifferenzen
gibt an, wie sehr Varianz zwischen den beiden Stichproben fehlerhaft sein kann und alles, was man daraus interpretiert, ebenfalls
two 'unpaired' samples {xiA}, {xiB} of size NA and NB, drawn from unknown probability densities PA and PB
true mean
Null hypothesis: Mittelwerte von A und B in der Grundgesamtheit gleich
Tests to discriminate distributions
Unterscheidung zwischen continuous und discrete values / stetigen und diskreten Werten
discrete distributions: X²-test
discrete distribution with i = 1,...,M
Ai = ^ number of events in bin i
ai: expected number (continuous, coming from model/hypothesis)
Case 1: Comparison of the a dataset with a given distribution
variance: like Bernoulli experiments np(1-p)
bin i will be realized with probability p = (ai)/sum(ai)
analyzing results
for large Chi2 -> large deviations between the distributions
Pearson: for large M, the denomianator will be normal distributed
errors ai - Ai are normal distributed
probability that an observed value X2^ is larger (X2^ > X2c) just by accident: Q(X2c, v) with strichobenundobenrechts(x)
Case 2: Comparison of two datasets
X² = M mal ((Ai-Bi)²/Ai+Bi)
Variance: o²(Ai-Bi) = o²(Ai) + o²(Bi) ungefähr Ai + Bi
continuous distributions: Kolmogorov-Smirnov test
two samples {xiA} i = 1,..., NA {xiB} i = 1,..., NB
Optional: comparison of {xiA] with a hypothetical distribution P(x)
goal: calculation of the empiric cumulative distribution FA(x) = 1/NA sum( x -xiA ). 1 if x>=0 0 else
alternative statistics: P(D>=Dcrit)
different formulas and their meaning
density function
cumulative density function (integral)
which kind of tests and their H0 hypothesis
depends on wished significance level, e. g. 5 %, 1 % or 0.1 %
one-sided test
used when: we assume that mü A > mü B (so positive value of t)
the area below the student-t density function from tc to infinite is the probability that a value t > tc occurs randomly
two-sided test
used when: | mü A - mü B | ungleich 0
the area below the student-t density function from tc to infinite is the probability that a value t > tc occurs randomly
Detection of Dependencies between variables
question: if the null hypothesis can be rejected, how strong is the dependency? - reparameterizazion, so that the value is independet of I and J
strength is essentially given by the value of X², conventional normalizations are those two
Cramer's V
needed: X², N, J, I
V = 1 <-- perfect Association (i. e. one variable determines the other)
V = 0 <-- no association at all
If I = J = 2, V is called 'durchstrich o - statistic''
Contingency coefficient
value C = 1 will never be reached
is only useful to compare the strengths of association of tables with equal (I, J)
both with V and C there is no direct statistical interpretation of values in between
Ni,j = # of cases with A = i-th value and B = j-th value
Ni,* = j mal Ni,j row marginal, sums over columns
N*,j = i mal Ni,j column marginal, sums over rows
N** = ij mal Nij = N total number of data points