8 Estimation of parameters

Följdsats av #

If T is sufficient for θ, the mle is a function of T

The Method of Moments

The Method of Maximum Likelihood

the kth moment of a probability law is defined as

$\mu_{k}=E(X^{k})$

the kth sample moment is defined as\[\hat{\mu}_{k}=\frac{1}{n}\sum_{i=1}^{n}X_{i}^{k}\]

Hitta uttryck för low order moments i termer av parametrarna
Invertera uttrycken
Byt ut momenten mot sample-moment.

$\theta$ $\hat{\theta}$

consistent: Låt $\hat{\theta}_n$ vara ett estimat till en parameter $\theta$ med sample-storlek n. Då är $\hat{\theta}_n$ consistent in probability om $\hat{\theta}_n$ konvergerar i sannolikhet till $\theta$ då n går mot oändligheten, dvs för $\epsilon>0$,\[P(\left|\hat{\theta_{n}}-\theta\right|>\epsilon)\rightarrow0\] då n -> oändligheten

$\textrm{lik}(\theta)=f(x_{1},x_{2},\ldots,x_{n}|\theta)$

gives the probability of observing the given data as a function of the parameter $\theta$

The maximum likelihood estimate (mle) of $\theta$ is that value of $\theta$ that maximizes the likelihood - that is, makes the observed data "most probable" or "most likely"

If i.i.d.

\[\textrm{lik}(\theta)=\prod_{i=1}^{n}f(X_{i}|\theta)\]

log likelihood

\[l(\theta)=\sum_{i=1}^{n}\log[f(X_{i}|\theta)]\]

Large Sample Theory for Maximum Likelihood Estimates

\[I(\theta)=E\left[\frac{\partial}{\partial\theta}\log\,f(X|\theta)\right]^{2}=-E\left[\frac{\partial^{2}}{\partial\theta^{2}}\log\,f(X|\theta)\right]\]

TA: <smoothness>
The mle from an i.i.d. sample is consistent.

LA: <smoothness>
E[part.der....]^2
=
- E[part.andrader....]

TB: <smoothness>
\[\sqrt{nI(\theta_{0})}(\hat{\theta}-\theta_{0})\sim Z\]

Confidence Intervals from Maximum Likelihood Estimates

Exact methods

Approximations based on #

Bootstrap confidence intervals

\[\hat{\theta}\pm z_{\alpha/2}/\sqrt{nI(\hat{\theta})}\]

och antag detta känt: $\Delta=\hat{\theta}-\theta_{0}$

Låt $\theta_0$ vara det sanna parametervärdet

$\underline{\delta}=\alpha/2\,\textrm{kvantilen}$
$\overline{\delta}=1-\alpha/2\,\textrm{kvantilen}$
av $\Delta$

$P(\underline{\delta}\leq\hat{\theta}-\theta_{0}\leq\overline{\delta})=1-\alpha$
$\Downarrow$
$P(\hat{\theta}-\overline{\delta}\leq\theta_{0}\leq\hat{\theta}-\underline{\delta})=1-\alpha$

Detta förutsatte att $\Delta=\hat{\theta}-\theta_{0}$ var känt, vilket typiskt inte är fallet.

Om $\theta_0$ var känd skulle denna fördelning kunna approximeras med simulering: Många många samples av observationer kunde slumpmässigt genereras med det sanna värdet av $\theta_0$; för varje sample kunde $\Delta=\hat{\theta}-\theta_{0}$ räknas ut.

Eftersom $\theta_0$ inte är känd föreslår bootstrap-principen att vi använder $\hat{\theta}$ i dess plats:

Generate many, many samples (say, B in all) from a distribution with value $\hat{\theta}$; and for each sample construct an estimate of $\theta$, say $\theta_j^*$, j = 1, 2, ..., B. The distribution of $\Delta=\hat{\theta}-\theta_{0}$ is then approximated by that of $\theta^*-\hat{\theta}$, the quantiles of which are used to form an approximate confidence interval.

t.ex. om B=1000, ta det 25:e största och det 975:e största värdet.

notera $\theta^*-\hat{\theta}$

The Bayesian Approach to Parameter Estimation

posterior distribution:
\[f_{\Theta|X}(\theta|x)=\frac{f_{X,\Theta}(x,\theta)}{f_{X}(x)}=\frac{f_{X|\Theta}(x|\theta)f_{\Theta}(\theta)}{\int f_{X|\Theta}(x|\theta)f_{\Theta}(\theta)d\theta}\]

prior distribution $f_{\Theta}(\theta)$: representerar vad vi vet om parametern innan vi observerat datan X

$f_{\Theta|X}(\theta|x)\quad\propto\quad f_{X|\Theta}(x|\theta)\quad\times\quad f_{\Theta}(\theta)$

$\textrm{Posterior density}\quad\propto\quad\textrm{Likelihood}\quad\times\quad\textrm{Prior density}$

Further remarks on priors

conjugate prior: if the prior distribution belongs to a family G and, conditional on the parameters of G, the data have a distribution H, then G is said to be conjugate to H if the posterior is in the family G

Gibbs sampling alternates between fixating one or the other, rendering it as a normal or gamma distribution respectively.

How to choose which method to use?

Det vore vettigt att välja den metod som är mest koncentrerad kring det sanna parametervärdet

Ett sätt är mean squared error:
$MSE(\hat{\theta})=E[(\hat{\theta}-\theta_{0})^{2}]$
$=\textrm{Var}(\hat{\theta})+(E(\hat{\theta})-\theta_{0})^{2}$

Given two estimates, $\hat{\theta}$ and $\tilde{\theta}$, the efficiency of $\hat{\theta}$ relative to $\tilde{\theta}$ is \[\textrm{eff}(\hat{\theta},\tilde{\theta})=\frac{\textrm{Var}(\tilde{\theta})}{\textrm{Var}(\hat{\theta})}\]

In searching for an optimal estimate, we might ask if there is a lower bound for the MSE of any estimate

would be a benchmark to compare estimates to

Cramér-Rao Inequality

Let $X_1, ..., X_n$ be i.i.d. with density function $f(x|\theta)$.
Let $T=t(X_1, ..., X_n)$ be an unbiased estimate of $\theta$. Then, under smoothness assumptions on $f(x|\theta)$, \[\textrm{Var}(T)\geq\frac{1}{nI(\theta)}\]

mle asymptotically normal with mean $\theta_0$,
variance $\frac{1}{nI(\theta_0)}$

An Example: The Negative Binomial Distribution

Generalization of negative binomial distr.

\[f(x|m,k)=\left(1+\frac{m}{k}\right)^{-k}\frac{\Gamma(k+x)}{x!\Gamma(k)}\left(\frac{m}{m+k}\right)^{x}\]

When applicable?

Suppose that $\Lambda$ is a random variable following a gamma distribution and that for $\lambda$, a given value of $\Lambda$, X follows a Poisson distribution with mean $\lambda$. It can be shown that the unconditional distribution of X is negative binomial. Thus, for situations in which the rate varies randomly over time or space, the negative binomial distribution might tentatively be considered as a model.

...

Sufficiency

Is there a statistic, a function $T(X_1, ..., X_n)$, that contains all the information in the sample about $\theta$?

If so, a reduction of the original data to this statistic without loss of information is possible.

Definition: A statistic $T(X_1,...,X_n)$ is said to be sufficient for $\theta$ if the conditional distribution of $X_1,...,X_n$, given $T=t$, does not depend on $\theta$ for any value of $t$.

A Factorization Theorem

A necessary and sufficient condition for $T(X_1,...,X_n)$ to be sufficient for a parameter $\theta$ is that the joint probability function (density function or frequency function) factors in the form:

\[f(x_{1},\ldots,x_{n}|\theta)={\color{OrangeRed}{g[T(x_{1},\ldots,x_{n}),\theta]}}{\color{DodgerBlue}{h(x_{1},\ldots,x_{n})}}\]

Suppose that $X_1, ..., X_n$ is a sample from a probability distribution with the density or frequency function

$f(x|\theta)$.

In other words, given the value of $T$, which is called a sufficient statistic, we can gain no more knowledge about $\theta$ from knowing more about the probability distribution of $X_1, ..., X_n$

Kom ihåg här att X egentligen beror på $\theta$ #

Frequentist: "A 95% confidence interval for $\theta$ is [1.8, 6.3]"

Bayesian: "After seing the data, the probability is 95% that $1.8\leq\theta\leq6.3$"

vs.

... followed by a long convoluted explication of the meaning of a confidence interval; $\theta$ is not a random variable, it either is inside or not.

dvs, omm vi kan att visa att $f(\mathbf{x}|\theta)$ har formen ovan så vet vi att $P(\mathbf{X}|T)$ inte beror på $\theta$

<-- dessa är alltså INTE SAMMA SAK!

\[f(\mathbf{x}|\theta)={\color{OrangeRed}{g(T,\theta)}}{\color{DodgerBlue}{h(\mathbf{x})}}\]

(dvs)

P(X | T) ej beror på θ

Exponential family of probability distributions
normal, binomial, Poisson, gamma, ...

f(x|θ) = exp[c(θ)T(x) + d(θ) + S(x)], $x \in A$
= 0, $x\notin A$
where the set A does not depend on θ

A study of the properties of probability distributions that have sufficient statistics of the same dimension as the parameter space regardless of sample size led to the development of the

$\sum_{i=1}^{n}T(X_{i})$ är sufficient

(one-parameter)

(k-parameter)

\[f(x|\theta)=\exp\left[\sum_{i=1}^{k}c_{i}(\theta)T_{i}(x)+d(\theta)+S(x)\right],\qquad x\in A\] och 0 annars

where the set A does not depend on θ

Följdsats av #

If T is sufficient for θ, the mle is a function of T

Proof: From the Factorization Theorem, the likelihood is $g(T, \theta)h(\mathbf{x})$, which depends on $\theta$ only through $T$. To maximize this quantity, we need only maximize $g(T,\theta)$.

(eftersom vi har omm)

The Rao-Blackwell Theorem

Let $\hat{\theta}$ be an estimator of $\theta$ with $E(\hat{\theta}^2)<\infty$ for all $\theta$. Suppose that $T$ is sufficient for $\theta$, and let $\tilde{\theta}=E(\hat{\theta}|T)$. Then, for all $\theta$,\[E(\tilde{\theta}-\theta)^{2}\leq E(\hat{\theta}-\theta)^{2}\]The inequality is strict unless $\hat{\theta}=\tilde{\theta}$

gives a quantitative rationale for basing an estimator of a parameter $\theta$ on a sufficient statistic if one exists

If an estimator is not a function of a sufficient statistic, it can be improved.

konfidensintervallet\[(\hat{\theta}-\overline{\delta},\hat{\theta}-\underline{\delta})\]känns lite konstigt, men är rätt

(blir ju detta) \[(2\hat{\theta}-\overline{\theta},2\hat{\theta}-\underline{\theta})\]