1024programmer Java Statistical Inference (5) EMalgorithm

Statistical Inference (5) EMalgorithm

1. EM- ML algorithm

  • formulation

    • complete data : z=[y,w]\mathsf{z=[y,w]} z=[y,w]
    • observation : y\boldsymbol{y}y
    • hidden variable : w\boldsymbol{w}w
    • estimation : x\mathcal{x}x

  • Derivation

Expect to obtain ML estimates, but in practice p(y;x)p(y;x)p(y;x) may be difficult to calculate (such as mixture gaussian model, Multiply and then sum)

x^ ML(y)=arg⁡max⁡xln⁡p(y;x)
\hat{x}_{ML}(y)=\arg\max_x \ln p(y;x) \\
x^ML​(y)=argxmax​lnp(y;x)

Introduction of complete data z=[y,w]\mathsf {z=[y,w]}z=[y,w], ​​lety =g(z)y=g(z)y=g(z)

p(z;x) =∑yp(z∣y;x)p(y;x)=p(z∣g(z);x)p(g( z);x )x^ML(y)=arg⁡max ⁡xln⁡p(z;x) −ln⁡p(z∣y;x)
p(z;x)=\sum_y p(z|y;x)p(y;x)=p(z|g(z);x) p(g(z);x) \\
\hat{x}_{ML}(y) = \arg\max_x \ln p(z;x) – \ln p(z|y;x)
p(z;x)=y∑​p(z∣y;x)p(y;x)=p(z∣g(z);x )p(g(z);x)x^ML​(y)=argxmax​lnp(z;x)−lnp(z∣y;x)

Becausex^ML(y)\hat {x}_{ML}(y)x^ML​(y) has nothing to do with z, so the right side can be true for p (z∣y;x′) p(z|y;x’)p(z|y;x′) Find the expectation

KaTeX parse error: No such environment : align at position 8:
\begin{̲a̲l̲i̲g̲n̲}̲
\ln p(y;x) &= …

whereV (x,x′)V(x,x’)V(x,x′) According to Gibbs inequality,alse”>)t_k(y)How to express tk​(y), tip: use a matrix with elements of 1 or 0), this applies here, facing pzp_z later pz​ The constraints also apply

  • To findML is to findreverse I-proj ), which is useful for later understanding the alerting projcetions of the EM algorithm
  • 4. EM-ML Alternating projections

    According to property 2 in #3, the expression of ML can be obtained, but the expression is too complex. Consider

    DPI(Data processing inequality): y=g(z)y=g(z)y=g(z)

    D(p(z)∣∣q(z))≥D(p(y)∣ ∣q(y))”=”   ⟺   pz(z)qz(z)=py(g(z)) qy( g(z)) ∀z
    D(p(z )||q(z)) \ge D(p(y)||q(y)) \\
    “=” \iff \frac{p_z(z)}{q_z(z)} = \frac {p_y(g(z))}{q_y(g(z))}\ \ \ \ \forall z
    D(p(z)∣∣q( z))≥D(p(y)∣∣q(y))”=”⟺qz​(z)pz​(z)​=qy​(g(z))py​(g(z))​ ∀z

    Therefore, according to equation (12), in order to minimizeD(p^y (⋅;y) ∣∣p(y;x))D(\hat{p}_y(\cdot;\boldsymbol{y}) || p(y;x))D (p^​y​(⋅;y)∣∣p(y;x)) You can consider minimizingD( p^z(⋅;z)∣∣p(z;x))D(\hat{p}_z(\cdot;\boldsymbol{z}) || p(z;x)) D(p^​z​(⋅;z)∣∣p(z;x))

    Becausep (y;x )p(y;x)The expression for p(y;x) is likely to be complex, but p(z;x)p(z;x)p(z;x) can be simplified a lot

    That is, maximum likelihood is transformed into minimization

    min⁡D(p^ z(⋅; z)∣∣p(⋅;x))
    \min D(\hat{p}_z(\cdot;z) || p(\cdot;x))
    minD(p^​z​(⋅;z)∣∣p(⋅;x))

    PZ(y)≜{p^Z(⋅):∑g(c)=yp^z(c)=p^y(b;y) ∀b∈y}
    \mathcal{P^Z}(y) \triangleq \left\{ \hat{p}_Z(\cdot): \sum_{g(c)=y} \hat{p}_z(c) = \hat{p}_y(b;\boldsymbol{y}) \ \ \ \forall b\in \mathcal{y} \right\} \\
    PZ(y)≜⎩⎨⎧​p^​Z​(⋅ ):g(c)=y∑​p^​z​(c)=p^​y​(b;y) ∀b∈y⎭⎬⎫​

    Remarks: Both distributions must be considered in the minimization process here:

    1. Due top^z\hat{p} _zp^​z​ In fact, under certain constraints (Linear distribution family, refer to reverse I-proj in #3), it can be chosen arbitrarily. Therefore, we need to optimize p^z\hat{p}_zp^​z​ Minimize the divergence;
    2. Becausep(⋅;x)p(\cdot;x)p(⋅;x) actual The above is what we require (we need to find an x ​​that maximizes the likelihood of the observation value y), so we must also optimize p(⋅;x)p(\cdot;x)p(⋅;x) minimizes the divergence;

    EM-ML
    Bonennult
    Published 37 original articles · 27 likes · Visits 20,000+
    Private message

    Follow

    mo stretchy=”false”>)≜{ p^Z(⋅ ):∑g(c)=yp^z(c)= p^y(b ;y) ∀b∈y}
    \mathcal{P^Z}(y) \triangleq \left\{ \hat{p}_Z(\cdot): \sum_ {g(c)=y} \hat{p}_z(c) = \hat{p}_y(b;\boldsymbol{y})\ \ \ \forall b\in \mathcal{y} \right\} \\
    PZ(y)≜⎩⎨⎧​p^​Z​(⋅):g(c)=y∑​p^​z​(c )=p^​y​(b;y) ∀b∈y⎭⎬⎫​

    Remarks: In the minimization process here, two Distribution must be considered:

    1. Due top ^z\hat{p}_zp^​z ​ In fact, under certain constraints (Linear distribution family, refer to reverse I-proj in #3), it can be chosen arbitrarily, so p^z\hat{p}_zp^​z​ Minimize the divergence;
    2. Becausep(⋅;x)p(\cdot;x)p(⋅;x) is actually what we require (we need to find an x ​​such that the observed value y maximum likelihood), so also optimize p(⋅;x)p(\cdot;x) p(⋅;x) minimizes the divergence;

    EM-ML
    Bonennult
    Published 37 original articles · 27 likes · 20,000+ views
    Private message

    Follow

    This article is from the internet and does not represent1024programmerPosition, please indicate the source when reprinting:https://www.1024programmer.com/724849

    author: admin

    Previous article
    Next article

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Contact Us

    Contact us

    181-3619-1160

    Online consultation: QQ交谈

    E-mail: [email protected]

    Working hours: Monday to Friday, 9:00-17:30, holidays off

    Follow wechat
    Scan wechat and follow us

    Scan wechat and follow us

    Follow Weibo
    Back to top
    首页
    微信
    电话
    搜索