Please enable JavaScript.

Coggle requires JavaScript to display documents.

Text Classification and Naïve Bayes - Coggle Diagram

- - - - testing時，P(d|c) = P( <tf1,d, …, tfM,d> | c)
        
        依chain rule太複雜
        
        做assumption1，conditional independence，互相獨立：
        
        但不夠，參數過多，文章字數無限制
        ->做assumption2，positional independence
        
        同個term用同一個probablitity，不依位置改變
        
        結果：
      - m維向量
      - 定義：
        一實驗中，Y1發生y1次的發生機率
    - - P(d|c) = P( <e1, …, eM> | c)
      - binary：1,0
        
        有做independence assumption
        positional assumption：做BIM時就一起處理了
      - training
        
        用文章數算
        
        分子：The number of training
        documents in c containing term i
        
        分母： The number of training
        documents in class c
        
        文章中沒使用的term也有影響
      - 不考慮token出現次數
  - - - 分子：Tctk某個term的出現次數
        分母：屬於該class中的總字數
    - - problem1：floating point underflow
        
        解法：取log
      - problem2：zero probability
        
        訓練時可能有自並沒出現在某個class中，但Test可能有 -> 0
        
        解法：add-1 smoothing
- - - - test the independence of two random variables
      - two random variables are
        
        document有沒有用到這個term
        
        document屬於哪個class
        
        越趨近0：獨立的假設合理
        <->
        越大：獨立的假設不合理
        
        值越大可能是正指標OR反指標
    - - 比較差異性，較有解釋力
      - It is a number of likelihood ratio of two hypotheses
      - hypothesis
        
        Hypothesis 1. P(t|c) = P(t|not c) = pt (獨立)
        
        Hypothesis 2. P(t|c) = p1 ≠ p2 = P(t|not c) (非獨立)
      - 做binomial experiments
        
        -2logλ = -2log(L(H1) / L(H2))
        
        值越大可能是正指標OR反指標
    - - 找出collocation(term特殊用法、關係)
        
        找出term與class關係
      - 如果t,c互相獨立
        -> 0
      - 常找到罕用字(不好)
        
        大多找出independence的Term
        
        改善：加入term frequency
        
        countt,c X I(t,c)
    - - 除11外。10、01、00的情形也計入
      - 不會有負值
        
        但有反指標
  - - - 頻率高低不一定有代表性
- - - - 考量大小資料(規模)分類情形
    - - 把據鎮資料混再一起，再算
      - 大資料分類會吃掉小分類資料