Please enable JavaScript.

Coggle requires JavaScript to display documents.

NLP (:straight_ruler:metrics (:recycle:Multiclass (:two:average=micro (…

- - - - :explode:一个class独自计算
    - - :explode:考虑不同class的sample数量,整个class都一起计算
      - :icecream:比如precision的micro计算就是对于整个预测的labels里面,不是某个类猜对的概率,而是整个猜对的概率
    - - :explode:每个class都很重要,单独计算precision,recall和f1,取算术平均值
- - - - :dagger_knife:methods
        
        :three:单词的TF-IDF
        
        :heavy_dollar_sign:extract TF-IDF features from text
        
        :one:TF matrix
        
        :link:CountVectorizer
        
        :icecream:vocabulary: {"red", "apple", "banana"},而document是"The apple and that apple are red.",那么TF matrix就是[1, 2, 0]
        
        :explode:已知所有文本中的vocabulary,那么一个document对应的TF matrix就是长度为vocab_size的每个元素表示该word出现在该document里的次数.
        
        :pencil2:implementation
        
        :link:tf_idf.py
        
        :maple_leaf:if 有很多文档:单词的重要度=tf-idf(一个单词,一个文档)
        
        :pencil2:formula
        
        :three: \( tf\_idf(t, d, D) = tf(t, d) * idf(t, D)\)
        
        :star:TF和IDF都有很多种定义,详细参考tf-idf wiki
        
        :two:idf(t, D)
        
        :two:log(文档总数 / (1 + 出现过单词t的文档总数)) :star:
        
        :one:log(文档总数 / 出现过单词t的文档总数)
        
        :one:tf(t, d)
        
        :three:1 如果单词t在文档中出现过, 否则0
        
        :three:log(1 + 文档中单词t出现的次数)
        
        :one:文档中单词t出现的次数
        
        :two:文档中单词t出现的次数/文档中单词总数 :star:
        
        :explode:提高罕见单词的重要度,从而更好地反映单词的重要度
        
        :two:单词出现的频率
        
        :one:单词是否出现
    - - :warning:训练集的feature是一个{单词:标量}的字典
    - - :key:[<feature,label>,<feature,label>,...]
        
        :tornado:nlp情感分析中,一句话/分词列表就是feature,积极/消极就是label
- - - - :dagger_knife:nltk.download()
        
        :tornado:brown
        
        :moneybag:categories
        
        :moneybag:words
        
        Installing NLTK Data
- - - - :one:北大词性标注集
      - :two:宾州词性标注集