Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization
Abstract and Introduction: 本文做了什么
我们要做的是event link gene ontology(event normalization).
以下问题首先需要被明确:
What is event? How is it extracted and stored(event extraction都有哪些方法,extract出的event是怎样的)?
What is gene ontology?
How to connect those 2 things? Are there any methods of connection whether in Bio or general for inspiration?
combine event extraction with gene normalization(which originally are independent research area), discover how event extraction can be augmented by gene normalizaiton
combine 2 state of art systems from BioNLP task and the BIoCreative challenge
broaden the normalization scope(multi-layer: gene, protein, and even gene family)
integrating a released gene family assignment algorithm
present a novel normalization algorithm
run analysis on PubMed abstracts and PubMed Central open access full texts
Materials and Method
Text pre-processing:
Unicode-to-ASCII mapping;
annotate all information extracted from full-text articles with specific section the data was retrieved from like "introduction", "abstract", "our approach"...
Entity recognition
使用的是BANNER。这个工具的特点是:identifies proteins, genes, but does not differentiate between these entity types nor involve any form of normalization.
Event extraction
使用的工具为TEES
bring dataset up to date with the latest results of the PM abstracts of 2011 and 2012, double the size, 所以用TEES好
Gene normalization(这篇文章主要内容是关于EVEX数据库应用论文,而且应该是EVEX数据库第二版本的论文,所以在本文说明了老版本的EVEX所采用的normalization方法只有Canonical forms, Family assignment, 新版本的EVEX新加了Gene normalization的方法)
Canonical forms(这应该是gene, protein名称字面上的特征)
消除了gene symbol拼写多样性问题,以及前后缀问题,见文章中ESR的例子
does not resolve synonymy or the species of gene mentions, so that it does not alone allow for a reliable mapping to database identifiers.
family assignment
The reasoning behind the family-based assignment is that homologous genes evolved from a common ancestor, often still exhibit similar functional behavior, and are consequently assigned to similar names which is hard to distinguish in biological literature.
老版本EVEX仅仅基于Canonical form来assign gene family, disambiguation见文章,但是效果不好,因为很有可能率属于不同gene family的gene在文章中的gene symbol(也就是canonical form)相同
Gene normalization
使用GenNorm normalization algorithm, for assign organism-specific(organism指的貌似是这个基因或是protein是哪个organ的,比如MEC1这个基因,也可以说是蛋白质编码,它的organ就是Saccharomyces cerevisiae酒酿酵母的) Entrez Gene identifiers(Entrez也是一个结构化的生物数据库,比如刚才MEC1在这个数据库里的链接就是https://www.ncbi.nlm.nih.gov/gene/852433/)
GenNorm is a method for cross species gene normalization
Combination of these normalization methods——这个Combination实际上就是作者提出的novel normalization algorithm,目标是什么?就是assign unique identifiers
Results and Discussion
Extraction Results
Extraction的结果,包括extract了多少event, gene, normalize的情况都如表3所示
说了一下event normalization的方法,具体的方法应该是combined normalization procedure+ defining equality of events. Defining equality of events有两种方法,一种是same event structure+same gene id of arguments, 另一种是same event structure + same gene family id of arguments, 但是这只是两种判别event相同的准则,在这个基础上怎么把event normalize到具体的databae在这个部分并没有讲清楚。
Abstract vs full texts
only demonstrate that wealth of information exists in full texts not only in abstract
Event extraction performance
TEES之前在BioNLP的数据上被验证是好的,本文又做了实验验证在本文more general的数据上TEES也适用
本文extract event仅限在one sentence之中,据调查amount of intersentence events is between 6-9% of all data, 所以未来会利用指代消解解决这个limitation。
Performance of assigning gene identifiers
首先GenNorm在BioCreative III challenge当中表现优异,所以是个好方法
然而为了验证GenNorm与Canonical form以及family方法combine之后的效果(即本文的图2中的Combination1,2),本文又在BioCreative III上做了进一步实验,结果发现效果没有单独使用GenNorm的好,这是为什么?
使用的entity recognition工具BANNER scope太广,给Gene normalization造成了难度(normalization的scope没有完全涵盖BANNER的范围)
主要原因是:本文使用的这个event pipeline处理不了table以及figure,which里面有很多在单独使用GenNorm时可以被标注的Gene,所以这就造成了大量false negative的出现,所以本文未来工作就是看看能不能把不能处理table以及figure这个问题解决。
接下来分析了combination方法对于gene normalization的效果。注意本文的衡量的貌似有两种combination结构,比如说表4中Canonical是指updated canonicalization algorithm using the taxonomic assignments of GenNorm(Combination 1 figure 2),然而Canonical +GenNorm则用到了fallback mechanism(有什么priority 和fallback,总之我也没太懂),然后gene family有三种不同定义HomoloGene,Ensembl,Ensembl GenNorm,这样就有三种Combination 2,本文分别做了实验。
得到的结论就是Canonical很好,然后family具体的方法有各自的特点 blablabla
performance of assigning gene families
这个地方也使用了original的EVEX方法,与GenNorm方法进行combine(一种combine叫做adapted version,是EVEX method using the taxonomic identifiers, 另一种就是用GenNorm来assign families然后用EVEX做fallback)
描述并分析了一下实验结果,另外单独说明了一下虽然GenNorm很好,但是也有失灵的时候,这时EVEX作为fallback的效果很好,这也侧面说明了combination这种思路的好处。
使用的什么数据?All methods are run on al PubMed abstracts and PubMed Central open access full texts, resulting in a unique dataset for text mining.
The canonical forms provide a powerful way to query textual representations of events through symbol search, dealing with lexical variation of gene symbols. 所以canonical form 是用来query EVEX的?
combination 3——for unique family id
if GenNorm algorithm produce a unique gene identifier
if GenNorm fails to produce a unique gene identifier(multiple gene identifiers)
query EVEX database to determine whether the gene ID, assigned by GenNorm is linked to a known gene family and assign that family accordingly.
apply adapted version of the original symbol-based family assignment as fall back: taking into account both the canonical form of the gene mention and organism assigned by GenNorm, assign a family that contains at least one (间接证明multiple identifiers)gene of the specified organism. When there are multiple candidate families, the family is picked that contains the most genes with this specific canonical form as synonym.
if there is no candidate family matching the organism assignment(and there is no id assigned by GenNorm)
original symbol-based algorithm is applied
Does EVEX database have the family data of normalized gene Ids? Yes, Ensembl, HomoloGene, Ensembl Genomes
combination 2——for unique species and Gene ID
due to the inter-species ambiguity of a specific gene symbol, GenNorm may assign multiple gene ids to a mention
the fallback procedure of the family assignment(combination 3的第二种情况),will assign a family(这个family考虑了canonical的信息). When this family contains a gene of the organism determined by GenNorm, 我们就可以从multiple IDs中筛选出最终的ID。
牵强,第一个问题:GenNorm在assign gene id时是否已经用过canonical form这个信息了?如果是,这就自相矛盾了。第二个问题:万一这个family还是包含了这些multiple ID呢,这方法就黄了。
inter-species ambiguity: One name,
abbreviation or code may refer to genes in multiple species, each with its own unique ID, or even to multiple genes in the same species or across different species.
combination 1——for unique gene and species ID
a significant proportion of gene symbols are unique within one species and can thus be assigned based solely on the canonical symbol and species.
相同的问题,GenNorm到底有没有用到Canonical form?