Natural Language Processing 3 (Information Retrieval (ASKMSR (Query…
Natural Language Processing 3
Multiple sentences. The meaning and tracking of nouns throughout an entire text.
Anaphor - Eg. "The professor was drunk. So Bill hid his car keys." His refers to the professor. Going back
Cataphor - "Listen to this: You'll never believe who Mary is dating." Goes forward
Deixis - "I'll see you tomorrow" Obviously depends on when the person says it. Refers information not said in the text
Indefinite NP - Introduces a new entity. Eg "A dog"
Definite NP - Describes an entity already introduced. Could be using a different name or describe new features.
Definite Pronouns - He/she/it/they etc.
Demonstrative Pronoun - This/that
One-Anaphor - "I saw four cats yesterday. Now I want one."
Inferrables - "I have a car. The engine is bad though" Entities not mentioned but can be inferred
Discontinuous Sets - "John is a boy. Mary is a girl. They are both 19."
Generics - Talking about a generic group of entities
Syntactic Restrictions - Eg. He/She/They must reference an entity of that gender/plural etc.
Semantic Restrictions - Eg. "John parked his car in the garage. He had been driving it around all day." The it refers to the car, not John or the garage.
Preferences - Choosing which entity was referred about based on recency, grammatical role, repetition.
Salience Algorithm - Calculates which entity is more likely when a definite pronoun is found.
U - Sentences. So U1 is sentence 1 etc.
Cb(Un) - Previous Cp
Cf(Un) - A list of all entities, ranked on preference.
Cp(Un) - The highest ranked Cf
If a pronoun can refer to more than one entity, then the algorithm is split and done in separate forks. The most likely is decided after each U, based on transition.
Transitions - Continue > Retain > Smooth > Rough
Sorting, searching and indexing of documents in a collection
Document - A unit of language/text
Collection - Set of documents
Query - Set of terms
Ad Hoc Retrieval - Get a subset of documents using a query
Vector Space Model
Documents/Queries are vectors where each term is a feature. Assume features are binary
Similarity is multiplying the terms
Term by Document Matrix - Each term is a dimension so each document is just a point
Normalisation - Calculating how similar two documents are by calculating the normalising each one and then calculating the cosine between them
Term Weighting - Ranking terms based on how frequent they are and how frequent they are in other documents. Calculates inverse document freq.
Stop words - 100 most common words that get ignored
Question Answering - Gives a short answer, perhaps with evidence
TREQ - IR competition. Uses newspaper text to get answers to factual questions
Query Rewriting - "Who" looking for a person. "When" looking for a time. Where-Q inserts "is" into the sentence to generate strings to use to search
Mining N-Grams - Send queries to search engine, generate n-grams. Only looks at search results not actual pages
Filtering N-Grams - Regular expressions to filter data by what we're looking for. Eg. times/dates/people/locations etc. Boost scores
Tiling - Putting similar answers together to boost scores
Deep Learning/Word Embeddings
Linear Regression - Bunch of points, line that represents them as best as possible. Error using least squares. You create a quadratic curve and find the point where dy/dx =0.
Gradient Decent - Finds a local minima, from a function of any size dimensions. To find global minima, do a couple times
Logistic Regressions - Categorising positives and negatives using a curve/shape. Can be solved using gradient decent
Deep Learning - Neural Networks with many hidden layers
One-Hot Vectors - First word is 1, second is 2 etc. Doesn't take into account stemmed words or duplicates.
Term Document Matix - Ranks each word by how many times it appears in that document
A way to classify words as numbers
Word-word matrix - Ranks each word against each other word to see how often it is used in the same context. This is done at a document or collection level. Can be mapped how close they are as words
Words that are less frequent in the document have a higher weighting
Phrase and Entity Identification
Identifying red herrings and phrases that mean something completely different
POS tagging doesnt work for these, so hand crafted rules will work
Use an encyclopedia. Wikification