Please enable JavaScript.
Coggle requires JavaScript to display documents.
01. Information Retrieval, Machine Learning, Pre-trained Models, Semi…
-
Machine Learning
-
-
Transformers
- Deep learning model used primarily for language modelling
- Improves RNN
- Pros: stronger performance, highly parallelisable
Self-attention
- Learns how to pay attention to relevant words during training
Transformers
- Made up of multiple identical encoder and/or decoder blocks
GPT-2
- Primarily known for text generation capabilities
- Trained on massive corpus of text gathered from pages around the internet.
- Encoder layers were removed.
- Finetuning: trainnig a pretrained model futher on a custom
Data collection
- 20000 articles from the categories.
Link network generation - GRAN
- Deep learning model for graph generation
Week 3 Machine Learning
Linear Regression
- The least distance from the oberved point and line gives the best fit.
- The loss function tells how good the line is.
Optimisation
- Loss function L measures how good a particuar line (w, b) is for our training dataset.
Gradient Descent
- Start by picking completely random values for w and b, e.g. by sampling N(0,1).
- Then compute the gradients dL/dw and dL/db
- Then update:
w := w-a*dL/dw
b := b-a*dL/db
- In the L vs w graph, the smaller the L value, the better the model.
- Keep doing these updates over and over until w and b stop chaning.
The role of learning rate
- The learning rate a is hyper-parameter you set.
- Too small a, need to take many steps.
- Too large a, may not converge.
- Update to you to choose a good value of a.
Higher Dimensions
- A linear functino from I inputs to O outputs
y = W*x + b
- W is a matrix with shape [I,O]
- b is a vector with shape [O]
- x is a vector with shape [I]
- y is a vector with shape [O]
Classification
- To deal with discrete dataset
- The output can be anything, we need to turn them into probabilities:
- need all the output values to be non-negative.
- need output values to sum up to 1.
Classification: loss
- Compute loss between our predicted probabilities and one-hot encoding of class label y, i.e. the probability of the correct class should be 1, all others should be 0. The output is between 0 and 1.
- Step 1: apply exponetial function to all values. This makes them positive.
- Step 2: divide each value by the sum of all values. This makes them sum to 1.
- We can pretend that our values are probabilities.
Classification: cross-entropy
- In practice, using cross entropy as the loss function for classification gives better results (softmax).
- Cross entropy example.
- y is either 0 or 1.
Classification: prediction
- To make a prediction for a new data point, apply your model to compute probabilities for each class. Predict the class with the largest probability.
- When predict a class in this way, the decision boundsries are straight lines.
Applying linear regression to text
- Our model needs numeric (continuous) input values.
What is Machine Learning
- Machine learning is about prediction
- Examples/features
- Labels/annotations
- Predictor
- Estimate best predicator = training, i.e. finding a function that does the prediction job.
- Run predictior function can make predictinos very well, but does not mean it understand it.
Week 4
Representation in NLP
Vector Model in IR
- Finding the cosine similarity between a query and a document. e.g. score:cos(V query, V document)
Traditional NLP
- Regards words as discrete symbols
- e.g. hotel = [000000000010000]
- Vector dimension = number of words in vocabulary (e.g. 500,000)
- Hotel and motel has the same semantic meaning, but has different representations.
-
Deep neural network
linear models to nueral networks
- Fit some other type of function to our data if it does not fit linear relationship.
- 𝜙 is some function that is non linear and (mostly) differentiable.
- 𝜙 is called the ‘activation function’.
- NN(x) is a hidden vector if it is in the middle.
- Input layer is the first layer.
- Output layer is the last layer.
- All other layers are 'hidden layers'.
- More hidden layer means more versions activation functions.
- More neurons can approximate more camplicated functions. , therefore better modelling of the data.
- If too much neurons are added, this can result fitting noise in the training dataset to cause overfitting.
Key idea of neural networks
- Stack multiple linear regression models.
- Output is used for the input of next model.
Rectified Linear Unit (ReLU)
- ReLU(x) = max(x,0)
- Fast to compute
- Works well in practice
- Each ReLU neurons represents on ReLU graph
Deep neural network
- Feed-forward neural network
Feature Learning
- One way of looking at the deep neural network as feature learning.
- i.e. the first layer (h1) is to transform its input to a linearly separable output, so the final layer can do its job properly.
- So the straight line (as decision boundary) can separate its classes in the final layer. If the model is good, the straight line should separate the classes.
Stochastic Gradient Descent
- Standard GD, the gradients are computed on the loss of the entire training dataset, whic is a problem if dataset is very large.
- For SGD, at each update we randomly sample a batch of B data points and compute gradients only on the loss of these points.
- SGD is much faster to compute.
- SGD acts as a regulariser, models trained with SGD usually generalise better, withtout overfitting the training dataset.
Backpropagation on Computation Graph
- Forward pass:
for i = 1, ...., N. Compute vi as a function of Pa(vi).
- Backward pass:
for i = N-1, ..., 1
Compute te loss (forward) ???
Compute the derivatives (backward) ???
Gradient Descent
- Simple Regression Model
- Chain Rule:
=>
=>
Attention
Translation
- Both input and output needs to be in sequence.
Problem with encoder-decoders
- all info compressed into one vector
- It is hard to do it with a linear model to extract information in the decoder.
Attention
- Instead of forcing the decoder to learn how to extract the information it needs from h, we just give it a vector specifically for current step.
-
-
-
Semi-supervised learning
- It is expensive to do lata labelling manually.
- It would be good to implement a model that on training some labelled data and training some on un-labelled data.
Self-supervised learning
- The label is generated by the computer it self not from human.
- Semi-supervised means you use self-supervised and supervised.
- So pre-training a language model then fine tuning on some labelled dataset is semi supervised.
- The labelled dataset is only used during fine tuning
-
Transformer Models
- GPT, BERT, GPT-2, GPT-3.
- GPT-3 is the largest trained model so far.
- The test accuracy has been increasing as the model gets larger.
- Use raw text to train the model. And predict what is coming next.
Pre-training
- BERT is different to GPT, rather than just look at the left to predict the right. It looks all the words in the text.
- 80 % of the target words are replaced by [MASK]
- 10 % replaced with another random word
- 10 % are left the same
- Google doc uses transformer to conduct word correction.
- BERT does sentense prediction as well. It does well on questionaire and comprehension. (adds each words vectors together)
BERT model fine tunning
- Uusing BERT on multiple tasks
-
XLNet
- Randomly picking the ordering of the text
-
Enterprise Analytics
- Data scientists are upstream of the data pipeline of the analytics engineers
- SMRs are the main transaction report type collected by AUSTRAC
- There are semi and unstructured data.
- There are 50000 SMRs each year.
Crimes check requirement
- The SMR reports are checked against scam, fraud, cahs deposit, money laudrying, tax evasion, terrorism financing.
- This task can be seen as the binary classification
- Input: SMR, then it go through machine learning models (label classifier).
- Output:
- The pre-defined labels need to be done manually
Machine learning process
- We have the labels. Then the task is to use the labels and to do predictions.
- The iteration will go for number of times, before it gives a good validation accuracy.
NLP Work
- The SMR is turned into bag of words representation.
- It is turned into TF-IDF to represent it without influence of high frequency words (e.g. money)
Process:
- Remove space, remove non-alphanumeric, etc.
- No stemming (not real words sometimes) or lemmatisation (real words)
- Filter out <0.1 or >0.9 document frequency.
Classifier
- Binary classifier for each label in the boosted (use the learning from one tree to the next tree) decision trees.
Output
- Classifier threshold =0.29 (depending on the type of crime)
- Less than 0.29 are ok, considered as SCAM if above 0.29
Advanced models
- No significant accuracy improvement on advanced tehcniques like Word2Vec, learning based models.
- Longer to train on deep learning
- Not flexible and maintainable as simple models (tf-idfs)
Patterns
- There are different ways of representing the patterns, e.g. accounts number.
- This allows for search on the regular expressions.
-
-
Hard vs Soft Clustering
- Hard clustering each doc in exactly one cluster
- Soft clustering: a doc can be in more than one cluster.
Flat alorithms
- Flat algorithms compute a partition of N documents into a set of K clusters.
- Effective heuristic method: K-means algorithm
Centroids in K-means
- Each cluster in K-means is defined by a centroid.
- w is a cluster.
- Centroids act as centers of gravity of points in cluster w.
K-means
- Objective: minimise the average squared difference from the centroid.
- Iterating K-mean algorithm:
- choose the pont closest centroid
- calculate the mean of the cluster
- The K-means is guaranteed to converge.
- RSS (residual sum of squares), sum of all squared distances between document vector and closest centroid. RSS decreases during each reassignment step.
- K-mean is garenteed to converge.
- K-mean is garenteed to converge to local minimum, not global.
- Random seed selection is one way of K-means initialisation, but not robust.
Recomputation decreases average distance
Evaluation
- RSS in K-means
- Ω is set of clusters
- C is the set of classes
How to choose the value of K
- External constraint on K.
- Find optimisation criterion
- find K for optimum is reached
- K = Nclusters