Please enable JavaScript.
Coggle requires JavaScript to display documents.
RE, Text Normalization
& Edit Distance (Regular
Expression (Features…
RE, Text Normalization
& Edit Distance
Regular
Expression
-
Embedded libraries in Python, Java...
-
Features
-
-
Test for a pattern within a string
- Check credit card number pattern (Validation)
Error
Type I error
- False Positive
- Min error = Increase accuracy/precision
Type II error
- False Negative
- Min error = Increase recall/convergence
Corpus
-
Utterance
- A unit of speech bounded by silence
Word Count
Word Types, V
- Number of distinct word in a corpus
- Vocabulary
Tokens, N
- Total number of running words
Normalization
Process
Normalize word formats
- Put word into a standard format
-
-
MaxMatch Algo
-
Algorithm:
- Start pointer at beginning of string
- Find the longest word in dictionary that matched
- Move pointer over the word in the string
- Repeat step 2 & 3
Collapsing
Words
Stemming
- Process of reducing a word to its word stem
- Cutting off the prefixes & suffixes
Lemmatization
- Process of reducing words into their lemma
- Takes into consideration the morphological analysis of the words
Edit Distance
-
Minimum Edit Distance
- Minimum of edit operations
- Insert, deletion & substitution