Character encoding, language identification, tokenization (Hello World _> HelloWorld! so string to sequence of tokens/words, Python spaCy), stopword removal, word normalization => stemming (map different forms of same word to same normalized form by stripping affixes: walking to walk, in English typically not done anymore) Better: lemmatization: map tokens to lexicon entries 8complex lexicon and mapping rules!!)