Please enable JavaScript.
Coggle requires JavaScript to display documents.
ELEMENTS OF UNSTRUCTURED DATA ANALYSIS, WhatsApp Image 2025-12-04 at 20.15…
ELEMENTS OF UNSTRUCTURED DATA ANALYSIS
TEXT
IS A PRODUCT PRESENTED IN THE FORM OF A WRITTEN DOCUMENT
HAS A BEGINNING
DEVELOPMENT
END
OWN AUTONOMY AND COMMUNICATIVE PURPOSE
CORPUS
IS A COLLECTION OF
COHERENT, HOMOGENEOUS, COMPREHENSIVE
TEXTS
SUITABLE FOR RESEARCH PURPOSES
THE STRUCTURE
corpus is a COLLECTION OF TEXTS which can be divided into
SUB-CORPORA/SUB-DIVIDED INTO SMALLER SECTIONS
(Chapters,sentences,paragraphs...)
TEXT
Consists of letters, spaces and other symbols (punctuation,numbers,random characters...)
WORD
Is a Sequence of letters separeted by spaces and punctuation marks.
STOP WORDS
List of words that can be
eliminated
because considered
"useless"
(articles, prepositions, pronouns etc...)
after eliminating them:
WORD CLOUD
A combination of words displayed in a size proportional to how frequently they appear in a graph
HOW TO COUNT WORDS IN A CORPUS?
(two concepts)
WORD TOKENS
(N)
Rappresent the size of the corpus in terms of the total number of occurrences (quante parole ci sono in una frase in totale)
WORD TYPES
(V)
Represents the size of the corpus vocabulary (quante tra le parole presenti sono diverse tra di loro)
The RATIO V/N
known as
THE TYPE-TOKEN RATIO (TTR)
is the measure of lexical variety.
KWIC
(KEY WORD IN CONTEXT)
show in which context a word is used (since can have a several meanings)
the chosen word is in the centre of a list of text portions and has left and right the context
SENTIMENTAL ANALYSIS
(POLARITY ANALYSIS)
POSITIVE WORDS
NEGATIVE WORDS
TOKENIZATION
Procedure capable of correctly isolating words (eliminating virgole, punti,apostrofi etc..)
determining where each word begins and ends
TF-IDF
(TERM FREQUENCY INVERSE DOCUMENT FREQUENCY)
MEASURE FOR IDENTIFYING WORDS THAT CAN BE CONSIDERED DISCRIMINAL FOR A TEXT
TWO PARTS (FORMULAS)