Big Data
2 Types
large storage / processign capacity (not really BD)
needs to be formalized
Identifiers: where it belongs --> alphanumeric string associated with particular object--> for completeness, otherwise no meaning
Immutability = freeze it, identifier must never change
Introspection = attach it to object, permanent place
Indexing --> find it (summary, search faster, cross index to find relationships, merging data, easily updatable, after database is created)
forms of data
lotsa / massive data --> simple format data
Big Data
Small Data
answer particular question
one / several pc's
highly structured
one / multidisziplin
one source / uninform format
data preparation method for entire database
7 xears
measurement: one standard protocol
repeatable
introspection: rows / columns
can be analyzed at once
big data
goal in mind / flexible / protean
powerfull servers
absorbing unstructured data
used by multiple disciplines
different sources / methods
longevity: kept eternally, changing storage / format over times
replication not feasible most cases
data introspection to identify data points = method
parallel processing required to analyze
Sample size calculation
Confidence Level --> how sure you can be, 95 %
Confidence Interval --> margin of error, plus-minus figure
population size 400000 infinite
Data sampling methodologies
Non Probability Sampling
Reliance available subjects --> gesamte Anzahl, Shopping Malls, Movie theatres)
Purposive / judgmental --> bestimmte Klassen, Restis
Snowball Sample --> who are difficult to be located
Quota Sample --> selection of individuals --> equal distribution in characteristics
Probability Sample
Simple random choice --> common method, equal chance each entity
systematic sample --> placed in list, kth time
stratified sample --> want to access samples in sub-groups / minorities, not well represented --> Gewichtung
Cluster Sample --> not well indexed or not at all
Autocoding
understand unstructured data
Tagging with identifier code
correspondents to all synonymous terms in a standard nomenclature
nomenclature:
specialized vocabulary
Synonyms and near synonyms = plesionyms)
canonical term = Collects
corrupted by polysemous (multiple meaning) terms
--> human involvement, 85 to 90 % accuracy
NLP --> Natural Language Processing
Rules --> Autocoding
without human involvement --> root stemming, stop words, syntax variation
lack etymologic commonality, slow, longevity (nomenclatures change over time and have to be curated)
stop words = most scalable method