Big Data

2 Types

large storage / processign capacity (not really BD)

needs to be formalized

Identifiers: where it belongs --> alphanumeric string associated with particular object--> for completeness, otherwise no meaning

Immutability = freeze it, identifier must never change

Introspection = attach it to object, permanent place

Indexing --> find it (summary, search faster, cross index to find relationships, merging data, easily updatable, after database is created)

forms of data

lotsa / massive data --> simple format data

Big Data

Small Data

answer particular question

one / several pc's

highly structured

one / multidisziplin

one source / uninform format

data preparation method for entire database

7 xears

measurement: one standard protocol

repeatable

introspection: rows / columns

can be analyzed at once

big data

goal in mind / flexible / protean

powerfull servers

absorbing unstructured data

used by multiple disciplines

different sources / methods

longevity: kept eternally, changing storage / format over times

replication not feasible most cases

data introspection to identify data points = method

parallel processing required to analyze

Sample size calculation

Confidence Level --> how sure you can be, 95 %

Confidence Interval --> margin of error, plus-minus figure

population size 400000 infinite

Data sampling methodologies

Non Probability Sampling

Reliance available subjects --> gesamte Anzahl, Shopping Malls, Movie theatres)

Purposive / judgmental --> bestimmte Klassen, Restis

Snowball Sample --> who are difficult to be located

Quota Sample --> selection of individuals --> equal distribution in characteristics

Probability Sample

Simple random choice --> common method, equal chance each entity

systematic sample --> placed in list, kth time

stratified sample --> want to access samples in sub-groups / minorities, not well represented --> Gewichtung

Cluster Sample --> not well indexed or not at all

Autocoding

understand unstructured data

Tagging with identifier code

correspondents to all synonymous terms in a standard nomenclature

nomenclature:

specialized vocabulary

Synonyms and near synonyms = plesionyms)

canonical term = Collects

corrupted by polysemous (multiple meaning) terms

--> human involvement, 85 to 90 % accuracy

NLP --> Natural Language Processing

Rules --> Autocoding

without human involvement --> root stemming, stop words, syntax variation
lack etymologic commonality, slow, longevity (nomenclatures change over time and have to be curated)

stop words = most scalable method