Please enable JavaScript.
Coggle requires JavaScript to display documents.
data science - Coggle Diagram
data science
numerical data type
-
discrete
ex: person's IQ measurement
coin toss
don't have to be whole, but do have to be distinct
continuous
ex: values can take on any number
1.1
1.2 (not continuous - because you have step sizing)
every number from start to finish can be taken on
bottle of water: amount of water takes on every value between 0 and 1
every number can be applied
ex: speed of car - car has to take on every speed to get from 0 - 60
not step-sizing
takes on every single one of speed values
-
range and domain
-
domain: values that the data points lie in
ex: starts at 15K and goes up to 200K
defines starting and ending points. A section in the data.
-
ordinal data type
mixed numerical and categorical data
star ratings -
can compare, but not as straightforward
taking averages from large datasets has more validity, more meaning
survey question re: feeling good, bad, neutral, excellent
-
-
next topic: inferential statistics, key statistical terms
averages:
- mean - sum all values and divide by total number of values you have
- median: middle value in your dataset
even number of datapoints, you don't have one - but take the two middle, add, divide by 2
can be a bit more useful in terms of outliers - medians don't care about outliers
STILL isn't really representative of the rest of the data - you know what is at the center, but you don't know about the other data points
- mode: most common value in your data
not only applicable to numerical data - can be used in categorical data, too
covariance:
how much one value changes when the other variable is changed
coffee + tiredness. change one, how much does it affect the other variable?
correlation: covariance divided by standard deviation
ranges between 0-1
- correlation of 1 = perfect positive correlation
- positive correlation = the variables go up together (closer to 1)
- slightly positive correlation - not very strong = between 0-1
- closer to zero, no correlation
*can go to -1 range ... so you CAN have a perfectly negative correlation
- correlation does not imply causation - just because they're correlated, that doesn't mean they cause one another
quantiles:
- when you split data into number ranges that have the same probability
- when you split data into distinct datasets containing the same amount of data
ex:
- quartile (split data into four equal regions) - lower 25%, middle 25-50, third, 50-75, highest, 75-100
- percentiles (splitting data into 100 equal segments)
used for normalization - takes into consideration, for example, whether the test was easy or hard - you aren't judged on YOUR performance, but in comparison with everyone else who took the same test
one-variable graphs:
- histogram: looks at each value + how often the value has occurred
- bar plot: allow comparison across different groups
*pie chart: pie = 100%
shows you how your data is sliced - what key categories make up your data
two-variable graphs:
- scatter plot: each data point is a dot on the graph
allow you to see the spread of data between two variables
show groupings
- examples:
income versus years in education
- line plots: same basis of x and y axis, but points are connected. Advantages - easy to see trends. Also advantageous if data points are related.
examples:
distance vs time
three or more variable graphs:
- heat maps: map of movement that shows time spent in a particular area (or on a page) to indicate interest, engagement
- multi-variable bar plot: take three variables, for instance, for each group and show them in groups on the x axis together
- adding more variables to lower dimensional graphs: add a Z axis to the x and y - plot in volume - can rotate the graph to view all angles - but this is a "con" to using a multi-variable scatter plot