Data analysis

1 introduction to data analysis
Break it down

how to translate raw numbers into intelligence that drives real-world action
how to break down and structure complex problems

When you analyze data, what are you doing?

what

Basic process

using empirical evidence to think carefully about problems.

  1. Define problems
  1. Disassemble
    breaking problems and data into smaller pieces
  1. Evaluate
    Here’s the meat of the analysis, where you draw your conclusions about what you’ve learned in the first two steps.
  1. Decide
    Finally, you put it all back together and make (or recommend) a decision.

all data analysis is designed to lead to better decisions

You need to get as much information as you can from business to define your problem.

Data analysis is all about identifying problems and then solving them.

“exploratory data analysis,” where you explore the data for ideas you might want to evaluate further.

questions to ask

Always ask “how much.” Make your goals and beliefs quantitative.

competitors

How can u use data to define problems and answer the questions?

if sth curious -> ask

customer target

Understand business

You need to divide your problem into manageable, solvable chunks

If the data you receive is a summary, you’ll want to know which elements are most important to you. -> Break down your summary data by searching for interesting comparisons.


If your data comes in a raw form, you’ll want to summarize the elements to make that data more useful.

Summarize what your client believes and your thoughts on the data you’ve received to do the analysis

the key to evaluating the pieces you have isolated is comparison.

Observations about the problem
Observations about the data

Analysis begins when you insert yourself

Inserting yourself into your analysis means making your own assumptions explicit and betting your credibility on your conclusions.


Whether you’re building complex models or making simple decisions, data analysis is all about you: your beliefs, your judgement, your credibility.

Insert yourself

Don’t insert yourself

Good for you

Good for your clients

You'll know what to look for in the data.
You'll avoid overreaching in your conclusions.
You'll be responsible for the success of your work.

Your client will respect your judgments more.
Your client will understand the limitations of your conclusions.

Bad for you

Bad for your clients

You’ll lose track of how your baseline assumptions affect your conclusions.
You’ll be a wimp who avoids responsibility!

Your client won’t trust your analysis, because he won’t know your motives and incentives.
Your client might get a false sense of “objectivity” or detached rationality.

The report you present to your client needs to be focused on making yourself understood and encouraging intelligent, data-based decision making.

  1. Context
  1. Interpretation of data
  1. Recommendation

Your assumptions and beliefs about the world are your mental model


You looked at your areas of uncertainty.

If you’re aware of your mental model, you’re more likely to see what’s important and develop the most relevant and useful statistical models.

Mental models should always include what you don’t know

Ex: List some assumptions that would be true if MoisturePlus is actually the preferred lotion for tweens.
List some assumptions that would be true if MoisturePlus was in serious danger of losing customers to their competition.

Being clear about your knowledge gaps is essential.

2 experiments
Test your theories

Can you show what you believe?
In a real empirical test? There’s nothing like a good experiment to solve your problems and show you the way the world really works. Instead of having to rely exclusively on your observational data, a well-executed experiment can often help you make causal connections. Strong empirical data will make your analytical judgments all the more powerful.

click to edit

How

the survey data

options to figure out how to increase sales

Interview the CEO to figure out how Starbuzz works as a business.

Do a survey of customers to find out what they’re thinking.

Find out how the projected sales figures were calculated.

They take a random, representative sample
of consumers and ask them a bunch of pertinent questions about how they feel

What people say in surveys does not always fit with how they behave in reality, but it never hurts to ask people how they feel.

Observational study: A study where the people being described decide on their own which groups they belong to.

Comparisons are key for observational data

The more comparative the analysis is, the better.

taking an inventory of observational data is often the first step to getting better data through experiments.

Look at

Do you notice any patterns?

Is there anything that might explain to you why sales are down?

observational studies aren’t that powerful when it comes to drawing causal conclusions.


=> need other tools to get those sorts of conclusions.

you don't know which cause
Ex: when you’re starting to suspect that causes are going in one direction (like value perception decline causing sales decline), flip the theory around and see how it looks (like sales decline causes value perception decline).

Observational studies are full of confounders

A confounder is a difference among the people in your study other than the factor you’re trying to compare that ends up making your results less sensible.

Your job as the analyst is always to think about how confounding might be affecting your results

it’s always really important that your conclusions make sense

Manage confounders by breaking the data into chunks

These smaller chunks are more homogenous

they don’t have the internal variation that might skew your results and give you the wrong ideas.

You need an experiment to say
which strategy will work best

If you want to draw conclusions about things
that overlap with your data but aren’t completely described in the data, you need theory to make the connection.

In order to get more clarity about which strategy is better, you’re going to need to run an experiment.

Control group A group of treatment subjects that represent the status quo, not receiving any new treatment.

No control group means no comparison.
No comparison means no idea what happened.

Confounders also plague experiments

In order for your comparison to be valid, your groups need to be the same.

Avoid confounders by
selecting groups carefully

randomization method.

Randomization selects
similar groups

a great way to avoid confounders.

the factors that might otherwise become confounders end up getting equal representation among your control and experimental groups.

In your spreadsheet program, create a column called “Random” and type this formula into the first cell: =RAND().

design your experiment

The purpose of the experiment

What are your control and experimental groups going to be?

How will you avoid confounders?

What will your results look like?

Assign the microregions randomly to control and experimental groups

Ex:
Control: maintain the status quo for a month
Experimental group #1: drop prices for a month
Experimental group #2: persuade customers that Starbuzz is a value for a month

We all want more of something.
And we’re always trying to figure out how to get it. If the things we want more of—profit, money, efficiency, speed—can be represented numerically, then chances are, there’s an tool of data analysis to help us tweak our decision variables, which will help us find the solution or optimal point where we get the most of what we want. In this chapter, you’ll be using one of those tools and the powerful spreadsheet Solver package that implements it.

3 optimization
Take it to the max

there’s an tool of data analysis to help us tweak our decision variables, which will help us find the solution or optimal point where we get the most of what we want. In this chapter, you’ll be using one of those tools and the powerful spreadsheet Solver package that implements it

Thinking

  1. đọc đề bài -> define problems
  1. What data do you need to solve this problem?
  1. You can divide those data needs into two
    categories: things you can’t control, and things you can.
  1. define your actual constraints for this problem

Constraints don’t tell you how to maximize profit; they only tell you what you can’t do to maximize profit.

Decision variables, on the other hand, are the things you can control

  1. When you want to get as much (or as little) of something as possible, and the way you’ll get it is by changing the values of other quantities, you have an optimization problem.

To solve an optimization problem, you need to combine your decision variables, constraints, and the thing you want to maximize together into an objective function.

  1. The objective is the thing you want to maximize or minimize, and you use the objective function to find the optimum result.

Screen Shot 2020-08-26 at 8.36.10 AM

Each “c” refers to a constraint.

Each “x” refers to a decision variable.

“P” is your objective: the thing you want to maximize.

the space where product mixes are within the constraint lines is called the feasible region.

Microsoft Excel and OpenOffice both have a handy little utility called Solver that can make short order of your optimization problems.

Your models approximate reality and are never perfect,

There’s a lot more to reality than this model.

if your assumptions are accurate and your data’s good the tools can be pretty reliable.

Your goal should be to create the most useful models you can

Calibrate your assumptions
to your analytical objectives

Write down everything you think you know and everything you
think you don’t know

You can’t specify all your assumptions, but if you miss an important one it could ruin your analysis.
You will always be asking yourself how far you need to go specifying assumptions.

Don’t assume that two variables are independent of each other. Any time you create a model, make sure you specify your assumptions about how the variables relate to each other.

All your data is observational, and you don’t know what will happen in the future.
Your model is working now, but it might break suddenly. You need to be ready and able to reframe your analysis as necessary. This perpetual, iterative framework is what analysts do.

You need more than a table of numbers.
Your data is brilliantly complex, with more variables than you can shake a stick at. Mulling over mounds and mounds of spreadsheets isn’t just boring; it can actually be a waste of your time. A clear, highly multivariate visualization can in a small space show you the forest that you’d miss for the trees if you were just looking at spreadsheets all the time.

4 data visualization
Pictures make you smarter

A clear, highly multivariate visualization can in a small space show you the forest that you’d miss for the trees if you were just looking at spreadsheets all the time.

good data analysis begins and ends with thinking with data.

Thinking

If you’ve got a lot of data and aren’t sure what
to do with it, just remember your analytical objectives.

when looking at a new visualization
=> first question "What is the data behind the visualizations?”

To build good visualizations, first identify what are the fundamental comparisons that will address your client’s objectives.

Great visualizations

Shows the data

Makes a smart comparison

Shows multiple variables

Scatterplots are great tools for exploratory data analysis, which is the term statisticians use to describe looking around in a set of data for hypotheses to test.


Analysts like to use scatterplots when searching for causal relationships, where one variable is affecting the other.


As a general rule, the horizontal x-axis of the scatterplot represents the independent variable (the variable
we imagine to be a cause), and the vertical y-axis of a scatterplot represents the dependent variable (which we imagine to be the effect).

You don’t have to prove that the value of the independent variable causes the value of the dependent variable, because after all we’re exploring the data. But causes are what you’re looking for.

A visualization is multivariate if it compares three or more variables.

making your visualizations as multivariate as possible makes it most likely that you’ll make the best comparisons.

One way of making your visualization more multivariate is just to show a bunch of similar scatterplots right next to each other,

This graphic was created with a open source software program called R

use illustration programs like Adobe Illustrator and just draw visualizations, if you have visual ideas that other software tools don’t implement.

If you want inspiration on designs, you should probably pick up some books by Edward Tufte

5 hypothesis testing
Say it ain’t so

The world can be tricky to explain.
And it can be fiendishly difficult when you have to deal with complex, heterogeneous data to anticipate future events. This is why analysts don’t just take the obvious explanations and assume them to be true: the careful reasoning of data analysis enables you to meticulously evaluate a bunch of options so that you can incorporate all the information you have into your models. You’re about to learn about falsification, an unintuitive but powerful way to do just that.

How

if we have a problem that can't find or see in pubic, What do you need to know in order to get started?

We’ll need some sort of insight into how they think about their releases, and we’ll need to know what kind of information they use in their decision.

You need to figure out how to compare the data you do have with your hypotheses about when PodPhone will release their new phone.

When you are looking at data variables, it’s a good idea to ask whether they are positively linked, where more of one means more of the other (and vice versa), or negatively linked, where more of one means less of the other.

  1. collect your evidence
  1. Define your hypotheses related to problems
    Different answers to that question are your hypotheses for this analysis.
  1. put all this intelligence together and form a solid prediction. => connect variables (+), (-) relating to problems

Using the relationships specified on the facing page, draw a network that incorporates all of them.

  1. run a hypothesis test: Falsification is the heart
    of hypothesis testing

Don’t try to pick the right hypothesis; just eliminate the disconfirmed hypotheses. This is the method of falsification, which is fundamental to hypothesis testing.


Falsification enables you to have a more nimble perspective on your hypotheses and avoid a huge cognitive trap.


Use falsification in hypothesis testing and avoid the danger
of satisficing. The big problem with satisficing is that when people pick a hypothesis without thoroughly analyzing the alternatives, they often stick with it even as evidence piles up against it

  1. Diagnosticity: Use the evidence to rank hypotheses in the order of which has the fewest evidence-based knocks against it.

Diagnosticity is the ability of evidence to help you assess the relative likelihood of the hypotheses you’re considering. If evidence is diagnostic, it helps you rank your hypotheses.


The “+” symbol indicates that the evidence supports that hypothesis, while the “–” symbol indicates that the evidence counts against the hypothesis.


Nondiagnostic evidence doesn’t get you anywhere.

Some of information is publicly available, some of it is secret, and some of it is rumor.

publicly: economy, industry, competitors, research, insights, google trends,..

secret, rumor: thinking of CEO, of brand, strategy,.....

in the real world are networked, not linear

As an analyst, you need to see beyond simple models like linear and expect to see causal networks

subjective probabilities

Sometimes, it’s a good idea to make up numbers.
Seriously. But only if those numbers describe your own mental states, expressing your beliefs. Subjective probability is a straightforward way of injecting some real rigor into your hunches, and you’re about to see how. Along the way, you are going to learn how to evaluate the spread of data using standard deviation and enjoy a special guest appearance from one of the more powerful analytic tools you’ve learned.

what is it?

Subjective probability is a type of probability derived from an individual's personal judgment or own experience about whether a specific outcome is likely to occur.

It contains no formal calculations and only reflects the subject's opinions and past experience

Subjective probabilities are something that everyone understands but that don’t get nearly enough use.

You want to use the standard deviation. The standard deviation measures how far typical points are from the average (or mean) of the data set.

Standard deviation can be used here to measure disagreement. The larger the standard deviation of subjective probabilities from the mean, the more disagreement there will be among analysts as to the likelihood that each hypothesis is true.

Use the STDEV formula in Excel to calculate the standard deviation.

=STDEV(data range)

Using each analyst’s first subjective probability as a base rate, maybe we can use Bayes’ rule to process this new information.

Bayes' theorem allows you to update predicted probabilities of an event by incorporating new information.


It is often employed in finance in updating risk evaluation

bayesian statistics

What is it?

It provides us with mathematical tools to update our beliefs about random events in light of seeing new data or evidence about those events.

You’ll always be collecting new data.
And you need to make sure that every analysis you do incorporates the data you have that’s relevant to your problem. You’ve learned how falsification can be used to deal
with heterogeneous data sources, but what about straight up probabilities? The answer involves an extremely handy analytic tool called Bayes’ rule, which will help you incorporate your base rates to uncover not-so-obvious insights with ever-changing data

Bayesian statistics is a system for describing epistemological uncertainty using the mathematical language of probability