Please enable JavaScript.

Coggle requires JavaScript to display documents.

DS101: Data Science Interview Overview - Coggle Diagram

- - - - Always clarify questions, understand context before offering suggestions
      - Three ideas are good enough
      - Ask for some time to structure my answer
      - :red_flag: Too many ideas, using framework blindly, not be able to defend yourself
    - - Structure
      - Comprehensiveness
      - Feasibility
    - - Goal
        
        Improve user experience?
        
        help promote businesses?
        
        Generate advertisement revenue?
        
        user growth, user engagement, increase revenue, acquisition, retention, activation
        
        Impact
        
        Challenges
        
        Finding
    - - Business process
      - Business requirements
      - Data sources
    - - E-Commerce
      - Saas
        
        Journey / Conversion Funnel
        
        Subscription
        
        Stickiness, engagement, retention
        
        Metrics
        
        Acquistion, churn, DAU, DAU/MAU, average time spent per user per day, Revenue(conversion rate, revenue per customer, LTV)
      - Mobile application
      - Two-sided marketplace
      - Social media
      - Search engine
  - - - About the company
        
        Why do you want to join? What do you value most
      - About you
        
        Tell me about yourself
        
        Impactful project
        
        Emphasize the impact
        
        Make it more conversational
      - The situational quesiton
        
        Tell a good story
  - - - zoom out and think about the big picture:why it is important project
      - Engage the interviewer
      - Remove useless details
- - - - User?, Target population in the experiment, determine the sample size
        
        n = 16*sigma^2/(delta^2) (based on alpha=0.05, Beta/power=0.80)
        
        alpha = 0.01 so n = 18.93 sigma^2/delta^2
        
        Sigma is from historical data (estimated variance)
      - Choosing Randomization Units
        
        User ID, Cookie, Event(page view and session), Device ID
        
        Ensure Consistent user experience
        
        User visible change(User ID)
        
        Non user visible change(Event/Session)
        
        The randomization unit should be coarser than unit of analysis
    - - We need domain knowledge to find which is better(aggregated or disaggregated)
      - Why simpson paradox happened? unbalanced dataset, maybe another variable affect the result in subgroups
      - (a trend appears in different groups of data but disappears or reverses when these groups are combined). The reasons for Simpson’s paradox happening could be: (1) The setup of your experiment is incorrect; (2) The change affects the new user and experienced users differently.
  - - - Alpha, power, minimum detectable effect(MDE)
      - Delta is difference between control and treatment (Minimum detectable effect)
  - - - Bias - Instrumentation effect , Checks: Guardrail Metrics(Latency time)
      - Bias - External Factors, Checks: Holidays, competition, economic disruptions(e.g. covid)
      - Bias - Selection bias, Check: A/A test
        
        Run A/A test if there is no historical data
      - Bias - Sample Ratio Mismatch, Checks: Chi-square Goodness of fit Test
      - Bias - Novelty effect, Check segment by new and old visitors
- - - - How to define ETA/...? By hour or by day, aggregated by day?
      - Check outliers, is there extreme value, investigate the extreme value
      - Was the data collected correctly?
      - Technique issues?
    - - Change suddenly ? Tech prediction algorithm, data collection process
      - Progressively? Look at historical trend, weekly pattern? Compare with Ground truth
- - - - Business
      - Internal Validity
        
        App's performance, bugs, loading time, # of errors
        
        Sample ratio mismatch
- - - - estimate how accurately a model will accomplish in practice.
      - limit problems like overfitting
- - - - Normal (search time) mean, median, mode…
        Exponential(time spent) mean, median, mode..
      - Normal distribution: 66-95-99.7 rule
        
        CLT
        
        rule of thumb is 30 samples
        
        Average follows CLT
        
        Sum also follows CLT
      - Test a normal distribution
    - - Number of clicks follows binomial distribution
      - Click-through rate follows normalized binomial distribution(Bernoulli distribution)
    - - Average customer lifetime
  - - - inferential techniques which use a sample to either estimate a population parameter or test the strength and validity of a hypothesis.
      - In other words, if the null hypothesized value falls within the confidence interval, then the p-value is always going to be larger than 5%. Conversely, if the null hypothesized value falls outside of our confidence interval then the p-value is going to be less than 5%.
      - When 0 is included in our confidence interval this means we are likely seeing that there is no difference between our sample and the population parameter.
  - - - LINE
        
        Linear
        
        Use residual plots
        
        Independent
        
        Use residual plots
        
        Seasonal, time serial
        
        residuals are Normal
        
        Nice to have
        
        Central limit theorem
        
        Use Q-Q plots to check
        
        Bow- or S-shaped is not normal
        
        Equal variance(Homoscedasticity)
        
        Residual plots
  - - - Use examples
      - Avoid to introduce more technical terms
  - - - The sum of squared deviations from the mean divided by n – 1 where n
        is the number of data values
      - MSE
    - - L2 norm
    - - L1 norm
    - - The difference between the 75th percentile and the 25th percentile
  - - - This reduces the effect of the outliers
  - - - F-tests