Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Science :checkered_flag: - Coggle Diagram
Data Science :checkered_flag:
Skillset
Skillsets Intersection
(Coding+Stats)-Domain => Machine Learning
(Domain+Stats)-Coding => Traditional Research
(Coding+Domain)-Stats => Dangerous Zone
Core Skillsets
Coding (get data)
R / Python (Stats)
SQL (database)
Bash (Command Line)
Regex (Search)
Math/Stats
Probabality,Algebra, Regression etc
Choosing Procedure
Diagnose Problems
Domain
Expertise in field
Goals, Methods, Constraint
Can implement well
Takeaways
Several fields make DS
Diverse Skill Needed
Many roles involved
Pathways
1.
Planning
A.Define Goals
2.Organize resources
3.Cordinate people
4.Schedule Project
2.Data Prep
Get Data
6.Clean Data
Explore Data
8.Refine Data
3.Modelling (Statistical Model)
Create Model
10.Validate Model
11.Evaluate Model
12.Refine Model
4.Follow up
13.Present Model
14.Deploy Model
15.Revisit Model
16.Archive Model
Takeaways
DS isn't just technical
Contextual Skill Matters
One step at a time
Roles
Engineer
Focus on backend hardware,software
Makes DS Possible
Developer, DBA
Big Data Expert
Focus on computer science & math
Do Machine Learning
Data Products
Resercher
Focus on domain specific research such as business, physics ets
Having strong statistics
Analyst
Day-to-day tasks
Web Analytics, SQL
Good for Business
Businessman
Frame business relevant questions
Manages projects
Must "speak data"
Entrepreneur
Data startups
Needs data & business skills
Creative throughout
Full-Stack Unicorn
Takeaways
DS is diverse
Different Goals and Skills
Different Context
Teams
Code
Statistics
Design
Business
Contrast
Code Vs Data
Tools
Skillsets
Data Science Vs Big Data
Data Science
Code
Stats
Domain
Big Data
Volume
Velocity
Variety
Big Data Science
DS vs BI
NO Coding in BI
Simple Stats
Focus on domain expertise & utility
BI is very goal-oriented
DS prepares data & form
DS can learn from BI
Ethical Issues
Privacy
Confidentiality
Shouldn't share
Sources not intended for sharing
Anonimity
easy to identify
HIPAA
Proprietary data may have identifiers
Copyright
Scrapping data is common & useful
Webpages, images, PDFs, audio, etc
Check copyright
Data Security
Potential Bias
Algorithms are only as neutral as the rules & data that they get.
Overconfidence
Analyses are limited simplifications; still need humans in the loop.
Takeaways
DS has potential & risks
Analyses can't be neutral
Good judgement is Vital
Method
1.Sourcing
Existing Data
In-house
Open
Third-Party
data APIs
Scrape web data
for web data without API
such as HTML, PDF, etc
Tools: Apps & Code
Make data
Interview
Surveys
Experiments
Takeaways
Get the raw materials
Many possible method
Check quality of data
Data Sourcing (Data Opus)
:checkered_flag:
Methods
for Accessing exisiting data
for Creating new, custom data
Data Measurement & Evaluation
Kinds
Metrics (Know your target/define success)
Takeaways
Many method available
Metrics help awareness
Balance multiple goal
Types
Multiple Goals
4 more items...
SMART Goals
5 more items...
KPI
7 more items...
Business Metrics
4 more items...
Reasons
Analyst
1 more item...
Client
1 more item...
Explicit
1 more item...
Action
1 more item...
Accuracy
Sensitivity
Specificity
Positive Predictive Value
Negative Predictive Value
Social context
Business Model
Restriction
2.Coding
Apps
Spreadsheets
Tableau
SPSS
JASP
Web Data
HTML
XML
JSON
Code
R
Python
SQL
Bash
Regex
Takeaway
Use tools wisely
a few is usually enough
Focus on your goal
3.Math
Reasons
Know which procedures to use & why
Know what to do when things dont work right
Some math is easier & quicker by hand than computer
Make informed choices
Lessons to learn
Algebra
Calculus
Big O
Probability
Bayes' theorem
4.Statics
(Find order in chaos)
Explore
Exploratory graphics
Bar Chart
for categorical variable
Histogram
for quantitative variable
Scatterplot
For visualizing the association between two quantitave variables
Linear
Spread
Outliers
Correlation
Overlay Plot
Increased info density
Exploratory Statistics
Descriptive Statistics
Inference
From samples to populations
Hypothesis testing
Estimation
Details
Feature selection
Problems
Validations
Estimators
Fit
Variables
Categorical
Nominal
Male,Female. Red, Green, Black
Ordinal
Small, Medium, Large. A, B, C (grades)
Numerical
Discrete
1, 2, 3, cars. 568 people
Continuous
Age, Height
Regression
Linear
Simple Linear Regression
Multiple Linear Regression
Logistic
Simple Logistic Regression
Multiple Logistic Registration
5.Machine Learning
Data Space
Dimension reduction
Clustering
k-Means
Anomalies
Categories
Logistic regression
kNN
Naive Bayes
Decision Trees
SVM
Neural Nets
Predictions
Linear regression
Poisson regression
Ensemble models
Interpretability
Solve for value
Data-driven stories
Analysis X Story = Value
Goals
Analysis is goal-driven
Story should match goals
answer questions clearly
Client is not You!
Egocentrism
False consensus
Anchoring
Clarity at each step
Answers
State the questions
Give your answer
Qualify as needed
Go in order
Discuss process sparingly
Analysis = simplification
Presentation tips
More Chart, less text
Simplify chart
Avoid tables
Takeaways
Stories give value
Address client's goals
Be minimally sufficient
Presentation Graphics
Presenting is not exploring
Be clear, be focused
Create a strong narative
Reproducible Resources
Revising
Borrowing
Handing Off
Accoutability
Archives
All data sets, both raw & processed
All code to process & analyze data
Actionable Insights
Why was the project conducted?
Goal is usually to direct action
Analysis should guide action
Next steps
Give next steps
Justify with data
Be specific
Doable by client
Build on each steps
Correlation vs Causation
Your data give you correlation but your client want causation