Please enable JavaScript.
Coggle requires JavaScript to display documents.
KE Data Engineer (DW/BI (MDM vs DQ (Data Quality Systems: Ref-tables…
KE Data Engineer
DW/BI
RDBMS
RDBMS structure 
-
-
Document-oriented databases: NoSQL database fro storing semi-structured data is a subclass of key-valuestore
Time-series DB: With time as a key index is different from RDBMS (.., Postgre, TimescaleDB)
-
DB Design/Data modeling (OLAP, OLTP)
Purpose
-
OLAP: Planning, Problem solving, and Decision support
-
Data Quality Systems: Ref-tables matching and validation
fix zip/phone/.. formats on flight (usually)
could be a part of ETL
-
-
-
MDM Systems:
consistency and synchronization problem
to orchestrate the same data is consistent in various targets
-
-
Data Governance is a collection of processes, roles, policies, standards, metrics for effective and efficient use of information in organization
-
-
-
-
-
REST (easy) /SOAP (standards!)
REST API Query Language
Post (new), Put (Upd), Get и Delete
XML or JSON
stateless
-
-
Job Orchestration
-
cloud: triggers, job event, file event
-
SQL based languages
-
Hive SQL:
no transactions, SubQ in SELECT, IN/EXISTS
good scaling
joins: inner, outer, semi, map, cross
DWH
-
approaches
-
-
Common
-
requirements:
- the most effective exploration of the source data
- data accurate and actual (works with both BUS and CIF)
Need to:
- store both Atomic + Aggregated data at a same time
-
-
-
Big Data
PL
Python
Pandas
DataFrame
2D labeled, size-mutable tabular structure
-
-
pymongo
collection (json)
document
-
update(query, update, options)¶
-
MongoClient('localhost', 27017)
-
Java, Scala, a pl BigData
Hadoop ecosystem
Spark faster than MapReduce (Batch/Iteration/graph)
performs Data-Parallel computing using a lot of memory
uses RDD: Resilient Distributed Dataset
-
Pig Java based language, for MapReduce high level programming
-
Zookeeper is registry for distributed systems.
Service for config information, naming, providing distributed synchronization, and providing group services, used by distributed applications.
Flume used for collecting, aggregating, moving large amounts of log data
-
-
-
Cloud Computing
Services
Similar to SaaS. it’s an API, which returns data e.g. curEx, weather
Strava, Yahoo! Weather, Wikipedia, Spotify, UPS, Google Cloud Natural
IaaS
You manage: OS, runtime, APP & Data
DigitalOcean, AWS, Microsoft Azure, Google Compute Engine (GCE)
SaaS
They manage everything
Google Apps, Dropbox, WebEx
PaaS
You manage: APP & Data
Windows Azure, AWS Elastic Beanstalk, Google App Engine
-
-
-
-
-
-
-
-
-