Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Platform Modernization - Coggle Diagram
Data Platform Modernization
Background
Existing data platform consists of
Data lake
EDW
Machine Learning Models
Ad-hoc analysis
Dashboards/reports etc.,
Project Objective
Modernize data platform to become a data driven company
AS-IS
Architecture Components
• Azure Data Lake Gen2 for Data Lake Storage
• Azure Data Factory and Talend for Data Pipelines
• Azure Databricks for data Processing
• Azure Event Hub for event-based Integration
• Snowflake for Enterprise Data Warehouse
• Tableau for Reporting
• Custom implementation of Single Customer View in Azure SQL Server
• Custom customer 360 with data residency in ADLS and accessibility via databricks
• API Gateway for API Integration
• API implementations for Single Customer View hosted in App Services
Data Layers
• Raw zone is the landing zone in ADLS and contains the data in the raw format (CSVs).
• Standard zone takes the CSV data from raw zone and converts it into Parquet format.
• Consumption zone takes the data from standard zone and converts it into Dimensional Model.
• In the current architecture Snowflake Data Warehouse is the consumption zone.
• Views are created in the consumption zone to be consumed for Reporting
• All data ingestion and transformation pipelines rely on full-data copies stored as daily snapshots. Only the latest data is made available for consumption purposes.
ETL Pipelines
• Data transformations are implemented using Azure Databricks. Majority of the transformation pipelines are written in SQL followed by PySpark.
• Data pipelines and workflows are orchestrated by Azure Data Factory.
Data Quality
• In the current setup Data quality is checked while moving data from ADLS consumption to Snowflake layer.
Data Security
• Data is always encrypted using MSFT-managed keys. SSL is enforced during transit.
Environment Details
Azure Environment: The Azure environment is provisioned in West-Europe.
Azure Subscriptions: There are two Azure subscriptions in use, one for production (prod) and another for non-production (non-prod) purposes.
Resource Tagging: All resources are tagged according to the infrastructure team's requirements, and these tags are used for chargebacks.
Resource Consumption: Most resources are consumed on a pay-as-you-go basis. However, VMs used for Talend and ADF integration runtimes have reserved capacities.
Centralized Management: Resources are deployed and managed by a central infrastructure team. Resource provisioning requires the creation of a ticket, review, approval from the head of IT, and subsequent resource creation.
Access Control: The data platform is currently accessible from the public internet, but access is controlled through Azure Active Directory (AAD).
VPN Connection: Azure datacenter and on-premises systems of Miral Experiences are connected using a site-to-site VPN, which is centrally governed in an Azure VNet peered to the data platform integration runtimes.
Data Source systems (15 nos)
Tunn3l
InfoGenesis
EATEC
SharePoint
Genesys
IPERA
FacePass
OMNI DB
Oracle Fusion
Emarsys
CRM (MS Dynamics)
Sprinklr
Google Analytics
SCV
B2C Apps
Volumetrics
Number of Data Sources: 10+
Types of Sources: Databases, Files, APIs
Number of Data Pipelines : 60+
Number of Databases: 10
Number of Fact Tables: 40 Approx
Number of Dimension Tables: 50 Approx
Number of Views: 200 Approx
Data Lake Volume (in TB): 17 Approx
EDW (Snowflake) Volume (in TB): 6 Approx
Databricks Notebooks: 125 Approx
Use Cases
BI & Analytics
Daily Revenue Recognition (DRR)
Etihad Arena (EA) Reporting
Experience Hub
Single Customer View (SCV)
Customer 360
Google Analytics
Survey Integration (Guest Services & Survey Integration)
Maintenance and Technical Services
Health Safety and Environment
Facepass
Sprinklr
Emarsys
Oracle Fusion (HR, Finance, Procurement)
Compliance
Machine Learning
F&B pricing in FWAD
F&B Pricing in WBW
F&B pricing in YWW
Super Sales WBW
Next Best Action
Media Mix Modelling
Retail Pricing Optimization
Workforce Optimization
Annual pass Smart Retention
Annual Pass Segmentation
Data user Roles
Data scientists
Data engineers
BI (self-service)
BI (Tableau)
DataLab
Target State
Objectives
Microsoft Azure selected as hosting environment for data platform services and infrastructure in line with Miral Experiences' cloud platform strategy.
New data platform on Azure aimed at reducing complexity and improving scalability, security, reliability, and flexibility.
Key considerations include integration, optimization, external tool compatibility, and data model structure.
Integration should reflect changes in operational data sources in the platform and reporting tools.
Platform optimization needed for cost-effectiveness and execution time of data pipelines.
The platform should support integration with external tools and data sources.
Data model follows a medallion architecture with Bronze, Silver, and Gold layers for raw data, validation, and reporting.
Different data consumption archetypes are supported, including reporting tools, custom software, and SaaS applications.
Data is primarily consumed via SQL from the Gold layer, with replication to operational data stores for specific needs.
Features
Easy data integration
High-performance ingestion and transformation
Integration with data science technologies
Issue resolution
Dashboard compatibility
External platform connectivity
Data privacy and security compliance
Secure data sharing
Alerting mechanism
User access management
Data archiving for cost optimization