Please enable JavaScript.

Coggle requires JavaScript to display documents.

AI Infrastructure - Coggle Diagram

- - - - Networking & Interconnect
        
        Compute hardware
        (Accelerator, memory, packaging, semiconductor/foundry)
        
        Dominance by NVIDIA
        
        CUDA
        (SW moat)
        
        Chips
        
        Chips depreciate very fast
        effective economic life of chips ~1 year
        
        Data centres
        (Physical shell + cooling)
        
        Power
        
        Electricity generation, grid connection, transmission and increasingly on-site generation
        
        Constraints: The defining constraint of 2026.
        Ave grid-connection wait times in primary markets exceeds 4 years, pushing operators towards direct energy investment. (Speed to Power)
        
        Dominance. Constellation Energy (nuclear) is now treated as a core AI stock
        hyperscalers are signing nuclear PPAs and Small Modular Reactor deals and restarting plants
        Trend: AI demand is reshaping national energy policy
        (SG: land,power are scarce. This is the binding national constraints)
        
        Buildings, racks, and thermal management: at modern rack densities, air cooling fails
        Direct-to-chip liquid cooling has become foundational for high-density deployments
        AI factories require an end-to-end approach from grid to chip and chip to chiller
        
        Constraints: Cooling, water usage, physical build time, and Skilled labour
        
        Dominance: Equinix, Vertiv (cooling/thermal), Schneider Electric
        Trend: Liquid cooling everywhere, data centres becoming purpose-build AI factories
        
        Accelerator (processor)
        
        GPUs + custom ASICS - Google's TPU, Amazon Trainium, Meta's MTIA
        NVIDIA and AMD lead the compute layer
        Trend: hyperscalers designing their own ASICs to escape NVIDIA pricing
        
        Memory (HBM)
        
        High bandwidth memory stacks DRAM layers vertically using through-silicon vias (TSVs) to improve speed and energy efficiency
        A single system can need enormous HBM
        Memory Wall. memory is now the binding semiconductor constraint
        Dominance: Micron and SK Hynix, Samsung
        
        Advanced Packaging
        
        How the GPU and HBM are fused into one device
        NVIDIA's chip uses CoWoS (chip-on-wafer-on-substrate): the GPU and HBM stacks are manufactured separately, joined on a silicon interposed then cut into packaged chips.
        Constraint: TSMC's CoWoS capacity is scaling to ~130K waters/month by 2026 (from ~28K in 2024)
        Hidden chokepoint. Packaging is the constraint.
        Trend: active interposers with transistors will replace passive ones, chiplet integration is moving towards stacking memory on ASICs and photonics
        
        Semiconductor / Foundry
        (the substrate)
        
        Dominance: TSMC manufactures essentially all leading edge AI silicon (industry single POF), concentrated in Taiwan
        Upstream sits:
        
        ASML (only maker of EUV lithography machines) 2. design-tool/IP layer (Synopsys, CAdence, ARM)
        **Constraints: a new fab costs tens of B$ and takes years, layer cannot flex quickly
        deepest geopolitical pressure point in the entire stack**
        
        scale-up: within a server/rack (NVLink)
        scale-out (across racks - InfiniBand or Ethernet)
        Training requires constant "all-reduce" sync across thousands of chips
        
        Constraints: emerging bottleneck. As system scale, constraint is shifting from processing to connectivity and data access.
        Copper hits physical limits, forcing a move to optical/photonics interconnect
        
        Dominance: Nvidia (NVLink + InfiniBand via Mellanox), broadcom, marvell
        live debate: InfiniBand's loseless low latency VS Ethernet oppeness
        Nvidia hedges both via its Spectrum-X platform
      - Rents compute and schedules it
        hyperscalers (AWS, Azure, Google Cloud)
        Neoclouds (CoreWeave, Nebius, Nscale) - purpose-built GPUaaS
        orchestration software (Kubernetes, Slurm, Ray) turns thousands of G{Us into one usable supercomputer
      - Constraints: Requires advanced orchestration, monitoring and traffic monitoring
        small config issues create performance bottlenecks
        Idle GPUs are money burning -> utilisation is key
      - Dominance: Big-3 hyperscales (Google, AWS, Google cloud) but neoclouds (Nvidia-backed) are fast growing insurgents
        Trend: capacity is supply-constrainted
        (Microsoft disclosed an S$80B backlog of azure orders it cant fulfil due to power constraints)
  - - - Demand layer. Consumers, enterprises, developers and now autonomous agents
      - Constraints: Monetisation and Trust
        
        despite 30-40B invested on enterprise gen AI, in 2025, a MIT report found that 95% of organisation reported zero measurable return
        
        The learning gap: did not adapt to enterprise workflow
        now: solved by agentic AI
        
        Build vs buy: build everything in house has success rate of only 33% vs 66% if u partner specialised ext vendors
        now: enterprise AI is more mature
        
        Misallocated budgets: 50-70% of enterprise AI budgets funned into front office functions like sales and marketing. but back-office automation is often more immediate ROI
        
        Adoption is climbing.
        
        almost one in 6 ppl in the world used a gen AI tool by second half of 2025
        
        many are stuck in pilot or expt, hard to scale
        Inaccuracy or hallucination reported
      - Trend: Scale multiplication: the shift from humans-in-the-loop to agentic consumption, where one task spawns thousands of model calls, multiplies the demand on every layer below it
    - - Thin Moats: many apps are wrappers over the same few models
      - Unit-economics problems: agentic models require far more tokens per task, so even a 90% drop in inference cost, may not produce cheaper enterprise AI. Token costs can exceed the human labour being replaced
- - - - Safety/verification
      - AI Governance
      - Cybersecurity (e.g. mythos)