Please enable JavaScript.
Coggle requires JavaScript to display documents.
Perception | Physical AI | Multimodal - Coggle Diagram
Perception | Physical AI | Multimodal
π§ Multimodal AI
π Key Concepts
Modalities β text, image, video, audio, sensor, 3D point cloud
Cross-modal attention β attending across modality tokens
Contrastive learning β CLIP: align image & text embeddings
Multimodal embeddings β shared latent space across modalities
Vision-Language Models (VLMs) β visual question answering, captioning
Vision-Language-Action (VLA) β extends VLM with action outputs
Any-to-any generation β input/output any combination of modalities
Grounding β tying language tokens to visual/spatial objects
Tokenisation β converting images/audio into discrete tokens
Instruction tuning β align multimodal model to follow human commands
Retrieval-Augmented Multimodal β fetch external knowledge at inference
π Global State & Where Heading
Frontier Models (2025β2026)
GPT-4o β native multimodal (text+vision+audio in one pass)
Gemini 2.5 Pro β top Chatbot Arena score (1446), leads video understanding
Claude 4 Opus β 200K context, strong vision + reasoning
Grok 3 β multimodal with real-time X/web grounding
Video & World Understanding
Sora (OpenAI) β text-to-video world model
Veo 2 (Google) β photorealistic video generation
Video understanding at minute+ scale for complex reasoning
Omni-Modal Systems
Any-to-any: input text, output image + audio simultaneously
Real-time voice + vision (GPT-4o voice mode)
Sensor + language fusion for robotics (VLA)
Market & Scale
Multimodal AI market: USD 1.6B (2024) β USD 27B (2034)
32.7% CAGR β fastest-growing AI subsegment
Enterprise adoption: healthcare imaging, legal doc review, manufacturing QA
Why world is heading here
Real world is multimodal β humans process all senses together
Unimodal AI hits ceiling; multimodal unlocks emergent reasoning
Autonomous agents need to see, hear, speak & act simultaneously
Healthcare, defence, robotics all demand multi-sensor AI
Token-based architecture makes adding modalities modular
β οΈ Key Challenges
Hallucination across modalities β confabulating visual details
Modality alignment β different embedding spaces hard to bridge
Evaluation β no unified benchmarks for multimodal tasks
Cultural & linguistic bias β most training data is English/Western
Compute cost β multimodal training orders of magnitude more expensive
Data licensing β images & audio have copyright issues
Temporal reasoning β understanding change over video sequences
πΈπ¬ Singapore Scene
SEA-LION V4 (AISG + Google DeepMind, late 2025)
Built on Gemma 3 (27B) β open-source, commercially licensed
Trained on 1 trillion+ tokens with heavy SEA dataset curation
Native multimodal: image understanding + text (128K context)
Supports 11 SEA languages + English with cultural nuance
Agentic workflows: function calling, JSON output, tool use
SEA-LION-VL β Vision-Language Model for SEA
First open vision-language model for Southeast Asian context
Understands culturally specific imagery, signage, context
Released for community to build SEA-specific applications
MERaLiON-AudioLLM (A*STAR + AISG)
Audio large language model for Singapore/SEA accents & languages
Bridges audio and language understanding
Targets local speech diversity β Singlish, Malay, Tamil
SEA-Guard β Safety Layer for SEA
Safety-focused LLM fine-tuned for SEA cultural norms
Moderates content according to regional values & standards
NAIS 2026 Link β Local Language AI Priority
SEA-LION embodies NAIS goal: AI that works for Singaporeans
Multilingual AI as sovereign capability for SEA leadership
NAIS 2026 Link β Healthcare Mission
Multimodal diagnostic AI β medical imaging + clinical notes + labs
AI triage combining visual scan + patient history
NAIS 2026 Link β Talent & Research
AISG funds SEA-LION as 100E (100 Experiments) programme output
NUS/NTU multimodal AI research groups
SEA-LION Summit 2025 β regional multimodal AI community
Why Singapore is heading here
SEA is linguistically diverse β needs localised multimodal AI
Singapore as SEA AI hub β export multimodal models regionally
IMDA AI Verify extended to cover multimodal system governance
NAIS: AI must serve Singapore's social & cultural context
π Perception Technology
π Key Concepts
Sensor modalities β camera, LiDAR, radar, IMU, sonar
Sensor fusion β combining modalities for robust perception
3D point clouds β spatial representation from LiDAR
Object detection β YOLO, DETR, Faster-RCNN
Semantic segmentation β pixel-level scene understanding
Instance segmentation β per-object masks
SLAM β simultaneous localisation & mapping
Depth estimation β monocular vs stereo vs ToF
Optical flow β motion estimation between frames
Adversarial robustness β resistance to perturbations
Foundation models for vision β SAM 2, Grounding DINO, Florence-2
π Global Trends & Where Heading
Universal perception β any-sensor shared representations
Vision foundation models replacing task-specific models
SAM 2 β segment anything, zero-shot
Grounding DINO β open-vocabulary detection
Florence-2 β unified vision backbone
4D perception β 3D space + temporal tracking
Radar-first systems β all-weather, GPS-denied ops
Real-time edge inference β on-device at SWaP constraints
Why: robotics + AVs + defence demand pervasive sensing
Why: VLMs need visual grounding to act in world
Why: LiDAR cost dropped 90% since 2020 β ubiquitous now
β οΈ Key Challenges
Distribution shift β model fails on out-of-domain data
Adversarial attacks β physical patches fool detectors
Labelled data scarcity β 3D annotation is expensive
Edge compute limits β inference budget on UAVs/robots
Sensor degradation β rain, fog, dust on optics
Certification β no clear standard for safety-critical CV
Privacy β camera/LiDAR capture of individuals
πΈπ¬ Singapore Scene
DSTA + Thales AI Co-Lab (2025) β counter-UAS drone detection
ML modules reduce false alarms in drone detection
Physics + knowledge + data-based AI fusion
Faster, more accurate drone classification
DSTA + Shield AI (March 2025) β autonomy development
RSAF partnership on manned-unmanned teaming
Autonomous mission execution in contested airspace
DSTA + Anduril β mission autonomy concepts
Manned-unmanned collaborative operations
AI-enhanced situational awareness
DSTA GAIA β generative AI for defence sensemaking
Internal tool for mission planning & decision support
DSO National Labs β sensor fusion research
Multi-modal EW & ISR sensor fusion
NUS / NTU β computer vision research
Singapore Vision Day 2026 β leading CV conference
Research in medical imaging, AV perception, 3D vision
NAIS Link β Advanced Manufacturing Mission
Machine vision for defect detection in fabs
Predictive maintenance via sensor streams
NAIS Link β Connectivity Mission
Smart cameras & radar for port + logistics AI
π€ Physical AI & Robotics
π Key Concepts
VLA (Vision-Language-Action) models β perception + reasoning + action
World models β learned simulator of physical environment
Sim-to-real transfer β train in simulation, deploy in reality
Foundation models for robotics β generalist robot policies
Teleoperation & imitation learning β learning from human demos
Dexterous manipulation β fine motor control (fingers, grippers)
Loco-manipulation β locomotion + manipulation combined
Digital twin β virtual replica for robot training & planning
EgoScale β training on human egocentric video for robot skills
Reward learning β inferring goals from demonstrations
Kinematic chains β how robot joints connect and move
π Global State & Where Heading
VLA Model Evolution
RT-1 (2022) β Google, first large-scale transformer robot policy
RT-2 (2023) β VLM backbone controlling robot directly
OpenVLA (2024) β open-source VLA for community
Ο0 (2024) β Physical Intelligence, dexterous flow matching
GR00T N1.7 (Apr 2026) β NVIDIA, 3B params, EgoScale on 20,854h video
GR00T N2 (preview) β world action model, 2x success on new tasks
World Models & Simulation
NVIDIA Cosmos 3 (COMPUTEX 2026) β world foundation model for autonomous systems
Synthetic data generation β train on sim, close the gap with domain randomisation
Isaac Sim / IsaacLab β physics-accurate robot simulation
Humanoid & Industrial Robots
Figure, Agility Robotics, 1X entering factory floors
Boston Dynamics Atlas β fully electric humanoid (2024)
Industrial robot density accelerating globally
Why world is heading here
LLMs cracked language β VLAs applying same recipe to physical actions
Labour shortage: ageing populations need robot augmentation
Cost of robot hardware dropping β sensors, actuators, compute
World models eliminate need for millions of real-world trials
Foundation model paradigm shift: one model, many tasks
β οΈ Key Challenges
Sim-to-real gap β physics fidelity never perfectly matches reality
Dexterous data scarcity β hard to collect at scale
Safety & reliability in unstructured environments
Generalisation β model trained on task A fails on task B
Edge latency β inference must be <10ms for real-time control
Regulatory β no clear framework for autonomous robot operations
Human-robot trust β acceptance in workplace & public space
πΈπ¬ Singapore Scene
Punggol Digital District (PDD) Testbed β launched 2026
Singapore's first multi-operator Physical AI testbed
IMDA + JTC + Singapore Institute of Technology (SIT)
Partners: Certis (security), DHL (logistics), Grab (delivery), QuikBot
Real mixed-use public space β not a lab environment
Test food delivery, parcel, cleaning, security patrol robots
NVIDIA Singapore AI Research Lab (announced May 2026)
Co-announced with NAIS Update 2026
Focus on Physical AI research in Singapore context
Collaboration with local universities & industry
National Robotics Programme (NRP) β S$635M (EDB)
One of highest per-capita robotics investments globally
Covers industrial, service, healthcare & defence robotics
Singapore robot density ~5x global average
NAIS 2026 Link β Advanced Manufacturing Mission
Robots for semiconductor, aerospace, precision engineering
Physical AI for process redesign via digital twins
Predictive maintenance reducing production downtime
NAIS 2026 Link β Healthcare Mission
Surgical robots β AI-guided precision surgery
Care robots for ageing population (eldercare)
NAIS 2026 Link β Connectivity Mission
Autonomous port logistics (PSA Singapore)
Last-mile delivery robots in HDB estates
Why Singapore is leading here
Dense urban environment = ideal testbed for robots
Labour constraints drive urgency β ageing + tight labour market
Strong manufacturing base needs robot augmentation
Government mandate: AI delivery phase, not just planning