Please enable JavaScript.
Coggle requires JavaScript to display documents.
VLM agent - Coggle Diagram
VLM agent
Grounding
set of mark prompting
HTML + Visuals
oracle grounding
image annotation
Benchmarks
online vs offline eval
Open ended computer use agents
predicting the future
trial an error?
Judging and re-iterating
Screen understanding
ScreenVLM
Trained on ScreenParse
ScreenVLM vs OMNIparser
dataset minning
video understanding task
architecture
why basic VLM as planner
Web vs domain geenralist
ICL vs SFT
RL? and imitation learning
VL
Saftey and Guadrailing
formalising the action space