OS Control
OmniParser for GUI Agent
OmniParser (opens in a new tab)
OmniParser for Pure Vision Based GUI Agent (opens in a new tab)
OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent
Microsoft OmniParser - Screen Parsing Model - Install Locally (opens in a new tab)
Microsoft AI Releases OmniParser Model on HuggingFace (opens in a new tab)
LLM OS
Andrej Karpathy's LLM OS
- From
[1hr Talk] Intro to Large Language Modelstalk by Andrej Karpathy (YT)
LLM OS. Bear with me I'm still cooking.
Specs:
- LLM: OpenAI GPT-4 Turbo 256 core (batch size) processor @ 20Hz (tok/s)
- RAM: 128Ktok
- Filesystem: Ada002

With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates:
- Input & Output across modalities (text, audio, vision)
- Code interpreter, ability to write & run programs
- Browser / internet access
- Embeddings database for files and internal memory storage & retrieval
A lot of computing concepts carry over. Currently we have single-threaded execution running at ~10Hz (tok/s) and enjoy looking at the assembly-level execution traces stream by. Concepts from computer security carry over, with attacks, defenses and emerging vulnerabilities.
I also like the nearest neighbor analogy of "Operating System" because the industry is starting to shape up similar: Windows, OS X, and Linux <-> GPT, PaLM, Claude, and Llama/Mistral(?:)). An OS comes with default apps but has an app store. Most apps can be adapted to multiple platforms.
TLDR looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.
AIOS
- agiresearch/AIOS (opens in a new tab)
- embed large language model into the operating system as the brain of the OS. AIOS is designed to address proble

Resources
- bilalonur/awesome-llm-os (opens in a new tab)
- Illustrated LLM OS: An Implementational Perspective (opens in a new tab) hf.co, December 3, 2023
- Medium articles (Protégé IGDTUW)
Other examples
- letta (opens in a new tab)
- open source framework for building stateful LLM applications
- MemGPT (opens in a new tab)
open interpreter
open-interpreter (opens in a new tab)
Introducing Local III (opens in a new tab)
changes.openinterpreter.com (opens in a new tab)
OSWorld
- os-world.github.io (opens in a new tab)
- OSWorld (opens in a new tab) github
- paper OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (opens in a new tab) / pdf (opens in a new tab)
Agent-S
- simular.ai/agent-s (opens in a new tab)
- github.com/simular-ai/Agent-S (opens in a new tab)
- paper Agent S: An Open Agentic Framework that Uses Computers Like a Human (opens in a new tab)
Windows Agent Arena
We built a scalable open-sourced framework to test and develop AI agents that can reason, plan and act on a PC using language models
- Windows Agent Arena (opens in a new tab)
- paper Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale (opens in a new tab)
- github microsoft/WindowsAgentArena (opens in a new tab)
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
Mind2Web
- osu-nlp-group.github.io/Mind2Web (opens in a new tab)
- paper Towards Learning a Generalist Model for Embodied Navigation (opens in a new tab)
- NaviLLM (opens in a new tab) - [CVPR 2024] The code for paper 'Towards Learning a Generalist Model for Embodied Navigation'
Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Mind2Web contains 2,350 tasks from 137 websites spanning 31 domains that:
- Reflect diverse and practical use cases on the web.
- Provide challenging yet realistic environments with real-world websites.
- Test generalization ability across tasks and environments.
Tools
iOS
scriptable.app (opens in a new tab) - Automate iOS using JavaScript
a-shell (opens in a new tab) - A terminal for iOS, with multiple windows
?
shizuku (opens in a new tab) - Let your app use system APIs directly