features
Os Control

OS Control

OmniParser for GUI Agent

OmniParser (opens in a new tab)

OmniParser for Pure Vision Based GUI Agent (opens in a new tab)

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Microsoft OmniParser - Screen Parsing Model - Install Locally (opens in a new tab)

Microsoft AI Releases OmniParser Model on HuggingFace (opens in a new tab)

LLM OS

Andrej Karpathy's LLM OS

xeet 1 (opens in a new tab):

LLM OS. Bear with me I'm still cooking.

Specs:

  • LLM: OpenAI GPT-4 Turbo 256 core (batch size) processor @ 20Hz (tok/s)
  • RAM: 128Ktok
  • Filesystem: Ada002

xeet 2 (opens in a new tab):

With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates:

  • Input & Output across modalities (text, audio, vision)
  • Code interpreter, ability to write & run programs
  • Browser / internet access
  • Embeddings database for files and internal memory storage & retrieval

A lot of computing concepts carry over. Currently we have single-threaded execution running at ~10Hz (tok/s) and enjoy looking at the assembly-level execution traces stream by. Concepts from computer security carry over, with attacks, defenses and emerging vulnerabilities.

I also like the nearest neighbor analogy of "Operating System" because the industry is starting to shape up similar: Windows, OS X, and Linux <-> GPT, PaLM, Claude, and Llama/Mistral(?:)). An OS comes with default apps but has an app store. Most apps can be adapted to multiple platforms.

TLDR looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.

AIOS

Resources

Other examples

open interpreter

open-interpreter (opens in a new tab)

Introducing Local III (opens in a new tab)

changes.openinterpreter.com (opens in a new tab)

FREE: AI Agent Controls Your Mouse & Computer! Open Interpreter OS Mode (Screenshots)🤖 Open Source (opens in a new tab)

OSWorld

Agent-S

Windows Agent Arena

We built a scalable open-sourced framework to test and develop AI agents that can reason, plan and act on a PC using language models

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

Mind2Web

Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Mind2Web contains 2,350 tasks from 137 websites spanning 31 domains that:

  • Reflect diverse and practical use cases on the web.
  • Provide challenging yet realistic environments with real-world websites.
  • Test generalization ability across tasks and environments.

Tools

iOS

scriptable.app (opens in a new tab) - Automate iOS using JavaScript

a-shell (opens in a new tab) - A terminal for iOS, with multiple windows

?

shizuku (opens in a new tab) - Let your app use system APIs directly