GitHub - wgabrys88/windows-ai-agent-toolset-v1: qwen3-vl-2b-instruct performing step by step tasks confirming normalized coordinations usage and tools executions

This GitHub repository (windows-ai-agent-toolset-v1) is a fresh proof-of-concept project demonstrating a lightweight, vision-based AI agent for automating Windows 11 desktop tasks. Overall, this is a neat, minimal take on "computer use" agents (like recent features from Anthropic/OpenAI), but fully local and dependency-free.

Solid foundation - excited to see if it evolves into more general tasks or broader toolsets! If you have specific questions (e.g., feedback, improvements, or help testing), let me know.

Strengths

Extremely lightweight and portable (no pip installs needed).
Precise coordinate handling for cross-resolution potential.
Fully local/no cloud—great for privacy/experimentation.
Impressive that a 2B-parameter model handles visual reasoning and tool use reliably enough for this demo.

Limitations

Tested only on 1080p resolution with specific DPI—no guarantees elsewhere.
Task is hardcoded and simple (targeted at Notepad++).
Small model can misinterpret visuals occasionally.
Early-stage/experimental (only a single main script + logs/screenshots).

Core Idea

The script uses the small Qwen3-VL-2B-Instruct vision-language model (run locally via LM Studio's OpenAI-compatible server) to "see" the screen via screenshots, reason step-by-step, and execute actions through tool calls. It requires no external Python packages—everything relies on built-in modules and direct Windows API calls via ctypes for screen capture, mouse control, clicking, typing, and scrolling.

Key Features

Normalized coordinates: All mouse movements use a 0–1000 scale (independent of resolution, mapped precisely to pixel positions).
Screenshot processing: Captures full desktop via Windows GDI APIs, resizes with high-quality Lanczos filtering (default to 1344x756 for model input), encodes to base64, and embeds in prompts.
Tools available to the model:
- take_screenshot
- move_mouse(x, y) (normalized)
- click_mouse
- type_text(text)
- scroll_down
Minimal dependencies: Pure Python + LM Studio running the model locally.
Low latency: The 2B model processes quickly (~60 tokens/sec in logs).

Demo Task

The hardcoded example prompt instructs the agent to:

Take a screenshot.
Move mouse to the center of a Notepad++ window.
Click.
Type "hello".
Take another screenshot.
Verify if "hello" is visible, and if so, scroll down once.

Repository screenshots show this working:

Before: Notepad++ open in dark theme with an empty new file (multiple untitled tabs visible), plus overlapping windows like Command Prompt (running the script) and LM Studio logs.
After: The word "hello" successfully typed into the editor.
Annotated version: User-added red lines and an arrow highlighting the typed text area (possibly for emphasis or debugging).

Setup & Running

Start LM Studio with the Qwen3-VL-2B-Instruct model in server mode (default localhost:1234).
Run code.py.
The script handles the loop: send prompt + current screenshot → model responds with reasoning + tool calls → execute tools → repeat up to max steps.

It logs progress, optionally dumps screenshots, and reports cursor positions (pixel + normalized).

No license file is included, so usage defaults to standard GitHub terms (view/share cautiously).

### (my original description)
code and logs with screenshots from the real run

This is a base for further improvements, the concept is working on the super small model, it may sometimes use center of a screen and not the notepad++ in the current hardcoded prompted schema but that is ok, it shows that THIS is a "good start" - #notetoself: whenever I will be stuck, use this as a reference of workflow that is working.

Anyone who read this - just run the code (launch LM Studio earlier in the "server" mode, its meant to run on the same machine, localhost) and you will see a real ai making step by step actions on your real laptop/PC. There is no advanced "reasoning" implemented here, its more like proper setup of how to:
-   take a screenshot using Windows 11 API (no pip install)
-   use OpenAIApi to communicate between LM Studio and Script (Agent)
-   everything is using "Normalized 0..1000" coordinates (which is crucial for simplicity and precision of moves,clicks regardless of screen size)
-   a good description of the tools, system and user task

**VALIDATION:**
Tested on 1080p with scale 100% Windows 11 (it uses resizing for speedup anyway but if it works on different monitor sizes and scales of them - I am not sure yet,

I am going to improve this version on my specific monitor setup for now to not waste a time for edge use cases - it may work with different resolutions due to Context Aware V2 DPI but I wont test it as for now, its not the main goal at this moment to check this)

**WHY ITS IMPORTANT REPOSITORY:**
I have finally achieved to connect all the basic mechanism and calculations to achieve precision (not even the model precision is important here but the screenshots/coordinates/tools usage precision - TO MAKE A CLICK ON A SPECIFIC PLACE ON THE SCREEN - this was the tough part, that was the part on which I have spent the most of the time.

**NEXT STEP:**
Debug the prompts and get the most precision from the model itself. Add logging to be extra sure about the coordinates calculations - I see that its working now but I dont have a general "mathematical" proof yet and thats ok, as for this version of the repository, everything is fine.

**IT WORKS!**

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
code.py		code.py
console-output.txt		console-output.txt
dump_screen_0001.png		dump_screen_0001.png
dump_screen_0002.png		dump_screen_0002.png
result-with-user-red-inpaintings.png		result-with-user-red-inpaintings.png
server-log.txt		server-log.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Strengths

Limitations

Core Idea

Key Features

Demo Task

Setup & Running

About

Uh oh!

Releases

Packages

Languages

wgabrys88/windows-ai-agent-toolset-v1

Folders and files

Latest commit

History

Repository files navigation

Strengths

Limitations

Core Idea

Key Features

Demo Task

Setup & Running

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages