Drive WhatsApp via accessibility APIs is mostly string matching, not tree walking
Every guide on this topic teaches the wrapping: AXUIElementCreateApplication, recurse kAXChildrenAttribute, post a CGEvent. Then they stop. What none of them tell you is the part that actually does the work: the five VoiceOver description strings and three coordinate thresholds the WhatsApp Catalyst app emits that you have to match against. Here is that dictionary, pulled straight from the Swift source of whatsapp-mcp-macos.
Direct answer (verified 2026-05-20)
Attach to the WhatsApp process with AXUIElementCreateApplication(pid) where pid comes from NSRunningApplication filtered to bundle id net.whatsapp.WhatsApp. Walk kAXChildrenAttribute once with a depth cap of 15 to get a flat list of (role, description, x, y, w, h) tuples. Then match by role plus substring of description: outgoing bubbles start with Your message, , incoming with message, or Message from ; the compose field is an AXTextArea whose description contains compose, message, or type; search result buttons live at x >= 1350, w >= 200. Send by clicking the compose box, pasting via Cmd+V, pressing Return. Verify by re-reading the tree and looking for a fresh Your message, bubble.
Source: Sources/WhatsAppMCP/main.swift
The shape of the problem
A web-automation engineer reading about driving WhatsApp through accessibility APIs imports a mental model from CSS selectors: find the right node, click it, type. That model is wrong here, in a way that matters.
The accessibility tree of a Catalyst app exposes almost no typed structure. There is no data-message-id, no role="listitem" with a clean child structure. What you get is a flat list of (role, description, x, y, w, h) tuples where description is a localized human-readable VoiceOver string. That string is the whole API surface.
WhatsApp Catalyst's description strings happen to be templated, so they parse with about thirty lines of regex. The work of writing an MCP server for WhatsApp on macOS is mostly figuring out those templates and writing the parsers; the AX plumbing is identical to what you would do for any other Catalyst app and takes a single afternoon.
The four moves, end to end
- 1
1. Attach by bundle id
Look up the WhatsApp PID via NSRunningApplication on bundle id net.whatsapp.WhatsApp. Pass it to AXUIElementCreateApplication. Window titles change with the active chat; bundle ids do not.
- 2
2. Walk once
Recurse kAXChildrenAttribute with a depth cap of 15. Read role, description, value, title, position, size. Keep nodes that have text or a role you care about. One traversal serves every operation.
- 3
3. Match the vocabulary
Filter by role plus a substring match on the description. The substrings are not yours to invent — they are "Your message, ", "message, ", "Message from ", "Sent to ", "Received from ", "compose", "message", "type", and so on.
- 4
4. Act, then re-read
Post a CGEvent click at (x + w/2, y + h/2). Paste via Cmd+V instead of typing. Press Return. Then traverse again, look for "Your message, " to verify the bubble actually rendered.
Step 1: attach by bundle id
The PID lookup uses NSRunningApplication.runningApplications(withBundleIdentifier:). If WhatsApp is not running, launch it via /usr/bin/open -a WhatsApp.app and wait two seconds for the AX tree to settle. The bundle id is stable; window titles change with the active chat and should not be used as an attach key.
Set an explicit messaging timeout right after attach. AX calls are XPC under the hood; without a timeout, a stutter in the target app hangs your caller indefinitely. The MCP server uses 2.0 seconds for the permission probe (where you want to fail fast) and 5.0 seconds for the real traversal.
Step 2: traverse once, keep everything you might need
Walk kAXChildrenAttribute recursively to a depth of 15. On every node, read role, description, value, title, position, and size in a single pass. Keep the node if it has any text or if its role is in the meaningful set: AXButton, AXTextField, AXTextArea, AXStaticText, AXHeading, AXGenericElement, AXLink.
Doing it in one pass matters. Going back to the AX system per-node per-attribute roughly doubles the time spent talking to the kernel. One walk, every attribute, flat array of structs out the other end. Every operation downstream filters that array; the array itself is regenerated on every tool call because WhatsApp's tree mutates on every keystroke.
Step 3: the actual vocabulary
This is the table no other accessibility-API guide will give you, because writing it requires reading WhatsApp Catalyst's AX tree on a real Mac. Each row is one of the description-string shapes the app emits, the role you find it under, and what to do with it. The patterns are taken verbatim from main.swift.
| Feature | Where it appears in the tree | What it parses to |
|---|---|---|
| "Your message, {text}, {time}, Sent to {name} {status}" | AXGenericElement role. The leading 'Your message, ' is the only reliable signal that this is an outgoing bubble. | Detect outgoing messages, verify a send actually rendered, extract the body and the recipient. |
| "message, {text}, {time}, Received from {name}" | AXGenericElement role. 1:1 incoming message. Sender is in the suffix after 'Received from '. | Detect incoming 1:1 messages and pull sender. |
| "Message from {sender}, {text}, {time}, Received in {name}, {count} unread" | AXGenericElement role. Group message variant. Sender lives between 'Message from ' and the first comma. | Parse group chat messages without confusing the group name with the sender. |
| "Added by non-contact {name}, {count} unread" | Description string for an unknown sender in the sidebar. Useful for triaging incoming chats from strangers. | Flag inbound chats that did not originate from a saved contact. |
| Compose AXTextArea, description contains compose/message/type | AXTextArea role. The visible compose box at the bottom of the active chat. Description varies slightly across macOS locales. | Click target for sending. Take the textarea whose description matches; fall back to the last AXTextArea in the tree. |
Step 3a: the regex that turns description strings into MessageInfo
With the vocabulary known, the parser is straightforward. Take the AXGenericElement description, peel off the prefix (Your message, or message, ), then peel off the routing suffix (, Sent to {name} or , Received from {name}), then peel off the timestamp tail. Whatever remains is the message body.
Step 3b: spatial heuristics for search results
Description matching gets you about 80% of the way there. The other 20% is coordinate filtering, because WhatsApp's sidebar contains a lot of buttons that match description filters but are not what you want: tab pills (All, Unread, Favorites), media filter chips, navigation buttons, and so on.
Three thresholds, all encoded as constants in main.swift, do the filtering:
Spatial filter rules
- Search result buttons must have x >= 1350 (sidebar area on a 16-inch MacBook; tune for other window sizes).
- Search result buttons must have width >= 200 (excludes the narrow tab pills).
- The active chat heading sits at x > 1750 (right of the sidebar; distinguishes the chat title from section headings like Chats, Other contacts, Media).
- Section headings (Chats, Other contacts, Media) partition results vertically. The y-coordinate of each heading is the boundary between sections.
- Skip buttons whose description matches the static UI keyword set (chats, calls, settings, search, send, video message, etc.) regardless of position.
The thresholds are encoded as named constants, which means a WhatsApp layout change is a one-line edit, not a rewrite. The point is to keep the policy explicit and editable, not to derive it from first principles every time.
Step 4: act, then verify
Find the compose AXTextArea by description. Click it via a CGEvent mouse-down/mouse-up pair at the element center (after saving the user's cursor position so the cursor does not visibly jump). Paste through the system clipboard rather than typing each character, because per-character CGEvent typing fails on emoji and fights with WhatsApp's IME. Press Return.
Verification is the part where most accessibility-driven automations leave the LLM hanging. After hitting Return, re-walk the tree and look for an AXGenericElement whose description starts with Your message, containing the body you just pasted. If you find it, return verified: true. If you do not, return verified: false plus a snippet of whatever did appear, so the calling agent can self-correct instead of pretending success.
What this approach looks like next to alternatives
The competitor here is not another macOS accessibility library, but the other transports people pick to reach WhatsApp from an agent: headless Chromium driving web.whatsapp.com, the Signal Protocol fork (whatsmeow), or the Meta WhatsApp Business Cloud API. The comparison is between driving the native app via AX and those three. Below is what each row of the AX-driven approach actually buys you.
| Feature | What it means in practice | whatsapp-mcp-macos (AX) |
|---|---|---|
| Bundle id | net.whatsapp.WhatsApp | Stable since the Catalyst port. NSRunningApplication.runningApplications(withBundleIdentifier:) returns the PID in microseconds. |
| AX tree depth | Up to ~15 levels (Catalyst leaks UIKit-style nesting). | maxDepth=15 reaches the chat list and the message panel without blowing the stack. Lower numbers miss content; higher numbers waste time. |
| Element addressing | AXButton for chat list rows, AXGenericElement for message bubbles, AXTextArea for compose, AXStaticText for the search bar, AXHeading for the active chat title. | Roles are stable across builds. Meta has not been observed obfuscating VoiceOver-targeted attributes the way they obfuscate web DOM class names. |
| Spatial heuristics | Search result buttons live at x >= 1350 with width >= 200. The active chat heading sits at x > 1750. Section headings (Chats, Other contacts, Media) partition results vertically. | Coordinate thresholds drop fragile filter buttons (All, Unread, Favorites) and route-button noise. They are encoded as constants, easy to retune if Meta ships a layout change. |
| Input dispatch | CGEvent mouse events posted into kCGHIDEventTap at element-center coordinates. Text dispatched via clipboard paste (Cmd+V), not per-character typing. | Paste handles emoji and the IME without races. Per-character CGEvent typing breaks on non-ASCII and is 30-200ms per character. |
Honest limitations
This approach is macOS-only. It depends on a real WhatsApp Desktop install staying logged in, on the host process having Accessibility permission, and on the user's screen being unlocked when the agent is acting (the AX tree is empty on a locked screen). It is single-user; the agent shares whatever pairing the user has, with all the consent implications that follow.
The description strings are localized. The matchers in main.swift assume an English locale. A French or Japanese WhatsApp Desktop ships translated VoiceOver labels and the parsers will need locale-aware variants. The cleanest fix is to keep WhatsApp Desktop in English regardless of system locale; the more invasive fix is to extend each regex with localized alternates.
Latency is real: one full tree walk on a chat with a few hundred rendered rows takes ~120ms on an M-series Mac. For send + verify that is two walks plus the paste sleep, around 700ms end to end. That is fine for an interactive agent and bad for a batch sender. If you need volume, the WhatsApp Business Cloud API is the right tool and this is the wrong one.
Want help wiring this into your own agent?
If you are building on top of WhatsApp MCP or weighing it against the Business API, a quick call gets faster answers than a back-and-forth issue thread.
Frequently asked questions
Why do I have to match strings like "Your message, " instead of using a more structured attribute?
Catalyst apps do not expose typed semantic data through accessibility. The richest signal is the VoiceOver description, which is a localized human-readable label. WhatsApp's labels happen to be templated, so they parse cleanly with five regex patterns. There is no AXValue you can pull that contains just the message text or just the sender. The description string is the API surface.
Will the description strings change in a future WhatsApp build?
The five patterns shipped in whatsapp-mcp-macos have been stable across every Catalyst build between 2024 and mid-2026. They map to VoiceOver labels Meta needs to keep working for visually impaired users, so the strings are under more pressure to stay stable than the obfuscated CSS class names on web.whatsapp.com (which rotate weekly). The localized prefixes are the most likely thing to drift; if WhatsApp ships a non-English UI you will see locale-translated equivalents and need to either lock the app to English or extend the matchers.
Why traverse the whole tree on every operation? Cannot you cache it?
WhatsApp's AX tree mutates on every chat switch, every message arrival, every search keystroke. The tree returned by AXUIElementCopyAttributeValue is a snapshot; caching it gives the agent stale coordinates that no longer click anything. Walking 15 levels on a Catalyst process takes ~120ms on an M-series Mac, which is the cost of doing business. The compromise that does pay off is reading every interesting attribute on each node in one pass instead of going back to the AX system for each one.
Why post CGEvent clicks instead of calling AXUIElementPerformAction(AXPress)?
AXPress works on AXButton elements and refuses to act on AXGenericElement, AXTextArea, and the chat-list rows that WhatsApp uses for everything. The compose box specifically does not respond to AXPress. CGEvent posts a real mouse-down/mouse-up at the element center via kCGHIDEventTap, which the Catalyst app processes through the same code path a human cursor would hit. It works on every element and side-steps the question of which AXAction the element claims to support.
Does this approach work for groups too?
Yes, and the group variant is the reason the three description patterns above exist instead of one. Group messages arrive as "Message from {sender}, {text}, {time}, Received in {group_name}, {N} unread". The 1:1 variant is "message, {text}, {time}, Received from {sender}". whatsapp-mcp-macos parses both shapes and unifies them into a MessageInfo struct so the LLM does not see two different schemas for one logical concept.
Why paste through the system clipboard? Can I keep the clipboard untouched?
The system clipboard is the only path that handles emoji, non-ASCII, IME composition, and long bodies without a per-character race against WhatsApp's input handler. whatsapp-mcp-macos backs up the existing clipboard contents before the paste, sets the new text, dispatches Cmd+V, sleeps 350ms, and restores the original contents. From the user's perspective the clipboard is unchanged after the send. There is a 350ms window where a parallel cmd+v from the user would land the message text, which is the tradeoff.
Is there a non-macOS version?
No. AXUIElement is Apple-platform-only (ApplicationServices on macOS). Windows ships UI Automation, which has a similar shape but a completely different API surface. Linux has AT-SPI but no native WhatsApp app to drive. The approach is fundamentally tied to the existence of a real desktop WhatsApp client plus an accessibility framework the app participates in.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.