WhatsApp Desktop accessibility automation, the part most guides skip

The minimal recipe for driving WhatsApp Desktop on macOS via the accessibility APIs is small: attach to the PID with AXUIElementCreateApplication, recurse kAXChildrenAttribute, post CGEvent mouse and keyboard events at the resolved coordinates. Every blog post and every example repo gets that part right. The part they get wrong is permission verification: they call AXIsProcessTrustedWithOptions and stop. In production that flag returns true while every AX call silently fails, and the calling agent has no idea why.

M
Matthew Diakonov
9 min read

Direct answer (verified 2026-05-07)

To automate WhatsApp Desktop with macOS accessibility, run a native binary that calls AXUIElementCreateApplication(pid) against the WhatsApp PID, sets a messaging timeout via AXUIElementSetMessagingTimeout, recurses through kAXChildrenAttribute to build a flat list of AXButton, AXTextField, AXTextArea, and AXStaticText nodes with their absolute screen coordinates, then posts CGEvent mouse and keyboard events at those coordinates. The host process must be enabled in System Settings > Privacy & Security > Accessibility, AND the TCC cache for that host must be fresh. Verify the second condition with a functional probe (a real read of kAXChildrenAttribute under a 2.0 second timeout), not just AXIsProcessTrustedWithOptions. The framework reference is at developer.apple.com/documentation/applicationservices/axuielement_h.

The pattern every other guide ships

If you read three or four accessibility-automation guides aimed at driving a Catalyst or AppKit app on macOS, the permission-check section is always the same. Some variant of this:

naive-permission-check.swift

That code compiles, runs, and looks correct. It will pass any unit test you can think to write. It will work flawlessly on a fresh macOS install, on a freshly granted Accessibility permission, the first time anyone runs the script. It is also wrong, and the way it is wrong is the worst kind: a silent false positive whose only symptom is that downstream tools return empty results and your agent loops on a hallucinated diagnosis.

Why the trust flag lies

macOS keeps a per-process trust cache called TCC (Transparency, Consent, and Control). When you toggle an app on in Privacy & Security > Accessibility, the TCC database records a row keyed to the app's code-signing identity and bundle id. When a process calls AXIsProcessTrustedWithOptions, macOS reads that row and returns true if it matches.

The catch: the row is bound to the code-signing identity that was present when you first granted permission. If anything changes the on-disk binary's signature without TCC invalidating the cache, the lookup still hits the old row and reports true. But the live AX subsystem now sees a different signature on the calling process and rejects every actual AXUIElementCopyAttributeValue call. From the script's perspective, you are trusted; from the kernel's perspective, you are not. The two never agree.

The triggers are mundane. A macOS minor update that bumps a system-level signature. A homebrew reinstall of a host app that re-signs the bundle. An over-the-top app update that ships a new provisioning profile. A switch between code signing identities during development. None of these are exotic. All of them produce the same surface symptom: status checks pass, every other tool returns nothing.

The only honest way to detect the broken state is to actually try to use the API. Read kAXChildrenAttribute on the root element. If it returns nil or an error, AXIsProcessTrusted is lying to you and you have a remediation message to surface.

The pattern whatsapp-mcp-macos uses instead

Two functions, lifted from Sources/WhatsAppMCP/main.swift. Note the two distinct timeouts: 2.0 seconds for the probe so a stale cache fails fast, and 5.0 seconds for the real traversal (further down) so a brief layout pass in WhatsApp does not abort a tool call.

Sources/WhatsAppMCP/main.swift

The exposed shape of whatsapp_status is what a calling LLM agent reads, and the difference between the three states is what tells the agent (or the human running it) exactly what to do next. Three real responses, copied from the binary at three different machine states:

whatsapp_status, three machine states

Walking the AX tree once you trust it

Once both flags are true, every other tool call ends up here: traverseAXTree(pid:). It returns a flat list of AXElementInfo values, each carrying enough role and screen-coordinate data for the click and read tools to operate on. The depth cap of 15 is empirical: any deeper and you hit pure skeleton-container leaves inside Catalyst's rendering pipeline; any shallower and you miss the chat list rows.

Sources/WhatsAppMCP/main.swift

Clicks and paste, two surprisingly load-bearing details

The AX tree gives you positions. Clicking and typing happens outside AX, through the HID event tap. Two things are easy to get wrong here. The first is the cursor: if you do not save and restore the user's mouse position around every click, the cursor jumps around the screen while the agent is working, which users hate immediately. The second is text input: per-character CGEvent typing is slow, races with the IME, and corrupts emoji. Paste does not.

Sources/WhatsAppMCP/main.swift

The full loop, end to end

1

1. Resolve the WhatsApp PID by bundle id, not by window title

Window titles change with the active chat. Bundle ids do not. whatsapp-mcp-macos hardcodes net.whatsapp.WhatsApp and looks up the PID via NSRunningApplication. If the app is not running, launch it with /usr/bin/open -a WhatsApp.app and wait two seconds for the AX tree to settle.

let bundleID = "net.whatsapp.WhatsApp"
let apps = NSRunningApplication
  .runningApplications(withBundleIdentifier: bundleID)
let pid = apps.first?.processIdentifier
2

2. Verify accessibility with TWO checks, not one

Call AXIsProcessTrustedWithOptions with the prompt suppressed. Then attach to the WhatsApp PID, set a 2.0 second AX messaging timeout, and try to read kAXChildrenAttribute on the root element. Only treat the host as functionally working when the read succeeds AND returns a non-nil children ref. This is the part of the pattern that catches a stale TCC cache.

3

3. Attach to the target with AXUIElementCreateApplication and a 5.0s timeout

AXUIElementCreateApplication(pid) returns the root AX element for the target process. AX calls go through XPC; without a messaging timeout, a busy or hung target hangs the caller indefinitely. whatsapp-mcp-macos uses 2.0s for the probe (where you want to fail fast) and 5.0s for the real traversal (where the app might briefly stutter while a chat loads).

4

4. Recurse through kAXChildrenAttribute with a depth cap of 15

Catalyst apps have absurdly deep AX hierarchies. A maxDepth of 15 catches the leaves of the WhatsApp chat list and message panel without blowing the stack on a runaway tree. On every node read kAXRoleAttribute, kAXDescriptionAttribute, kAXValueAttribute, kAXTitleAttribute, kAXPositionAttribute, kAXSizeAttribute. Keep nodes with non-empty text or with a role in the meaningful set: AXButton, AXTextField, AXTextArea, AXStaticText, AXHeading, AXGenericElement, AXLink.

5

5. Find target elements by role plus a substring of description, value, or title

WhatsApp Desktop's AX tree exposes contact names and message text in kAXDescriptionAttribute, search input in an AXTextField, and the compose box as an AXTextArea whose description usually contains 'compose', 'message', or 'type'. Lowercase both sides of the comparison. Take element coordinates as absolute screen pixels.

6

6. Click by posting a CGEvent at the element center, then restore the cursor

Compute (x + width/2, y + height/2) and post a leftMouseDown/leftMouseUp pair via CGEventSource(stateID: .hidSystemState).post(tap: .cghidEventTap). Save NSEvent.mouseLocation before, flip Y from NSEvent's bottom-origin to CGEvent's top-origin, and restore the cursor at the end. The user does not see their cursor jump.

7

7. Type via Cmd+V paste, not character-by-character CGEvent typing

Per-character CGEvent typing fails on emoji, fights with WhatsApp's IME, and is slow enough to race the message render. Backup the user's clipboard, set the new value, post Cmd+V, sleep 350 ms, restore the original clipboard. One keystroke, full unicode, predictable.

8

8. Re-read the AX tree after the action and verify the change

After hitting Return, traverse the tree again and look for an AXGenericElement whose description starts with 'Your message, '. That is how WhatsApp Catalyst announces an outgoing bubble to VoiceOver. If you see it, return verified: true. If not, return verified: false plus a snippet of whatever did appear, so the calling agent can self-correct instead of pretending success.

Accessibility automation vs. WhatsApp Web automation

The other widespread approach to automating WhatsApp without the Business API is driving web.whatsapp.com in headless Chromium. It works, but it has a different shape of fragility. Honest tradeoffs, in the dimensions that actually decide which one breaks first in production:

Featureweb.whatsapp.com automation (browser)whatsapp-mcp-macos (accessibility)
What gets drivenA Chromium tab pointed at web.whatsapp.com. The DOM, the IndexedDB session, and the ServiceWorker are the surface area you act against.The native net.whatsapp.WhatsApp Catalyst process. The macOS accessibility tree of that process is the surface area; the underlying app is unaware anything beyond a real user is interacting.
How elements are addressedCSS selectors against a heavily-obfuscated DOM that rotates class names on every Meta build (e.g. _akbu, _akbv). Selectors break weekly.kAXRoleAttribute filtered to AXButton/AXTextField/AXTextArea/AXStaticText/AXGenericElement/AXLink, then a substring match on kAXDescriptionAttribute, kAXValueAttribute, or kAXTitleAttribute. AX descriptions track VoiceOver labels, which Meta has not been observed to obfuscate.
Pairing modelQR code each session in a clean profile, plus the multi-device Signal Protocol fork (whatsmeow et al.) if you want to bypass the browser. Both are fragile under Meta's anti-automation heuristics.Whatever pairing the human already did inside the WhatsApp Desktop app. The MCP server never touches the network protocol; if the desktop app is logged in, you are logged in.
Permission surfaceNone at the OS level, but Meta-side: account flags, rate-limits, and the risk of being banned for scripted browser activity at scale.macOS Accessibility, granted to the host process that forks the MCP child (e.g. Claude.app, Cursor.app, Terminal). One toggle, one user consent. No Meta-side keys.
Failure mode that costs the most timeA class name rotates and every selector in the codebase 404s at once. You ship a patch within hours or the bot is dark.Stale TCC cache after a macOS update or a host app re-sign. AXIsProcessTrusted reports true, AX child reads return nothing. Without a functional probe in your status output, the LLM gets confused for hours; with the probe, the user is told to remove and re-add the host app in Privacy & Security in one breath.
Cross-app reuseThe whole driver is bound to web.whatsapp.com. Driving Slack or Linear is a fresh project.The exact same Swift loop that drives net.whatsapp.WhatsApp can target /System/Applications/Calculator.app or any Catalyst window the user has Accessibility access for. The walker, the click, and the probe pattern are app-agnostic.
What runs whereA headless Chromium plus a session store, plus a process supervisor. Resident memory in the hundreds of megabytes.A single Swift binary started by the MCP client over stdio. Roughly twenty megabytes resident; nothing runs when no tool is invoked.

Building an agent that actually has to read your WhatsApp?

If you are wiring a Claude or Cursor workflow to WhatsApp Desktop and you want the TCC trap caught, the AX tree walked, and the send verified before you tell a user it shipped, walk through it with us.

Frequently asked questions

Why is AXIsProcessTrusted not enough on its own?

AXIsProcessTrustedWithOptions reads a flag in the per-process trust cache. After a macOS update, an app re-sign, a code-signing identity rotation, or simply a TCC database getting out of sync with the on-disk app bundle, that flag can return true while every AXUIElementCopyAttributeValue call against a target PID returns errors or nil. The host process thinks it has accessibility; the kernel disagrees. The fix is to actually exercise the API once at startup with a tight timeout, and surface both states so a caller can distinguish 'not granted' from 'granted but broken'.

What is the smallest end-to-end loop to drive WhatsApp Desktop from a Swift script?

Look up the PID with NSRunningApplication.runningApplications(withBundleIdentifier: 'net.whatsapp.WhatsApp'). Call AXUIElementCreateApplication(pid) and AXUIElementSetMessagingTimeout(root, 5.0). Recurse through kAXChildrenAttribute up to depth 15, capturing role, description, value, title, position, and size on each node. Filter to AXButton, AXTextField, AXTextArea, AXStaticText, AXGenericElement, AXLink, plus anything with non-empty text. Find the compose AXTextArea by description substring, click its center via CGEvent, paste with Cmd+V, post a Return key event. About 250 lines of Swift.

Why use clipboard paste instead of typing characters one at a time?

Posting a CGEvent for each character takes 30 to 200 ms per character, depending on the source rate and how the target's IME schedules input. Emoji, ZWJ sequences, RTL marks, and many non-ASCII characters either fail or arrive out of order. Pasting via Cmd+V is a single keystroke regardless of payload size, handles all unicode, and lands as one atomic edit in the WhatsApp compose field. The only cost is briefly clobbering the user's clipboard, which is mitigated by saving the prior value, sleeping 350 ms, and restoring it.

How does the automation know it is talking to the right chat?

The send tool refuses to fire unless getActiveChatName(pid:) returns a non-empty string by walking the AX tree of the chat panel. After the send, the tool walks the tree again and looks for an AXGenericElement whose description starts with 'Your message, '. That string is what the WhatsApp Catalyst app announces to VoiceOver when an outgoing bubble is rendered. If it is present and contains a prefix match against what was sent, the tool returns verified: true; otherwise it returns verified: false with a snippet of what did appear, so the calling LLM can correct.

What does the AX messaging timeout actually protect against?

AXUIElementCopyAttributeValue is a synchronous IPC call into the target process via XPC. If the target is busy, hung, beachballing, or in a state where the AX server is blocked behind a layout pass, the call can wait indefinitely. AXUIElementSetMessagingTimeout caps that wait. whatsapp-mcp-macos uses 2.0 seconds for the boot-time probe (you want fast failure if the cache is stale) and 5.0 seconds for the real traversal during a tool call (long enough that a brief stutter while a chat loads does not abort).

What if accessibilityTrusted is true but accessibilityWorking is false?

That is the stale TCC cache case the probe is designed to catch. The remediation that ships in the status JSON is: open System Settings > Privacy & Security > Accessibility, find the host app that runs the MCP server (Claude.app, Cursor.app, Fazm, or Terminal, depending on how you launched it), toggle it off, click the minus button to remove it entirely, then add it back with the plus button and toggle on. Quit and relaunch the host app afterward. That sequence rebuilds the TCC entry from the current code-signing identity.

Does this approach work for WhatsApp on Windows or Linux?

No. The whole approach is built on AXUIElement, which is an Apple-platform-only framework (ApplicationServices on macOS, UIAccessibility on iOS but not exposed for cross-app driving). Windows has UI Automation (UIA) with a similar shape but a totally different API; on Linux there is AT-SPI but no native WhatsApp app to drive. whatsapp-mcp-macos is macOS only, by design and by the realities of the underlying frameworks.

Will Meta detect or block this?

There is no network-level signal for an automation built this way. The MCP server never touches WhatsApp's servers; it touches the running app the user already has logged in. The clicks, paste, and Return key are indistinguishable to the app from a human at the keyboard, because at the OS level they are CGEvents posted into the same HID event tap a real user's input goes through. Risk surface is OS permissions and the user's own behavior, not Meta-side anti-automation.