I'm not familiar with the space, but wouldnt something that streams the whole screen like a video (WebRTC or Moonlight and VNC works like this ) work here too as well, and would be universal? Wayland already supports screen capture (into a texture, at interactive framerates) fairly well.
I'd say the problematic part is not capturing the desktop but injecting controls into it. Proper universal support for simulated input is still missing.