Main image of article ARKit 3 Ushers in RealityKit, Priming Us for Augmented Reality Glasses

At WWDC 2019, Apple introduced us to RealityKit, augmented reality (AR) tooling that is undoubtedly the foundation for the holy grail of augmented reality: an eyeglasses-like heads-up display.

Prior to ARKit 3 and RealityKit, a lot of augmented reality tooling was handled by SceneKit, Apple’s toolkit meant for 2D games. It allowed us to place virtual objects in real-world scenes, which was good, but not great. SceneKit hasn’t been abandoned, per se, but it’s no longer as advanced as we need for augmented reality.

RealityKit is a brand new toolkit that is focused on 3D augmented reality, and specializes in two key areas: People Occlusion and Motion Capture.

People Occlusion is tooling within RealityKit that performs a critical task, and removes a ton of complexity for developers. It allows you to place a virtual 3D object in a real space and have your app understand and appreciate where people are in the scene at all times.

In theory, this eliminates those annoying AR moments where a vase on a table is suddenly in someone’s back pocket because they walked into your scene. People Occlusion knows a person is in the scene, and how far they are from the plane you placed an object on.

It does this by taking what amounts to depth-shots at 60 frames-per-second, rendering them in real-time, and judging who is where. Prior to RealityKit, ARKit and SceneKit would place an object on a table, for example, but it was overlaid onto the scene; the plane was quasi-‘virtual’ and couldn’t identify what else was going on.

RealityKit constantly scans its depth-sensing images, feeds them through a machine learning algorithm to identify people, checks its knowledge, and returns a scene you can interact with. It also allows virtual objects to be occluded from a scene. In a game like poker, where you don’t want to show your hand, this is handy: An AR poker game would show your cards only to you, and anyone who walked behind you would see nothing, or maybe a snarky message about keeping their eyes to themselves.

3D objects are still “collision shapes” when they need to be interactive, and those items now load asynchronously in the background to make interactions smoother.

Tracking and ARView

The platform's motion tracking is equally impressive. It uses a device’s camera to scan a person and return a 3D or 2D skeletal representation of their movements. You can play that back with the BodyTrackedEntity class, and model it to your own mesh character. This mesh character can be anything, but as Apple notes, it must adhere to the same 96 joints on your body its own engine tracks (you can’t turn a human into a spider, for instance).

There’s also facial tracking, which performs tasks such as tagging a mask to a person. If they go out of frame, it recognizes them when they return (and returns their virtual mask).

These capabilities all fall under the ARView class, which is a RealityKit version of SceneKit’s ARSCNView class. If we’re being honest, RealityKit does a lot of what SceneKit did, just a lot smoother and in a way that makes a lot more sense for 3D environments.

Even the documentation is deeper. While SceneKit’s ARSCNView has seven key sections, ARView for RealityKit has 14, and each is a deep dive. There’s even a new gesture recognizer, which has tooling that allows users to interact with virtual items via (you guessed it) gesture.

Multiplayer is also possible within RealityKit. For example, one user could start a game and place a virtual board on the floor, then invite a second player. Both map the scene as they play, and RealityKit uses information from both of their devices to detail the 'shared’ map. It also allows items on that board to interact with both players; developers can code ‘turns’ so one player is prohibited from interacting with certain items at certain times.

The Future

For now, this is all dependent on a phone or tablet. Gestures, for example, are on-screen gestures (such as drawing a half-circle with your finger to turn an item). Motion tracking handles skeletal frames of humans, but not fine-tuned hand motions or gestures.

But it’s not hard to see where this is all headed. Think of an AR frisbee game: motion tracking would be able to recognize your hand was in a position to hold the virtual frisbee, while facial tracking and on-device mapping would know who you’re playing with. If you “hit” someone with your frisbee, the game would know that you screwed up. If someone besides the other player made a gesture (accidentally) to catch the frisbee, the game would know to ignore them.

RealityKit can’t handle such a scenario today, but it’s far more scalable than SceneKit, and is still tied directly to ARKit. Keep an eye on this space; RealityKit updates will undoubtedly serve as a precursor for the AR glasses that Apple likely wants to sell us.