Beam Me Up Scotty
The system is based around a 65” TV display which allows for life-sized video, along with 4 speakers that produce spatial audio, meaning the position of the sounds is encoded in the audio signal and can reproduce the precise 3 dimensional location of its origination. This is accomplished with 12 microphones that enable the system to catch every word or sound and its precise location, all of which is part of HP’s Poly Studio A2 Audio Solution bridge for video conferencing. The heart of the system is Google’s 3D imaging system and light field rendering, along with a volumetric video AI model, which needs some explanation. A typical video image is 2 dimensional, with height and width but no depth information. 360° video does capture the depth information, but it is all oriented from a single fixed point. The volumetric model captures what is known as 6DoF (Degrees of Freedom) which not only captures the height, width, and depth as it moves through time, but also captures Pitch, Yaw, and Roll. This allows the model to look at any ‘slice’ of video (time) from any perspective, no longer tied to a single point, which, in theory, means that you could look behind, around, or above or below the image without any additional hardware. Of course you need a number of high-speed cameras and depth sensors to feed this information to the model, all of which are included in the system. Then the Ai reconstructs and renders the data.
In order to play back this complex information the system creates a number of ‘views’ of each video slice, each with a slightly different perspective. Each group of pixels in the display is covered with a microlens array that directs light in different directions. The system can control which sub-pixel under each lens receives each piece of information and therefore controls the direction that information flows. So one eye will see a slightly different image than the other eye, creating the 3D effect as it does in nature. It is a complex process, but with the Ai running the show the system should produce realistic live video that contains significant depth and looks as if the person were sitting across from the viewer.
Here's the problem. While the idea of 3D communication might be attractive, it is expensive with a single system costing about $25,000. The benefits are a more realistic conversation and the ability to notice considerable nuance that might not be visible in a 2D image. But is this necessary? Perhaps during tense geopolitical negotiations where every move of the hand or tap of the finger has meaning, but relatively little value in most video conversations, which commonly use relatively inexpensive or free technology. Certainly not anywhere near the quality that the Google Beam system can generate, but att $25,000/unit (you need at least 2) it seems a bit out of the realm of most consumers and even a stretch for most businesses. This leaves us to look at the system more as a ‘proof of concept’, showing it can be done. As to refining it to more commercial pricing, it could be difficult given the number of microphones and cameras needed to create the 3D imaging along with the AI model and associated equipment. It works, now what? After 7 or 8 years of research, it seems a bit of a waste.
RSS Feed