Chapter 6

Computing

SENDING ENOUGH DATA AND IN A TIMELY FASHION is just one part of the process of operating a synchronized virtual world. The data must also be understood, code must be run, inputs assessed, logic performed, environments rendered, and so on. This is the job of central processing units (CPUs) and graphics processing units (GPUs), broadly described as “compute.”

Compute is the resource that performs all digital “work.” For decades, we’ve seen increases in the number of computing resources available and manufactured per year, and we’ve witnessed how powerful they can be. Despite this, computing resources have always been and will likely remain scarce—because when more computing capability is available, we tend to try and perform more complicated calculations. Observe the size of the average video game console over the past 40 years. The first PlayStation, released in 1994, weighed 3.2 pounds and measured 10.75 inches by 7.5 inches by 2.5 inches. The fifth, released in 2020, weighs 9.9 pounds and is 15.4 inches by 10.2 inches by 4.1 inches. Most of the growth relates to the decision to place more computing power in the device—and larger fans to cool it as it performs its work. Today, the original PlayStation (save for its optical drive) could fit in a wallet and cost less than $25, but there’s little demand for such a device compared to modern alternatives.

Earlier in the book, I wrote about the supercomputer Pixar built to produce 2013’s Monsters University: some 2,000 conjoined industrial-grade computers with a combined 24,000 cores. The cost of this data center would have been in the tens of millions of dollars, far more than a PlayStation 3, of course, but also capable of far larger, more detailed, and more beautiful images. Altogether, each of the film’s 120,000 frames took 30 core hours to render.* In the following years, Pixar replaced many of these computers and cores with newer and more capable processors that could render these same shots more quickly. But instead of optimizing speed, Pixar uses this power to create more sophisticated renders. For example, one shot in the studio’s 2017 film Coco had nearly eight million individually rendered lights. At first, it took over 1,000 hours, then 450, to render every frame in the shot. Pixar was able to reduce the time to 55 hours in part by “baking” a number of lights in 20-degree longitudinal and latitudinal increments—that is, reducing their responsiveness to the camera.1

This might seem to be an unfair anchor. After all, not every render needs eight million lights, or real-time specifications, nor will it be scrutinized on a 350-square-meter IMAX screen. However, the renders and calculations required for the Metaverse are far more complicated still. They must also be created every ~0.016 or, better yet, ~0.0083 seconds! Not every company—and certainly few individuals—can afford a supercomputer data center. It’s actually remarkable how computationally limited even the most impressive virtual worlds are today.

Let’s return to Fortnite and Roblox. While these titles are incredibly creative achievements, their underlying ideas are far from new. For decades, developers have imagined experiences with dozens of live players (if not hundreds or thousands) in a single, shared simulation, as well as virtual environments limited only by the imagination of the individual user. The problem was they were not technically possible.

While virtual worlds with hundreds of even thousands of “concurrent users” (or CCUs) have been possible since the late 1990s, both the virtual worlds and users in them were severely constrained. EVE Online does not allow individual players to congregate via avatars. Instead, users direct large and mostly static ships to relocate in space or exchange artillery fire. Dozens of World of Warcraft avatars can be rendered in the same place, but model detail is limited, the perspective relatively zooms out, and players have limited control over what each avatar can do. If too many players have converged on a single area, the game’s server would temporarily “shard” it into concurrently operating but independent copies of that space. Some games even chose to limit real-time rendering to individual players and select in-game AI, with the entire background pre-rendered and thus impossible for players to affect. Engaging in any of these experiences also required a player to buy a dedicated gaming PC, which could run into the thousands of dollars. Even if such a device wasn’t strictly necessary, a user likely had to “turn off” or “turn down” the game’s rendering capability or halve the frame rate.

It was only by the mid-2010s that millions of consumer-grade devices could manage a game like Fortnite—one with dozens of richly animated avatars in a single match, each one capable of a wide range of actions, and interacting in a vivid and tangible world, rather than the cold vastness of space. It was around this same time that enough affordable servers were available that could manage and synchronize the inputs coming from so many devices.

These computational advancements led to extraordinary change in the video gaming industry. Within years, the most popular (and revenue generative) games in the world were those focused on rich UGC and high numbers of concurrent users (Free Fire, PUBG, Fortnite, Call of Duty: Warzone, Roblox, Minecraft). In addition, these games quickly expanded into the sorts of media experiences that were previously “IRL Only” (the Travis Scott concert in Fortnite, or Lil Nas X’s in Roblox). The collective result of these new genres and events was enormous growth in the gaming industry. Over the course of an average day in 2021, over 350 million people participated in a battle royale game—just one genre of high CCU game—and billions were able to do so. In 2016, only 350 million people in the world owned the equipment needed to render a rich 3D virtual world. At its peak in 2021, Roblox had 225 million monthly users—a figure over a third higher than the lifetime sales of the best-selling console in history, the PlayStation 2, and two-thirds the size of social networks such as Snapchat and Twitter.

As you might be able to guess by now, these games feel so ahead of their time in part because of specific design decisions that allow them to work around current computation constraints. Most battle royales support 100 players, but they also use enormous maps with numerous “points of interest” to scatter players far from one another. This means that while the server needs to track what every player is doing, each player’s device doesn’t need to render them or track or process their actions. And while players must ultimately converge on a small space—sometimes the size of a dorm room—the very premise of a battle royale means that almost all players have been defeated at that point. And as the map shrinks, it becomes harder to survive. A battle royale player might need to worry about 99 competitors, but their device faces far fewer.

Still, these tricks only go so far. The mobile-only battle royale Free Fire, for example, is one of the most popular games in the world. However, most of its players are in Southeast Asia and South America, where most devices are low-to-mid-range Android devices, rather than more powerful iPhones and high-end Androids. As such, Free Fire’s battle royale is limited to 50, not 100. Meanwhile, when titles such as Fortnite or Roblox operate social events in a more confined space, such as a virtual concert venue, they reduce CCUs to 50 or fewer. They also limit what users can do compared to the standard game modes. The ability to build might be turned off, or the number of dance moves reduced from the normal dozen or two to only a single preset option.

If you have a processor that’s not as powerful as the average player’s, you’ll observe that more compromises have to be made. Devices that are a few years old will not load the custom outfits of other players (as they have no gameplay consequence) and instead just represent them as stock characters. For all the marvels of Microsoft Flight Simulator, fewer than 1% of desktop or laptop Macs and PCs can even run the title on its lowest-fidelity settings. Part of the reason why MSFS is possible on those devices is because so little of its world is real beyond its map, weather, and flight paths.

Of course, computing capabilities improve every year. Roblox now supports up to 200 players in its relatively lower-fidelity worlds, with up to 700 players possible in beta testing. However, we remain far from the point at which the only constraint is creative. The Metaverse will involve hundreds of thousands participating in a shared simulation and with as many custom virtual items as they like; full motion capture; the ability to richly modify a virtual world (rather than pick from a dozen or so options) with full persistence; and rendering that world not just in 1080p (typically considered “high definition”), but 4K or even 8K. Even the most powerful devices on earth struggle to do this in real time because every single asset, texture, and resolution increase or added frame and player means an additional draw on scarce computing resources.

Nvidia’s founder and CEO, Jensen Huang, imagines the next step for immersive simulations as taking us far beyond more realistic-looking explosions or a more animated avatar. Instead, he envisions the application of the “laws of particle physics, of gravity, of electromagnetism, of electromagnetic waves, [including] light and radio waves . . . of pressure and sound.”2

Whether the Metaverse will require such fidelity to physics is debatable. The important point here is simply that computing power is always scarce specifically because additional computing capabilities lead to important advances. Huang’s desire to bring the laws of physics into a virtual world might seem excessive and impractical, but assuming that it is requires predicting and dismissing the innovations that could come from it. Who would have thought that enabling 100-player battle royales would change the world? What is guaranteed is that the availability of and limitations to compute will shape which Metaverse experiences are possible, for whom, when, and where.

Two Sides of the Same Problem

We know the Metaverse requires more compute, but exactly how much is needed remains unclear. In Chapter 3, I quoted Oculus’s former and now consulting CTO, John Carmack, who believes “building the Metaverse is a moral imperative.” In October 2021, Carmack said that if he’d been asked 20 years earlier whether “one hundred times the processing power” would be sufficient to meet this duty, he would have said yes. Yet even though billions of devices now hold such capability, according to Carmack the Metaverse remains at least five to ten years away and would still face “serious optimization tradeoffs” even at the further edge of that prediction. Two months later, Raja Koduri, Intel’s SVP and general manager of its Accelerated Computing and Graphics Group published similar thoughts on Intel’s investor relations site. Koduri said that “indeed, the metaverse may be the next major platform in computing after the world wide web and mobile . . . [but] truly persistent and immersive computing, at scale and accessible by billions of humans in real time, will require even more: a 1,000-times increase in computational efficiency from today’s state of the art.”3

There are varying perspectives on how best to achieve this.

One argument is that as much “work” as possible should be performed in remote, industrial-grade data centers rather than in consumer devices. That most of the work involved in a virtual world happens on each user’s device strikes many as wasteful because it means many devices are performing the same work at the same time in support of the same experience. In contrast, the super-powerful server operated by the virtual world’s “owner” is just tracking user inputs, relaying them when necessary, then refereeing process conflicts when they occur. It doesn’t even need to render anything!

An example helps bring this to (virtual) life. When a player shoots a rocket launcher at a tree in Fortnite, this information (the item used, its attributes, and the trajectory of the projectile) is sent from that player’s device to Fortnite’s multiplayer server, which then relays that information to all of the players who require that information. Their local machines then process and act on that information: they show the explosion, determine whether their players are harmed, remove the tree from the map, and allow the players to move through where it once was, and so on.

In practice, players might not even see the same visual explosion, even though the “same” explosive hit the exact “same” tree at the exact “same” angle at the exact “same” time, and the exact same logic was applied to process the cause and effect. This reflects the fact that (due to variable latency) a given device might think the rocket was sent slightly earlier or later, and from a slightly different position. Usually this doesn’t matter, but sometimes it is enormously consequential. For example, Player 1’s console might determine that Player 2 was killed by the explosion that destroyed the tree, while Player 2’s console would say Player 2 took significant, but not fatal, damage. Neither console is “wrong,” but the game obviously can’t proceed with both versions of the “truth.” And so the server must “pick.”

The current reliance on personal devices creates other limitations, too. Consumers can experience only what their own device can manage. A 2019 iPad, 2013-era PlayStation 4, and 2020 edition PlayStation 5 will all present Fortnite differently. The iPad will be limited to 30 frames per second, while the PlayStation 4 will offer 60 FPS and the PlayStation 5 120 FPS. The iPad will likely load only selective map textures and maybe even skip avatar outfits, while the PlayStation 5 will show refracting light and shadows, something the PlayStation 4 cannot. This, in turn, means that the overall complexity of a virtual world ends up partly limited by the lowest end device that can access it. Epic Games has decided that the avatars and outfits in Fortnite shouldn’t have an impact on its gameplay, but changing its mind might entail cutting off many players.

Shifting as much processing and rendering to industrial-grade data centers seems both more efficient and essential to building the Metaverse. There are already companies and services pointing in this direction. Google Stadia and Amazon Luna, for example, process all video gameplay in remote data centers, then push the entire rendered experience to a user’s device as a video stream. The only thing a client device needs to do is play this video and send inputs (move left, press X, and so on)—similar to watching Netflix.

Proponents of this approach often highlight the logic of powering our homes via power grids and industrial power plants, not private generators. The cloud-based model allows consumers to stop buying consumer-grade, infrequently upgraded, and retailer-marked-up computers and instead rent access to enterprise-grade equipment that is more cost-efficient per unit of processing power and more easily updated. Whether a user has a $1,500 iPhone or an old WiFi-enabled fridge with a video screen, they could play a computationally intensive title such as Cyberpunk 2077 in all its fully rendered glory. Why should a virtual world depend on a small piece of consumer hardware wrapped in plastic dye covers, rather than on a multi-million-dollar (if not billion-dollar) server stack owned by the company that operates the virtual world?

For all the ostensible logic of this approach, and the success of server-side content services such as Netflix and Spotify, remote rendering is not the consensus solution among game publishers today. Tim Sweeney has argued that “initiatives to place real-time processing on the wrong side of the latency wall have always been doomed to failure because, even though bandwidth and latency are improving, local computing performance is improving faster.”4 Put differently, the debate is not whether remote data centers can offer better experiences than consumer-owned ones. They obviously can. Rather, it’s that networks get in the way and will likely continue to do so.

Here the power generator analogy begins to break down. In most of the developed world, consumers don’t struggle to receive the power they need on a daily basis, nor as quickly as needed. This is despite the fact that very little power—that is, data—is sent. For remote-rendered experiences to be delivered, many gigabytes per hour will need to be sent in real time. But as you know, we’re still struggling to send a few megabytes per hour on a timely basis.

Furthermore, remote compute has yet to prove itself to be more efficient for rendering. This is a consequence of several interconnected issues.

First, a GPU does not render an entire virtual world, nor even much of it, at any given point. Instead, it renders just what’s necessary for a given user when that user needs it. When a player turns around in a game like The Legend of Zelda: Breath of the Wild, the Nintendo Switch’s Nvidia GPU effectively unloads everything that was previously rendered in order to support the player’s new field of view. This process is called “viewing-frustrum culling.” Other techniques include “occlusion,” in which objects that are in a player’s field of view are not loaded/rendered if they are obstructed by another object, and “level of detail” (LOD), in which information, such as the nuanced texture of a birch tree’s bark, are only rendered when the player should be able to see it.

Culling, occlusion, and LOD solutions are essential to real-time rendered experiences because they enable a user’s device to concentrate its processing power on what the user can see. But as a result, other users cannot “piggyback” off the work of one player’s GPU. Some readers might think this is a lie, recalling many hours spent playing Mario Kart on the Nintendo 64, which allowed players to “split” a TV screen into four, one for each driver. Even today, Fortnite allows a single PlayStation or Xbox to cleave a screen in half so that two players might play at once. But in this case, the relevant GPU is supporting simultaneous renders for multiple participants, not users. The distinction here is critical. Every player must enter the same match and level—and cannot leave it early, either. This is because the device’s processors can only load and manage a finite amount of information, and its random-access memory system will temporarily store various renders (e.g., a tree or building) so that it can be continuously reused by each player, rather than rendered from scratch each time. Furthermore, each player’s resolution and/or frame rate drops by an amount proportional to the number of users. This means that even if two TVs are used to operate two-player Mario Kart, rather than one TV split in two, each player receives half as many rendered pixels per second.

It is technically possible for a GPU to render two entirely different games. A top-of-the-line Nvidia GPU can certainly support two distinct emulations of a 2D sidescrolling Super Mario Bros., or one version of Super Mario Bros. and another similarly low-powered title. However, this is not done in a compute-efficient manner. An Nvidia GPU that might run high-end Game A at its fullest rendering specifications cannot run two versions of the title at half of the specifications—or even a third. It also cannot trade off its power between each game based on what they need and when, like a parent helping two kids study or get to bed. Even if Game A can never use all of a given Nvidia GPU’s power, that spare cannot be assigned elsewhere.

GPUs do not generate generic rendering “power” that can be divided across users in the way a power plant splits electricity across multiple homes, or in the way a CPU server can support input, location, and synchronization data for a hundred players in a battle royale. Instead, GPUs typically operate as a “locked instance” supporting a single player’s rendering. Many companies are working on this problem, but until it’s possible, there’s no inherent efficiency in designing “mega GPUs” akin to large industrial power generators, turbines, or other infrastructure. While power generators are typically more cost-efficient per unit of power as their capacity increases, the reverse is true with GPUs. A GPU that’s twice as powerful as another, in a simplified sense, costs more than twice as much to produce.

The difficulties of “splitting” or “sharing” GPUs are why Microsoft Xbox’s cloud game streaming server farms are, in fact, made up of racks and racks of de-shelled Xboxes, each one serving a player. Put another way, Microsoft’s electrical power plant is really just a network of single-household power generators, rather than a single, neighborhood-sized one. Microsoft could use bespoke GPU and CPU hardware to support cloud instances, rather than the GPU and CPU hardware in its consumer-centric Xboxes. However, this this would require every Xbox game be developed to support an additional “type” of Xbox.

Cloud-rendering servers also face utilization issues. A cloud gaming service might require 75,000 dedicated servers for the Cleveland area at 8 p.m. Sunday night, but only 20,000 on average, and 4,000 at 4 a.m. Monday. When consumers own these servers, in the form of consoles or gaming PCs, it doesn’t matter that they’re unused or offline. However, data-center economics are oriented toward optimizing for demand. As a result, it will always be expensive to rent high-end GPUs with low utilization rates.

This is why Amazon Web Services gives customers a reduced rate if they rent servers from Amazon in advance (“reserved instances”). Customers are guaranteed access for the next year because they’ve paid for the server, and Amazon is pocketing the difference between the cost and the price the customer is charged (AWS’s cheapest Linux GPU reserved instance, equivalent to a PS4, costs over $2,000 for one year). If a customer wants to access servers when they need them (“spot instances”), they might find they’re not available, or that only lower-end GPUs are. This last point is key: we’re not solving the computing shortage if the only way to make remote servers affordable is to use rather than replace older ones.

There is another way to improve cost models: consolidate servers into fewer locations. Rather than operate a cloud game streaming center in Ohio, Washington State, Illinois, and New York, a company could just build one or two. As the number and diversity of customers increases, demand tends to stabilize, resulting in greater average utilization rates. Of course, this also means increasing the distance between remote GPUs and the end user, thereby increasing latency. And this doesn’t solve for the distance between users, either.

Shifting computing resources into the cloud creates many new costs. For example, side-by-side, always turned-on devices running at data centers create considerable heat—far more than the aggregate heat of those servers sitting in a family’s living room credenza. Servicing, securing, and managing this equipment is costly. The shift from streaming limited bits of data to high-resolution, high-frame-rate content means substantially higher bandwidth costs, too. Yes, Netflix and others make the costs work, but they’re typically sending fewer than 30 frames of video per second (not 60 to 120) with a lower resolution (e.g., 1K or 2K, not 4K or 8K, as Google Stadia was promised), on a non-real-time basis, and from nearby servers that are storing files rather than performing intensive computing operations.

For the foreseeable future, what I call “Sweeney’s Law”—improvements in local compute will continue to outpace improvements in network bandwidth, latency, and reliability—seems likely to hold. Although many believe that Moore’s Law, which was coined in 1965 and states that the number of transistors in a dense integrated circuit doubles about every two years, is now slowing down, CPU and GPU processing power continues to grow at a rapid pace. In addition, consumers today frequently replace their primary computing device, resulting in enormous improvements for end-user compute every two to three years.

Dreams of Decentralized Computing

The insatiable need for more processing power—ideally, located as close as possible to the user but, at the very least, in nearby industrial server farms—invariably leads to a third option: decentralized computing. With so many powerful and often inactive devices in the homes and hands of consumers, near other homes and hands, it feels inevitable that we’d develop systems to share in their mostly idle processing power.

Culturally, at least, the idea of collectively shared but privately owned infrastructure is already well understood. Anyone who installs solar panels at their home can sell excess power to their local grid (and, indirectly, to their neighbor). Elon Musk touts a future in which your Tesla earns you rent as a self-driving car when you’re not using it yourself—better than just being parked in your garage for 99% of its life.

As early as the 1990s programs emerged for distributed computing using everyday consumer hardware. One of the most famous examples is the University of California, Berkeley’s SETI@HOME, wherein consumers would volunteer use of their home computers to power the search for alien life. Sweeney has highlighted that one of the items on his “to-do list” for the first-person shooter Unreal Tournament 1, which shipped in 1998, was “to enable game servers to talk to each other so we can just have an unbounded number of players in a single game session.” Nearly 20 years later, however, Sweeney admitted that goal “seems to still be on our wish list.”5

Although the technology to split GPUs and share non–data center CPUs is nascent, some believe that blockchains provide both the technological mechanism for decentralized computing as well as its economic model. The idea is that owners of underutilized CPUs and GPUs would be “paid” in some cryptocurrency for the use of their processing capabilities. There might even be a live auction for access to these resources, either those with “jobs” bidding for access or those with capacity bidding on jobs.

Could such a marketplace provide some of the massive amounts of processing capacity that will be required by the Metaverse?#x2021; Imagine, as you navigate immersive spaces, your account continuously bidding out the necessary computing tasks to mobile devices held but unused by people near you, perhaps people walking down the street next to you, to render or animate the experiences you encounter. Later, when you’re not using your own devices, you would be earning tokens as they return the favor (more on this in Chapter 11). Proponents of this crypto-exchange concept see it as an inevitable feature of all future microchips. Every computer, no matter how small, would be designed to be auctioning off any spare cycles at all times. Billions of dynamically arrayed processors will power the deep compute cycles of even the largest industrial customers and provide the ultimate and infinite computing mesh that enables the Metaverse. Perhaps the only way for everyone to hear a tree fall is for all of us to water it.

* As a reminder, this is not a literal 30 hours. Instead, it is 30 core hours. One core could spend 30 hours rendering, or 30 cores could spend one hour rendering, etc.

The exception here is when a game is running well below the capacity of the GPU that’s supporting it—as would be the case if one played the Nintendo 64 version of Mario Kart on the Nintendo Switch, which released 21 years after the Nintendo 64.

Neal Stephenson described this sort of technology and experience at length in Cryptonomicon, which was published in 1999, seven years after Snow Crash.

If you find an error or have any questions, please email us at admin@erenow.org. Thank you!