Patrick O'Shaughnessy

Real-time video upscaling using CNNs and WebGPU

WebGPU delivers a more modern GPU API over the familiar OpenGL ES 2.0 based WebGL that was introduced in 2011. In this talk, James Hurley, Principal Software Engineer at Twitch, discusses how his team leverages WebGPU to build a neural network to upscale video to 1080p and 1440p at 60 frames per second on end user devices.

Published
Published Nov 21, 2025
Uploaded
Uploaded Jun 13, 2026
File type
YouTube
Queried
0

Full transcript

Showing the full transcript for this video.

AI-generated transcript with timestamped sections.

0:07-1:37

[00:07] Hey everybody, the talks so far have been really great and I am here to change that. [00:12] My name is James Hurley. I'm a principal engineer at Amazon IVS, aka Twitch's video platform. [00:19] I've been building video applications, in particular for user devices, in one way or another, for about 15 years now. [00:26] I'm going to talk a little bit about how we've been using WebGPU to build real-time super-resolution capabilities into our video players. [00:35] So, obviously, at Twitch, we stream a lot of video on the Internet, and a lot of the time the user is seeing what the broadcaster had done just a couple of seconds [00:45] earlier, sometimes even less than two seconds earlier. And streaming video over the internet comes with some challenges. [00:53] This applies to both VOD and live streaming, but I'm [00:57] more familiar with live streaming. The user can be on a mobile network. They could be in a developing country. [01:05] They could have a metered internet connection, so they could have like a daily or a monthly allotment of bandwidth, and they want to-- [01:12] be cognizant of how much data they're downloading and have more control over that. [01:18] This is also a very data intensive application. [01:22] If you're a customer and you're streaming, [01:25] video at a very high scale. You might want to be able to control your costs a little bit. [01:30] And so-- [01:32] There are some interesting developments in user devices over the past few years.

1:37-3:14

[01:37] mainly that [01:38] GPUs have gotten a lot more powerful. [01:41] Some devices are coming with NPUs. [01:43] And this raises a couple of interesting questions for us. Can we trade compute for bandwidth? [01:52] Can we [01:53] generate a rendition that didn't exist before on the fly on the user's device. [02:00] And so this is where super resolution comes in. Super resolution, for those of you who may not be familiar with it, is sort of an umbrella of technologies. [02:11] that, uh, [02:13] upscale images or video or games in real time. They can be interpretive, as ours is, or generative. There are a number of GAN-based super resolution models out there. [02:26] You may actually be most familiar with Super Resolution in gaming. [02:31] Modern games tend to come with NVIDIA DLSS or AMD FSR technology. These are super-resolution models that are built into the game engine or are integrated tightly with the game engine. And the game can be cheaper to render at a lower resolution and then upscale using some ML model. [02:54] We are going to try something similar, [02:57] but we're dealing with pixels, and we're not integrated directly into a game engine or anything like that, so it's a little different. [03:03] Thank you. [03:04] So we have some requirements here. [03:06] The fun thing about Twitch and YouTube as well is that even though it's user-generated content,

3:14-4:46

[03:14] and the production quality is relatively low. [03:17] The frame rates and the frame sizes are pretty high. [03:21] So generally speaking, users will stream at 1080p or higher. [03:27] And they're streaming generally at 60 frames per second. So that's something like 125 million pixels per second. [03:34] And we don't want to melt the user's GPU because they may not appreciate that. [03:41] So this is kind of a high-level overview of what we want to do. [03:46] First, we want to take the frame that we're decoding. [03:49] And we want to grab it, and we want to grab it as close to every frame as possible, and we want to pass that into our... [03:56] frame transformer, which is going to run the neural network and do the frame processing. And it will give us a GPU texture. And we're going to blit that texture to the canvas on the other side. [04:09] Meanwhile, we want to make sure we monitor the performance of this, because if we're missing our deadlines, we want to be able to fall back to just a pass-through mode. We don't want to have the user have a bad experience here. [04:20] Or if we encounter errors or anything like that, we want to just be able to fail, [04:24] and enter a pass-through mode. [04:29] Thank you. [04:30] Before I get on to WebGPU, this is actually a very important piece of the puzzle. [04:37] We need to get the video frame [04:40] from the decoder, [04:42] into our GPU context.

4:46-6:21

[04:46] There are two ways of doing this right now. Web codec [04:51] is sort of the best way of doing it. This is the newest way of doing it. This would ensure that we get every single frame. [04:57] uh, [04:58] But the problem is our [05:00] Player is a little bit old. [05:02] It's older than WebCodec anyway. So we're using MSE, and it would be kind of a tremendous amount of work for us to switch that. [05:10] As of Chrome 116, there is a new -- [05:14] newish API called Request Video Frame Callback. [05:17] On the screen, you can see an example of how you would actually use that. This allows us to grab a video frame, a WebCodec video frame from the video element. [05:29] and pass that on to WebGPU. [05:33] unfortunately does not guarantee that you get every frame, but in our experience, it's been close enough to every frame that it's acceptable. [05:41] Thank you. [05:41] So now let's talk about how we are going to make a neural network [05:46] using WebGPU. [05:48] and compute shaders. [05:51] So for those of you who are not familiar with [05:54] actually programming a GPU. This is sort of how you do it. [05:58] You create very small programs, generally called shaders, or sometimes called kernels, depending on the context. These shaders are run [06:09] many, many times in parallel. [06:11] uh... [06:12] There are a number of different kinds of shaders, and generally they map to a different part of a graphics pipeline. For example, there are vertex shaders.

6:21-7:55

[06:21] pixel or fragment shaders. [06:23] geometry and tessellation shaders. But there are also a-- [06:28] a class of shaders called Compute Shaders that allow you to do general purpose computing on GPUs. [06:33] So, here we have a very simple example of a compute shader in action. At the top there, you have the code that is run on the GPU. It is going to copy a pixel from an input texture. [06:46] and then put that into an output texture at a given coordinate. For a 1080p image, this is going to run about 2 million times. [06:55] This is invoked from the CPU, and you can see the CPU code [06:59] below, it is invoked in a couple of grids. On WebGPU, you start with your workgroup thread, your workgroup, and this has 256 threads, and each invocation is on a single thread. [07:12] And then you have your dispatch group, which is any number of work groups. This is a 3D volume, and you don't have to use all three dimensions, but you can [07:22] It's just a way of organizing your workgroups. [07:26] Um, [07:27] Some number of work groups will be run concurrently, and eventually they'll [07:32] I'll finish by exiting their threads, and we can move on with our lives. [07:39] So this is the neural network that we want to build. This is a very simple CNN. [07:44] It has six convolutional layers separated by a leaky reluse. And none of the layers are particularly large, so you can see that it's about 3,000 multiply accumulates per pixel.

7:56-9:27

[07:56] And this is -- [07:58] Uh, [07:59] We have a RGB to YUV conversion at the front, because we're only dealing with the luminance plane. [08:06] I'm not going to explain that right now because that's... [08:09] into the weeds a little bit, as if I'm not getting into the weeds. And then at the end, we have a RGB to YUV conversion. [08:17] And in the middle, we have some hidden layers that are just bog standard depthwise 2D convolutions. [08:24] Um... [08:26] And so when we were building this, we broke this up into three shaders. You could do it differently, but we decided that this was kind of the approach we would take. We have our first compute shader. This does our bilinear upscale. It does our YUV conversion. [08:39] and it does the first [08:41] layer of our [08:43] of our convolutions. We have a second shader that is run multiple times for the multiple layers. [08:51] And then we have a final exit shader that does our [08:55] final convolution followed by the YUV to RGB conversion [09:03] Thank you. [09:04] So, stepping back and looking at how this actually interacts with the browser. [09:10] We start with using the gpuDevice.importInternalTexture. [09:15] This is how you convert a video frame to a GPU texture. [09:19] On most devices, the video frame is already going to be on the GPU, and this is-- [09:26] at the time that this

9:27-10:57

[09:27] function is called. [09:28] zero copy. [09:30] However, [09:32] Video tends to get decoded into YUV frames. And so what Chrome will do is it will [09:41] put a YUV to RGB... [09:45] shader in front of your-- when you access the texture, [09:50] uh... [09:51] you'll get an automatic YUV to RGB conversion. [09:56] We pass our weights in as just binary data. We have a base64 blob that we include in our web application. [10:07] and a JSON sidecar that gives us the indices into the base 64 blob for where we can find our weights for any given layer. [10:18] Um... [10:19] At the end of it, [10:20] We have a GPU texture out the other side. This is our upscaled texture. So for example, we might upscale from 720p to 1080p. We have a 1080p texture out the other side. [10:30] And we blit that to the canvas. [10:33] uh... [10:34] We keep the canvas, or we instantiate the canvas to be the same size as the texture, and we blit it rather than doing some sort of scaling so that the browser can handle the final scale to the viewport size, because the browser could potentially use different scaling algorithms, and we want to match what the browser is doing, and we don't want to have... [10:54] multiple scaling [10:56] operations happen.

11:00-12:31

[11:00] So let's take a look at the performance that we can get out of it. On the left here, I have an output of our -- just like a console output. [11:09] output of the network running. You can see it takes about eight to nine milliseconds to run. On the top here, you have the same network. [11:19] uh... running uh... [11:21] This was compiled from PyTorch to Coromel and then just run [11:25] through the Xcode profiler. It runs in about seven-- [11:30] 0.2 milliseconds. And interestingly, the NPU version of it runs in about 5 milliseconds. [11:37] So I think that WebNN is actually an interesting... [11:42] option for us in the future. [11:43] for [11:45] Thank you. [11:45] for when we [11:48] work on Macs [11:51] Thank you. [11:52] Thank you. [11:54] There are a number of lessons we learned here. This is kind of a short list. This was a very involved project, but-- [12:02] It was a lot of fun. [12:04] um... [12:06] The first thing and the most impactful thing is to quantize. [12:11] at least the float 16, if anybody considers float 16 quantization anymore. [12:17] it [12:18] is, at least for us, it was quite easy. We were able to simply truncate, and it worked. [12:23] But if you're memory bound on your network, [12:27] just halves your memory requirements.

12:32-14:10

[12:32] The different GPUs may have instructions to do two FP16 for one FP32 operation. Like on NVIDIA GPUs, you have HFMA2. [12:46] So there's a lot of upside here. We didn't find any accuracy loss for F16 players. [12:53] And so there's little downside. This is something you should just do. [12:56] Int8 is also an option. [12:59] Some systems support DP4A, which is you can do four [13:04] for int 8 operations for the cost of one F32, but that's kind of limited and [13:11] I think once we get [13:12] wider support for cooperative matrices, this will be more interesting. [13:18] Loop unrolling is super important also. [13:24] Web GPU has no way of hinting that you want a particular loop unrolled. So if you have a very hot inner loop, [13:31] you're better off unrolling that because otherwise, you're leaving a lot of performance on the table. [13:38] In particular, a problem for Metal. The Metal compiler doesn't seem to [13:43] do as good of a job of unrolling loops as the DirectX compiler does. [13:49] Either way, it's better to be explicit. So in practice, what your code may end up looking like is some meta programming. At least that's what ours looks like now. [13:58] Thank you. [13:58] I'm [14:00] And one kind of final and interesting thing is Chrome can dump the WGSL and HLSL and MSL that it produces

14:10-14:40

[14:10] via a command line switch. So when you include the command line switch that's on the slide here, when you actually compile a WebGPU shader, it will [14:22] dump out the WGSL that was given to it and also the MSL or HLSL that was produced when it does the transpilation to those languages. [14:36] All right, and that is it for me. [14:38] Thank you, everybody, for listening to me ramble.

Want to learn more?