ENFR

Tech • IA • Crypto

Today My briefing Videos Top articles 24h Archives Favorites My topics

Snap’s GPU-Accelerated Secret to Processing 10 Petabytes a Day | NVIDIA AI Podcast Ep. 298

NVIDIANVIDIAMay 13, 2026 at 04:00 PM23:33

0:00 / 0:00

TL;DR

Snap cut data pipeline costs by 76% and significantly reduced compute and memory usage by shifting large-scale processing to GPU-accelerated infrastructure.

KEY POINTS

Massive data scale and strict deadlines

Snap operates at enormous scale, with its experimentation platform alone processing more than 10 petabytes of data daily. Results must be ready each morning to inform product decisions across engineering, data science, and product teams. This requires both high performance and predictable delivery under tight service-level agreements.

Experimentation as a core product driver

The company relies heavily on A/B testing and advanced statistical methods to guide feature development. Techniques such as heterogeneous treatment effect detection, variance reduction, and sample size mismatch correction help ensure that new features deliver value across diverse global user segments without degrading user experience.

Shift from CPUs to GPUs

To handle growing complexity without escalating costs, Snap transitioned from CPU-heavy processing to GPU-accelerated workloads using NVIDIA Spark RAPIDS. The approach enables faster execution and more efficient scaling, particularly for data-intensive operations like joins and repartitioning.

Performance gains across workloads

Benchmarking showed substantial improvements: over 3× speedups for join-heavy jobs, around 2× for union operations, and roughly 1.5× for aggregations. These gains stem from GPUs’ parallel processing capabilities and high-bandwidth memory.

Zero code migration advantage

A key factor in adoption was the ability to deploy GPU acceleration with no changes to existing codebases. This minimized engineering overhead and accelerated rollout, aligning with internal priorities around developer productivity.

Innovative use of idle GPU capacity

Snap identified that large amounts of GPU capacity used for real-time inference sat idle during off-peak hours, particularly between 1 a.m. and 5 a.m. Pacific Time. The company repurposed this unused capacity for batch data processing, significantly improving resource utilization.

New platform built on Kubernetes

To enable this shift, Snap developed a new data platform on Google Kubernetes Engine (GKE). This allowed batch workloads to run on infrastructure originally designed for online services, while introducing mechanisms like preemption to prioritize user-facing workloads when needed.

Resilient multi-layer fallback system

The system includes dynamic fallback paths: from GPUs to CPUs, and from Kubernetes-based execution to traditional Dataproc clusters when necessary. This ensures reliability even during peak usage or resource shortages.

Dramatic efficiency improvements

The migration delivered major operational gains, including a 76% reduction in job costs, 62% fewer CPU cores required, and an 80% drop in memory usage. It also eliminated roughly 120 terabytes of disk and memory spill, a common bottleneck in large-scale data pipelines.

Rapid production rollout

Despite the scale, Snap moved from experimentation to full production in approximately 8 to 9 months. The effort relied on close collaboration across infrastructure and technology partners, as well as extensive benchmarking and staged deployment.

Broader impact on internal AI and data strategy

The new platform now enables multiple teams to leverage shared GPU resources for a variety of workloads. This has influenced Snap’s broader roadmap, encouraging more widespread adoption of GPU acceleration and more flexible scheduling of compute-intensive tasks.

CONCLUSION

Snap’s transition to GPU-accelerated data processing demonstrates how large-scale platforms can dramatically cut costs and improve performance by rethinking infrastructure and maximizing existing resources.

Full transcript

We were able to cut almost about 76% of our job costs as a result of this migration. 76? 76 it's phenomenal, right? I mean, for the engineers out there, like we were able to cut down the number of cores, required by like 62%. Amazing. The memory footprint, we could drop it by like 80%. So, phenomenal results. The results speak for themselves. Welcome to the NVIDIA AI podcast. I'm Noah Kravitz. I'm here with Prudhvi Vatala, Pru is the head of engineering platforms at Snap. And we're here to talk about data processing, and in particular, how a social platform with more than 940 million active users accelerated their data pipeline. Pru, welcome to the NVIDIA AI podcast. Thanks so much for taking the time to join us. Yeah, thanks for having me here, Noah. So maybe we can start with the basics. Tell us a little bit about, well, about what Snap is now. I'm old, but I still think of it, you know, the Snap glasses and everything, but Snapchat, obviously a huge social platform. So maybe tell us a little bit about Snap and then your role there. Absolutely. Yeah. I mean Snapchat at this point is pretty much a household name, you know. It's Snap as a company. It's interesting that you bring up the spectacles because Snap as a company believes that camera is at the center of, you know, improving how people communicate and improve their lives. You know, in the digital world, so to speak. So we've been steadfast on that belief. And, you know, Snap right now is, is at the, intersection of, augmented reality, AI, and visual communication. Like I said, serving close to a billion monthly active users. I've been at Snap for a while now, and, I lead a multifaceted organization. We do a little bit of it has to do with big data infrastructure, a little bit of it with developer productivity and a little bit of it with enterprise AI, and whatnot. So, yeah. And so when we talk about accelerating data processing, what does that mean to you? What does that mean for Snap? And thinking about the scale that you operate on. You just talk a little bit about what it means to accelerate data at that level. Absolutely. That's a great question. As you can imagine, with as many users as we have and Snap, Snapchat in particular is a very complex application. So you can imagine the scale at which we operate. Especially on the data processing side, we are dealing with my team's experimentation platform is dealing with ten-plus petabytes each day. It's a massive scale. Huge scale. And then we have a strict SLA in the morning because the experimentation results need to be ready for developers, product managers, data scientists to act on as early as possible. So that, you know, they can take appropriate action. So, for us, accelerating data processing basically means instead of throwing more and more CPUs at the problem, figuring out a way to flatten that scale, you know, so in this particular scenario, it was about figuring out how to leverage GPUs for, improving, our workloads, making sure they run faster, cheaper and scale, you know, linearly or sub-linearly, unlike, you know, like it's definitely super linear with, with feature areas. So, so that's where accelerating. So you mentioned, experimentations. What, what does that mean? What are, when you're conducting experiments in Snap? What does that look like? And then maybe how does that fit into, is that where the ten petabytes of data each morning comes from? Or we can talk about that. Yeah. Absolutely. So this ten petabyte data is only about the experimentation platform. The the big data across Snap is far wider. Sure. You know so experimentation it's, a little bit about Snap's product philosophy. Like we are, we believe that experimentation, safety. and privacy are core pillars for our product development and iteration. Like, when, when we when we are thinking about new product areas, when we are shipping new product features to our, you know, half a billion daily active users across the globe, we need to think about, how the users are receiving it, how they're responding to it, how they're using it, whether or not this is adding value there, you know, daily lives and, also guardrailing things like, is it regressing their performance, you know, is it causing their devices to slow down or, you know? We need to be very particular about protecting their experiences as well. And so pru, along those lines with the experimentation. Can you talk a little bit about the importance of A/B testing? So A/B testing is, you know, the the concept of randomized control trials has been around for a long time. You know, especially in the clinical fields and whatnot. But with, the digital revolution, it has become the mode of bringing statistical rigor to decision-making at scale. Right? So that's what A/B testing adds to us. Like, you know, when we are dealing with this massive user base that is diverse by nature, you know, from all walks of life across the globe. You know, and we are trying to delight them. We are trying to bring experiences to them. We need to make sure, you know, what we are delivering is, is buttoned down, like it's, it's actually really adding value the way we think it is. Right? And at this scale, a lot of things, you know, can happen. And that's where having the statistical rigor grounded in, you know, hold outs and, you know, well-defined controls, and statistical methods, comes in like, over the years, my team has added a bunch of statistical methods to our platform, you know, heterogeneous treatment effects. Detection, for example. You know, you may think that a feature is performing well for the global audience, but, it may not perform so well for like, a subset. Right. So figuring out those heterogeneous effects is, is one thing that we focus on. And, you know, at this scale, no matter how you slice your experiments, you are still allowing some bias to seep in, as in, you know, some power users may end up on one side of the experiment rather than the other. So how do we make sure the distributions are evened out when the experiment results are read? That's the variance reduction aspect. So that's something my team built over time. And then, you know, sometimes when we ship a feature, if people don't like it, they, they might even just stop showing up, you know? Right. That's the sample size mismatch. problem. So we also do a bunch of that rigorously. So that's what A/B testing brings to the table. So with all of the data processing every day, what made you think that maybe some NVIDIA tech put into the stack might help things out? How did that process start? And maybe you can talk about, you know, what you've integrated and what you're using. Absolutely. So I'm really proud of this. I'm really proud of my team because over the years that have, that have been, seeing our platform, the number of users grew like Snap, you know, ballooned. Sure. Right. In terms of footprint, the number of features we shared, like, you know, spotlight, you know, AR features, AR lenses, and all of the AI features we shipped in recent past. So they've also been adding a lot of additional dimensions to the platform. And my team was hard at work making sure we are not, you know, we are scaling appropriately, even as all of the scaling, of course. Yeah. Grows. And, they've done a very good job of it, historically for years now, maintaining the cost flat and, you know, performance predictable, meeting the SLAs and whatnot. And one thing we came across, you know, we came across, NVIDIA Spark RAPIDS on, on one of the blog posts and we saw, NVIDIA is shipping this, you know, solution to speed up, our PySpark workloads, by anywhere from 3.6x, performance, versus 50%, you know, runtime, you know? Okay. So, it was phenomenal. Yeah. So that's what drew us. You know, I'm, I'm waiting to hear the numbers sound good. I'm waiting to hear the rest. Yeah, yeah. So, so we read those and we got super excited. And then we, our stack was, it still is entirely Google Cloud, for experimentation platform. We loved working with them. The, the Google Cloud Dataproc was phenomenal. They've been a fantastic partner to us throughout the scaling journey. So when this year. Yeah. And then when, when this news came out, with Spark RAPIDS, we wanted to try it out. We we did a bunch of benchmarking. We tried, obviously our like I said, we do a lot of things. So there is a lot of complexity and, to the nature of the jobs we run. So we had to benchmark each kind of job as well. But like, you know, taking jobs that are heavy with, joins and repartitions and, you know, shuffling of data that moved it around. What says, you know, jobs that are purely unioning data from various places. What's this? You know, jobs that are purely aggregating, like running sums and whatnot. So we had to benchmark across all of them. And we noticed that even on Google Dataproc with Spark RAPIDS, we got about, you know, I want to say, 3x plus, you know, improvement, for the, you know, join jobs, and about close to 2x for, you know, the union jobs and, a little over 1.5x for, for aggregations. That's largely because CPUs are already good at aggregation, right? Right, right. So and then the other thing is GPUs by nature, support parallelism, and high-bandwidth memory on, on the hardware itself. So that made it, like a very good candidate for us to pursue, And so you're running your GPU-accelerated pipelines on Google Kubernetes, is that right? Yes, yes. That has been a very interesting journey from, you know, testing out our pipelines with, Dataproc, GPUs and to today and, and one other thing, like with Spark RAPIDS, I want to mention it, we didn't have to change a single thing about how we ran the jobs. That was the beauty of it. Not at all changes. Oh, it's amazing. Zero code changes. So, I'm, I'm into developer productivity and developer enablement. So for me, that was music to my ear. Sure, of course. So that was very impressive. So with Dataproc, which abstracts out the Spark runtime for us, and Spark RAPIDS, which didn't require us to change the jobs. It was phenomenal. Yeah. I mean, so it went very well. So we wanted to productionize this, we, we were able to, at our scale, pipelines aren't just monolithic, right? We do a bunch of sharding and then, you know, batching of work. So we were able to migrate one shard to production on Google Dataproc using 300 GPUs. The results were phenomenal. Yeah. And then in the next phase, we wanted to migrate ten shards of our total, you know, 50-plus shard architecture. And then it needed about 3,000 GPUs, which was still doable with Dataproc on-demand GPUs because you know, GPU capacity is on everybody's mind these days, right? Yeah. So that was well and good. But then, we didn't have a path forward. Okay. After that. Right. So we, we kind of, hit a roadblock with, you know, on-demand GPU capacity. So we had to get creative. So we started looking around, we were like, where at Snap do we have GPU capacity that we can borrow? Right. And, you know, that's where the real insight came from. I was like, Snap has a global audience. And, the Snapchatters' behavior is Snapchatters'. during the day. People wake up, they use Snapchat and they go to bed. They don't. Right. So what that meant was when some of our biggest markets went to bed, a lot of our online inference GPU capacity was sitting idle. Yeah, yeah. Somewhere between 1 a.m. and 5 a.m. Pacific, you know. Yeah. So that was, that was our opening, our opportunity. Yeah, yeah. Go tackle and that brought about its own set of complexity. Right. Because online serving stack is not built for batch data process. OK, yeah. That, that, fundamentally they were considered fundamentally different worlds. Right. So all the online, GPUs were tied to Kubernetes and GKE, and we were already on Google Cloud, so GKE wasn't an issue for us at all. It was actually very welcome. So we had to migrate our workloads to Kubernetes-based Spark runtime and, host it on GKE so that we can leverage, you know, like what the underlying GPUs had to offer. And, for that, we had to actually build a data platform ground up. OK. You know, because, it's one thing for my team to just use this idle capacity, but at Snap, we wanted to make sure even as the online need for GPUs increased as our AI footprint increased, we could, we should still have any team at Snap be able to leverage that capacity for any of that. And sure, yeah. As available. And then we had to also acknowledge that if a user wanted to see Fresh spotlight content, it supersedes GPU need for experimentation. Yeah. So you know preemption had to be built in. Yes. Yeah. So if, if, if, if we had a sudden spike in traffic, we had to give up GPU capacity. So if all of that in mind, we built out a platform ground up. Okay. And then, then we started migrating and that's, that's the, and we had a lot of blockers along the way. And the team got really creative, right? Yeah. It was a phenomenal journey. Amazing, yeah, yeah. And so you're also running an accelerated Apache Spark pipeline? Yes, yes. So, a lot of our pipelines, at high level, our pipelines are split into, daily and hourly. OK. Cadence. So hourly is mostly for guardrailing, like I said, like, you know, we don't want to break users experience no matter what. And having that hourly feedback cycle goes a long way in doing that. And then we also have daily pipelines, which serve as the statistical authority, for decision-making. So, our first migration to GKE plus, NVIDIA Sparks, Spark Rapids was, the hourly pipeline, because, you know, speed mattered far more. Right? So we migrated and then we migrated and operationalized it. And during that process, we ran into a few corner cases. You know, if the GPU capacity wasn't available at like 11 a.m. and everybody was active on Snap, right? What do we do? So we had to figure out how to gracefully fallback from GPUs to CPUs. Right. And then if the, shared GKE resources itself was the constraint, then we had to gracefully fallback from CPUs to Dataproc clusters. So building all of that with operational reliability in mind was also great. Yeah. Looking back on it, what learnings would you, you know, if there's a listener out there who's embarking on a similar project or trying to figure out maybe there's a, you know, like you said, kind of an, a daily cycle of when the GPUs are in use for inference and when they're not, they're thinking about, you know, borrowing GPUs from the rest of the company, learnings you would share from this whole process, is there a big takeaway, something that surprised you? Yeah. Right. So, the direction that, NVIDIA is headed in is phenomenal for these kinds of needs. You know, NVIDIA Spark RAPIDS, like I said, zero code written. Yeah. Zero code changed to enable it. We had to figure out the image building and environment difference and whatnot. The testing cycles obviously any any production workload needs to go through that rigorous rollout process. So everybody needs to pay attention to it. But this is a real possibility, you know, the NVIDIA direction. The other thing, that, that, NVIDIA offered that really helped us a lot was NVIDIA Aether It's, it's another, solution that gives us Spark tuning out of the box because especially when we had this fallback mechanism in place where we had to go from GPUs to CPUs to Dataproc, the environments are different. The Spark parameters had to be different. So, something like NVIDIA Aether giving us a starting point and making sure the tuning stayed consistent across all of these versions was also very helpful. So you've mentioned obviously the the work with NVIDIA and Google Cloud as well. Kind of from taking a step back sort of bigger picture. What are these partnerships and working, you know, hand-in-hand so closely with Google Cloud with NVIDIA, what is that doing to the way that you and Snap see your road maps for both data and AI kind of growing going forward? Yeah. It's, I mean, huge props to the NVIDIA team and the Google Cloud team. Honestly, it's, it's been a phenomenal three-way partnership like I have never seen in my career before. Amazing. Yeah, it was phenomenal. And the impact speaks for itself. Like we were able to cut almost about 76% of our job costs as a result of this migration. 76? 76 it's, it's phenomenal. Right? I mean, for the, for the engineers out there, like we were able to cut down the number, of cores, required by like, 62%. I mean, the memory footprint print, we could drop it by like 80%. I mean, for the Spark nerds out there, we were able to cut out almost 120 terabytes of disk spill, disk and memory spill from our pipelines. Wow. Just vanished once we started doing, you know, all of this. Yeah. So, that is one of the biggest headaches any, you know, data pipeline at scale runs into. So, phenomenal results. The results speak for themselves. So without the partnership, this would not have been possible in the timescale that it was possible. Right, like migrating a production pipeline with ten-plus petabytes from, you know, prototyping exploration to full production in a matter of about 8 to 9 months is, is phenomenal. Right, and without the continuous, you know, back and forth and, you know, knowledge sharing and partnership, across these three companies, this wouldn't have been possible. Oh, that's that's great. Yeah, and in terms of the, road map, it definitely had an impact. Like I said, my team built this bottom-up data platform to, to enable any team at Snap to leverage the GPU capacity. And, you know, what NVIDIA, libraries have to offer. And, that's, we're already seeing movement with it. Right. Even my own teams are migrating other things that we haven't even tried out so far, experimenting with them, you know, trying out. Because even if we don't have ideal capacity to fit all of our workloads all the time, if we can schedule things creatively, if we can move things around, we can maximize the capacity as much as we can. And a lot of other teams are also picking this out. Yeah. It's fantastic. So you've been at Snap for eight years, is that right? Seven? Close to, yes. close to eight. OK. And Snap's been around for about 15 years? Yes. Give or take. Working at a social media, a huge social media platform, over this span of time where social media has just, you know, become such a, such a core part of the fabric of so many people's lives. What's it been like to be at Snap and to see the changes? Both. You know, I said at the beginning, right. I remember the spectacles. That's my first thought of Snap. And obviously now Snapchat, you know, that same, same lineage, same philosophy, different product, obviously. Right. But what's it like to just have seen the evolution of social media and then also so many technological changes that impact you know, what you're able to do and how you do it, as you were just describing. What's it been like from the inside? Yeah, it's been, it's been, unbelievable of an experience. Noah, like, that's what gets me up in the morning every day, you know, like, Snap. I mean, in the in the visual communication AI landscape, Snap is has had a massive impact on the planet. Yes. Honestly. And, having a direct role to play in it is, is a great feeling, right? I've seen the company grow from, you know, the camera messaging, you know, picture messaging to what it is today, AR stories, which is something we invented. And the whole world, including some newspapers setting it up, you know, so, the stories as a format and then, to your point about spectacles, we did it before anybody else was even thinking about it, you know? So, so the company is innovative. We, we come up with so many new things and, running platforms inside means that I have to, you know, figure out a way to enable all of this, even as the company evolves. And that's been having a front row seat to that evolution. And playing a big part of it has been very fulfilling. Fantastic. Pru, for the listeners, viewers who there are some out there who haven't used Snapchat before. For anyone who wants to get the experience, but also to learn more about, about Snap and maybe about some of the technical work that you're doing, are there obviously the website, there's social media, is there a research blog, where can people go? Absolutely. So we have an engineering blog that's pretty active. We share, a lot of phenomenal work that, that engineers in the company are working on. And, you know, we are also, participating in events like this and sharing our knowledge with the world. So, you know, and, and Snapchat, if you haven't used it, you should definitely give it a try. It's, it's, it's different from social media. I, I this is true story. I got a Snap from my younger son maybe 45 minutes before we sat down to do this, and it made my day so, so. Absolutely. If you haven't. Yeah. Pru Vatala, thank you so much. This has been a great conversation. And I'm sure the developers, the engineers in the audience, hopefully they've taken a lot from it. But thank you so much for taking the time to join us. And all the best to you and everybody at Snap to, keep changing the world for the better. Thank you so much, Noah. Thanks for having me. Appreciate it.

More from NVIDIA