Power real-time apps and AI agents with Redis (Sponsored)Real-time isn’t just about speed. It’s about instant, fresh, and reliable responses at scale. This definitive Redis guide breaks down how to architect a real-time data layer that keeps user experiences snappy, AI agents responsive, and data up to date across your stack. Inside, you’ll learn:
Note: This article is written in collaboration with the Shopify engineering team. Special thanks to the Shopify engineering team for sharing details with us about their Black Friday Cyber Monday preparation work and also for reviewing the final article before publication. All credit for the technical details shared in this article goes to the Shopify Engineering Team. Black Friday Cyber Monday (BFCM) 2024 was massive for Shopify. The platform processed 57.3 petabytes of data, handled 10.5 trillion database queries, and peaked at 284 million requests per minute on its edge network. On app servers alone, they handled 80 million requests per minute while pushing 12 terabytes of data every minute on Black Friday. Here’s the interesting part: this level of traffic is now the baseline for Shopify. And BFCM 2025 was even bigger, serving 90 petabytes of data, handling 1.75 trillion database writes with peak performance at 489 million requests per minute. This is why Shopify rebuilt its entire BFCM readiness program from scratch. The preparation involved thousands of engineers working for nine months, running five major scale tests. In this article, we will look at how Shopify prepared for success during the Super Bowl of commerce The Three-Track FrameworkShopify’s BFCM preparation started in March with a multi-region strategy on Google Cloud. The engineering team organized the work into three parallel tracks that run simultaneously and influence each other:
These three tracks constantly feed into each other. For example, risk findings might reveal capacity gaps the team didn’t account for. Infrastructure changes might introduce new risks that need assessment. In other words, it’s a continuous feedback loop. Game DaysTo assess risks properly, the Shopify engineering team runs Game Days. These are chaos engineering exercises that intentionally simulate production failures at the BFCM scale. The team started hosting Game Days in early spring. This involves deliberately injecting faults into the systems to test how they respond under failure conditions. Think of it like a fire drill, but for software. During these Game Days, the engineering team focuses extra attention on what they call “critical journeys”. These are the most business-critical paths through their platform: checkout, payment processing, order creation, and fulfillment. If these break during BFCM, merchants lose sales immediately. Critical Journey Game Days run cross-system disaster simulations. Here are some common aspects that are tested by the team:
These exercises build muscle memory for incident response by exposing gaps in operational playbooks and monitoring tools. Most importantly, Shopify closes those gaps well ahead of BFCM instead of discovering them when merchants need the platform most. All findings from Game Days feed into what Shopify calls the Resiliency Matrix. This is centralized documentation that tracks vulnerabilities, incident response procedures, and fixes across the entire platform. The Resiliency Matrix includes five key components.
The Matrix becomes the roadmap for system hardening before BFCM. Teams update it continuously throughout the year, documenting resilience improvements as they go. Load Testing with Genghis and ToxiproxyGame Days test components in isolation, but Shopify also needs to know if the entire platform can handle BFCM volumes. That’s where load testing comes in. The engineering team built a tool called Genghis that runs scripted workflows mimicking real user behavior. It simulates browsing, adding items to the cart, and going through checkout flows. The tool gradually ramps up traffic until something breaks, which helps the team find their actual capacity limits. Tests run on production infrastructure simultaneously from three Google Cloud regions: us-central, us-east, and europe-west4. This simulates global traffic patterns accurately. Genghis also injects flash sale bursts on top of baseline load to test peak capacity scenarios. Shopify pairs Genghis with Toxiproxy, an open-source framework they built for simulating network conditions. Toxiproxy injects network failures and partitions that prevent services from reaching each other. For reference, a network partition is when two parts of your system lose the ability to communicate, even though both are still running. During tests, teams monitor dashboards in real time and are ready to abort if systems begin to degrade. Multiple teams coordinate to find and fix bottlenecks as they emerge. When load testing reveals limits, teams have three options:
These decisions set the final BFCM capacity and drive optimization work across Shopify’s entire stack. The key insight is that the team cannot wait until BFCM to discover the capacity limits. It takes months of preparation to scale infrastructure and optimize code. The Analytics Platform ChallengeBFCM tests every system at Shopify, but 2025 presented a unique challenge. Part of their infrastructure had never experienced holiday traffic, which creates a problem: how do you prepare for peak load when you have no historical data to model from? In 2024, Shopify’s engineering team rebuilt its entire analytics platform. They created new ETL pipelines. ETL stands for Extract, Transform, Load, which is the process of pulling data from various sources, processing it, and storing it somewhere useful. They also switched the persistence layer and replaced their legacy system with completely new APIs. This created an asymmetry. The ETL pipelines ran through BFCM 2024, so the team had one full season of production data showing how those pipelines perform under holiday load. But their API layer launched after peak season ended. They were preparing for BFCM on APIs that had never seen holiday traffic. This matters a lot because during BFCM, merchants obsessively check their analytics. They want real-time sales numbers, conversion rates, traffic patterns, and data about popular products. Every single one of these queries hits the API layer. If those APIs can’t handle the load, merchants lose visibility during their most critical sales period. Shopify ran Game Days specifically for the analytics infrastructure. These were controlled experiments designed to reveal failure modes and bottlenecks. The team simulated increased traffic loads, introduced database latency, and tested cache failures to systematically map how the system behaves under stress. The results showed four critical issues that needed fixes:
Beyond the performance fixes, the team validated alerting and documented response procedures. Their teams were trained and prepared to handle failures during the actual event. The Scale TestsGame Days and load testing prepare individual components, but scale testing is different. It validates the entire platform working together at BFCM volumes, revealing issues that only surface when everything runs at capacity simultaneously. From April through October, Shopify ran five major scale tests at their forecasted traffic levels, specifically their peak p90 traffic assumptions. In statistics, p90 means the 90th percentile, or the traffic level that 90% of requests will be below. Here are the details of those scale tests:
These tests are extraordinarily large, and therefore, Shopify runs them at night and coordinates with YouTube because the tests impact shared cloud infrastructure. The team tested resilience, not just raw load capacity. They executed regional failovers, evacuating traffic from core US and EU regions to validate their disaster recovery procedures actually work. Shopify ran four types of tests:
The team simulated real user behavior, such as storefront browsing and checkout, admin API traffic from apps and integrations, analytics and reporting loads, and backend webhook processing. They also tested critical scenarios like sustained peak load, regional failover, and cascading failures where multiple systems fail simultaneously. Each test cycle identified issues that would never appear under steady-state load, and the team fixed each issue as it emerged. Some of the key issues were as follows:
Mid-program, Shopify made an important shift. They added authenticated checkout flows to their test scenarios. Modeling real logged-in buyers exposed rate-limiting code paths that anonymous browsing never touches. Even though authenticated flows were a small percentage of traffic, they revealed bottlenecks that would have caused problems during the real event. BFCM Weekend OperationsBFCM preparation gets Shopify ready, but operational excellence keeps them steady when traffic actually spikes. The operational plan coordinates engineering teams, incident response, and live system tuning. Here are the key components of this plan:
The philosophy is simple: preparation gets you ready, but operational excellence keeps you steady. ConclusionShopify’s 2025 BFCM readiness program shows what systematic preparation looks like at scale. Thousands of engineers worked for nine months, running five major scale tests that pushed their infrastructure to 150% of expected load. They executed regional failovers, ran chaos engineering exercises, documented system vulnerabilities, and hardened systems with updated runbooks before merchants needed them. What makes this different from typical pre-launch preparation is the systematic approach. Most companies load test once, maybe twice, fix critical bugs, and hope for the best. Shopify spent nine months continuously testing, finding breaking points, fixing issues, and validating that the fixes actually work. Also, the tools Shopify built aren’t temporary BFCM scaffolding. The Resiliency Matrix, Critical Journey Game Days, and real-time adaptive forecasting became permanent infrastructure improvements. They make Shopify more resilient every day, not just during peak season. To provide a visualization of BFCM, Shopify also launched an interesting pinball game to showcase the Shopify Live Globe. The game itself runs at 120fps in a browser with a full 3d environment, physics engine, and VR Support. Behind the scenes, the game is a three[dot]js app built with “react-three-fiber”. Every merchant sale shows up a few seconds later on this globe. Everyone can check out the game and the visualization on the homepage for Shopify Live Globe References: SPONSOR USGet your product in front of more than 1,000,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com. |
How Shopify Prepares for Black Friday
Tuesday, 23 December 2025
Subscribe to:
Post Comments (Atom)





No comments:
Post a Comment