The Situation
In online retail, the shopping season stretching from just before Black Friday and Cyber Monday through the holidays can be very productive. It can also create a high level of stress if your eCommerce site isn’t functioning properly. For years, Roger’s Sporting Goods had productive shopping seasons, improving each year; however, they suffered from their eCommerce site going down each year during this peak shopper timeframe even though an effort was made to improve things.
Donald MacCormick at Roger’s Sporting Goods wanted more than improvements, he wanted to prevent the site from going down. In an effort to improve the site’s infrastructure, with plenty of lead time to the shopping season, Classy Llama helped Roger’s move hosting providers and deployed new hardware that was more capable. However, even with the new hardware in place, Classy Llama could still not project what capacity the infrastructure could withstand and if it was capable of handling the influx of shoppers in the upcoming season. To try and identify exactly what the infrastructure capacity was and what it could withstand, load testing was needed. In addition, the testing was going to have to be performed on Roger’s Sporting Goods production site limiting what could be simulated within any test.
Goals & Strategies
- Increased product page performance to load in 0.7 seconds from 4 seconds
- Developed tested infrastructure to handle 15 page views per second (50% more than expected peak traffic volume)
- Seamlessly load tested a production site without taking the site down during testing
Our Suggested Solution
Donald reached out to Classy Llama for help. The Classy Llama team suggested a load testing plan that included analysis of prior years’ peak shopping traffic for Roger’s Sporting Goods. The analysis provided a baseline regarding the level and type of traffic that had been causing the site crashes over the past several years. This data was used to create a load testing plan to hit the new infrastructure appropriately to ensure it was ready for the shopping season.
The historical analysis was performed on prior year’s Cyber Monday data. The analysis looked at what pages were being visited, and hourly peaks. Data was bucketed into major categories of pages visited to understand what needed to be recreated and tested in the new infrastructure. The categories included:
- Home Pageviews
- Category Pageviews
- Product Pageviews
- Site Search
- Cart Pageviews
- Starting Checkout
- Completing Checkout
Simulating adding items to the cart, and placing orders in a production environment is both difficult and would introduce a ton of test orders we’d have to figure out how to avoid running through other systems, so we decided to start by focusing on the simpler browse traffic which looked to provide the majority of insights.
The previous year during Cyber Monday the site hit a peak hour with an average of 5.1 page views per second. This data point provided a baseline regarding volume the infrastructure needed to handle. Through discussions with Roger’s on their plans for the upcoming shopping season, it was determined the expected bump in traffic from planned marketing efforts would mean doubling the expected page views per second that needed to be handled. However, final plans were to ensure the infrastructure would go beyond this expected traffic bump, so we set a goal to have the capacity of 15 page views per second.
Solution Components
- Used the pages per second that a server could deliver per percentage of CPU usage as a key performance metric
- Addressed full page cache issues
- Doubled web servers with hosting provider for 3 month period
- Utilization of New Relic APM (Application Performance Monitoring) to track down application performance issues and help diagnose full page cache issues
- Utilization of Datadog to monitor infrastructure server and service performance
Setting an Appropriate Metric
To provide a single metric to roughly gauge the capacity of a server, Classy Llama used the pages per second that a server could deliver per percentage of CPU usage. Page requests were easy metrics to gather both currently and historically (via Google Analytics) and the CPU usage was the strongest bottleneck of the web servers. This proved a decent metric to indicate resource usage for the whole infrastructure. Pulling some recent metrics from Google Analytics for pageviews in an hour, and some server CPU usage from the same time period it was looking like the new infrastructure was able to withstand roughly 10 pageviews per second.
With this load estimate in place, a full 15-minute load test was performed cranking 1 of the 2 web servers to 85% CPU usage. The load test showed that 3.82 page views per second could be handled on one server. With the Roger’s infrastructure running two similar servers, this equated to a total fully tested load capacity of 7.65 page views per second. This was adequate to support prior year needs; however, not enough to reach the 15 pageview goal.
Addressing the Full Page Cache Problem
Classy Llama expected to see some initial load from uncached pages, and then a significant improvement after repetitive page requests were made for cached pages. In reality, the team saw a consistent load time on every request for the same page. This indicated some issues with full page cache functionality. The team was noting a TTFB (Time to First Byte) of 2 – 4 seconds on most pages when it was expected TTFB would be under 1 second after being cached. Classy Llama recommended investigating and fixing the full page cache issue given it’s important for server performance when under load, and its improvement to the customer experience. With full page cached fixed, Classy Llama believed they could achieve the needed performance out of the existing 2 web servers to handle 15 pageviews per second.
Investigating the full page cache issue revealed the appearance of operations as expected, yet the performance improvement from a page being cached was not realized. Classy Llama discovered that when the Magento system was saving a full page cache after building a page empty value were being saved in the cache storage. This resulted in what looked like a cache hit; however, when it returned an empty value it processed the request as a cache miss rebuilding the page all over. The problem was found in how the core code performed hole punching in the cached page. The implementation was hitting an obscure limit when parsing the page contents.
A simple adjustment to server configurations increased the limit to allow caching to operate correctly. This adjustment dropped product page loading from 4 seconds to 0.7 seconds for TTFB (Time to First Byte) yielding significant results. The same load test that previously indicated the two servers could handle 7.6 page views per second was now also projecting a new capacity of 89.9 page views per second.
Doubling Web Servers
Not wanting to take any chances, Rogers decided to not only investigate the full page cache issues but also double the servers over the next few months. This was accomplished by arranging set up of an additional two identical web servers for 3 months with the hosting provider. This doubled the web servers providing a total of 4 to address the need. Better monitoring tools were also established using DataDog to help diagnose infrastructure performance issues, and identify irregularities as adjustments were being made. The addition of the two additional web servers delivered a projected capacity of 194 page views per second, doubling their capacity with all four web servers under the same load test.
Trust our Experts in eCommerce
When Cyber Monday traffic was delivered, Roger’s set new sales records for the second time in the same month. Roger’s peak hour for Cyber Monday delivered 10.11 page views per second and only hit an average of 27% CPU usage across all web servers.
218%
increase in Cyber Monday Orders
Year-Over-Year
290%
increase in Cyber Monday Revenue
Year-Over-Year
80%
Higher Average Order Value
Year-Over-Year
Delivered Results
Shortly after addressing the full page cache issues and doubling the servers Roger’s had a large promotion planned to test the improvements well ahead of Cyber Monday. The promotion was a huge success and significantly broke sales records for orders, revenue, and average order value per hour and per day for Roger’s. The promotion produced 6.9 page views per second during the peak hour only stressing the infrastructure to a point where CPU utilization hit 14%. In addition, there were no issues throughout the entire promotion period. Roger’s was ready for Cyber Monday.
When Cyber Monday traffic was delivered, Roger’s set new sales records for the second time in the same month. Roger’s peak hour for Cyber Monday delivered 10.11 page views per second and only hit an average of 27% CPU usage across all web servers. In the peak hour on Cyber Monday a more granular analysis of across 20 seconds saw spikes hitting 14 page views per second, bringing CPU usage to 56% across all web servers. Another system reported 20%+ of the traffic hitting the site at that time period was bot traffic, which wouldn’t have reported to Google Analytics or New Relic’s Browser metrics used to determine pageviews per second. Classy Llama found the application (PHP-FPM) was processing 28 requests per second at the same traffic peak.
With the measurements from Cyber Monday, the new projections of the infrastructure at 85% capacity was around 32.43 Pageviews per second. While this was not the 194 page views per second metric identified in the synthetic load test, it was still a significant improvement.
Outside of a few brief outages due to a cache flush being performed during high loads on the site, the Roger’s Sporting Goods site never slowed down during the largest shopping season the company had ever experienced. Goals were reached, pages were delivered, money was exchanged, and new records were set.