Last year, we published Validating AWS EC2 SQL Server Deployments Using Benchmark Tools. Since then, AWS released enhancements like Optimizing CPU Option that allow you to choose a larger EC2 instance for your SQL Server 2017 instances for greater Memory and IOPS / vCPU to keep your licensed vCPU count down.
In this blog post, I’ll demonstrate how benchmarking your SQL Server workloads on AWS in a pre-production environment can help you pick the best candidate EC2 instance with an optimized storage configuration based on real data.
It turns out that the process is simple. It’s just a matter of knowing what to look for. I’m going to use the original TPC-C small, medium, and large configurations with 1000 warehouses to demonstrate this process. At DB Best, we have our DBMSys platform that does this for our customers as a Software-As-A-Service or deployed within our customer’s network. What I’ve done is distill this down to the essential steps to optimize your EC2 deployments.
It’s all about CPU, Memory, Disk, and Network at the server level
It starts with these assumptions within the current environment:
- The database has a known size and generally consistent set of transactions per interval over time.
- The database server has limited amount of memory available.
- The database server has limited speed for the processor with a limited number of cores / virtual cores (vCores).
- The database server storage has limited IOPS and megabytes per second (MBps) of throughput that are running your data, log, and TempDB files. You may need to consider the IOPS and MBps for your backup files. However, I left this out of the optimization process to keep things simple.
- Network bandwidth should only be a consideration for high network traffic as a way to limit your choices of EC2 instances.
- You are maintaining your performance service-level agreements.
Each one of these assumptions can be measured for your current environment and then again on AWS for your initial deployment. Then, using the newly collected data from your EC2 instance, you can use this process to optimize your EC2 instances for production.
Tips for limiting your choices for viable EC2 instances for SQL Server workloads
At DB Best, we’ve developed some general rules for limiting the huge number of EC2 instances you’ll want to run SQL Server 2017 on. It goes something like this.
• Start with AWS Nitro based instances that have local SSD. AWS now exposes their local SSD using NVMe for greater performance than the usual SCSI.
• If you can’t find a standard R5 series of instance that meets your IOPS requirements, consider using a larger vCPU system like an r5d.8xlarge that gets you into the 30000 IOPS range with the Optimizing CPU option. This way you save on SQL Server vCores licenses with Microsoft Software Assurance for SQL Server with license mobility.
• Consider using the technique above can also be used to add more local SSD as an option for hosting your data files. If you can fit your data files into locally attached SSD, you only need provision your log and back up files on gp2 storage. You just need to make sure to provide enough storage to meet the IOPS needed for log writes.
• Using the current vCPUs as a starting point, pick an instance that fits within your memory and IOPS requirements. Look for the least expensive option between the m5d, c5d, r5d, and i3 instance types and try it out first.
• Benchmark the result and look to reduce vCPUs and IOPS to match the behavior of your pre-production instance. For vCPU reduction, we recommend keeping it simple and look for 25%, 50%, and 75% reductions. Especially if you can reduce the vCPUs to fit less than or equal to vCPU limit with SQL Server Standard Edition. The assumption is you aren’t depending on Enterprise features. Even then, reducing Enterprise vCPUs can make a huge difference to your bottom line.
Sneak preview on the performance gains with great cost per transaction value
The following is a summary of the comparison of the results between last year’s instance types versus a fully optimized instance type that takes advantage of using local SSD for data and TempDB files.
Comparison of 2018 to 2019 runs of the TPC-C benchmark with 1,000 warehouses using HammerDB
Figure 1 – Comparison of 2018 to 2019 runs of the TPC-C benchmark with 1,000 warehouses using HammerDB
We essentially relied on the R5d instance family with local SSD that could fit the entire database file and tempdb files to cut gp2 costs. But if you can optimize your gp2 storage for faster disaster recovery, the process that’s I’m going to show you can make a significant difference in performance per cost.
Setting up your benchmark environment
By using HammerDB, we are able to setup a steady state environment for running an on-line transaction processing system with the TPC-C benchmark. You can ready last year’s blog post to understand our test environment using HammerDB.
Collecting performance counters for benchmarking SQL Server
As much as I would like everyone to use our DBMSys platform, there are other great tools you can license or get from open-source projects. In our case, we used Windows typeperf that comes with Windows Server. We automated the execution as a PowerShell job like this:
In this case, the T: drive is the local SSD on the SQL Server’s EC2 instance. This command collects the performance counters defined in the Windows_PerfMonCounters.txt file every 60 seconds. Here is a link to the file we used – https://1drv.ms/t/s!ArrjT1GQ1oiajqxY4_KAkmlpQV7Vmw?e=cA3Iw0.
After the test run, our automation script uploaded the data to an AWS S3 bucket. I then loaded the data into Power BI and used the magic of the Table.UnvivotOtherColumns command to make the data usable. This is a topic for another blog.
Measuring the performance of the EC2 CPU
While AWS publishes an ECU value that represents a level of CPU power between instances, you can’t really tell the power of the actual cores. Steve Shaw is the primary contributor to HammerDB. Last year, after we had completed our round of testing, Steve wrote the following blog post – https://www.hammerdb.com/blog/uncategorized/hammerdb-best-practice-for-sql-server-performance-and-scalability/. He proposed using a Single Threaded Performance test using SQL Server T-SQL code as shown below.
For each EC2 instance that we tested, we would run this code 10 times and then average up the results. Rather than rely in Intel/AMD specs on CPUs and run abstract CPU benchmarks, this test was an easy way to measure the performance for the CPU for the EC2 instance.
Measuring the raw performance of the disks
We set out to use Windows Diskspd to measure the performance of our data, log, and tempdb drives used on the EC2 instance. We used the guidance from long time SQL MVP Glenn Berry in the blog he posted at https://sqlperformance.com/2015/08/io-subsystem/diskspd-test-storage to run the Diskspd. Here is what the command line looks like that we used with PowerShell right after configuring the disks with Windows Storage Spaces.
The problem is with the -b64k option. This command uses 64k block sizes to simulate the way SQL Server reads and writes data for the buffer pool. AWS publishes gp2 IOPS performance using 16k block sizes! Because of the use of the -w25 option that weights 25% of the performance on writes and the other 75% of the total IOPS on reads, the total IOPS reported for the drive are around 3 times slower than the published specification. However, the LogicalDisk(_Total)\Disk Transfers/sec performance counter ends up aligning to the actual IOPS value for the gp2 drive performance.
Instead, I actually relied on the total latency values reported to make sure there were no issues in the way with striped the disks using Windows Storage Spaces. Here is an example of that we say on a r5d.xlarge with 6 striped 1043 GB drives to get to the max IOPS of 18750 = 6 disks * 1042 GB * 3 IOPS/GB. Ok, the actual number comes out to 18,756. The extra 6 IOPS were a rounding error.
I focus my attention on the 75th thru 99th percentile values as an indication of the quality of the AWS Nitro-Instance fabric compared to non-Nitro instances.
With these tools in place, you can now measure your existing production environment and the target environment with a pre-production test run to evaluate your current and future state on EC2.
Review of last year’s results
Using the data from our runs last year, we decided to use a slightly different method of computing the price per TPC-C result. We added a new metric to show for the TPC-C results called New Orders Per Minute (NOPM). As per the documentation for HammerDB, NOPM is a better metric to compare different database engines. This is because HammerDB uses different ways per database engine to impliment TPC-C benchmark. However, NOPM is based on a metric captured from the test schema for New Orders generated.
We are looking at just SQL Server as part of our benchmarking for this blog. However, as part of our Cloud Migration practice, we use NOPM to help customers determine how to right size their non-SQL Server database workloads to the cloud. Besides, using a metric of New Orders is easier to understand since that is the true test of how fast the platform revenue generating transactions.
In addition, we are using $/hr for the cost of “Pay As You Go” (PAYG) Windows virtual machine, cost of storage, and PAYG SQL Server license. In the new world of DevOps and microservices architectures, there is a trend in the industry to run EC2 instances and EC2 containers for short durations with pooled storage. So, using a $/hr metric is more relevant than an inflated $/month value like we used last year.
Using a realistic $ per TPC-C result with New Orders Per Hour (M)
The new metric we used to help customers we use for capacity planning purposes is PAYG $ / New Orders (M) / Hour. We take the PAYG $/hr for the test system and divided it by the NOPM * 60 / 1,000,000 (New Orders (M)). We include two sets of values:
• PAYG $ / New Orders (M) / Hour includes Windows OS + EC2 Instance + Storage. This is used for SQL Server instances that are covered under Microsoft’s license mobility program. So, there is no extra cost for running SQL Server on EC2 instances.
• PAYG SQL $ / New Orders (M) / Hour includes the SQL Server license costs for every four vCPUs of either SQL Server Standard Edition for the small and medium VM sizes and SQL Server Enterprise Edition for the large VM size with 64 vCPUs.
Here is the current licenses costs for SQL Server for on-demand instances (PAYG) as the date shown in the table below.
Let’s now compare the 2018 results where storage was not optimized for the maximum IOPS available for the VM using SQL Server 2016 versus the 2019 results with SQL Server 2017 with cloud storage optimized for performance. These results are for runs of 3, 5, 8, 13, 21, 34, 55, 89, 144, and 233 virtual users.
Updated 2018 test results with the PAYG $ / New Orders (M) / Hour
In the first chart, you can see the original throughput for the 2018 run using the New Orders Per Hour (millions) – NOPM (M) – results for small, medium, and large VM sizes.
Using NOPM (M) for the 2018 AWS results for TPC-C with 1,000 warehouses
Hindsight is always 20-20 – New baseline test needed for the original systems
For last year’s benchmark testing, we did not use the temporary SSD storage for TempDB since we limited our testing to systems without local SSD. We’ve since learned, this can also help performance for reducing SQL Server vCPUs. In our cloud practice, we now recommend always using virtual machines that support local temporary SSD for TempDB.
Introducing Maximum Disk Optimization Benchmarking (MaxDOB)
If you don’t have any existing performance data to work from, I recommend the MaxDOB process to provision storage for maximum IOPS possible for the VM that I’m testing. This way, I can see if the instance hits the limit, or other limits like CPU performance and available memory for the given types of transactions and size of the database.
Again, I’m trying to reduce the potential variables for the EC2 instance under evaluation. For example, for every VM under test, we install a SQL Server startup procedure for the SQL Server instance as follows:
IF NOT EXISTS (SELECT 1
WHERE [ROUTINE_NAME] = 'sp_sql_startup_parameters'
AND [ROUTINE_SCHEMA] = 'dbo'
AND [ROUTINE_TYPE] = 'PROCEDURE'
EXEC ('CREATE PROCEDURE [dbo].[sp_sql_startup_parameters] AS RETURN(0)')
ALTER PROCEDURE [dbo].[sp_sql_startup_parameters]
* This stored procedure will run at startup time and set min/max memory settings, MAXDOP and other base configuratin values.
-- Declare Variables
DECLARE @maxdop SMALLINT
-- Identify [hyperthread_ratio] & [physical_memory_kb] from [sys].[dm_os_sys_info]
SELECT @maxdop = [hyperthread_ratio]
, @total_memory_mb = [physical_memory_kb]
-- Convert memory to the nearst whole GB
SELECT @max_memory_mb = @total_memory_mb - 2000
-- If we are running on Linux use the maximum memory presented to SQL Server from Linux.
-- On Windows we are presented with all memroy but on Linux that memory limit is 80% or defined by memory.memorylimitmb in the configuration file.
IF (RIGHT(@@version, LEN(@@version)- 3 -charindex (' ON ', @@VERSION)) LIKE 'Linux%')
SELECT @max_memory_mb = @total_memory_mb
-- Set the Min Memory to 1/4 of the Max Memory.
SELECT @min_memory_mb = (@max_memory_mb /4 )
-- Set sp_configure values.
EXEC sp_configure 'show advanced',1
EXEC sp_configure 'backup compression default', 1
EXEC sp_configure 'contained database authentication',1
EXEC sp_configure 'cost threshold for parallelism',50
EXEC sp_configure 'max degree of parallelism', 0
EXEC sp_configure 'max server memory (MB)',@max_memory_mb
EXEC sp_configure 'min server memory (MB)',@min_memory_mb
EXEC sp_configure 'remote admin connections',1
-- Reconfigure to read the new memory values.
RECONFIGURE WITH OVERRIDE
EXEC sp_procoption @ProcName = 'sp_sql_startup_parameters'
,@OptionName = 'STARTUP'
,@OptionValue = 'on'
This way, max server memory (MB) leaves 2000 GB available for the Windows OS. This is plenty of memory for the OS, since I’ve never observed any paging of OS memory from the performance counters.
Likewise, we set max degree of parallelism (MAXDOP) to 0. Many people that test HammerDB force MAXDOP = 1 to avoid deadlocking. This is generally a bad idea in the real world, so we go with the default value for SQL Server.
Microsoft recently updated KB article 2806535 – Recommendations and guidelines for the “max degree of parallelism” configuration option in SQL Server with recommendations for SQL Server 2016. It recommends values that are same as running with the default of use all logical processors for our 4, 16, and 64 vCPU configurations.
Likewise, for our AMI instances used in testing, we make sure that the Power Options in the Control Panel are set to the maximum.
Now for the standard answer from good DBAs – It Depends
There are lots of other factors that could come into play that could improve performance of the system like: manual soft-NUMA, SQL Server edition, trace flags galore, read/write disk caching with Storage Spaces (great for gaming scenarios, but it sounds dangerous for business critical databases), Memory-Optimized Tables, columnstore indexes, In-memory clustered columnstore indexes, and this list goes on.
Each change changes the overall memory, storage, and CPU performance and requires another round of benchmarking. At some point, disk performance in terms of IOPS and throughput can end up as a bottleneck.
The MaxDOB process focuses on starting out by maximizing storage for the IOPS possible for the instance.
The following table shows how we should have configured additional storage to match the maximum IOPS for the VM – VM Max IOPS.
Additional gp2 storage needed to match VM maximum IOPS
Figure 4 – By comparing the 2018 benchmark gp2 disk IOPS to the maximum IOPS for the VM, here is how much additional storage that is needed to match the VM
As you can tell, we were ran the entire set of benchmarks with not enough storage to match the IOPS for the instances, especially for the r4.16xlarge instance.
So, let’s see what happened with the m5.xlarge when we provisioned the instance with 6 gp2 drives of 1042 GB and a log drive volume with 512 GB.
Figure 5 – Using MaxDOB to provision storage for a new performance baseline.
The good news is that the new configuration runs 1.2 times faster. The bad news is that the system costs 83% more for NOPM (M) / Hour. Yes, there is such thing as too much of a good thing.
Here was the profile for the updated run by user grouping.
Figure 6 – 2019 results for NOPH (M) by number of virtual users running at time
Compare this to 2018’s results.
Figure 7 – 2018 results for NOPH (M) by number of virtual users running at time
In 2018, the best result topped out with 21 virtual users running at once. In 2019 with more gp2 storage for IOPS, it topped out at 34 users.
So, where is the bottleneck? Here are the performance counters for the Windows Server and SQL Server CPU utilization.
Figure 8 – CPU activity during the m5.xlarge benchmark run
Clearly, the CPU capacity of the 4 vCPU m5.xlarge instance was a limiting factor in the ability to create more NOPH (M). However, there may be other bottlenecks at play.
So, what do the IOPS look like overall and for the different drives. The E: drive included the data and tempdb files. The L: drive included the log files. For the test scenario with small VMs, we used a 512 GB drive which should deliver a base line of 1536 IOPS.
Figure 9 – Look at disk operations for the logical disk with disk reads/writes for the E: data and L: log volumes
Notice that the 8th group in the chart above shows a noticeable drop in Disk Writes/sec for the log drive. What happened is that since the log drive size is under 1 TB, we got the advantage of 3000 IOPS on the drive.
The following table show the maximum values for the reads and writes across the entire run.
Figure 10 – Disk operations for the Logical Disk versus the disk operations for the E: data and L: log drives
If I apply a filter to the results on or after 10:30 PM, you can see the throttling of the IOPS with the maximum value.
Figure 11 – Disk operations after exhausting the initial burst bucket
Clearly the Burst Bucket impacted the IOPS for the log drive.
If you look at the spike in the last group of virtual users, the maximum value for IOPS was only 9,410. The following gauge charts so the
Here is the disk usage report for the 1000 warehouse database.
Disk usage report for original testing for TPC-C 1,000 warehouses used for small, medium, and large VM sizes
Figure 12 – Disk space used for TPCC 1,000 warehouse database
Given that the data space that was actually used was only 78.48 80 GB, the m5.xlarge instance could ran out of CPU power.
To optimize, we need to scale back on the gp2 storage. In addition, we moved the log drive to the data drive, so that both drives can take advantage of the IOPS for the single volume.
There are two factors in setting up the storage.
• What throughput is needed? In this case, we need about 81.5 MBps. Any drive over 334 delivers 250 MBps.
• What IOPS are needed? The maximum value during the test was 9410. With gp2 drives, 5334 GB delivers the maximum IOPS per drive of 16000 IOPS. Since this drive provides more IOPS than needed, I use my target of 9500 IOPS / 3 IOPs / GB to get a gp2 disk size of 3167 GB.
Here are the results with the new configuration. The bottom row shows the comparison with less storage with the revised storage. Better performance than the fully loaded VM, but still a small fraction more expensive. The -14% translates to $ / month 1.68. With that difference, and better performance. I’d call it a win.
What are the results for the medium run?
What are the results for the large run?
More to follow