When people talk about the cloud, chances are that they are referring to AWS. They are currently the largest hosting company with an estimated revenue of $33B in 2019.1
The “cloud” is a marketer’s term for shared hosting. Cloud providers make it relatively easy to set up and connect to virtual servers. This makes it possible to quickly scale up or down the amount of server resources needed; however, this abstraction causes some bottlenecks that aren’t present on bare metal servers.
EC2 is a service that allows people to provision various sized virtual servers ranging from a few dollars a month to several thousand depending on the resources available like CPU cores, available memory, etc. Each of these virtual servers are considered ephemeral; meaning, if they are stopped and then restarted, they may boot up on a different physical host server. When this happens, their IP address changes and any data on a local drive is gone. (Note: “stopping” an EC2 instance is more destructive than a simple reboot, where the virtual server generally remains on the same underlying host.) Note: Amazon provides a utility called an Elastic IP which is a public IP address that will move to the new instance.
The list of different instance types2 that can be set up is ever growing and changing; however, there are different classes of instance types that are worth discussing.
Since most queries simply scan over tables or indices and return the results, memory is the first bottleneck most small to medium sized databases hit. Keeping most, if not all, data in memory can give a database a dramatic speed boost. Hard drives, especially EBS drives on the network, are far slower than RAM; therefore, the Memory Optimized class of instances are an excellent choice with the “R” class (R5, R5a, etc) being the most common. These instances provide an increased amount of available RAM for the price.
Once the amount of data is too large to fit in memory and EBS bottlenecks start to be reached, the Storage Optimized classes become valuable. These provide ephemeral Local Storage drives, which are currently super fast NVMe drives. Because data will be lost on a local storage drive if the VM crashes or is stopped and restarted, care must be taken to ensure that there are proper safeguards.
Databases that can cluster together such as ElasticSearch, Dgraph, etc can take advantage of these with relative safety. The databases themselves handle sharding and load balancing. For relational databases like Postgres or MySQL, a synchronous follower should be set up to minimize risk.
Another great use case is for databases that do a lot of transforming rather than storing Source of Truth data. The database is ephemeral in nature and therefore benefits greatly from this type of server.
Using these storage optimized classes aren’t for everyone, but for those that need every last IOPS available or GB/s throughput, they are an excellent option. Currently I3en instances are cheaper than R5 instances with a large EBS drive.
Some databases have been developed to use GPUs such as OmniSci3. These enable incredibly fast table scans making it possible to have databases that don’t require indices to speed up queries. While OmniSci has a community edition, the actual instance type can be relatively expensive.4
Because EC2 instances are ephemeral, people need a different way to persist data to disks that will survive a virtual host crash. EBS is Amazon’s solution to this, which is essentially a Storage Area Network (SAN). EBS is a large array of disks that live on the network. This makes it possible for people to create a virtual drive with however many GBs they want to pay for. The current upper limit is 16TB per drive.5
EBS currently has 4 types of drives available with different costs associated with them6. The cheapest and slowest is Cold HDD, and the fastest and far more expensive is Provisioned IOPS SSD volumes.
For databases, I almost always use general purpose SSD (GP2) drives. I’ve switched to Provisioned IOPS in dire situations where the system was going to crash without it. This has always been a very limited time until the underlying problem was identified and fixed. 9 out of 10 times, this means finding a bad query and optimizing it.
EBS drives have 2 bottlenecks that everyone needs to be aware of. These are IOPS and throughput.
IOPS means Input/Output Operations Per Second. This limit becomes a bottleneck more often on transactional databases that are running many small queries.
With Provisioned, you specify how many IOPS you need and pay for the privilege. Currently the maximum is 64,000.
With GP2, it’s dependent on how big your hard drive is. For every GB, you get a baseline of 3 IOPS with a current minimum of 100 and a maximum of 16,000. This means that if your hard drive is 500 GB, your limit is 1500 IOPS and so on. There is also the concept of “Burst Credits” if your drive is smaller than 1TB. This means that if you don’t use your maximum, credits are applied to your drive. If you need more IOPS than your baseline, up to a max of 3000, you use these credits until they run out. This is great for short bursts.
The throughput limit becomes a bottleneck when running queries that have to scan a lot of data on disk. This is common when queries do large or recursive table scans, use large CTEs (Common Table Expressions), or large subqueries.
With GP2, the max speed is currently 250 MB/s7 with a drive that is 334 GB or larger and I/O of 256KB. Smaller drives have a decreased max throughput using the formula:
((Volume size in GiB) × (IOPS per GiB) × (I/O size in KiB)). Since I/O if often smaller than 256KB, the actual maximum will be somewhat less than this number or require a drive that is quite a bit bigger.
With Provisioned, the top throughput limit can increase to 500 MB/s on drives with less than 32,000 IOPS and it increases to 1000 MB/s as it approaches 64,000 IOPS. Based on the formula, in theory, you can see 500 MB/s throughput at around 2,000 Provisioned IOPS if packets are 256KB. These are often smaller so the actual IOPS would need to be higher to get close to the 500 MB limit.
S3 is a block storage service. Pricing is based on how much storage is used vs. how much is provisioned. It’s excellent for backup files and storing blobs like images and videos. Amazon provides a service called Athena8, which makes it possible to run SQL over files in an S3 bucket. In this model, pricing is based on the amount of data scanned, and there is no server cost associated with it. I haven’t had a need to try it yet; however, it may be an effective way to run a small number of queries on archived data such as log or backup files.
RDS9 is a good option for people that need a database, but don’t want to manage it themselves. Amazon provides built in functionality to upgrade the database software, change instance types, failover to a cold standby, and the ability to have an asynchronous follower set up with a few clicks.
Backups are easy to set up, which do a snapshot of the disk; however, when the data is large (i.e a few TB), I’ve seen backups take more than 24 hours to complete and lock you out of being able to edit any config because the next backup starts immediately after the first one completes.
Instances are more expensive than their EC2 counterparts, and a failover server cannot be queried. This doubles the price for high availability without providing any load balancing. For load balancing, an additional follower needs to be set up.
RDS does not allow SSH access, nor do they allow custom plugins. Cloudwatch metrics are available like other EC2 instances, and they provide some unique database related metrics. Query logs are dumped to many fragmented files onto S3. It’s up to you to compile and parse these for query log analysis.
RDS provides the ability to change some of the database config, but not everything is editable.
Bottom line is that RDS is a good place to start to get up and running quickly, but when Milliseconds Matter, managing the server yourself is worth the effort.