Amazon Web Services (AWS) offers several options for databases—so many, that choosing one can be overwhelming, even for developers skilled in all things AWS. Where do you begin? Today let’s look at the different relational (SQL) types of databases available on AWS, along with the pros and cons of each.
Note that you need to understand the concept of availability zones on the platform. AWS has data centers all over the planet, known as regions. Within regions, there are multiple data centers; each individual data center within a region is called an availability zone. The idea is that if one data center goes down in another region, another will be there up and running. When you provision servers within a single region, they can span multiple availability zones for easy replication.
You also need to understand the concept of “serverless” as it relates to AWS. At its most basic level, this means that you don’t have to provision your own EC2 servers; you don’t have to create a Linux or Windows system that’s running continuously where your database software runs. Instead, the software runs on AWS’s own servers, and for the most part you’re only charged for the time the servers are running.
With databases, you will typically have a main one through which write requests are sent, corresponding to your inserts and updates. Read requests (corresponding to select statements) can take place either in the main one or in any replicas called “read replicas”. In terms of AWS, the read replicas are typically distributed across availability zones within a single region.
Finally, when using MySQL-based databases, you need to understand the concept of database engines. These are ways, for example, that MySQL stores its data. The default and most common is InnoDB. The second-most-common is MyISAM. You can find more information here.
And one final tip for all the databases: Set up the security appropriately so that bots can’t randomly connect to your database and start trying to gain access. Lock down the IP address access by setting up the security groups. Not sure about security groups? Start here.
Relational databases have been around for decades. The most common names are Oracle and SQL Server (both proprietary and expensive), and MySQL, MariaDB, and PostgreSQL (all free and open source). Here are the managed SQL database options:
Amazon RDS: Launched in 2009, this was the first of the managed databases on AWS. You get to choose your database software, specifically MySQL, PostgreSQL, Oracle, SQL Server, or MariaDB. The general idea with RDS is that you can launch a database in one region with multiple physical servers separated between availability zones. One will be the main server; the others will be read replicas. In the event the main one goes down, another will become the main. Additionally, high volumes of reads can use the read replicas while writing takes place in the main. Note that although AWS has evolved with more modern approaches to database distributions (see Aurora, below), the “older” style such as RDS is still perfectly valid and very cost-effective.
When you launch a database under RDS, you can control how many servers will be running (with one being the primary writable database, and several read replicas). Although you’ll generally want more than one server, the cold, hard reality is that you’re paying per hour for each server. This can add up very quickly and end up costing you far more than you might have planned for. Most database experts recommend three servers total; however a lot of smaller shops do just fine with just two servers.
Also note that to get the most out of RDS (including productivity and cost savings), you need to understand how Virtual Private Clouds (VPCs) work.
- Incredibly easy to set up and use.
- Data storage is very cheap, on the scale of dimes per gigabyte per month.
- Although storage is cheap, the servers themselves can get expensive very quickly! You are charged a flat amount per hour based on the size of the server.
- You need to monitor the systems manually, especially for cost analysis.
You can find RDS pricing details here.
Reality check from the author of this article: In the past, I found that launching my own EC2 instances and installing MySQL manually was more cost effective than using RDS. However, others have disagreed as long as they monitor the RDS costs carefully. Either way, monitor your usage so you don’t get a shockingly huge bill at the end of the month.
Amazon Aurora: Launched in 2014, this is a newer generation of managed database systems featuring either MySQL and PostgreSQL compatible (meaning these are Amazon’s own builds of the source code, as opposed to official MySQL and PostgreSQL releases).
- Fully managed, meaning you don’t need to install the software or the servers.
- Backs up data continuously to S3.
- Scales automatically (both up and down) with a minimum of 10GB up to a max of 64TB based on how much data you store. Can scale among multiple availability zones within a region.
- Up to 15 read-replicas available, with replication happening usually under 100 milliseconds.
- Automatic fail-over.
- Encryption available both for the data as it sits on the SSD drives as well as while the data is being transmitted to another server.
- Uses only the InnoDB database engine.
- Although Amazon claims it’s cost-efficient, watch for hidden costs that can drive up the cost significantly. For example, if you are only doing a few database operations per hour, each operation will get a minimum of five minutes of usage charged.
In terms of SQL databases, it generally comes down to RDS vs Aurora. However, there’s another big offering.
This database offering is primarily for data warehousing, and it’s meant for much larger databases compared to RDS and Aurora. It includes advanced data analytics. Small queries can take time on Redshift; however, if you’re querying massive databases with millions of rows, queries will ultimately be faster on Redshift than RDS and Aurora, and this is where Redshift shines.
However, Redshift is much more expensive per hour compared to RDS and Aurora. It’s really intended for extremely large corporations with petabytes of data. If you’re involved in a small startup, this choice is definitely not for you.
- Can store petabytes worth of data.
- Fast when doing massive queries across millions of rows of data.
- Slow for smaller data sizes.
- More expensive than RDS and Aurora.
Among the offerings, the vast majority of people will choose either RDS or Aurora. Both work great; RDS is more cost-effective, but requires a bit more work setting up compared to Aurora. Aurora, on the other hand, is much easier to run, but can result in larger-than-expected bills at the end of the month. In either case, RDS or Aurora, you will want to continuously monitor your cost and usage throughout the month.
Redshift is in a class by itself and is built for extremely large amounts of data (think petabytes). It’s very expensive, but also includes sophisticated data analytics. Redshift is mostly used by large corporations that have massive amounts of data alongside massive amounts of money.
In any case, if you’re prepping for a certification exam or hoping to land a job, you’ll want to become familiar with all three types. You can certainly practice with Redshift; you’ll just want to only run it for short periods of time.