UUIDs, or universally unique identifiers, are 16 byte numbers that are represented by 32 hexadecimal digits separated by hyphens in the form of 8-4-4-4-12, with the number being the number of characters before the hyphen. For example:
In the number, there are a few significant bits. The M is a 4 bit number that represents the UUID version, and the 1-3 most significant bits in the digit N is the UUID variant. In our example, it’s a version 4, variant 1 UUID, which means it’s a randomly generated UUID.
Microsoft calls these GUIDs or globally unique identifiers. They’re exactly the same, but nearly anyone outside of Microsoft calls it a UUID, so that’s how I’ll refer to them.
It is possible for UUIDs to have collisions, i.e. 2 are generated with the same number randomly; however, this is an extremely small chance as there are a total of 2¹²² possible UUIDs.
A sample of 3.26*10¹⁶ UUIDs have a 99.99% chance of not having any duplicates. Generating that many UUIDs, at a rate of one per second, would take a billion years. - Ludi Rehak 1
Typically, a database will use integers for its primary and foreign keys. Integers are smaller numbers than UUIDs, which makes them use less memory and scan faster. We’ll get into the specifics a little later on this page; however, there are several valid reasons to choose a UUID over an integer.
UUIDs are used when different systems have to talk to each other using a common ID among all of them. If auto increment integers are used, these would easily be repeated in different systems and the meaning of the number couldn’t be guaranteed to be the same.
As an example, let’s assume there are 2 different systems that both have to maintain a copy of the
users table. If a user is ever inserted out of order in either system, the ID will be different. Also the auto incrementing sequence is often advanced when doing updates so new inserts aren’t guaranteed to get the same ID even if they are inserted in the same order.
A variant on the exact copy of a table detailed above is having a foreign reference to an object in another database. Perhaps the system doesn’t need an exact copy of the
users table, but it stores additional data about the user. Using a universally unique ID will ensure that it won’t collide with other systems.
Another common problem is when running tests using “magic numbers”. For example, if the test ran this query:
SELECT * FROM users WHERE id = 42;
The test is assuming the ID will always be 42. If an additional test is created that writes to the users table or tests are run in a random order, the ID will likely end up different than 42. This is a bad example, because nobody writes tests with magic numbers. Right?
There are 2 ways to deal with this. The first is to not query off of ID at all. Instead, do something like
SELECT * FROM users WHERE name = 'Brian'; where the value can be guaranteed to be the same every time.
The other is to use a UID like this:
SELECT * FROM users WHERE uid = '49373f75-65ef-40d8-a700-38ae8cf5d032';
Notice I named the column UID for universal ID. This is the naming convention 2 I use for any universal ID since UIDs aren’t always of the type UUID.
Sharding a database is a strategy used to create many database clusters with the same table structure in order to spread the write load across them. This happens often in large SaaS environments where one or more customers are assigned to specific shards. Often, small customers are grouped together on a shard, while larger customers might be isolated to their own shard.
When a customer outgrows its available server resources, their data may be moved to an isolated shard. It’s much easier to move when all of the foreign key references don’t have to change. If everything were an integer, all of the integers for that customer might need to be regenerated because something else might already be using them. This is brittle, error prone, and takes a lot longer to do a data migration.
When using integers as foreign keys, you will need to know what the primary key value is before you can insert a referenced value in another table.
For example, if our database has a
users table and an
addresses table with a foreign key to the
users table, you will need to insert or select an existing row in the
users table before knowing what value to put in the
user_id column in the
However, if you already know what the UUID value will be, there’s no need to do an insert into the users table first and get the returned sequence value. You can simply generate a UUID and use that in the appropriate columns in both tables.
Note: we’ll learn on the next page why we shouldn’t use UUIDs as primary keys. Instead we should have a secondary column labeled
uid and reference that with our foreign keys.
By their nature, UUIDs are generated out of sequence. Usually they’re random, but even when they’re generated from a seed, the number isn’t incremented.
This poses a problem for B-Tree indexes. In databases, like Postgres, the data file and index is written to during periodic checkpoints. It’s done this way to do as many sequential writes as possible, as opposed to a bunch of random writes. As we learned in Chapter 1, sequential writes are faster, even on modern SSDs.
In Chapter 4, we learned that B-Trees write values into 8KB blocks, or pages, where each page represents a node in the tree. If all of our new values are sequential, they will belong to the same set of nodes; one per 8KB of data being written. Since the value is ever increasing, these nodes will always be at the end of the tree. In other words, most of the data will be appended to the end of the file which is very efficient.
However, with UUIDs, this isn’t the case. These numbers are all over the spectrum of possibilities. This causes a lot of writes to happen in the middle of the tree, which will cause lots of page splits and other random writes to occur. This is quite a bit more intensive than the near append only write with sequential integers.
In MySQL, there isn’t a native UUID type. Instead, people use the type binary(16). While this is essentially the same thing, it creates a headache. In MySQL 8.0, they added 2 functions that address dealing with UUIDs in a human readable way,
BIN_TO_UUID. In Postgres, you can run this query:
SELECT * FROM users WHERE uid = '49373f75-65ef-40d8-a700-38ae8cf5d032';. It will automatically do the conversion, but in MySQL, you have to run it like this:
SELECT * FROM users WHERE uid = BIN_TO_UUID('49373f75-65ef-40d8-a700-38ae8cf5d032'); This adds cognitive load for both the database analyst and any ORM being used.
In Postgres 11 and older, they do not support compound GIN trigram indices where one of the columns is a UUID. This was fixed in PG12, but there are probably other areas that aren’t fully supported.
Small integers: 2 bytes
Integers: 4 bytes
Big integers: 8 bytes
UUIDs: 16 bytes
This might not sound like a lot, but it adds up quickly. If an index full of big integers is 4GB in size, it will now be at least 8GB.
When this is the case, any table using a UUID as its primary key will bloat all of the secondary indices.
The ctid is the physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row’s ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows. - Postgres System Columns 3
On a 16TB data warehouse using Postgres 9.6, I was using UUIDs for all of the primary keys. I instead created
bigserial (i.e. auto incrementing
bigint) primary keys named
id and added an additional UUID column called
uid, which was also indexed. Each table had many other indices as most columns are indexed in a data warehouse.
By switching to int and bigint for the primary keys, the database used ~30% less memory even though there was an entirely new indexed UUID column.