Databases are essentially fancy file scanners with optimizations to minimize how much of the file needs to be looked at to get the requested data. As we learned previously, data and index files are organized in fixed size blocks. In Postgres, these blocks are 8KB and are referred to as pages. In MySQL, they are 16KB.
What happens if a tuple contains more than a page worth of data? The tuple ends up spanning two or more pages.
To put this in context, let’s assume we have a table called
users that contains a lot of character fields. For our example, the data per row often exceeds 8KB, and the table consists of 100k rows. The data file would contain up to 200k 8KB page blocks.
If we ran the following query,
SELECT * FROM users;, the database would have to scan all 200k pages. Instead, if we ran this query,
SELECT id FROM users;, the
id field would only be in one page per tuple. Once it finds the field, it doesn’t have to look at the second page. The scan could be completed by scanning 100k pages or one page per tuple, which would speed up the query.
To minimize the number of pages scanned, there are three things to consider:
Use the smallest type that’s reasonable
Put smaller types first
Put frequently accessed columns before infrequently accessed ones
Regardless of column order, you should use the smallest type that you’ll reasonably need. For example, integers in Postgres can be a
bigint, which are 2, 4, or 8 bytes respectively. If you’re never going to exceed the max value of a smaller type, use that. This will reduce the overall data stored, which reduces memory usage and speeds up table scans.
Fixed sized columns such as
UUIDs, and fixed width
character fields should go before variable width columns such as
JSON fields. Depending on the character set, such as UTF-8, each character will use between 1-4 bytes. Therefore, one single character may use the same space as an entire integer.
By packing in the small fields first, more overall fields can fit in 8KB of storage. This increases the probability that a query will find all of the fields it needs in the first page.
Put columns that are more likely to be selected before more obscure columns. The database stops reading a page when it has found all the data it needs in it; therefore, putting less accessed columns at the end will reduce the amount of data required to be read.
In my tables, it’s quite common to have
updated_at columns because the system I use creates these as standard. For consistency purposes, I generally put these two right after the primary key, which is always labeled
id. If there’s another date that is nearly always sorted on, it will come next such as
published_at. After that, I put any foreign key columns because these are commonly used to join tables. Next come booleans, numbers, and dates. After that are fixed width character columns, and finally I add varchars, text fields, and other variable width fields. These are ordered by the probability that they will be needed in a query.
CREATE TABLE IF NOT EXISTS users ( id SERIAL NOT NULL -- serial is an auto increment in Postgres CONSTRAINT users_pkey PRIMARY KEY, inserted_at TIMESTAMP, updated_at TIMESTAMP, account_id INTEGER NOT NULL, -- reference to the accounts table is_active BOOLEAN DEFAULT false NOT NULL, uid UUID NOT NULL, -- universal id. password_digest CHAR(40) DEFAULT '' NOT NULL, -- 40 character SHA email VARCHAR DEFAULT '' NOT NULL, name VARCHAR DEFAULT '' NOT NULL, phone VARCHAR DEFAULT '' NOT NULL, avatar_url TEXT, -- URLs can be more than 255 characters bio TEXT );
As columns are added and removed over time, it’s easy to get the column order out of whack. This is especially true when using ORMs with migration capabilities built in because each migration is treated as a separate task, rather than part of the whole.
With MySQL, it’s easy to keep things organized because it comes with the
AFTER keyword in the
ALTER TABLE command.1 For example,
ALTER TABLE users MODIFY email VARCHAR AFTER password_digest;
Postgres does not have the
AFTER keyword. Reordering columns can be done, but it’s a bit more complicated. The steps are:
Create a new table with the columns in the right order.
INSERT from a
SELECT into the new table. Ex:
INSERT INTO new_table_name SELECT columns FROM old_table_name;
Rename original table. I add the suffix
Rename the new table to the original table’s name.
Create any indices
Verify everything is good.
Drop original table
You can also use
CREATE TABLE AS, but it won’t create the primary key or the auto incrementing sequence so you’ll need to
ALTER TABLE and add these.
CREATE TABLE IF NOT EXISTS accounts ( id SERIAL NOT NULL CONSTRAINT accounts_pkey PRIMARY KEY, inserted_at TIMESTAMP, updated_at TIMESTAMP, website TEXT, is_active BOOLEAN DEFAULT false NOT NULL, name VARCHAR );
Create new table with columns in the right order:
CREATE TABLE IF NOT EXISTS accounts_new ( id SERIAL NOT NULL CONSTRAINT accounts_new_pkey PRIMARY KEY, inserted_at TIMESTAMP, updated_at TIMESTAMP, is_active BOOLEAN DEFAULT false NOT NULL, name VARCHAR, website TEXT ); INSERT INTO accounts_new SELECT id, inserted_at, updated_at, is_active, name, website FROM accounts; ALTER TABLE accounts RENAME TO accounts_old; ALTER TABLE accounts_new RENAME TO accounts;
Verify everything is good and do a backup.
DROP TABLE accounts_old;