On the previous page, we detailed how to set up a “has_many” relationship using a content management system as an example. On this page we’ll go into what a “has_one” relationship is.
The “has_one” is the reason I wanted to use a CMS as the example. Most people that have worked with databases understand the “has_many” relationship very quickly. It’s one of the main reasons people don’t just keep all of their data in spreadsheets and CSV files. The “has_one” relationship isn’t used nearly as often, though it should be used more.
A “has_one” relationship is where 2 tables have a foreign key reference between them, but when they are joined together, each row only references one other row. Conceptually, the schema looks the same as a “has_many”; however, I recommend also creating a unique index on the foreign key column to guarantee that there will only be one row to join to. We’ll get into indexes in chapter 6.
The reason these types of relationships aren’t used very often is because they get confused with attributes. If a thing has exactly one thing, that thing could exist inside a column, especially if that thing is a boolean or text string.
CREATE TABLE templates ( id SERIAL NOT NULL CONSTRAINT templates_pkey PRIMARY KEY, header TEXT, footer TEXT ); CREATE TABLE sites ( id SERIAL NOT NULL CONSTRAINT sites_pkey PRIMARY KEY, url VARCHAR(255) ); CREATE TABLE web_pages ( id SERIAL NOT NULL CONSTRAINT web_pages_pkey PRIMARY KEY, template_id INT CONSTRAINT fk_template_web_pages REFERENCES templates(id), site_id INT CONSTRAINT fk_site_web_pages REFERENCES sites(id), slug VARCHAR(255) NOT NULL, content TEXT );
In the “has_many” example, we started to talk about how every page has a slug that we could put in our
WHERE clause. For example, if the URL were
www.example.com/pages/my_puppy_pictures.html, the query might look like:
SELECT content, header, footer FROM web_pages w INNER JOIN sites s ON w.site_id = s.id LEFT JOIN templates t ON t.id = w.template_id WHERE s.url = 'www.example.com' AND w.slug = 'my_puppy_pictures' LIMIT 1;
It’s selecting the content and template info, and it’s filtering it by the URL and the slug. The home page wouldn’t have a slug, so this query isn’t going to work when somebody just wants to visit
www.example.com. There are at least 4 ways we can solve this. One will create data issues, one will create performance issues, one meets our needs beautifully, and one works but isn’t recommended.
We’ll talk about the benefits of using blank values instead of nulls in the query optimization chapter. Since we’re going for a different point here, we’ll assume they’re both equal.
Therefore, we’ll add the business rule that if a slug is an empty string, it is the homepage.
To query the homepage, we’ll run this SQL:
SELECT content, header, footer FROM web_pages w INNER JOIN sites s ON w.site_id = s.id LEFT JOIN templates t ON t.id = w.template_id WHERE s.url = 'www.example.com' AND w.slug = '' LIMIT 1;
This works great until we add another page that doesn’t have a slug. Ideally, our web pages table would also have a page title so that it can show up in the browser, bookmarks, etc. We would want code to automatically convert the title to a slug so people don’t have to think about it. If we decided to then change which page is the home page, we’d have to add the slug to our old homepage and erase the slug on the new one.
The web pages table now looks like this:
CREATE TABLE web_pages ( id SERIAL NOT NULL CONSTRAINT web_pages_pkey PRIMARY KEY, template_id INT CONSTRAINT fk_template_web_pages REFERENCES templates(id), site_id INT CONSTRAINT fk_site_web_pages REFERENCES sites(id), title VARCHAR(255) NOT NULL, slug VARCHAR(255) NOT NULL, content TEXT );
But how can we guarantee that an empty slug means that it’s the homepage? If somebody updates the homepage record, there’s nothing explicit saying that it is, in fact, the homepage. What might happen instead is that the slug will be updated to the page title and the query looking for the home page won’t return anything.
Ok, so let’s be explicit and add an attribute called
is_homepage as a boolean.
CREATE TABLE web_pages ( id SERIAL NOT NULL CONSTRAINT web_pages_pkey PRIMARY KEY, template_id INT CONSTRAINT fk_template_web_pages REFERENCES templates(id), site_id INT CONSTRAINT fk_site_web_pages REFERENCES sites(id), is_homepage BOOLEAN NOT NULL DEFAULT FALSE, title VARCHAR(255) NOT NULL, slug VARCHAR(255) NOT NULL, content TEXT );
Our code would now have an
if statement. If there is a slug, run the query above, otherwise run this query:
SELECT content, header, footer FROM web_pages w INNER JOIN sites s ON w.site_id = s.id LEFT JOIN templates t ON t.id = w.template_id WHERE s.url = 'www.example.com' AND w.is_homepage = 'true' LIMIT 1;
Now that we’re explicitly calling out which page is the homepage, we won’t run into weird issues if there happens to be 2 pages without slugs. Also, anyone looking at the data will know exactly which page is the homepage. To change which page is the homepage, we would update the
is_homepage value for the old homepage to false and the new homepage to true in the same transaction.
There’s a dragon lurking in this design though.
Only one row per website will have a true value. All of the rest will be set to false. Since the column can only have 2 values, true or false, a B-Tree index doesn’t really help. We’ll get into the nitty gritty of how B-Trees work in the chapter on indexes. For now, it’s important to understand that the database will almost certainly choose to do a table scan instead of look at an index.
When a site has only a few pages, this isn’t much of an issue. On a large site with thousands of pages, the database will look at each of them until it finds the “true” value, which will make the query slower than it needs to be.
Another way to say this is that the
is_homepage column is full of sparse data. There’s only 1 row that matters.
The concept of sparse data comes up a lot. There are countless examples where most of a column’s values contain default values, empty strings, or nulls. Keeping all of this data in the same table bloats the table, which creates the need to use more available RAM and slows down table scans.
We want to avoid columns with sparse data when we can. Indexes aren’t as efficient or used at all, and they bloat our tables.
Instead of using the
is_homepage column, lets create a new table called
CREATE TABLE homepages ( id SERIAL NOT NULL CONSTRAINT homepages_pkey PRIMARY KEY, web_page_id INT CONSTRAINT fk_homepage_web_pages REFERENCES web_pages(id) );
Now, to find the homepage for
www.example.com, we’ll run:
SELECT content, header, footer FROM web_pages w INNER JOIN sites s ON w.site_id = s.id INNER JOIN homepages h ON h.web_page_id = w.id LEFT JOIN templates t ON t.id = w.template_id WHERE s.url = 'www.example.com' LIMIT 1;
There’s only one row in the
homepages table per website, and the query is only going to join 1 of them. Because of this, the query will be much faster than scanning over every row in the
web_pages table looking for the single true value in the
To set another page as the homepage, we would simply update the
web_page_id to the new value. This might get a little messy because we would have to rely on the web page it references to know which site to select and it becomes a bit of a circular logic thing. To avoid confusion, we would simply add another foreign key on the homepages table for the site.
CREATE TABLE homepages ( id SERIAL NOT NULL CONSTRAINT homepages_pkey PRIMARY KEY, site_id INT CONSTRAINT fk_homepage_site REFERENCES sites(id), web_page_id INT CONSTRAINT fk_homepage_web_pages REFERENCES web_pages(id) );
This will allow us to make an ever so slight optimization in the query. In our example, it won’t help a lot because there aren’t that many rows; however, the concept is used a ton in data warehouses and is referred to as a “star schema”. We’ll get into fine details later in the book, but the gist is that the joins go through a central table to limit the number of different tables being joined as possible. This speeds up the query by reducing the number of different indexes the query has to scan.
SELECT content, header, footer FROM homepages h INNER JOIN sites s ON h.site_id = s.id INNER JOIN web_pages w ON h.web_page_id = w.id LEFT JOIN templates t ON t.id = w.template_id WHERE s.url = 'www.example.com' LIMIT 1;
As you can see, we’re now joining through the homepages table twice to both the sites and web pages tables.
For completeness, instead of a
homepages table, we could add a column called
homepage_id on the
sites table, since a site always has a single homepage. This would technically work and have adequate query performance. It’s largely a style issue, but I don’t prefer this approach for a few reasons.
A homepage isn’t an attribute of a website. A website has a homepage, but a homepage does not describe a website.
A column named
homepage_id isn’t an obvious join to a table called
web_pages. If we called the column
web_page_id, it’s not obvious that it’s indicating that it’s the homepage. There would be a foreign reference on both tables to each other, which would get confusing. Yes, the code could describe what’s going on, but queries should be able to be understood on their own.
This goes back to the original thought in this section. “has_one” relationships are often confused with “attributes”, which can cause discoverability and maintainability problems.
We touched on using a “has_one” when we have sparse data such as a single homepage among all of the pages of a website; however, there are several other good reasons to use them.
It’s common sense that different processes need data for different reasons. Each process probably only cares about a subset of columns in a table. In cases where there is only one process that needs a particular set of columns, it benefits the whole to break those columns out into their own table.
I see it all the time where a worker process keeps a status field on a common table shared across many use-cases.
Going back to the CMS example, let’s suppose that we have a process that does text analysis on each page to determine the most appropriate hashtags it should belong to. This process runs asynchronously and knows which page to process based on a status field called
CREATE TABLE web_pages ( id SERIAL NOT NULL CONSTRAINT web_pages_pkey PRIMARY KEY, template_id INT CONSTRAINT fk_template_web_pages REFERENCES templates(id), site_id INT CONSTRAINT fk_site_web_pages REFERENCES sites(id), is_hashtag_processed BOOLEAN NOT NULL DEFAULT FALSE, slug VARCHAR(255) NOT NULL, content TEXT );
Every time the process runs, it queries the table and finds any rows (i.e. tuples) where
is_hashtag_processed is false, processes them, and updates the column to true.
Each time an update occurs on an ACID compliant database, like Postgres, the tuple is rewritten. This is so the table can contain the old and new versions of the data in case any
SELECT statements are actively running before it is updated.
Let’s suppose that there was an enhancement made to the hashtag process. In order to take advantage of these improvements, all tuples’
is_hashtag_processed column would be set to false. The process is rerun and this column is set back to true as each record is completed.
When all of the values are set to false, the table’s data file is doubled in size. This is because every tuple is written again and the original tuple is marked as “dirty”.
Generally in Postgres, when 20% or more tuples are marked “dirty”, a
VACUUM process is started. 1 This is a normal process that flags dirty tuples to indicate that they can be overwritten. Postgres is most effective at cleaning up after itself when few tuples are updated over a long period of time. This keeps the table’s size manageable because new tuples are generally overwriting existing dirty ones.
When every row is updated at once, it can’t keep up and the size of the table is doubled in size. Once the
VACUUM completes, new updates will replace some of the dirty rows, but that bloat doesn’t go away. This causes data fragmentation resulting in slower table scans.
To compound the problem, let’s assume there are a few other processes with their own status columns. When updates happen, the rows being updated are locked to prevent other processes from corrupting the data. If many processes are updating the same rows, there will be write contention causing everyone to slow down because they’ll have to wait for each other to release the locks before they can proceed.
Also, updating data causes caches of that data in memory to become stale. The memory is discarded and the next query must fetch it from the hard drive and return it to memory. This is much slower than simply getting data from the cache.
To fix this, keep data that doesn’t change often and is shared across all of the processes in the main table. Data for specific processes, or data that is frequently updated, is then moved to isolated tables grouped logically.
In our example, we would create a new table called
worker_hashtags that would join to
web_pages as a “has_one” relationship. We could go 2 different routes with the structure. First, we could simply have a column called
is_hashtag_processed and use this exactly like the example above. We would have a matching row for every row in the
web_pages table and update the boolean to true as it’s processed.
This solves our problem of table bloat on the
web_pages table and helps avoid the stale cache problems detailed above. The
worker_hashtags table would be bloated, but since it’s isolated and contains few columns, the problem is minimized.
There’s a better way though.
Our web pages don’t change all that often, and it’s rare to create more than a couple new pages in a day. Therefore, our worker is going to be caught up most of the time. Occasionally, we might want to reprocess all of the hashtags when we enhance our process, but this will be relatively rare too. All of this indicates that the
is_hashtag_processed data is sparse. Most of the time, the value is going to be true with a couple of falses peppered in as changes occur.
Instead of having a one for one relationship, we can have a one for zero relationship.
What that means is that the presence of the row in the
worker_hashtags table indicates that it needs to be worked. Once it’s processed, the row is deleted. Occasionally, when there are no rows in the table, a
TRUNCATE command is run.
TRUNCATE TABLE deletes the data file and recreates it, which removes any bloat. It’s like dropping the table and recreating it without losing any of the existing indexes, functions, triggers, etc that might be on that table.
The schema would then look like this:
CREATE TABLE web_pages ( id SERIAL NOT NULL CONSTRAINT web_pages_pkey PRIMARY KEY, template_id INT CONSTRAINT fk_template_web_pages REFERENCES templates(id), site_id INT CONSTRAINT fk_site_web_pages REFERENCES sites(id), is_hashtag_processed BOOLEAN NOT NULL DEFAULT FALSE, slug VARCHAR(255) NOT NULL, content TEXT ); CREATE TABLE worker_hashtags ( id SERIAL NOT NULL CONSTRAINT worker_hashtags_pkey PRIMARY KEY, web_page_id INT CONSTRAINT fk_worker_hashtag_web_page REFERENCES web_pages(id) );
Any time a web page is created or modified, a row is upserted into the
worker_hashtags table to be processed.
web_page_id column would have a unique index to prevent a web page from getting inserted multiple times since the worker processes the entire document and catches all changes.
This keeps our data small, minimizes data fragmentation, and keeps everything easy to reason about.
We touched on this in the last example, but it is worth repeating. Often, we have a core set of columns on a table that doesn’t change very often and a small set of columns that do.
Years ago, I built a photo sharing website called The Lens Flare. The site is still accessible but is currently defunct. This was my first attempt at creating a SaaS based application. It allowed people to create an account and upload their favorite photos.
I wanted to keep a counter for the number of times a photo was viewed. My first attempt was to add an integer column called
counter on the
images table. Each time a page was viewed, the counter was incremented. Therefore, the
images file ran 2 queries. I updated the column and then selected the row.
UPDATE images SET counter = counter + 1 WHERE slug = 'my_photo'; SELECT * FROM images WHERE slug = 'my_photo';
At first, this was fine. The web pages weren’t loaded very often and there weren’t many of them. It started to become a problem once the website had 50,000 pictures and pages were constantly getting viewed by humans and bots.
The problem was that the
SELECT happened serially and both had to complete before the server could send HTML to the browser. If another visitor was looking at the same image, it too would be running an
UPDATE. This locks the row causing any other updates to have to wait.
Since the counter was on the same table as the rest of the image data, including the title, text description, etc, all of that data was rewritten every time the image was viewed. This quickly overwhelmed the write capacity of the database server causing everything to slow down even further. This also invalidated the in memory cache for every
SELECT on this table, making things even worse.
The first thing I did was move the counter to its own table, called
image_views using the same technique I described in the “Segmenting use-cases” example. This reduced the total amount of data being written and improved cache hits. However, it still suffered from write contention because every page load was writing to this new table.
So I made a hard choice. I wanted the counter to be accurate all the time, but I compromised and made the counter accurate as of the previous hour. I created a new table called
view_logs, and when a web page was viewed, it would create an asynchronous process that inserted a row in the logs table. Once per hour, a process would count how many new views there were and added those to the counter in the
Updating rows once per hour, instead of constantly, solved my problem. On the web page, I added a tooltip that showed the last time the counter was updated so visitors weren’t confused by a number that didn’t often change.