As previously noted, a typical Postgres server is IO bound rather than CPU bound. If the server has 4 or more cores and is bottlenecked by the CPU, queries are doing too much. Later in this book, we’ll talk about how to remove unnecessary functions in the query and preprocess data to reduce the amount of work each query has to do at run time.
Knowing how a database server will be used is key in optimizing its configuration; for example, a data warehouse is going to typically run less queries, but each query will be more intense. A transactional database will run many more small queries from a greater number of connections.
Memory is allocated separately for different tasks such as connections, indices, and maintenance operations. Also, some amount of memory must be left to the underlying operating system to run properly.
Most of these settings are based on the recommendations of the Postgres guide1.
Effective Cache Size
effective_cache_size is one of the most important memory settings. It’s an estimate of how much memory is available for disk caching. The number includes caching by the OS and the database because Postgres uses the OS cache to cache queries.
This number is only used by PG’s query planner. It does not allocate this memory. It’s used to determine if the query plan will fit in memory or not, which helps determine if it will use an index or not.
The guide says a conservative setting is 50% of total memory, while 75% is a more aggressive but still reasonable amount.
I take a slightly different approach. Assuming it’s running on Linux as a server and it’s dedicated to PG, the OS is going to need 2-4GB of RAM for itself and any non-PG process running. Therefore, the 50-75% recommendation works until the server has 16GB of RAM. After that, I simply take the total RAM and subtract 4GB; for example, if the server has 64GB, I’ll set it to 60GB.
Another approach is to look at
top on *NIX servers. Add the free and cached numbers together to get an estimate. These numbers change as different processes are spun up and down, so I’ve found a 2-4GB estimate for the OS works well to give it a little cushion.
shared_buffers sets how much memory is dedicated to Postgres for caching data. Keep in mind that PG also uses the operating system cache so this number becomes somewhat of a guessing game.
First off, make sure your OS is 64 bit and an up to date version. If you’re on a 32 bit OS, it can’t deal with much more than about 2GB of memory. Also, if you’re running anything older than PG 10, now is a good time to upgrade.
If you have less than 1GB of memory, setting shared buffers to about 15% is suggested since the OS is going to need a lot of that; however, if you’re running into performance issues and only have 1GB of memory, it’s time to upgrade.
The guide suggests setting the value to 25%-40% of RAM for anyone using more than 1GB of memory. I usually set it to 25% and let it run a typical workload for a few days and see how much RAM is actually being used. It’s rare when I’ve adjusted it to more than this since PG uses the OS cache as well.
work_mem determines how much memory is available per connection. This includes things like temporary tables created by a query or ordering results. The maximum number of connections available should be taken into consideration when adjusting
work_mem to avoid Out of Memory (OOM) errors. If a query does merge sorts with multiple tables, each table can use up to
work_mem; for example, if it’s set to 20MB and has to merge sort 4 tables, the connection may use up to 80MB of RAM.
Once a connection is holding onto memory, it is not released until the connection is closed. This can be a problem for applications that do connection pooling because these connections are long lived. If the amount of memory used by PG drops significantly when deploying a change or restarting your application, it’s due to memory being held by these connections.
To get an idea of what this value should be set to, calculate the maximum memory available per connection using this formula:
effective_cache_size / max_connections. This number is way too high since PG is going to need memory for things like indices, but it gives the ultimate max. From here, determine how much of the overall memory you’re willing to give to connections. Usually an end result of about 20MB per connection on a transactional system or 60MB for a data warehouse is a good place to start.
log_temp_files can be turned on to see how often
work_mem is being used. Also running
EXPLAIN ANALYZE can shed light on queries that are creating temp tables by listing a value for
external merge disk. If this value is 9012kb, a
work_mem of 9MB or greater is a good place to start. Data on disk is more compact than data in RAM, so give it some buffer to ensure it will completely fit in memory.
Data warehouses can get away with allowing a lower
max_connections setting. Transactional systems tend to need many more connections, especially if the application is using a connection pool such as is the case with Rails’ ActiveRecord or Elixir’s Ecto. This setting is important to other settings, especially
PG can handle a few hundred connections pretty easily. If you need more than 400 or so, consider using some sort of connection pooling software such as PGBouncer.2
Maintenance Work Memory
maintenance_work_mem is how much memory is allocated to vacuums and other maintenance tasks. Allocating enough memory will dramatically speed up these operations. On systems without a lot of memory, changing
vacuum_cost_* settings so that they run much more frequently will keep the amount of memory required to a minimum.
I use the formula
effective_cache_size / 12 for transactional systems and
effective_cache_size / 6 for data warehouses as a starting point. This may need to be bumped up if vacuums commonly run on massive tables with a lot of updates or deletes.
Persisting data to disk is done through a process called a checkpoint. In PG, a checkpoint is one of the mechanisms used to ensure data is not lost or corrupted after a crash.
checkpoint_timeout by default is set to 5 minutes. Checkpoints take new data written to the WAL and commits them to the data files. Having a fast timeout like 5 minutes means that during recovery, it will spin back up faster; however, it can cause write amplification especially on tables with a lot of updates and deletes. It’s usually better to change this value to run a checkpoint every 15-30 minutes.3
Pair this value with
checkpoint_completion_target for optimal performance.
Max WAL Size
Prior to Postgres 9.5, this setting was called
max_wal_size determines how many Write Ahead Log (WAL) files are kept on the server. The Write Ahead Log contains any changes made to the server and is used for things like replication, recovery, etc. Data is written to the WAL first. Then, when a checkpoint occurs, these changes are written to the data files. Each WAL file is 16MB, so set this value to something divisible by 16MB.
The best way to know what this value should be is to look at how much is getting written to the WAL. To do this:
Get the current WAL insert location. Prior to PG10, the command was
SELECT pg_current_xlog_insert_location();. This changed to
SELECT pg_current_wal_insert_lsn(); in PG10+.
Run it again in 5 minutes.
Take a diff of the 2 results from the steps above to find out how much was written. Prior to PG10, the command was
SELECT pg_xlog_location_diff('newest', 'oldest');. In PG10+, it is now
SELECT pg_wal_lsn_diff('newest', 'oldest');
Multiply this value by 6 to get a 30 minute average (5 minutes * 6 = 30)
Multiply this value by 3 due to “spread checkpoints”, which is controlled by the
max_wal_size can be exceeded when using Replication Slots, which is a way for another process to subscribe to the WAL. The subscriber ACKs when it’s finished with each WAL file letting PG know that it’s ok to delete. This is to ensure data isn’t lost if a subscriber crashes or gets behind.
This has the unfortunate side effect that the hard drive can fill up on the PG server if a subscriber takes too long to come back online. The way to fix this is to either bring the subscriber back online or delete the Replication Slot.
Checkpoint Completion Target
checkpoint_completion_target is typically set to 0.9. This means that the system tries to finish writing the checkpoint by the time the next checkpoint is 90% ready. For example, checkpoints happen every 5 minutes by default. Setting it to 0.9 means that the system will try and be done writing the current checkpoint within 4 1/2 minutes.
A good rule of thumb is to use this formula:
(checkpoint_timeout - 2min) / checkpoint_timeout. If the result is more than 0.9, use 0.9 instead.
Default Statistics Target
Postgres collects stats about each table such as the number of updates, vacuums, and how dirty it is. Dirty, in this case, indicates a tuple that has been updated or deleted since the last vacuum. Until the vacuum happens, this tuple is taking up unnecessary room on the data drive, which can slow down disk seeks and table scans.
These stats are used by the query planner to determine how to run the query such as whether or not to use an index. If the query plan isn’t behaving, it may be because the stats aren’t detailed enough. By increasing the value of
default_statistics_target then running an
ANALYZE on the table, it will provide more accurate results though the
ANALYZE will take longer to run. Typically the default is good enough.
Random Page Cost
random_page_cost tells the query planner how long it takes to seek a random page on the disk as a multiple of the time to do a sequential read. The sequential read is considered 1.0.
random_page_cost is generally a number between 1-4. I usually set it to 3.0 though this could be set lower on modern drives with fast random access.
If query plans aren’t working like you’d expect, this is one of the last things to try. Make sure the other settings are dialed in, specifically the memory settings, before changing this number. Also, run
default_statistics_target set high enough. If the table has a lot of updates or deletes, the table may be fragmented and so you can try running
VACUUM FULL to compact the table. This will lock the table while it’s rewritten, so only run a
VACUUM FULL during scheduled maintenance.