Day 5: Partitioning, Configuration, Performance Tuning, Query Tuning

Get to [2021-training-notes-day-4].

Partitioning

Views

So we can learn about partitions.

CREATE DATABASE cat;
\c cat
CREATE TABLE person (id bigint, name text, bday date);
INSERT INTO person VALUES(1, 'oscar', '1996-11-23');
INSERT INTO person VALUES(2, 'brie', '2000-11-23');
SELECT id, name FROM person WHERE bday > '2000-01-01';
 id | name
----+------
  2 | brie
(1 row)

Say we use that query frequently… We create a view. A view is a SELECT with a name. It looks and acts like a table but isn’t one.

cat=# CREATE VIEW youngsters AS SELECT id, name FROM person WHERE bday > '2000-01-01';
CREATE VIEW
cat=# \d youngsters
             View "public.youngsters"
 Column |  Type  | Collation | Nullable | Default
--------+--------+-----------+----------+---------
 id     | bigint |           |          |
 name   | text   |           |          |

cat=# SELECT * FROM youngsters;
 id | name
----+------
  2 | brie
(1 row)

Explore EXPLAIN using the view and the underlying query:

cat=# EXPLAIN SELECT * FROM youngsters;
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on person  (cost=0.00..24.12 rows=377 width=40)
   Filter: (bday > '2000-01-01'::date)
(2 rows)

cat=# EXPLAIN SELECT id, name FROM person WHERE bday > '2000-01-01';
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on person  (cost=0.00..24.12 rows=377 width=40)
   Filter: (bday > '2000-01-01'::date)
(2 rows)

Useful for:

shortcut/nicknames –> don’t keep typing the same thing
extraction layer –> Maybe your application just deals with the VIEW. This makes application updates much less annoying. Even if you change the tables around, just update the view and the application gets the same results.

As with everything, use in moderation. Views on views on views on views. Please don’t. :turtle: :turtle: :turtle:

Partition Table

Partitions as kinds of views.

The partitioning key determines in which row/partition something belongs.

READ Table Partitioning

On INSERT, unique partition for each row. SELECT collects info from all partitions as if you had done UNION ALL.

:arrow_right: INSERT :arrow_left: SELECT

There are three partitioning strategies:

range partitioning
list partitioning
hash partitioning

You can do range partitioning – everything between x and y goes here. No overlaps, positions must be unique. List partitionign is also possible (list of values specified, instead of range). Hash partitioning – use partitioning key, depending on data type, the appropriate hashing function is called, result divided by # partitions and remainder determins resulting partition. Mostly useless. Good for randomly splitting a big table into many, I/O balancing advantage but beyond that – eh, why?)

Range Partitioning

CREATE TABLE data (id bigint GENERATED ALWAYS AS IDENTITY NOT NULL, created timestamp NOT NULL, catload text) PARTITION BY RANGE (created);   ;

The (created) value specifies the partitioning key.

cat=# INSERT INTO data (created, catload) VALUES (current_timestamp, 'Cats are so extremely cute, OMG.');
ERROR:  no partition of relation "data" found for row
DETAIL:  Partition key of the failing row contains (created) = (2021-09-03 15:23:32.195025).
cat=#

Automatic partition creation == NO. Not currently available in PostgreSQL. Your application or cron or something has to create the partitions. We create it with CREATE TABLE:

CREATE TABLE data_2021 PARTITION OF data FOR VALUES FROM ('2021-01-01 00:00:00') TO ('2022-01-01 00:00:00');
CREATE TABLE data_2020 PARTITION OF data FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2021-01-01 00:00:00');
CREATE TABLE data_2019 PARTITION OF data FOR VALUES FROM ('2019-01-01 00:00:00') TO ('2020-01-01 00:00:00');
CREATE TABLE data_2018 PARTITION OF data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00');

The upper boundary is always excluded to avoid overlaps on New Year’s Day at midnight with the statements above. If you try, you’ll get an error:

cat=# CREATE TABLE data_dogs PARTITION OF data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00');
ERROR:  partition "data_dogs" would overlap partition "data_2018"

cat=# INSERT INTO data (created, catload) VALUES (current_timestamp, 'something');
INSERT 0 1
cat=# TABLE data_2021;
 id |          created           |  catload
----+----------------------------+-----------
  2 | 2021-09-03 15:28:12.284289 | something
(1 row)

cat=# INSERT INTO data_2018 (created, catload) VALUES (current_timestamp, 'something');
ERROR:  null value in column "id" of relation "data_2018" violates not-null constraint
DETAIL:  Failing row contains (null, 2021-09-03 15:28:28.176442, something).

In use:

cat=# EXPLAIN SELECT * FROM data;
                                QUERY PLAN
---------------------------------------------------------------------------
 Append  (cost=0.00..104.20 rows=4280 width=48)
   ->  Seq Scan on data_2018 data_1  (cost=0.00..20.70 rows=1070 width=48)
   ->  Seq Scan on data_2019 data_2  (cost=0.00..20.70 rows=1070 width=48)
   ->  Seq Scan on data_2020 data_3  (cost=0.00..20.70 rows=1070 width=48)
   ->  Seq Scan on data_2021 data_4  (cost=0.00..20.70 rows=1070 width=48)
(5 rows)

Updates that move data between partitions regularly probably mean you should have chosen a better partitioning strategy.

cat=# UPDATE data SET created = created - INTERVAL '1 year' WHERE id = '2' ;
UPDATE 1
cat=# TABLE data_2021;
 id | created | catload
----+---------+---------
(0 rows)

cat=# TABLE data_2020;
 id |          created           |  catload
----+----------------------------+-----------
  2 | 2020-09-03 15:28:12.284289 | something
(1 row)

Let’s add 1,000,000+ rows that will be in different partitions:

INSERT INTO data (created, catload) SELECT t, 'random value' FROM generate_series(TIMESTAMP '2018-12-29 00:00:00', TIMESTAMP '2021-01-06 00:00:00', INTERVAL '1 minute') AS gs(t);
INSERT 0 1064161

Partition pruning in action:

cat=# EXPLAIN  SELECT * FROM data WHERE id = 42;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..27785.85 rows=4 width=29)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..26785.45 rows=4 width=29)
         ->  Parallel Seq Scan on data_2020 data_3  (cost=0.00..13241.01 rows=1 width=29)
               Filter: (id = 42)
         ->  Parallel Seq Scan on data_2019 data_2  (cost=0.00..13205.00 rows=1 width=29)
               Filter: (id = 42)
         ->  Parallel Seq Scan on data_2021 data_4  (cost=0.00..211.90 rows=1 width=29)
               Filter: (id = 42)
         ->  Parallel Seq Scan on data_2018 data_1  (cost=0.00..127.53 rows=1 width=29)
               Filter: (id = 42)
(11 rows)

cat=# EXPLAIN SELECT * FROM data WHERE created = '2020-07-12 12:00:00';
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Gather  (cost=1000.00..14241.21 rows=2 width=29)
   Workers Planned: 2
   ->  Parallel Seq Scan on data_2020 data  (cost=0.00..13241.01 rows=1 width=29)
         Filter: (created = '2020-07-12 12:00:00'::timestamp without time zone)
(4 rows)

Using created lets the partition work.

BEGIN;
DROP TABLE data_2018;
COMMIT;

You can detach the partition with ALTER TABLE data DETACH PARTITION data_2018;. The table persists but is no longer apart of the partition. The reverse operation is possible by attaching a partition:

ALTER TABLE data ATTACH PARTITION data_2018 FOR VALUES FROM ('2018-01-01') TO ('2019-01-01');

Or:

CREATE TABLE data_2022 (LIKE data_2021);
<POPULATE></POPULATE>

ALTER TABLE data ATTACH PARTITION dat-2022 FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');

Default Partition

CREATE TABLE data_def PARTITION OF data DEFAULT;

If it doesn’t belong elsewhere, it goes there.

INSERT INTO data (created, catload) VALUES ('2039-02-14 04:19:00', 'SO very cute');

Yikes! Exercise caution when using a default partition. In the example above, you won’t be able to create the data_2039 partition. this might be fine depending on how you are defining your partitions:

cat=# CREATE TABLE data_2023 PARTITION OF data FOR VALUES FROM ('2039-01-01 00:00:00') TO ('2040-01-01 00:00:00');
ERROR:  updated partition constraint for default partition "data_def" would be violated by some row

That’s because of the row we INSERTed above. With a fixed set of partitions, a default partition is OK, like users by alphabetical order or similar.

Let’s create an index…

Creating an index on data creates a partitioned index, observe: :eyeglasses:

cat=# \d data
                               Partitioned table "public.data"
 Column  |            Type             | Collation | Nullable |           Default
---------+-----------------------------+-----------+----------+------------------------------
 id      | bigint                      |           | not null | generated always as identity
 created | timestamp without time zone |           | not null |
 catload | text                        |           |          |
Partition key: RANGE (created)
Indexes:
    "data_created_idx" btree (created)
Number of partitions: 5 (Use \d+ to list them.)

cat=# \d data_2018
                        Table "public.data_2018"
 Column  |            Type             | Collation | Nullable | Default
---------+-----------------------------+-----------+----------+---------
 id      | bigint                      |           | not null |
 created | timestamp without time zone |           | not null |
 catload | text                        |           |          |
Partition of: data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00')
Indexes:
    "data_2018_created_idx" btree (created)

cat=# DROP INDEX data_2018_created_idx;
ERROR:  cannot drop index data_2018_created_idx because index data_created_idx requires it
HINT:  You can drop index data_created_idx instead.

PostgreSQL prevents you from doing silly things:

cat=# ALTER TABLE data_2019 ADD newcol inet;
ERROR:  cannot add column to a partition

The optimizer creates an execution plan individually for each partition. Partitioning creates more work for the query optimizer.

Let’s try to create a primary key:

cat=# ALTER TABLE data ADD PRIMARY KEY (id);
ERROR:  unique constraint on partitioned table must include all partitioning columns
DETAIL:  PRIMARY KEY constraint on table "data" lacks column "created" which is part of the partition key.

:game_die: No dice because no guarantee of uniqueness across partitions. We require inherent uniqueness that can be imbued by the partitioning key.

Uniqueness within partition guaranteed by the index.
Uniqueness across partitions added by the partitioning key.

cat=# ALTER TABLE data ADD PRIMARY KEY (id, created);
ALTER TABLE

The above works but is unfortunate.

A global index that does the desired thing could happen but would deprive us of the performance benefits of partitioning. Not possible today with PostgreSQL.

Subpartitioning is possible and could theoretically be arbitrarily deep. Beyond two layers:

WHY:question:

Q: What’s a good approach for partitioning a live table after the fact?
A: Partition it in advance, think ahead. Fallacy: I don’t need to worry about that now, I’ll deal with it when I have lots of data.
- See Initiate the web_hook_logs partitioning where we rolled our own with TRIGGER.
- With PostgreSQL 13+, set up logical replication somewhere with a partition table, replicate over.

11:55 Break! 12:04 Back!

Performance, Partitioning and (Query) Planning

With a query like EXPLAIN SELECT catload, count(*) FROM data GROUP BY catload;:

SET enable_partitionwise_aggregate = on;

Two table spartitioned with the same partitiong key and along the same boundaries, use enable_partitionwise_join.

The enable_partitionwise_aggregate and enable_partitionwise_join options are off by default: PostgreSQL tries to reduce costs. This only makes sense if your setup would benefit by it.

When would I partition a table? When shouldn’t I partition a table?

Situations when you might want to use partitioning:

Getting rid of data more performant: Makes retention policies with DROP TABLE; nice and easy. See notes above on DROP TABLE data_2023; instead of a long-running, I/O intensive DELETE, requires a long VACUUM;, introduces bloat.
Backloading: load partition, create table with COPY FREEZE, then create indexes and then create partitions. Recall DETACH and ATTACH from above.
Performance: mixed bag. :handbag: Imagine a query that requires a sequential scan, has a WHERE, the sequential scan can run on just those partitions rather than everywhere. :warning: Narrow use case. Could be more expensive. The moral of many of these stories is: IDK, depends on your data, your use case, your tolerance level for different behaviors. Average query may become slower but some queries will be much faster.
VACUUM performance: only one running VACUUM per-table, could cause performance problems. Parition permits simultaneous VACUUM along the partitioned data (multiple allowed).

An example:

cat=# EXPLAIN SELECT * FROM data WHERE id = 42;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Append  (cost=0.29..41.66 rows=5 width=33)
   ->  Index Scan using data_2018_pkey on data_2018 data_1  (cost=0.29..8.30 rows=1 width=29)
         Index Cond: (id = 42)
   ->  Index Scan using data_2019_pkey on data_2019 data_2  (cost=0.43..8.45 rows=1 width=29)
         Index Cond: (id = 42)
   ->  Index Scan using data_2020_pkey on data_2020 data_3  (cost=0.43..8.45 rows=1 width=29)
         Index Cond: (id = 42)
   ->  Index Scan using data_2021_pkey on data_2021 data_4  (cost=0.29..8.30 rows=1 width=29)
         Index Cond: (id = 42)
   ->  Index Scan using data_def_pkey on data_def data_5  (cost=0.12..8.14 rows=1 width=48)
         Index Cond: (id = 42)
(11 rows)

With partitioning, you have to do way more scans than you otherwise would so what even is the point.

Big tables and slow queries –> the answer is not necessarily partition! Partitioning is not considered to be a performance improvement. Makes the most sense in a data warehouse or similar. :factory:

Myth: Partitioning will make all queries faster.
- The average query will probably be slower. Some queries will be faster.
Myth: If table is very large, you must partition.
- Have a better reason: auto vacuum, specific queries that require sequential scan.

Partitioning was introduced in major version 10. Use at least 12 to take advantage of benefits. Essentially feature-complete with 11 and performance improvements began in 12.

Sharding

Split data into parts that go in different databases on different hosts to permit scaling, load distribution. Sensible with data that lends itself to being split in this way. Retain some data on all shards to make sharding less painful.

Shard:

in the application
consider PG fork
PL/Proxy
PostgreSQL 14+ -> built-in options
Foreign Tables (looks and feels like a regular table but acts as client to another database, presents received data as if local)
- Use foreign table as partition -> kinda sharding. With 14+, parallel append makes this approach more reasonable (before, each ‘shard’ is queried sequentially, lame).

Not every data set can be sharded successfully.

/ :bookmark_tabs:

Configuration

GUC - Grand Unified Configuration

If you add trash to postgresql.conf and tell it to reload, it’ll still be running with the old config. If you then do systemctl restart or hard reboot – you’ll really be sad as it has the old config on disk and is aiming to use that.

There is no --validate-config flag: you just reload and look for errors. :sign_of_the_horns:

The GUC (Grand Unified Configuration):

graph TD

  defaults --> postgresql.conf;
  postgresql.conf --> C[ALTER SYSTEM];
  C-->D[start server, override params, see Patroni]
  D-->E[ALTER DATABASE, per-database settings]
  E-->F[ALTER ROLE]
  F-->G[ALTER ROLE IN DATABASE]
  G-->H[SET]
  H-->I[SET LOCAL]

There are so many options: you could have fun. Say one DBA sets a change in postgresql.conf and another with ALTER SYSTEM. The person changing postgresql.conf will be sad. :crying_cat_face: Talk to your fellow DBAs.

SELECT * FROM pg_settings ;

Explore the output, a few notes:

Contexts: postmaster must be changed at startup
use the source field to figure out where a specific setting has been set!
Use pending_restart for settings that require a restart to be changed. So cool!

postgres=# SELECT name,setting,category,context FROM pg_settings WHERE name = 'search_path' ;
    name     |     setting     |                    category                     | context
-------------+-----------------+-------------------------------------------------+---------
 search_path | "$user", public | Client Connection Defaults / Statement Behavior | user
(1 row)

You might want a different schema search path for different databases:

ALTER DATBASE course SET search_path = myapp, public;

That change would apply to new connections unless you restart.

The things lower down will override the things that are higher up.

Design principle: do configuration changes from the client, as you may not have shell access. That’s why there are so many of these little things. Super cool.

postgresql.auto.conf is modified automatically when using ALTER SYSTEM:

ALTER SYSTEM SET enable_partitionwise_aggregate = on;

You still need to reload for changes set via SQL. Thankfully, you can reload via SQL:

SELECT pg_reload_conf();

Not all settings can be changed via SQL, like….SET port = 5555;.

Why?
No.
Change in postgresql.conf and reload

 SHOW server_version;

Use \drds to list the per-database role settings stored by this table.

https://www.postgresql.org/message-id/flat/20091008163401.23D8C753FB7%40cvs.postgresql.org

Use SET LOCAL to make a change for a single transaction. Cool! (Increase memory, timeouts, whatever.)

Challenge: Improve performance by changing a single value from the defaults!
Response: Increase shared_buffers beyond the default. The guideline for determining a reasonable value: 1/4 available RAM, no more than 8GB. Why so little? More than that could be beneficial but try it for your workload. (Bigger isn’t always better, consider a representative load test.) PostgreSQL uses buffered I/O: reads may not go to disk so you may still benefit from stuff being cached into memory even if it’s not in PG’s shared buffers (thanks, kernel! :popcorn: ).
- Avoid caching the same things twice: lower than half of the memory or more than half of the memory but not exactly half. :one: :heavy_division_sign: :two:

Look at the Buffer Cache

\d pg_buffercache
SELECT usagecount, count(*) FROM pg_buffercache GROUP BY usagecount ORDER BY usagecount;

So, usagecount is a number between 0 and 5. Each usage increases the count by 1, up to 5. You can only evict a buffer with usagecount of 0. This happens when someone passes by looking for an empty buffer. There’s a background job that goes through and counts down usage counts. Backend looking for empty buffer will count it down. Usage increases the usagecount, someone skipping past it decreases the usagecount. Clock sweep algorithm

READ The LRU/Clock sweep cache algorithm

:cooking: Cache Warm-Up

Without this, it takes awhile for things to get warmed up. If you know what things should be cached for good performance, you can prewarm to load things into cache:

SELECT pg_prewarm('test_upper_idx');
SELECT pg_prewarm('test_id_idx');

Data that has been prewarmed is still subject to the usagecount stuff. You can do some pre-warming at start-up with a parameter called shared_preload_libraries:

shared_preload_libraries = 'pg_prewarm'

Restart PostgreSQL and you’ll see a custom background worker in ps:

postgres: 13/main: autoprewarm master

It regularly checks what’s in shared buffers, persists this in a file on disk in the data directory called….something.

It’s not as bad as Apache configs, which are science.

:joy:

13:01 Break! 13:17 Back!

Tuning Parameters

Parameter: work_mem

work_mem private memory for backend processes, the calculate queries on behalf of the client.

postgres=# EXPLAIN SELECT * FROM pg_stats;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=126.24..153.50 rows=5 width=401)
   ->  Hash Join  (cost=126.11..152.20 rows=5 width=475)
         Hash Cond: ((s.starelid = c.oid) AND (s.staattnum = a.attnum))
         ->  Seq Scan on pg_statistic s  (cost=0.00..22.02 rows=402 width=349)
         ->  Hash  (cost=111.80..111.80 rows=954 width=142)
               ->  Hash Join  (cost=22.65..111.80 rows=954 width=142)
                     Hash Cond: (a.attrelid = c.oid)
                     Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
                     ->  Seq Scan on pg_attribute a  (cost=0.00..81.61 rows=2861 width=70)
                           Filter: (NOT attisdropped)
                     ->  Hash  (cost=17.82..17.82 rows=386 width=72)
                           ->  Seq Scan on pg_class c  (cost=0.00..17.82 rows=386 width=72)
                                 Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
   ->  Index Scan using pg_namespace_oid_index on pg_namespace n  (cost=0.13..0.18 rows=1 width=68)
         Index Cond: (oid = c.relnamespace)
(15 rows)

There are a great many nodes or steps in the execution plan. each node or step could use the amount of memory in work_mem. Most nodes won’t use a lot of memory (sequential and index scans). The three types of nodes that consume memory:

Hash
Bitmap
Sort –> with insufficient memory you’ll have to sort in batches. Avoid.

The above makes it tricky to calcualte what work_mem should be. LA recommends this formula as a guideline:

shared_buffers + work_mem * max_connections (less than or equal to ) RAM

Consider max_connections as an upper limit, you might have far fewer in reality.

Example:

postgres=# CREATE DATABASE cats;
CREATE DATABASE
postgres=# \c cats
You are now connected to database "cats" as user "postgres".
cats=# CREATE TABLE work (id integer GENERATED ALWAYS AS IDENTITY, name text);
CREATE TABLE
cats=# INSERT INTO work (name) SELECT 'brie and plop' FROM generate_series(1, 1000000);
INSERT 0 1000000
cats=# INSERT INTO work (name) SELECT 'plop so cute' FROM generate_series(1, 1000000);
INSERT 0 1000000
cats=# ANALYZE work;
ANALYZE
cats=# EXPLAIN SELECT name, count(*) FROM work GROUP BY name;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=26239.05..26239.56 rows=2 width=21)
   Group Key: name
   ->  Gather Merge  (cost=26239.05..26239.52 rows=4 width=21)
         Workers Planned: 2
         ->  Sort  (cost=25239.03..25239.03 rows=2 width=21)
               Sort Key: name
               ->  Partial HashAggregate  (cost=25239.00..25239.02 rows=2 width=21)
                     Group Key: name
                     ->  Parallel Seq Scan on work  (cost=0.00..21072.33 rows=833333 width=13)
(9 rows)

cats=# SET max_parallel_workers_per_gather = 0;
SET
cats=# EXPLAIN SELECT name, count(*) FROM work GROUP BY name;
                             QUERY PLAN
---------------------------------------------------------------------
 HashAggregate  (cost=42739.00..42739.02 rows=2 width=21)
   Group Key: name
   ->  Seq Scan on work  (cost=0.00..32739.00 rows=2000000 width=13)
(3 rows)

Just the statements:

CREATE DATABASE cats;
\c cats

The HashAggregate takes care of the GROUP BY.

If we SELECT id instead:

cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 HashAggregate  (cost=145239.00..180864.00 rows=2000000 width=12)
   Group Key: id
   Planned Partitions: 64
   ->  Seq Scan on work  (cost=0.00..32739.00 rows=2000000 width=4)
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(7 rows)

This will require temporary partitions sent to disk because work_mem is too small. :crying_cat_face: Let’s fix that:

cats=# SET work_mem = '1GB';
SET
cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id;
                             QUERY PLAN
--------------------------------------------------------------------
 HashAggregate  (cost=42739.00..62739.00 rows=2000000 width=12)
   Group Key: id
   ->  Seq Scan on work  (cost=0.00..32739.00 rows=2000000 width=4)
(3 rows)

The minimum possible work_mem is 64kB. Let’s set work_mem to that minimum and see how the query plan changes:

cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 GroupAggregate  (cost=351436.69..386436.69 rows=2000000 width=12)
   Group Key: id
   ->  Sort  (cost=351436.69..356436.69 rows=2000000 width=4)
         Sort Key: id
         ->  Seq Scan on work  (cost=0.00..32739.00 rows=2000000 width=4)
 JIT:
   Functions: 6
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(8 rows)

We see that GroupAggregate is used instead. The cost values start to be useful: we can see the impact of a lower work_mem. :ram:

Choosing work_mem

Watch how many temp files are being written for a database called cats:

cats=# \watch 4 SELECT temp_files,temp_bytes FROM pg_stat_database WHERE datname = 'cats' \gx
Fri Sep  3 17:31:55 2021 (every 4s)

 temp_files | temp_bytes
------------+------------
          2 |   28000000
(1 row)

:musical_note: Don’t give more memory than you can afford.

About work_mem

From the docs:

work_mem (integer)

Sets the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. If this value is specified without units, it is taken as kilobytes. The default value is four megabytes (4MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will generally be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.

Hash-based operations are generally more sensitive to memory availability than equivalent sort-based operations. The memory available for hash tables is computed by multiplying work_mem by hash_mem_multiplier. This makes it possible for hash-based operations to use an amount of memory that exceeds the usual work_mem base amount.

Parameter: max_connections

We care about the connections that are actually consuming system resources. A sufficiently high max_connections value can (and will) degrade performance. Also, consider work_mem when setting max_connectiosn, see the guideline above.

Use connection pools to manage the number of active connections.

READ Client-side connection-pool from GitLab development docs

Parameter: maintenance_work_mem, autovacuum_work_mem and logical_decoding_work_mem

The friends of work_mem.

maintenance_work_mem – private memory for creation of indexes and clusters, full vaccuum. You probably don’t do this too often so feel free to be generous.
autovacuum_work_mem will take the value of maintenance_work_mem, limits number of rows that can be processed at a time. Be generous, raise to speed vacuum.

Parameter: `temp_file_limit`

Let’s look at temp files.

Do SELECT * FROM generate_series(1, 10000000000000000000000);
Elsewhere: while true ; do date ; ls -ahls ; sleep 1 ; done
:skull: Kill the SELECT from above unless you want to live on the wild side.

Runaway queries could exhaust your disk space.

Query Tuning Parameters

Planner Cost Constraints

Tune to match hardware.

random_page_cost – most important, tuned for spinning disks by default.
seq_page_cost cost to read one 8K page during sequential scan.

Match random_page_cost and seq_page_cost with SSDs.

effective_cache_size – impact is moderate, easy to set tells the optimizer how much memory is available to cache data. PG doesn’t otherwise know how much RAM your machine has, lol. Index scans become cheaper with an increased effective_cache_size.

:warning: Avoid the temptation to tune a bunch of paramters all at once just because you have a theory or read somewhere on the Internet that a parameter is super awesome.

Change one parameter at a time. (Duh.)
Observe the results.
Proceed.

Query and Analyzing Tuning: Find Problematic Queries

:hospital: Essentially med school for database doctors. You can learn so much and then you see a bunch of queries (patients) and can apply suggestions from your experience.

Parallel Queries

The Gather in an execution plan denotes parallelization.

Screenshot around 13:52 local time

Everything under the Gather node is parallelized.

Configuring Parallelization

max_parallel_workers = limit of parallel workers across entire cluster. Take # cores into consideration. More parallel workers than cores –> No. Use some number less than the number of cores.

#max_parallel_workers = 8    # maximum number of max_worker_processes that
                             # can be used in parallel operations

Tune for conflicting ends: throughout vs response time

Parallel queries is good for response time, a loss for throughput. Don’t use parallel queries if you wish to optimize for throughput.
Data warehouse, many long queries: set max_parallel_workers and max_parallel_workers_per_gather a bit higher.

You could find a difference between Workers Planned and Workers Launched depending on what’s going on when the query actually runs. The query won’t wait for the number of Workers it plans to be available to be launched to run the query.

READ Parallel queries in PostgreSQL

Hmm: observe that parallelization is not used here, observe the rows value to understand why:

screenshot around 13:59 local time

parallel_setup_cost –> cost for setting up parallelization parallel_tuple_cost estimate for transferring rows betwene parallel processes. min_parallel_table_scan_size if table exceeds this size, a worker will be considered. if table exceeds double the size, another worker is consdiered. logarithmically increased.

EXPLAIN SELECT count(*) FROM test;
SET max_parallel_workers_per_gather = 8 ;
EXPLAIN SELECT count(*) FROM test;

:moneybag: Never mind the cost, make it fast:

ALTER TABLE test SET (parallel_workers = 6);

14:04 Break! 14:12 Back!

log_min_duration_statement - log queries that take longer than a certain amount of time.

Q: Which queries are causing load and hurting the most?
A: Short ultra-frequent queries? Won’t find with log_in_duration_statement. Instead, we want

New shared_preload_libraries:

shared_preload_libraries = 'pg_prewarm,pg_stat_statements'

LA recommends always having this in production. Creates extra shared memory segment with statistics about executed queries. Not a record of all queries, lumps together. queries who only differ by constants == identical.

There’s a built in benchmarking tool: pgbench.

$ pgbench  -c 2 -T 30 cats
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 10
query mode: simple
number of clients: 2
number of threads: 1
duration: 30 s
number of transactions actually processed: 29922
latency average = 2.005 ms
tps = 997.304049 (including connections establishing)
tps = 997.414817 (excluding connections establishing)

A high standard deviation: query is sometimes slow and sometimes fast.

CREATE EXTENSION pg_stat_statements;
\d pg_stat_statements

SELECT calls, total_exec_time, query FROM pg_stat_statements ORDER BY total_exec_tiem DESC LIMIT 10;

Here’s how that looks after running pgbench:

postgres=# SELECT calls, total_exec_time, query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
 calls |  total_exec_time   |                                                query
-------+--------------------+------------------------------------------------------------------------------------------------------
| 1472.7642000000155 | UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2
| 1397.9302000000264 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2
| 1044.9512000000009 | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2
|           918.3737 | INSERT INTO large SELECT i, i / $1 + $2 FROM generate_series($3, $4) AS i
|  648.4597000000005 | SELECT abalance FROM pgbench_accounts WHERE aid = $1
|   557.028700000003 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP)
| 181.55990000000943 | BEGIN
|  176.7948000000034 | END
|            57.3338 | CREATE EXTENSION pg_stat_statements
|            10.5965 | vacuum pgbench_branches
(10 rows)

cats=# CREATE TABLE large (id integer, x integer);
CREATE TABLE
cats=# INSERT INTO large SELECT i, i / 10000 + 1 FROM generate_series(1, 1000000) AS i;
INSERT 0 1000000

This will make blocks of rows with the same value for x.

CREATE TABLE small (x integer);
INSERT INTO small VALUES (1);
\timing

Observe that SELECT * FROM large JOIN USING (x); takes a while (Time: 233.530 ms).

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);

So much info:

# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=30832.00..198533.19 rows=12750000 width=8) (actual time=10265.965..10330.206 rows=9999 loops=1)
   Hash Cond: (small.x = large.x)
   Buffers: shared hit=4426, temp read=2845 written=4218
   ->  Seq Scan on small  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.009..0.020 rows=1 loops=1)
         Buffers: shared hit=1
   ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8) (actual time=10217.431..10217.450 rows=1000000 loops=1)
         Buckets: 131072 (originally 131072)  Batches: 32 (originally 16)  Memory Usage: 3073kB
         Buffers: shared hit=4425, temp written=3276
         ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=4.940..5004.856 rows=1000000 loops=1)
               Buffers: shared hit=4425
 Planning Time: 0.189 ms
 JIT:
   Functions: 10
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.847 ms, Inlining 0.000 ms, Optimization 0.298 ms, Emission 4.485 ms, Total 5.630 ms
 Execution Time: 10380.233 ms
(16 rows)

Time: 10380.995 ms (00:10.381)

Differences between estimate and actual are close, you are probably good.

READ pgbench

`JOIN` in PostgreSQL

To understand misestimates.

There are three ways that a JOIN could be calculated:

nested loop JOIN (For each row in the first table, look for a match in the second. The obvious and somewhat naive approach.) Efficient with some setups: small table used in the outer loop. An index on the inner table will help.

:three: Types of `JOIN` Strategies

	Nested Loop `JOIN`	Hash `JOIN`	Merge `JOIN`
How it works	Useful when one table is smaller.	sequential scan of inner table, hash key is the `JOIN` condition, sequentail scan of outer tabe and probes. For each result in the outer table, we probe the hash instead of the entire table. Startup cost of `hash` repeated with later performance.	Sort both tables on `JOIN` condition and then merge them.
good?	“outer table” small	Bigger tables. If hash fits into `work_mem`. If hash from inner table doesn’t fit in `work_mem`, less good.	Large tables when hash `JOIN` is no fun (too big for `work_mem`)
index?	On `JOIN` condition, on the inner table	Using an index is not helpful – probably.	on `JOIN` condition on both tables.

A poor estimate could lead to PostgreSQL choosing the wrong kind of JOIN.

FUN!

:bookmark: https://explain.depesz.com/
:bookmark: https://explain.dalibo.com

The results from explain.dalib.com are pretty:

EXPLAIN (
    ANALYZE,
    BUFFERS
)
SELECT
    *
FROM
    large
    JOIN small USING (x);

Problem: The estimate is WAY higher than the atual – why? Statistics are off.

Fix: ANALYZE small;

It’s true!

# ANALYZE small;
ANALYZE
Time: 1.875 ms
cats=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);
                                                             QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
-------
 Gather  (cost=1001.02..12196.86 rows=10000 width=8) (actual time=0.379..3302.646 rows=9999 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=4623 read=1
   ->  Hash Join  (cost=1.02..10196.86 rows=4167 width=8) (actual time=0.327..3271.277 rows=3333 loops=3)
         Hash Cond: (large.x = small.x)
         Buffers: shared hit=4623 read=1
         ->  Parallel Seq Scan on large  (cost=0.00..8591.67 rows=416667 width=8) (actual time=0.046..1624.942 rows=333333 lo
ops=3)
               Buffers: shared hit=4425
         ->  Hash  (cost=1.01..1.01 rows=1 width=4) (actual time=0.055..0.069 rows=1 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               Buffers: shared hit=3
               ->  Seq Scan on small  (cost=0.00..1.01 rows=1 width=4) (actual time=0.017..0.028 rows=1 loops=3)
                     Buffers: shared hit=3
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.208 ms
 Execution Time: 3358.110 ms
(18 rows)

Time: 3358.938 ms (00:03.359)

WHY?
The small table is tiny and auto analyze never got ran because we didn’t get to the threshold. That totally makes sense.

# CREATE INDEX ON large(x);
CREATE INDEX
Time: 327.184 ms
cats=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);
                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
------
 Nested Loop  (cost=0.42..360.44 rows=10000 width=8) (actual time=0.065..167.526 rows=9999 loops=1)
   Buffers: shared hit=46 read=11
   ->  Seq Scan on small  (cost=0.00..1.01 rows=1 width=4) (actual time=0.006..0.018 rows=1 loops=1)
         Buffers: shared hit=1
   ->  Index Scan using large_x_idx on large  (cost=0.42..259.43 rows=10000 width=8) (actual time=0.037..58.590 rows=9999 loo
ps=1)
         Index Cond: (x = small.x)
         Buffers: shared hit=45 read=11
 Planning:
   Buffers: shared hit=15 read=1
 Planning Time: 0.306 ms
 Execution Time: 220.724 ms
(11 rows)

Time: 221.788 ms
cats=#

Q: Why isn’t it doing a Nested Loop JOIN when we know that would be better?
A: No index!

Scroll :up: to see how we reduced the time of the query with a few steps.

More Query Tuning

If you have a sequential scan with a filter and the majority of the rows are filtered out: use an index.