Get to [2021-training-notes-day-4].
So we can learn about partitions.
CREATE DATABASE cat;
\c cat
CREATE TABLE person (id bigint, name text, bday date);
INSERT INTO person VALUES(1, 'oscar', '1996-11-23');
INSERT INTO person VALUES(2, 'brie', '2000-11-23');
SELECT id, name FROM person WHERE bday > '2000-01-01';
id | name
----+------
2 | brie
(1 row)
Say we use that query frequently… We create a view. A view is a SELECT with a name. It looks and acts like a table but isn’t one.
cat=# CREATE VIEW youngsters AS SELECT id, name FROM person WHERE bday > '2000-01-01';
CREATE VIEW
cat=# \d youngsters
View "public.youngsters"
Column | Type | Collation | Nullable | Default
--------+--------+-----------+----------+---------
id | bigint | | |
name | text | | |
cat=# SELECT * FROM youngsters;
id | name
----+------
2 | brie
(1 row)
Explore EXPLAIN using the view and the underlying query:
cat=# EXPLAIN SELECT * FROM youngsters;
QUERY PLAN
----------------------------------------------------------
Seq Scan on person (cost=0.00..24.12 rows=377 width=40)
Filter: (bday > '2000-01-01'::date)
(2 rows)
cat=# EXPLAIN SELECT id, name FROM person WHERE bday > '2000-01-01';
QUERY PLAN
----------------------------------------------------------
Seq Scan on person (cost=0.00..24.12 rows=377 width=40)
Filter: (bday > '2000-01-01'::date)
(2 rows)
Useful for:
VIEW. This makes application updates much less annoying. Even if you change the tables around, just update the view and the application gets the same results.As with everything, use in moderation. Views on views on views on views. Please don’t. :turtle: :turtle: :turtle:
Partitions as kinds of views.
The partitioning key determines in which row/partition something belongs.
On INSERT, unique partition for each row. SELECT collects info from all partitions as if you had done UNION ALL.
:arrow_right: INSERT
:arrow_left: SELECT

There are three partitioning strategies:
You can do range partitioning – everything between x and y goes here. No overlaps, positions must be unique. List partitionign is also possible (list of values specified, instead of range). Hash partitioning – use partitioning key, depending on data type, the appropriate hashing function is called, result divided by # partitions and remainder determins resulting partition. Mostly useless. Good for randomly splitting a big table into many, I/O balancing advantage but beyond that – eh, why?)
CREATE TABLE data (id bigint GENERATED ALWAYS AS IDENTITY NOT NULL, created timestamp NOT NULL, catload text) PARTITION BY RANGE (created); ;
The (created) value specifies the partitioning key.
cat=# INSERT INTO data (created, catload) VALUES (current_timestamp, 'Cats are so extremely cute, OMG.');
ERROR: no partition of relation "data" found for row
DETAIL: Partition key of the failing row contains (created) = (2021-09-03 15:23:32.195025).
cat=#
Automatic partition creation == NO. Not currently available in PostgreSQL. Your application or cron or something has to create the partitions. We create it with CREATE TABLE:
CREATE TABLE data_2021 PARTITION OF data FOR VALUES FROM ('2021-01-01 00:00:00') TO ('2022-01-01 00:00:00');
CREATE TABLE data_2020 PARTITION OF data FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2021-01-01 00:00:00');
CREATE TABLE data_2019 PARTITION OF data FOR VALUES FROM ('2019-01-01 00:00:00') TO ('2020-01-01 00:00:00');
CREATE TABLE data_2018 PARTITION OF data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00');
The upper boundary is always excluded to avoid overlaps on New Year’s Day at midnight with the statements above. If you try, you’ll get an error:
cat=# CREATE TABLE data_dogs PARTITION OF data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00');
ERROR: partition "data_dogs" would overlap partition "data_2018"
cat=# INSERT INTO data (created, catload) VALUES (current_timestamp, 'something');
INSERT 0 1
cat=# TABLE data_2021;
id | created | catload
----+----------------------------+-----------
2 | 2021-09-03 15:28:12.284289 | something
(1 row)
cat=# INSERT INTO data_2018 (created, catload) VALUES (current_timestamp, 'something');
ERROR: null value in column "id" of relation "data_2018" violates not-null constraint
DETAIL: Failing row contains (null, 2021-09-03 15:28:28.176442, something).
In use:
cat=# EXPLAIN SELECT * FROM data;
QUERY PLAN
---------------------------------------------------------------------------
Append (cost=0.00..104.20 rows=4280 width=48)
-> Seq Scan on data_2018 data_1 (cost=0.00..20.70 rows=1070 width=48)
-> Seq Scan on data_2019 data_2 (cost=0.00..20.70 rows=1070 width=48)
-> Seq Scan on data_2020 data_3 (cost=0.00..20.70 rows=1070 width=48)
-> Seq Scan on data_2021 data_4 (cost=0.00..20.70 rows=1070 width=48)
(5 rows)
Updates that move data between partitions regularly probably mean you should have chosen a better partitioning strategy.
cat=# UPDATE data SET created = created - INTERVAL '1 year' WHERE id = '2' ;
UPDATE 1
cat=# TABLE data_2021;
id | created | catload
----+---------+---------
(0 rows)
cat=# TABLE data_2020;
id | created | catload
----+----------------------------+-----------
2 | 2020-09-03 15:28:12.284289 | something
(1 row)
Let’s add 1,000,000+ rows that will be in different partitions:
INSERT INTO data (created, catload) SELECT t, 'random value' FROM generate_series(TIMESTAMP '2018-12-29 00:00:00', TIMESTAMP '2021-01-06 00:00:00', INTERVAL '1 minute') AS gs(t);
INSERT 0 1064161
Partition pruning in action:
cat=# EXPLAIN SELECT * FROM data WHERE id = 42;
QUERY PLAN
------------------------------------------------------------------------------------------
Gather (cost=1000.00..27785.85 rows=4 width=29)
Workers Planned: 2
-> Parallel Append (cost=0.00..26785.45 rows=4 width=29)
-> Parallel Seq Scan on data_2020 data_3 (cost=0.00..13241.01 rows=1 width=29)
Filter: (id = 42)
-> Parallel Seq Scan on data_2019 data_2 (cost=0.00..13205.00 rows=1 width=29)
Filter: (id = 42)
-> Parallel Seq Scan on data_2021 data_4 (cost=0.00..211.90 rows=1 width=29)
Filter: (id = 42)
-> Parallel Seq Scan on data_2018 data_1 (cost=0.00..127.53 rows=1 width=29)
Filter: (id = 42)
(11 rows)
cat=# EXPLAIN SELECT * FROM data WHERE created = '2020-07-12 12:00:00';
QUERY PLAN
----------------------------------------------------------------------------------
Gather (cost=1000.00..14241.21 rows=2 width=29)
Workers Planned: 2
-> Parallel Seq Scan on data_2020 data (cost=0.00..13241.01 rows=1 width=29)
Filter: (created = '2020-07-12 12:00:00'::timestamp without time zone)
(4 rows)
Using created lets the partition work.
BEGIN;
DROP TABLE data_2018;
COMMIT;
You can detach the partition with ALTER TABLE data DETACH PARTITION data_2018;. The table persists but is no longer apart of the partition. The reverse operation is possible by attaching a partition:
ALTER TABLE data ATTACH PARTITION data_2018 FOR VALUES FROM ('2018-01-01') TO ('2019-01-01');
Or:
CREATE TABLE data_2022 (LIKE data_2021);
<POPULATE></POPULATE>
ALTER TABLE data ATTACH PARTITION dat-2022 FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE data_def PARTITION OF data DEFAULT;
If it doesn’t belong elsewhere, it goes there.
INSERT INTO data (created, catload) VALUES ('2039-02-14 04:19:00', 'SO very cute');
Yikes! Exercise caution when using a default partition. In the example above, you won’t be able to create the data_2039 partition. this might be fine depending on how you are defining your partitions:
cat=# CREATE TABLE data_2023 PARTITION OF data FOR VALUES FROM ('2039-01-01 00:00:00') TO ('2040-01-01 00:00:00');
ERROR: updated partition constraint for default partition "data_def" would be violated by some row
That’s because of the row we INSERTed above. With a fixed set of partitions, a default partition is OK, like users by alphabetical order or similar.
Let’s create an index…
Creating an index on data creates a partitioned index, observe: :eyeglasses:
cat=# \d data
Partitioned table "public.data"
Column | Type | Collation | Nullable | Default
---------+-----------------------------+-----------+----------+------------------------------
id | bigint | | not null | generated always as identity
created | timestamp without time zone | | not null |
catload | text | | |
Partition key: RANGE (created)
Indexes:
"data_created_idx" btree (created)
Number of partitions: 5 (Use \d+ to list them.)
cat=# \d data_2018
Table "public.data_2018"
Column | Type | Collation | Nullable | Default
---------+-----------------------------+-----------+----------+---------
id | bigint | | not null |
created | timestamp without time zone | | not null |
catload | text | | |
Partition of: data FOR VALUES FROM ('2018-01-01 00:00:00') TO ('2019-01-01 00:00:00')
Indexes:
"data_2018_created_idx" btree (created)
cat=# DROP INDEX data_2018_created_idx;
ERROR: cannot drop index data_2018_created_idx because index data_created_idx requires it
HINT: You can drop index data_created_idx instead.
PostgreSQL prevents you from doing silly things:
cat=# ALTER TABLE data_2019 ADD newcol inet;
ERROR: cannot add column to a partition
The optimizer creates an execution plan individually for each partition. Partitioning creates more work for the query optimizer.
Let’s try to create a primary key:
cat=# ALTER TABLE data ADD PRIMARY KEY (id);
ERROR: unique constraint on partitioned table must include all partitioning columns
DETAIL: PRIMARY KEY constraint on table "data" lacks column "created" which is part of the partition key.
:game_die: No dice because no guarantee of uniqueness across partitions. We require inherent uniqueness that can be imbued by the partitioning key.
cat=# ALTER TABLE data ADD PRIMARY KEY (id, created);
ALTER TABLE
The above works but is unfortunate.
A global index that does the desired thing could happen but would deprive us of the performance benefits of partitioning. Not possible today with PostgreSQL.
Subpartitioning is possible and could theoretically be arbitrarily deep. Beyond two layers:
WHY:question:
TRIGGER.11:55 Break! 12:04 Back!
With a query like EXPLAIN SELECT catload, count(*) FROM data GROUP BY catload;:
SET enable_partitionwise_aggregate = on;
Two table spartitioned with the same partitiong key and along the same boundaries, use enable_partitionwise_join.
The enable_partitionwise_aggregate and enable_partitionwise_join options are off by default: PostgreSQL tries to reduce costs. This only makes sense if your setup would benefit by it.
Situations when you might want to use partitioning:
DROP TABLE; nice and easy. See notes above on DROP TABLE data_2023; instead of a long-running, I/O intensive DELETE, requires a long VACUUM;, introduces bloat.COPY FREEZE, then create indexes and then create partitions. Recall DETACH and ATTACH from above.WHERE, the sequential scan can run on just those partitions rather than everywhere. :warning: Narrow use case. Could be more expensive. The moral of many of these stories is: IDK, depends on your data, your use case, your tolerance level for different behaviors. Average query may become slower but some queries will be much faster.VACUUM performance: only one running VACUUM per-table, could cause performance problems. Parition permits simultaneous VACUUM along the partitioned data (multiple allowed).An example:
cat=# EXPLAIN SELECT * FROM data WHERE id = 42;
QUERY PLAN
----------------------------------------------------------------------------------------------
Append (cost=0.29..41.66 rows=5 width=33)
-> Index Scan using data_2018_pkey on data_2018 data_1 (cost=0.29..8.30 rows=1 width=29)
Index Cond: (id = 42)
-> Index Scan using data_2019_pkey on data_2019 data_2 (cost=0.43..8.45 rows=1 width=29)
Index Cond: (id = 42)
-> Index Scan using data_2020_pkey on data_2020 data_3 (cost=0.43..8.45 rows=1 width=29)
Index Cond: (id = 42)
-> Index Scan using data_2021_pkey on data_2021 data_4 (cost=0.29..8.30 rows=1 width=29)
Index Cond: (id = 42)
-> Index Scan using data_def_pkey on data_def data_5 (cost=0.12..8.14 rows=1 width=48)
Index Cond: (id = 42)
(11 rows)
With partitioning, you have to do way more scans than you otherwise would so what even is the point.
Big tables and slow queries –> the answer is not necessarily partition! Partitioning is not considered to be a performance improvement. Makes the most sense in a data warehouse or similar. :factory:
auto vacuum, specific queries that require sequential scan.Partitioning was introduced in major version 10. Use at least 12 to take advantage of benefits. Essentially feature-complete with 11 and performance improvements began in 12.
Split data into parts that go in different databases on different hosts to permit scaling, load distribution. Sensible with data that lends itself to being split in this way. Retain some data on all shards to make sharding less painful.
Shard:
Not every data set can be sharded successfully.
/ :bookmark_tabs:
GUC - Grand Unified Configuration
If you add trash to postgresql.conf and tell it to reload, it’ll still be running with the old config. If you then do systemctl restart or hard reboot – you’ll really be sad as it has the old config on disk and is aiming to use that.
There is no --validate-config flag: you just reload and look for errors. :sign_of_the_horns:
The GUC (Grand Unified Configuration):
graph TD
defaults --> postgresql.conf;
postgresql.conf --> C[ALTER SYSTEM];
C-->D[start server, override params, see Patroni]
D-->E[ALTER DATABASE, per-database settings]
E-->F[ALTER ROLE]
F-->G[ALTER ROLE IN DATABASE]
G-->H[SET]
H-->I[SET LOCAL]
There are so many options: you could have fun. Say one DBA sets a change in postgresql.conf and another with ALTER SYSTEM. The person changing postgresql.conf will be sad. :crying_cat_face: Talk to your fellow DBAs.
SELECT * FROM pg_settings ;
Explore the output, a few notes:
postmaster must be changed at startupsource field to figure out where a specific setting has been set!pending_restart for settings that require a restart to be changed. So cool!postgres=# SELECT name,setting,category,context FROM pg_settings WHERE name = 'search_path' ;
name | setting | category | context
-------------+-----------------+-------------------------------------------------+---------
search_path | "$user", public | Client Connection Defaults / Statement Behavior | user
(1 row)
You might want a different schema search path for different databases:
ALTER DATBASE course SET search_path = myapp, public;
That change would apply to new connections unless you restart.
The things lower down will override the things that are higher up.
Design principle: do configuration changes from the client, as you may not have shell access. That’s why there are so many of these little things. Super cool.
postgresql.auto.conf is modified automatically when using ALTER SYSTEM:
ALTER SYSTEM SET enable_partitionwise_aggregate = on;
You still need to reload for changes set via SQL. Thankfully, you can reload via SQL:
SELECT pg_reload_conf();
Not all settings can be changed via SQL, like….SET port = 5555;.
postgresql.conf and reload SHOW server_version;
Use \drds to list the per-database role settings stored by this table.
https://www.postgresql.org/message-id/flat/20091008163401.23D8C753FB7%40cvs.postgresql.org
Use SET LOCAL to make a change for a single transaction. Cool! (Increase memory, timeouts, whatever.)
shared_buffers beyond the default. The guideline for determining a reasonable value: 1/4 available RAM, no more than 8GB. Why so little? More than that could be beneficial but try it for your workload. (Bigger isn’t always better, consider a representative load test.) PostgreSQL uses buffered I/O: reads may not go to disk so you may still benefit from stuff being cached into memory even if it’s not in PG’s shared buffers (thanks, kernel! :popcorn: ).
\d pg_buffercache
SELECT usagecount, count(*) FROM pg_buffercache GROUP BY usagecount ORDER BY usagecount;
So, usagecount is a number between 0 and 5. Each usage increases the count by 1, up to 5. You can only evict a buffer with usagecount of 0. This happens when someone passes by looking for an empty buffer. There’s a background job that goes through and counts down usage counts. Backend looking for empty buffer will count it down. Usage increases the usagecount, someone skipping past it decreases the usagecount. Clock sweep algorithm
Without this, it takes awhile for things to get warmed up. If you know what things should be cached for good performance, you can prewarm to load things into cache:
SELECT pg_prewarm('test_upper_idx');
SELECT pg_prewarm('test_id_idx');
Data that has been prewarmed is still subject to the usagecount stuff. You can do some pre-warming at start-up with a parameter called shared_preload_libraries:
shared_preload_libraries = 'pg_prewarm'
Restart PostgreSQL and you’ll see a custom background worker in ps:
postgres: 13/main: autoprewarm master
It regularly checks what’s in shared buffers, persists this in a file on disk in the data directory called….something.
It’s not as bad as Apache configs, which are science.
:joy:
13:01 Break! 13:17 Back!
work_mem private memory for backend processes, the calculate queries on behalf of the client.
postgres=# EXPLAIN SELECT * FROM pg_stats;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=126.24..153.50 rows=5 width=401)
-> Hash Join (cost=126.11..152.20 rows=5 width=475)
Hash Cond: ((s.starelid = c.oid) AND (s.staattnum = a.attnum))
-> Seq Scan on pg_statistic s (cost=0.00..22.02 rows=402 width=349)
-> Hash (cost=111.80..111.80 rows=954 width=142)
-> Hash Join (cost=22.65..111.80 rows=954 width=142)
Hash Cond: (a.attrelid = c.oid)
Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
-> Seq Scan on pg_attribute a (cost=0.00..81.61 rows=2861 width=70)
Filter: (NOT attisdropped)
-> Hash (cost=17.82..17.82 rows=386 width=72)
-> Seq Scan on pg_class c (cost=0.00..17.82 rows=386 width=72)
Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
-> Index Scan using pg_namespace_oid_index on pg_namespace n (cost=0.13..0.18 rows=1 width=68)
Index Cond: (oid = c.relnamespace)
(15 rows)
There are a great many nodes or steps in the execution plan. each node or step could use the amount of memory in work_mem. Most nodes won’t use a lot of memory (sequential and index scans). The three types of nodes that consume memory:
The above makes it tricky to calcualte what work_mem should be. LA recommends this formula as a guideline:
shared_buffers + work_mem * max_connections (less than or equal to ) RAMConsider max_connections as an upper limit, you might have far fewer in reality.
Example:
postgres=# CREATE DATABASE cats;
CREATE DATABASE
postgres=# \c cats
You are now connected to database "cats" as user "postgres".
cats=# CREATE TABLE work (id integer GENERATED ALWAYS AS IDENTITY, name text);
CREATE TABLE
cats=# INSERT INTO work (name) SELECT 'brie and plop' FROM generate_series(1, 1000000);
INSERT 0 1000000
cats=# INSERT INTO work (name) SELECT 'plop so cute' FROM generate_series(1, 1000000);
INSERT 0 1000000
cats=# ANALYZE work;
ANALYZE
cats=# EXPLAIN SELECT name, count(*) FROM work GROUP BY name;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=26239.05..26239.56 rows=2 width=21)
Group Key: name
-> Gather Merge (cost=26239.05..26239.52 rows=4 width=21)
Workers Planned: 2
-> Sort (cost=25239.03..25239.03 rows=2 width=21)
Sort Key: name
-> Partial HashAggregate (cost=25239.00..25239.02 rows=2 width=21)
Group Key: name
-> Parallel Seq Scan on work (cost=0.00..21072.33 rows=833333 width=13)
(9 rows)
cats=# SET max_parallel_workers_per_gather = 0;
SET
cats=# EXPLAIN SELECT name, count(*) FROM work GROUP BY name;
QUERY PLAN
---------------------------------------------------------------------
HashAggregate (cost=42739.00..42739.02 rows=2 width=21)
Group Key: name
-> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=13)
(3 rows)
Just the statements:
CREATE DATABASE cats;
\c cats
The HashAggregate takes care of the GROUP BY.
If we SELECT id instead:
cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id;
QUERY PLAN
---------------------------------------------------------------------------------
HashAggregate (cost=145239.00..180864.00 rows=2000000 width=12)
Group Key: id
Planned Partitions: 64
-> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=4)
JIT:
Functions: 3
Options: Inlining false, Optimization false, Expressions true, Deforming true
(7 rows)
This will require temporary partitions sent to disk because work_mem is too small. :crying_cat_face: Let’s fix that:
cats=# SET work_mem = '1GB';
SET
cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id;
QUERY PLAN
--------------------------------------------------------------------
HashAggregate (cost=42739.00..62739.00 rows=2000000 width=12)
Group Key: id
-> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=4)
(3 rows)
The minimum possible work_mem is 64kB. Let’s set work_mem to that minimum and see how the query plan changes:
cats=# EXPLAIN SELECT id, count(*) FROM work GROUP BY id;
QUERY PLAN
---------------------------------------------------------------------------------
GroupAggregate (cost=351436.69..386436.69 rows=2000000 width=12)
Group Key: id
-> Sort (cost=351436.69..356436.69 rows=2000000 width=4)
Sort Key: id
-> Seq Scan on work (cost=0.00..32739.00 rows=2000000 width=4)
JIT:
Functions: 6
Options: Inlining false, Optimization false, Expressions true, Deforming true
(8 rows)
We see that GroupAggregate is used instead. The cost values start to be useful: we can see the impact of a lower work_mem. :ram:
Watch how many temp files are being written for a database called cats:
cats=# \watch 4 SELECT temp_files,temp_bytes FROM pg_stat_database WHERE datname = 'cats' \gx
Fri Sep 3 17:31:55 2021 (every 4s)
temp_files | temp_bytes
------------+------------
2 | 28000000
(1 row)
From the docs:
work_mem (integer)
Sets the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. If this value is specified without units, it is taken as kilobytes. The default value is four megabytes (4MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will generally be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.
Hash-based operations are generally more sensitive to memory availability than equivalent sort-based operations. The memory available for hash tables is computed by multiplying work_mem by hash_mem_multiplier. This makes it possible for hash-based operations to use an amount of memory that exceeds the usual work_mem base amount.
We care about the connections that are actually consuming system resources. A sufficiently high max_connections value can (and will) degrade performance. Also, consider work_mem when setting max_connectiosn, see the guideline above.
Use connection pools to manage the number of active connections.
The friends of work_mem.
maintenance_work_mem – private memory for creation of indexes and clusters, full vaccuum. You probably don’t do this too often so feel free to be generous.autovacuum_work_mem will take the value of maintenance_work_mem, limits number of rows that can be processed at a time. Be generous, raise to speed vacuum.temp_file_limitLet’s look at temp files.
SELECT * FROM generate_series(1, 10000000000000000000000);while true ; do date ; ls -ahls ; sleep 1 ; doneSELECT from above unless you want to live on the wild side.Runaway queries could exhaust your disk space.
Planner Cost Constraints
Tune to match hardware.
random_page_cost – most important, tuned for spinning disks by default.seq_page_cost cost to read one 8K page during sequential scan.Match random_page_cost and seq_page_cost with SSDs.
effective_cache_size – impact is moderate, easy to set tells the optimizer how much memory is available to cache data. PG doesn’t otherwise know how much RAM your machine has, lol. Index scans become cheaper with an increased effective_cache_size.
:warning: Avoid the temptation to tune a bunch of paramters all at once just because you have a theory or read somewhere on the Internet that a parameter is super awesome.
:hospital: Essentially med school for database doctors. You can learn so much and then you see a bunch of queries (patients) and can apply suggestions from your experience.
The Gather in an execution plan denotes parallelization.
Screenshot around 13:52 local time

Everything under the Gather node is parallelized.
max_parallel_workers = limit of parallel workers across entire cluster. Take # cores into consideration. More parallel workers than cores –> No. Use some number less than the number of cores.
#max_parallel_workers = 8 # maximum number of max_worker_processes that
# can be used in parallel operations
Tune for conflicting ends: throughout vs response time
max_parallel_workers and max_parallel_workers_per_gather a bit higher.You could find a difference between Workers Planned and Workers Launched depending on what’s going on when the query actually runs. The query won’t wait for the number of Workers it plans to be available to be launched to run the query.
Hmm: observe that parallelization is not used here, observe the rows value to understand why:

screenshot around 13:59 local time
parallel_setup_cost –> cost for setting up parallelization
parallel_tuple_cost estimate for transferring rows betwene parallel processes.
min_parallel_table_scan_size if table exceeds this size, a worker will be considered. if table exceeds double the size, another worker is consdiered. logarithmically increased.
EXPLAIN SELECT count(*) FROM test;
SET max_parallel_workers_per_gather = 8 ;
EXPLAIN SELECT count(*) FROM test;
:moneybag: Never mind the cost, make it fast:
ALTER TABLE test SET (parallel_workers = 6);
14:04 Break! 14:12 Back!
log_min_duration_statement - log queries that take longer than a certain amount of time.
log_in_duration_statement. Instead, we wantNew shared_preload_libraries:
shared_preload_libraries = 'pg_prewarm,pg_stat_statements'
LA recommends always having this in production. Creates extra shared memory segment with statistics about executed queries. Not a record of all queries, lumps together. queries who only differ by constants == identical.
There’s a built in benchmarking tool: pgbench.
$ pgbench -c 2 -T 30 cats
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 10
query mode: simple
number of clients: 2
number of threads: 1
duration: 30 s
number of transactions actually processed: 29922
latency average = 2.005 ms
tps = 997.304049 (including connections establishing)
tps = 997.414817 (excluding connections establishing)
A high standard deviation: query is sometimes slow and sometimes fast.
CREATE EXTENSION pg_stat_statements;
\d pg_stat_statements
SELECT calls, total_exec_time, query FROM pg_stat_statements ORDER BY total_exec_tiem DESC LIMIT 10;
Here’s how that looks after running pgbench:
postgres=# SELECT calls, total_exec_time, query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
calls | total_exec_time | query
-------+--------------------+------------------------------------------------------------------------------------------------------
29019 | 1472.7642000000155 | UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2
29019 | 1397.9302000000264 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2
29019 | 1044.9512000000009 | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2
1 | 918.3737 | INSERT INTO large SELECT i, i / $1 + $2 FROM generate_series($3, $4) AS i
29019 | 648.4597000000005 | SELECT abalance FROM pgbench_accounts WHERE aid = $1
29019 | 557.028700000003 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP)
29019 | 181.55990000000943 | BEGIN
29019 | 176.7948000000034 | END
1 | 57.3338 | CREATE EXTENSION pg_stat_statements
1 | 10.5965 | vacuum pgbench_branches
(10 rows)
cats=# CREATE TABLE large (id integer, x integer);
CREATE TABLE
cats=# INSERT INTO large SELECT i, i / 10000 + 1 FROM generate_series(1, 1000000) AS i;
INSERT 0 1000000
This will make blocks of rows with the same value for x.
CREATE TABLE small (x integer);
INSERT INTO small VALUES (1);
\timing
Observe that SELECT * FROM large JOIN USING (x); takes a while (Time: 233.530 ms).
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);
So much info:
# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=30832.00..198533.19 rows=12750000 width=8) (actual time=10265.965..10330.206 rows=9999 loops=1)
Hash Cond: (small.x = large.x)
Buffers: shared hit=4426, temp read=2845 written=4218
-> Seq Scan on small (cost=0.00..35.50 rows=2550 width=4) (actual time=0.009..0.020 rows=1 loops=1)
Buffers: shared hit=1
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8) (actual time=10217.431..10217.450 rows=1000000 loops=1)
Buckets: 131072 (originally 131072) Batches: 32 (originally 16) Memory Usage: 3073kB
Buffers: shared hit=4425, temp written=3276
-> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8) (actual time=4.940..5004.856 rows=1000000 loops=1)
Buffers: shared hit=4425
Planning Time: 0.189 ms
JIT:
Functions: 10
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.847 ms, Inlining 0.000 ms, Optimization 0.298 ms, Emission 4.485 ms, Total 5.630 ms
Execution Time: 10380.233 ms
(16 rows)
Time: 10380.995 ms (00:10.381)
Differences between estimate and actual are close, you are probably good.
JOIN in PostgreSQLTo understand misestimates.
There are three ways that a JOIN could be calculated:
JOIN (For each row in the first table, look for a match in the second. The obvious and somewhat naive approach.) Efficient with some setups: small table used in the outer loop. An index on the inner table will help.JOIN StrategiesNested Loop JOIN |
Hash JOIN |
Merge JOIN |
|
|---|---|---|---|
| How it works | Useful when one table is smaller. | sequential scan of inner table, hash key is the JOIN condition, sequentail scan of outer tabe and probes. For each result in the outer table, we probe the hash instead of the entire table. Startup cost of hash repeated with later performance. |
Sort both tables on JOIN condition and then merge them. |
| good? | “outer table” small | Bigger tables. If hash fits into work_mem. If hash from inner table doesn’t fit in work_mem, less good. |
Large tables when hash JOIN is no fun (too big for work_mem) |
| index? | On JOIN condition, on the inner table |
Using an index is not helpful – probably. | on JOIN condition on both tables. |
A poor estimate could lead to PostgreSQL choosing the wrong kind of JOIN.
FUN!
The results from explain.dalib.com are pretty:

EXPLAIN (
ANALYZE,
BUFFERS
)
SELECT
*
FROM
large
JOIN small USING (x);


Problem: The estimate is WAY higher than the atual – why? Statistics are off.
Fix: ANALYZE small;
It’s true!
# ANALYZE small;
ANALYZE
Time: 1.875 ms
cats=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
-------
Gather (cost=1001.02..12196.86 rows=10000 width=8) (actual time=0.379..3302.646 rows=9999 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=4623 read=1
-> Hash Join (cost=1.02..10196.86 rows=4167 width=8) (actual time=0.327..3271.277 rows=3333 loops=3)
Hash Cond: (large.x = small.x)
Buffers: shared hit=4623 read=1
-> Parallel Seq Scan on large (cost=0.00..8591.67 rows=416667 width=8) (actual time=0.046..1624.942 rows=333333 lo
ops=3)
Buffers: shared hit=4425
-> Hash (cost=1.01..1.01 rows=1 width=4) (actual time=0.055..0.069 rows=1 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=3
-> Seq Scan on small (cost=0.00..1.01 rows=1 width=4) (actual time=0.017..0.028 rows=1 loops=3)
Buffers: shared hit=3
Planning:
Buffers: shared hit=5
Planning Time: 0.208 ms
Execution Time: 3358.110 ms
(18 rows)
Time: 3358.938 ms (00:03.359)
small table is tiny and auto analyze never got ran because we didn’t get to the threshold. That totally makes sense.# CREATE INDEX ON large(x);
CREATE INDEX
Time: 327.184 ms
cats=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large JOIN small USING (x);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
------
Nested Loop (cost=0.42..360.44 rows=10000 width=8) (actual time=0.065..167.526 rows=9999 loops=1)
Buffers: shared hit=46 read=11
-> Seq Scan on small (cost=0.00..1.01 rows=1 width=4) (actual time=0.006..0.018 rows=1 loops=1)
Buffers: shared hit=1
-> Index Scan using large_x_idx on large (cost=0.42..259.43 rows=10000 width=8) (actual time=0.037..58.590 rows=9999 loo
ps=1)
Index Cond: (x = small.x)
Buffers: shared hit=45 read=11
Planning:
Buffers: shared hit=15 read=1
Planning Time: 0.306 ms
Execution Time: 220.724 ms
(11 rows)
Time: 221.788 ms
cats=#
Scroll :up: to see how we reduced the time of the query with a few steps.
If you have a sequential scan with a filter and the majority of the rows are filtered out: use an index.
