Day 4: Replication / High Availability / “Complex” Architectures

Get to [2021-training-notes-day-3]. Get to [2021-training-notes-day-5].

When recovering, WAL is retrieved from the archive. You could do replication that way but it would be super annoying. Instead, enter streaming replication.

Replication is recovery (of a type) that never ends.

Streaming replication introduces two proceses:

WAL sender (on the primary)
WAL receiver (on the standby)

When soemthing htis WAL buffers, it goes to the WAL sender, is received by the WAL receiver and then sent to WAL on the standby. First to the WAL, and then to the data – remember.

The startup process that helps with recovery replays the data into the standby. This is a fundamentally async process.

WAL sender and WAL receiver

The standby acts as a client to the primary. PG has two protocols, both on 5432, it seems:

QUERY
replication –> optimized for streaming from primary to standby, used by pg_basebackup

The config on the primary and the secondary should be as similar as possible.

pg_createcluster standby
rm -rf /var/lib/postgresql/13/standby/
pg_basebackup -D 13/standby # Replace with contents of `pg_basebackup`>

Modify postgresql.conf.

max_wal_senders –> more than 0.

wal_level can not be minimal, must be replica or logical.

Configure primary_conninfo = '', this is a connection string.

With the streaming replication protocol, you connect to the cluster, no database needed.

touch 13/standby/standby.signal

Replicate and then recovery mode until I tell you to promote.

systemctl start postgresql@13-standby

Observe walsender and walreceiver processes:

sent_lsn position received by the stand by write_lsn psition written by standby flush_lsn what’s persisted and definitely safe replay_lsn what position in the WAL has already been replayed and is visible on the standby server.

Log Sequence Numbers (LSN)

first 8 hex digits –> denotes the timeline second 8 hex digits –> a number that corresponds to the one before the slash in the _lsn

first two hex digits after slash in _lsn are the last two digits of the WAL segment name remaining six digits in _lsn are the offset.

SELECT pg_current_wal_lsn(); to get the current LSN.

:watch: psql has watch!

\watch 1 SELECT * FROM whatever;

Measure replication lag in bytes, not time. Time is unreliable for this purpose.

docker run –name postgres_training
-e POSTGRES_PASSWORD=mysecretpassword
-d postgres:13

Stop the standby.

There are checkpoints (actually called restart points) on a recovering database. This makes an interrupted recovery less annoying. Resume recovery from that point, rather than the beginning. This is clearly essential for replication (it super needs to recover from where it left off, not the beginning, where possible).

CREATE TABLE accounts (id bigint PRIMARY KEY, name text NOT NULL, amount numeric(15,2) NOT NULL);

If you delete a WAL segment needed by the secondary…please don’t. Rebuilding the secondary becomes an option.

The restore command can not be changed without restarting the server.

Use wal_keep_size to archive unneeded WAL to ensure it’s available for the standby, to make it easier if it falls too far behind. This is usually unnecessary. If standbydowntime is longer than wal_keep_size accounts for, you’ll still have trouble. Enter the replication slot. It marks a position in the WAL stream. The primary doesn’t have any memory of the standby. (It’s essentially a client.)

The primary should not discard any WAL younger than the replication slot. The replication slot is an indicator of ho much WAL the secondary has.

SELECT pg_create_physical_replication_slot('slot');

Consider max_slot_wal_keep_size –> override the replication slot indicator to avoid disk space pile up. This will tell PostgreSQL to nuke that WAL even if the standby has not caught up.

With synchronous replication, COMMIT does not return until the synchronous standby has received the WAL. Doing this across continents: everything will take forever. This only works when the primary and the standby are close together in the network. the other downside of synchronous replication: the availability of the entire system goes down. A failure on the standby means a failur eon the primary.

Use synchronous replication if you can not afford to lose a single transaction (fail over occurs before sync completes). Most people are using async.

The application_name field can be used to tell the PG server who I (the client) am. This is handy for replicas. Well-behaved applications set this name. This is used as a convention to name the standby server. You can set this in the connection string.

There are a range of options for synchronous_commit when it’s on.

on – remote flush, waits for standby to write to WAL file and flush to disk remote_apply –> wait until standby has replayed information and that info is now visible.

If you query the standby immediately after making a change, it may not be available with the default settings for synchronous_commit. You want remote_apply for that.

12:01 Break! 12:12 Back!

Good use cases for streaming replication:

HA/failover – additional software required to determine that the primary is down.
offloading queries that you might not want to run on production (resource management)
problematic but possible: scaling (for reads, not writes). Application must be able to read and write to one DB and read from the other. You have to account for the time required for the data that was written to be replayed. You could do synchronous commit at remote_apply levels. Yeah, it would work, but…enter the replication conflict. VACUUM on the primary during a long running query on the second.

max_standby_streaming_delay governs how long you wait before canceling a query and resuming streaming of WAL. The default value is 30s. Consider your report that runs on this replicated database. Please please please don’t cancel it 30 seconds in. You can make this indefinite with -1.

Primary goal: replication doesn’t lag too much.

hot_standby Setting this to off disallows queries during recovery.

The name of the game here is preventing conflicts.

hot_standby_feedback – Send oldest query blocking VACUUM. No more conflicts due to VACUUM. It’s not on by default because a very long running query on the replica could cause bloat on the primary server. The primary will know the oldest query.

READ about hot_standby_feedback

remote_apply –> Please don’t, says LA. You’ll have lots of waiting around all the time.

You can have an intentionally delayed standby server. This is helpful to recover from logical problems, assuming you notice the problem within the intentional delay period. Possible appraoch for quick recovery.

Fun with Functions

Every second, tell Postgres to sleep for 10 seconds.

\watch 1 SELECT pg_sleep(10);

Temporarily Suspend Replication

Temporarily suspend replication: SELECT pg_wal_replay_pause();

resume with SELECT pg_wal_replay_resume();

High Availability (continued) and Failover

There is no cluster-ware in PostgreSQL but the tools to make it happen are available. Here’s a manual failover:

Make the primary go down.
Make sure that the primary is actually down. (Avoid split brain mode. This isn’t fun. :brain:)
A few quick steps required:
- Standby must be promoted out of recovery mode. Three ways to do this:
  - #promote_trigger_file = '' # file name whose presence ends recovery
  - psql command
  - SELECT pg_promote();
- Redirect the clients. Swizzle IP. Swizzle DNS (LOL). Load balancer HAproxy or similar. Alternately: this can be done with connection strings. If I can’t get to the first, talk to the second.
Now I am back up and running but in a degraded mode, there is no standby any longer. We can work on bringing the primary back unless it’s really :skull:. The primary server may be ahead: transactions commited on the primary but not yet gone to the secondary. How do we handle the deviation? We are respecting the timeline of the old standby, which is now the primary. What if we could just roll back the transactiosn that the primary has that are ahead? Well! Enter pg_rewind.

psql 'host=/var/run/postgresql,/var/run/postgresql port=5432,5433 user=postgres'

Connect with a string like the above and observe the output of \conninfo to see which you are connected to. That approach is cool but makes it possible to wind up connected. tothe stand by and why? It’s read-only. Enter target_session_attrs, an addition to the connection string:

psql 'host=/var/run/postgresql,/var/run/postgresql port=5432,5433 user=postgres target_session_attrs=read-write'

With the above, you’ll only connect to a database that is read-write. The clients do the handovers. Any client that links with libpq can do this. Possible with different syntax with GDBC.

A few ways to stop the server:

pg_ctl stop there are three shutdown modes: smart, fast immediate. The default is fast.
fast transactions. areall cancelled, client sessions are interrupted, checkpoint made, goes down. Default.
smart no new client connections, existing connections not interrupted. Once last client connection closes, DB shuts down. Not used too often.
immediate evil mode. Go down as quickly as you can, no clean up. Crash the server. Crash recovery will be required ons tartup. Bad for production, fun for testing. No fsync.

pg_ctl stop -m immediate -D /etc/postgresql

READ pg_rewind from the PostgreSQL docs

Run pg_rewind locally on the failed primary.

Running pg_rewind:

pg_rewind --target-pgdata=13/main --source-server='port=5433 user=postgres'

Replication slots are only on the primary, they are not replicated. You’ll need to add it again when failing over to a new primary!

12:55 Break! 13:10 Back!

High Availability and Patroni

What we use and they recommend for HA and cluster-ware.

Project: zalando/patroni
Docs: Intro

Written by Zalando for their own purposes, taken off and now highly recommended.

PostgreSQL has a shared mapping architecture, which means there are no shared components. Good for reducing single points of failure.

Patroni runs on both the primary and the secondary. Patroni handles starting and stopping PostgreSQL for you. It does the failover and the configs. No more postgresql.conf editing. You tell Patroni what to do and Patroni does it for you.

Use etcd [or Consul (if you’re GitLab)]. There is a third box: three instances of etcd, implements the RAFT protocol which requires an odd number.

Patroni uses the key-value store (provided by etcd or whatever). See diagram. If it can’t talk to the etcd running on itself, that’s OK, it’ll talk to another.

ONe Patroni is the Leader (primary) and the others are Followers (secondaries). They vote regularly. Config is distributed across all instances in the key-value store.

etcd doesn’t need to run on the same machine.

the solo etcd host can be a little EC2 or whatever. It just needs fast disk so it isn’t seen to be nonresponsive.

READ PostgreSQL replication and failover with Omnibus GitLab

Security: Users, Permissions and More

The User

View users:

postgres=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of
-----------+------------------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Think of it as root. There’s nothing preventing access to the underlying filesystem. Applications should not ever connect as postgres.

There is no big difference between users and user groups: users can log in to the system and groups do not log in. Regardless, users and groups are caled roles. Call CREATE ROLE to create a user.

postgres=# \h CREATE ROLE
Command:     CREATE ROLE
Description: define a new database role
Syntax:
CREATE ROLE name [ [ WITH ] option [ ... ] ]

where option can be:

      SUPERUSER | NOSUPERUSER
    | CREATEDB | NOCREATEDB
    | CREATEROLE | NOCREATEROLE
    | INHERIT | NOINHERIT
    | LOGIN | NOLOGIN
    | REPLICATION | NOREPLICATION
    | BYPASSRLS | NOBYPASSRLS
    | CONNECTION LIMIT connlimit
    | [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL
    | VALID UNTIL 'timestamp'
    | IN ROLE role_name [, ...]
    | IN GROUP role_name [, ...]
    | ROLE role_name [, ...]
    | ADMIN role_name [, ...]
    | USER role_name [, ...]
    | SYSID uid

URL: https://www.postgresql.org/docs/13/sql-createrole.html

CREATE ROLE plop LOGIN PASSWORD 'cutecats';

Don’t do this:

goes in .psql_history

Use \password plop after CREATE ROLE plop LOGIN;

Get all roles:

postgres=# \duS
                                                                     List of roles
         Role name         |                         Attributes                         |                          Member of
---------------------------+------------------------------------------------------------+--------------------------------------------------------------
 pg_execute_server_program | Cannot login                                               | {}
 pg_monitor                | Cannot login                                               | {pg_read_all_settings,pg_read_all_stats,pg_stat_scan_tables}
 pg_read_all_settings      | Cannot login                                               | {}
 pg_read_all_stats         | Cannot login                                               | {}
 pg_read_server_files      | Cannot login                                               | {}
 pg_signal_backend         | Cannot login                                               | {}
 pg_stat_scan_tables       | Cannot login                                               | {}
 pg_write_server_files     | Cannot login                                               | {}
 plop                      |                                                            | {}
 postgres                  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

The Cannot login objects are roles. Grant access to these system roles instead of granting superuser.

Experiment with pg_authid to get more info:

postgres=# SELECT rolname, rolpassword FROM pg_authid WHERE rolname LIKE 'plop';
 rolname |             rolpassword
---------+-------------------------------------
 plop    | md5edd3bc15c6025ae3a4570fdf0061e20b
(1 row)

MD5. Sad. :cry: :sad: With 14, scram-sha-256 will be the default. :tada:

Every user can see pg_ table names. People don’t like it. There are exceptions like pg_authid, which can only be read by the **Superuser.

listen_addresses = 'localhost'

That’s the server interface. Accepts a comma-separated list of server interfaces. With the above, it’s only listening on the loopback (TCP connections are only allowed from localhost –> no remote access.

Update to listen_addresses = '*' to listen on all available interfaces, requires a restart.

When a client connects:

Client IP address, desired database user and desried database namea re known to the server.
Server consults HBA config file, which governs how incoming client will be authenticated.

Entries are read top-to-bottom. First matching entry is chosen and determiens how client must authenticate.

local socket connections
host TCP

There are dedicated entries for replication.

The HBA file has these columns:

TYPE
DATABASE
USER
ADDRESS
METHOD

This shows us why psql doesn’t need a password on localhost with sockets but not TCP:

This is the bit of PostgreSQL I never fully grokked.

You can have an authentication method of reject to specifically reject someone. Also, no matching entries means that a rejection will occur.

READ Client Authentication
trust –> come on in! no checking
peer –> come on in! only for local connections and only when DBNAME and local username match. (postgres can run psql no problem but root can not.)
password authenticatoni three variants:
- passowrd –> bad, sends password in clear text
- the others are md5 and scram-sha-256, challenge/response. client hashes password with known salt to prevent password from being transmitted.

Users are shared by all databases but not every user can connect to every database. pg_hba.conf permits control of this. Password authentication is second-rate among the other available options. The available options permit GSSAPI, PAM, RADIUS, LDAP, all the good stuff!

READ Configure PostgreSQL block

Every object in PostgreSQL has an object (the person who created it). The owner can do anything with an object and other users must be granted access. However, only the owner can ALTER or DROP, regardless of GRANT. Permissions are stored on the object.

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres

PostgreSQL file system permission inspired by Unix file system permissions.

To that end, a schema is a directory (a place whre a file or table could go). No nested schemas. Each database object is always in some schema.

cutecats=# \dn+ myapp
                  List of schemas
 Name  |  Owner   | Access privileges | Description
-------+----------+-------------------+-------------
 myapp | postgres |                   |
(1 row)

Nothing shown in Access privileges means it has the defaults. Sad.

Schem privs:

write
usage –> read and execute

GRANT USAGE ON SCHEMA myapp TO plop;

cutecats=# \dn+ myapp
                    List of schemas
 Name  |  Owner   |  Access privileges   | Description
-------+----------+----------------------+-------------
 myapp | postgres | postgres=UC/postgres+|
       |          | plop=U/postgres      |
(1 row)

After the slash is who granted the priv, the bit before the = is who got the priviege and the bit between = and / is the privilege:

plop=U/postgres

The user postgres granted the user plop the U __.

Now, let’s look at the public schema:

cutecats=# \dn+ public
                          List of schemas
  Name  |  Owner   |  Access privileges   |      Description
--------+----------+----------------------+------------------------
 public | postgres | postgres=UC/postgres+| standard public schema
        |          | =UC/postgres         |
(1 row)

The __ bits are PostgreSQL's way of saying that anybody can do whatever.

One of the first things you do after creating a database should be to change the schema.

cutecats=# REVOKE CREATE ON SCHEMA public FROM PUBLIC;
REVOKE
cutecats=# \dn+ public
                          List of schemas
  Name  |  Owner   |  Access privileges   |      Description
--------+----------+----------------------+------------------------
 public | postgres | postgres=UC/postgres+| standard public schema
        |          | =U/postgres          |
(1 row)

Do the same thing on template1 so new DBs won’t have the bad lame public schema.

\c template1
REVOKE CREATE ON SCHEMA public FROM PUBLIC;

Database Cluster –> Databases –> Schema

Each database has its own schemas.

What MySQL calls a database is a schema elsewhere. WHAT?

13:58 Break! 14:11 Back!

There are also system schemas. Viwe with \dnS.

pg_catalog –> preexisting system types, operators, system functions, catalog tables all live here.
pg_toast –> Toast tables for overlength attributes.
pg_temp_2 and pg_toast_temp_2 –> implementation details
information_schema –> Interesting for the develoer.

database-independent standard-confirming way to access database metadata.

Use this not that:

THAT template1=# SELECT * FROM pg_attribute \gx THIS template1=# SELECT * FROM information_schema.columns;

Q: Why did our tables get created in the public schema?
A: This is governed by the search_path parameter.

template1=# SHOW search_path;
-[ RECORD 1 ]----------------
search_path | "$user", public

Time: 0.283 ms

Relative path: SELECT * FROM accounts; Absolute path: SELECT * FROM public.accounts;

If you do SELECT * FROM accounts;, it will find the first table in a schema on the search_path

READ the docs about schemas

Available options when querying, you can:

always use the schema (bad search_path never gets in your way)
don’t use schema, make sure search_path is right

This is answering a lot of questions! The most useful day so far.

Permissions on tables

GRANT { { SELECT | INSERT | UPDATE | DELETE | TRUNCATE | REFERENCES | TRIGGER }

REFERENCES allows people to generate a foriegn key that points to their table
TRIGGER don’t use, lets someone else create a trigger on your table

Row-Level Security

We have a two-user architecture.

A three-user architecture:

Super user
User that owns everything
Application user that can just access (principle of least privilege)

This three-user architecture would make migrations more difficult.

To achieve row-level security, a policy must be defined.

READ Row Security Policies

ALTER TABLE rls ENABLE ROW LEVEL SECURITY;
CREATE POLICY coolstuff ON rls FOR SELECT TO joe USING (id % 2 == 0);

The user joe will be able to see the results of the query that they apply and a filter based on the contents of USING() above.

Bypassing Row Level Security

Super user and table owner are not subject to row level security

There is a BYPASSRLS optoin when creating a role.

You can also turn row security off

SHOW row_security;

Upgrade

Minor upgrades are easy: always do them. Install new software, restart the database. Happiness! Like, pg_dump not required ( do it anyway probably).

Due diligence: read the release notes.

READ Migrate to version 13

The simplest method:

pg_dumpall
initdb
restore dump with psql

This leaves out configuration, remember!

You can just copy pg_hba.conf around between versions. This is not advised with postgresql.conf:

parameters get removed (and new ones are introduced) –> What if DB doesn’t start up because parameter is unknown?
you lose info about the default config for that option

LA recommends comparing old and new, looking them up and configuring the new. It’s an opportunity to review your config.

Use the newer pg_dump where possible.

Another way of upgrading…

pg_upgrade –> introduced in 8.x

Create 13 cluster of new version
Run pg_upgrade which will extract metadata from old database (only catalog tables), then copies data files as-is.

Faster

only works if same storage format things that influence how data is stored must also be the same (-k during initdb) –> You would have to use pg_dump and restore in that case. This approach can still be painfully slow. Enter pg_upgrade -?.

With -k, --link passed to pg_upgrade, it hard links instead of copying files to the new cluster. Up to you whether you want to do this but with this approach, you don’t have a known old good db to go back. Old server can not be started. Fall back from upgrade with pg_upgrade --link requires recovering from a rsetore or failing over to a replica or something.

  -k, --link                    link instead of copying files to new cluster

The statistics are not moved over. Instead, you get a analyze_new_cluster.sh script to help guide the process for you.

So, pg_upgrade is super fast (neer more than 2-3 minutes, he says but huh???)

anyway, say you don’t even want those 2-3 minutes:

logical replicatoin an extension that ships logical informationto the standby, reverse enginers WAL to figure out the right thing to do usually for replicating part of a database useful if physical differences are OK streaming replication across PostgreSQL versions

This options reduces downtime as much as possible but at the cost of simplicity.

Q: Why is smart rarely used?
A: Because you normally have a connection pool, so you have long database sessions (which is a good idea; more tomorrow).