Get to [2021-training-notes-day-2].
Sharding will be out of scope for the training. We will talk a bit about partitioning.
Docs, the greatest resource.
The PostgreSQL Wiki –> Good for things in the PostgreSQL ecosystem, broader scope than the docs.
free, open source software. License: PostgreSQL is released under the PostgreSQL License, a liberal Open Source license, similar to the BSD or MIT licenses. A great many forks of PostgreSQL:
There are somewhere between 40-500 contributors to PostgreSQL, depending on how you define a contributor.
See the versioning policy.
Major versions are released yearly, usually in the fall. We expect 14 in fall 2022. Releases are supported for 5 years.
Supported means you’ll get new minor releases (always back fixes, never new behavior).
For minor releases, the community considers not upgrading to be riskier than upgrading.
The goal is for people to be able to upgrade to a new minor release without needing to test their code. They focus on not introducing bugs in minor releases fo facilitate this.
pg_dump and pg_restore or in-place upgrade. Both find options. The former is super simple “unparalleled simplicity” and the other is quicker. In-place upgrades are a bit stricter (can’t make some modifications). There is no requirement to follow a certain upgrade path (as there is with GitLab). This could change one day. Crystal ball required. In-place upgrades may leave you without the benefit of some index updates that require changes to physical, on-disk representations.There are a few ways to do it (compile from source, repos from distro, straight from PG, Docker, etc). The docs have more information on installing it on Ubuntu. Installing it via apt creates a PostgreSQL cluster. Nice but what if I want to control how it’s created?
dpkg -l | grep postgres
Install debugging symbols to make analysis with perf less annoying:
apt install -y postgresql-13 postgresql-13-dbgsym
Let’s become the user postgres to proceed from here. (su - postgres)
Let’s take a look:
# pwd && ls
/usr/lib/postgresql/13
bin lib
# bin/postgres --version
postgres (PostgreSQL) 13.4 (Ubuntu 13.4-1.pgdg20.04+1)
There are more goodies in /usr/share/postgresql/13/ on Ubuntu.
Omnibus: /opt/gitlab/embedded/postgresql/[version]/lib
11:52 Break! 12:03 Back!
Run initdb to create a database cluster.
Try /usr/lib/postgresql/13/bin/initdb -? | less
Consider adding /usr/lib/postgresql/13/bin to your $PATH.
It creates a DATADIR. There is no default location but the standard is /var/lib/postgresql/13/main. That’s familiar! The install put stuff there but we are nuking it. Only a DATADIR must be passed to initdb.
Specific options to consider:
-E, --encoding=ENCODING set default encoding for new databases
--locale=LOCALE set default locale for new databases
--lc-messages=LOCALE controls the language for error messages, use **English** for better search results
--lc-mone
PostgreSQL encoding options: consider UTF-8. The default is taken from the environment (shell running initdb or whatever), don’t leave the default encoding mechanism to fate. ;-)
An option for calling initdb:
initdb -E UTF-8
Set --locale and leave things like lc-collate or lc-monetary alone unless you want to override them. A few of these options can not be changed later, like --lc-collate and --lc-ctype. This controls how databases are created in your cluster moving forward. Once the database has been created the character type and collation can not be changed.
--lc-ctype controls character type, determines what class a character belongs to (punctuation, number, letter, space, etc).
--lc-collate determines the sort order of strings.
NOTE: PostgreSQL uses the collations from C when left to its own devices (no ORDER BY). Upgrading C library could change collation order in PostgreSQL. Not great, could corrupt indices by changing collation. Consider rebuilding all indices.
Try locale -a to see what collation options are available.
# locale -a
C
C.UTF-8
POSIX
Using C collations means that upgrading C libraries is not a concern: the code points won’t change.
-k, --data-checksums use data page checksums
In databases, everything is stored in a block. Blocks are the same size. PostgreSQL block size: 8K. With -k, checksums are written/read to disk to alert of possible storage problems. (Something changed between when the block was written and when it was later read.)
What? Why?
The term cluster is somewhat unfortunate. A cluster contains several databases. A better word would be instance.
A cluster contains:
We have the same set of users for all databases in a cluster. The permissions can be different for one user across different databases.
Default Port: 5432
When connecting to postgres, you must be connected to a database. Once you are connected to that database, that’s the one. You can’t query a different database from that connection.
Forget about initdb and use Ubuntu’s thing.
/usr/bin/pg_dropcluster 13 mainpg_createcluster which accepts initdb options:pg_createcluster 13 main -- -E UTF-8 --locale=C.UTF-8 -k
Docker: pg_ctlcluster 13 main start
Ubuntu: systemctl start postgresql && systemctl enable postgresql
Now that it’s up and running:
/etc/postgresql/13/main//var/libg/postgresql/13/main$ ps x -u postgres
PID TTY STAT TIME COMMAND
6214 ? Ss 0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file
6216 ? Ss 0:00 postgres: 13/main: checkpointer
6217 ? Ss 0:00 postgres: 13/main: background writer
6218 ? Ss 0:00 postgres: 13/main: walwriter
6219 ? Ss 0:00 postgres: 13/main: autovacuum launcher
6220 ? Ss 0:00 postgres: 13/main: stats collector
6221 ? Ss 0:00 postgres: 13/main: logical replication launcher
Postgres uses a multi-process architecture and does not use threading. It’s designed to be portable and run on a range of OSs which will implement multithreading differently. That leaves us with a dedicated process on the PostgreSQL server for each database connection.
There are three methods we will cover:
The only client shipped is psql, our friend. See the list of PostgreSQL clients on the wiki.
It’s like vi. You might not like it but you might need to use it one day so try to get along. :-)
There are four things we must specify with each connection:
psql -h db.brie.dev -p 5432 -U postgres -d nameofdatabase
You’ll be prompted for a password with the above so we’ll have to try another way. Thankfully, PostgreSQL listens on TCP and Unix sockets. Yay! :tada:
psql -h /var/run/postgresql -p 5432 -U postgres -d postgres
The psql client is smart enough to know what to do with the directory with the command above. Now, we have a prompt:
$ psql -h /var/run/postgresql -p 5432 -U postgres -d postgres
psql (13.4 (Ubuntu 13.4-1.pgdg20.04+1))
Type "help" for help.
postgres=#
How do I bail? (It’s not :wq!.) Bail with \q or Ctrl + d.
Commands starting with \ are client commands; all other commands are server commands.
Defaults with psql:
-h –> local Unix socket-p –> 5432-U –> user calling psql-d –> same as database usernameSo, we can just say psql. :laughing:
We can use connection strings:
psql 'host=/var/run/postgresql port=5432 user=postgres dbname=postgres'
Support for this comes from the shared linked libraries (specifically libpq.so.5).
HELP: /?
export PAGER='less -RS'
Consider ls -RS as the pager.
Get a list of databases with \l.
Maybe you don’t need an interactive session:
psql -d template1 -l
The above just says connect to the template1 database and list all databases.
You do have to be connected to a database (like postgres) in order to create a database.
Get help for SQL statements with \h! That’s kind of awesome… :sparkles:
\h CREATE DATABASE
postgres=# \h CREATE DATABASE
Command: CREATE DATABASE
Description: create a new database
Syntax:
CREATE DATABASE name
[ [ WITH ] [ OWNER [=] user_name ]
[ TEMPLATE [=] template ]
[ ENCODING [=] encoding ]
[ LOCALE [=] locale ]
[ LC_COLLATE [=] lc_collate ]
[ LC_CTYPE [=] lc_ctype ]
[ TABLESPACE [=] tablespace_name ]
[ ALLOW_CONNECTIONS [=] allowconn ]
[ CONNECTION LIMIT [=] connlimit ]
[ IS_TEMPLATE [=] istemplate ] ]
URL: https://www.postgresql.org/docs/13/sql-createdatabase.html
Creating a database in PostgreSQL is essentially copying an existing database (and then doing a little bit of magic). You can use any database as a template. (Permits cloning or duplicating a database.)
You’ll get the default locales and collation information when you create a database unless you specify them differently when creating the database.
A database can only be used as a template if no one else is connected to it. (To assist with this, you can toggle ALLOW_CONNECTIONS to prevent anyone from connecting to a database.)
You get tab completion inside the psql prompt. Double tab for all options. Terminate statements with ;. The prompt will change to indicate that the server is still waiting on you. Observe:
postgres=# CREATE DATABASE cutecat
postgres-#
postgres-# ;
CREATE DATABASE
postgres=#
We need a new connection in order to connect to our new database. We can do this from within the same psql session! :smile_cat:
Use \c cutecat for this:
postgres=# \c cutecat
You are now connected to database "cutecat" as user "postgres".
cutecat=#
Change the default database info with environment variables like PGHOST, PGPORT, PGDATABASE and PGUSER.
13:02: Break! 13:19: Back!
We wrote CREATE DATABASE course; to create a database called course. SQL demands that everything is case folded to upper case but PostgreSQL case folders to lower case. We could do cReAtE dAtAbAsE course; and the same thing would happen. Prevent case folding with double quotes. Do CREATE DATABASE "COURSE"; to really create a database called COURSE, otherwise CREATE DATABASE COURSE; would create a database called course. :warning: Exercise caution with using "". (You could use "" to create a database called USER but maybe don’t do that because it’s a reserved keyword and DROP DATABASE USER; won’t do what you would expect. It’ll nuke user, not USER.)
See Capitalization/case-folding from Stanford’s NLP group or what the W3 has to say about case mapping and case folding.
cutecat=# \c template0
FATAL: database "template0" is not currently accepting connections
Previous connection kept
cutecat=# \c template1
You are now connected to database "template1" as user "postgres".
template0 database? datname | datallowconn
-----------+--------------
postgres | t
cutecat | t
template1 | t
template0 | f
course | t
We see that template0 is not allowing connections. The docs have more info about the template databases.
Having multiple databases in a single cluster comes with some pros and cons:
## Indexing
We need a bunch of rows to make the database sweat. :cold_sweat:
CREATE TABLE test (id integer GENERATED ALWAYS AS IDENTITY, name text);
Describe a table with \d table_name.
cutecat=# \d test;
Table "public.test"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+------------------------------
id | integer | | not null | generated always as identity
name | text | | |
That query would be akin to the following in MySQL:
CREATE TABLE test (
id INT NOT NULL AUTO_INCREMENT,
name CHAR(30) NOT NULL,
PRIMARY KEY (id)
);
The GENERATED ALWAYS AS IDENTITY has been part of PostgreSQL since 11. Text columns are limited to a size of 1 GB in PostgreSQL. (What are you doing?!?) use text if you don’t have a limit that needs to be enforced. (Don’t add a limit from the database side when there is no such limit in the application. )
Let’s add a few rows to our table:
cutecat=# INSERT INTO test (name) VALUES ('plop'), ('boo radley');
INSERT 0 2
Duplicate the database:
cutecat=# INSERT INTO test (name) SELECT name FROM test;
INSERT 0 3
cutecat=# SELECT * FROM test;
id | name
----+------------
1 | plop
2 | boo radley
3 | fluffy
4 | plop
5 | boo radley
6 | fluffy
(6 rows)
Run the query above until you are happy with how large and slow your database will be. :smile_cat:
Measure time from client:
cutecat=# \timing
Timing is on.
cutecat=#
OK, now we have a super slow query:
cutecat=# SELECT * FROM test WHERE id = 42;
id | name
----+-------
42 | tiger
(1 row)
Time: 679.481 ms
What happens with a query you provide?
* with the names of the columns.Use EXPLAIN to get the execution plan for your query.
# EXPLAIN SELECT * FROM test WHERE id = 42;
QUERY PLAN
------------------------------------------------------------------------
Gather (cost=1000.00..82407.93 rows=1 width=11)
Workers Planned: 2
-> Parallel Seq Scan on test (cost=0.00..81407.83 rows=1 width=11)
Filter: (id = 42)
(4 rows)
Time: 0.734 ms
Let’s turn parallel off to make the plan simpler. SET max_parallel_workers_per_gather = 0;. Here’s the updated EXPLAIN result with this change to the parallel works:
# EXPLAIN SELECT * FROM test WHERE id = 42;
QUERY PLAN
---------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..137334.80 rows=1 width=11)
Filter: (id = 42)
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)
Time: 34.111 ms
The value of the cost is not in any unit, it’s primarily useful for comparison. You get two numbers. From above:
0.00 – start up cost, cost to deliver first row75100.80 – total cost, cost for retrieving all resultsrows=1 a guess at how many rows will be returned.We don’t care much about the width value.
Analogy: A table is a library. A book is a table row. If you go to the library with no information about how the books are organized, you’ll have to do a sequential scan, looking at each book, spine-by-spine. Thankfully, that’s not reality! Today, you consult a card catalog. This is akin to a database index.
EXPLAINTables are also referred to as heaps in PostgreSQL. Tables have no particular order (not ordered by id). An index is ordered. Each entry in an index points to the corresponding row in the table. (OK, think about it that way. It’s actually implemented in a tree structure.)
# CREATE INDEX test_id_idx ON test (id);
CREATE INDEX
Time: 7084.656 ms (00:07.085)
Let’s look at that SELECT again:
cutecat=# EXPLAIN SELECT * FROM test WHERE id = 42;
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using test_id_idx on test (cost=0.43..8.45 rows=1 width=11)
Index Cond: (id = 42)
(2 rows)
Time: 1.027 ms
The start up cost is higher but wow is the total cost lower. We did it. :thumbsup:
integer?bigint, bringing you from a four-byte integer to an eight-byte one. Doing this would rewrite the entire table (lots of downtime, resource usage). There are tricks you can do to do this but it’s a pain. Don’t ever use integer. See the Convert ci_builds.id to bigint - Step 1: Add new columns and sync data issue for the pain GitLab is experiencing with this very topic.No matter what: stay safe, use
bigint.
– Laurenz Albe
If you didn’t stay safe and you have this problem, here are (from memory) guidance on converting from int to bigint:
bigint columnTRIGGER that copies the existing primary key into the new column:pray: Please just use bigint.
14:00 Break! 14:09 Back!
Using EXPLAIN to understand what PostgreSQL does when we give it queries.
SELECT reltuples, relpages FROM pg_class WHERE relname = 'test';
Tell me how many rows (reltuples) and 8K pages (relpages). These are estimates (they’ll be in the right order of magnitude, “close to the truth”.)
seq_page_cost and cpu_tuple_cost give us the cost associated with
seq_page_cost - Sets the planner’s estimate of the cost of a disk page fetch that is part of a series of sequential fetches. The default is 1.0. This value can be overridden for tables and indexes in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).
cpu_tuple_cost cost to process one row. The default is 0.01.
Picking the lowest cost query is easy; generating queries with lowest costs accurately is….not so easy.
Let’s look at a more interesting query:
EXPLAIN SELECT * FROM test WHERE id < 42;
When scanning more than one row with an index: PostgreSQL
The real cost in an index scan is not in reading the index. The expensive part is in reading the table row that the index points to. The index can be read from the beginning to the end or from the end to the beginning.
With a query like SELECT * FROM test WHERE id > 42;, the query optimizer could choose to do an index scan but we see it did a sequential scan instead:
cutecat=# EXPLAIN SELECT * FROM test WHERE id > 42;
QUERY PLAN
---------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..137335.20 rows=7669977 width=11)
Filter: (id > 42)
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)
Time: 1.158 ms
With a query like this, it’s back to the index scan:
cutecat=# EXPLAIN SELECT * FROM test WHERE id > 4000000;
QUERY PLAN
------------------------------------------------------------------------------------
Index Scan using test_id_idx on test (cost=0.43..121539.97 rows=3587802 width=11)
Index Cond: (id > 4000000)
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)
Time: 1.070 ms
We can be reasonably sure that the optimizer is doing the fastest thing when choosing between index scan and a sequential scan.
The index is useful for doing things like EXPLAIN SELECT max(id) FROM test;:
Result (cost=0.46..0.47 rows=1 width=4)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.46 rows=1 width=4)
-> Index Only Scan Backward using test_id_idx on test (cost=0.43..218357.71 rows=7670016 width=4)
Index Cond: (id IS NOT NULL)
(5 rows)
The index is useful also for sorting:
cutecat=# EXPLAIN SELECT * FROM test ORDER BY id;
QUERY PLAN
------------------------------------------------------------------------------------
Index Scan using test_id_idx on test (cost=0.43..240645.67 rows=7670016 width=11)
(1 row)
Time: 0.598 ms
I can forbid PostgreSQL from using an index scan with SET enable_indexscan = off;. Use RESET enable_indexscan; to revert that change.
:smile: The index is clearly super helpful but…there are associated downsides to be aware of. :frowning:
Inserting a row into the heap is cheap.
As a result: it is not always necessarily a good idea to have an index. You need to balance speed when querying vs speed when updating. (If you are querying monthly but inserting daily……maybe don’t bother with the index.)
An index you don’t use is a bad thing: it degrades performance and provides no benefit.
What if I create an index on the text column?
CREATE INDEX test_name_idx ON test (name);
Will be index even be used?
# EXPLAIN SELECT * FROM test WHERE name = 'plop';
QUERY PLAN
--------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=24286.17..92968.34 rows=2177773 width=11)
Recheck Cond: (name = 'plop'::text)
-> Bitmap Index Scan on test_name_idx (cost=0.00..23741.73 rows=2177773 width=0)
Index Cond: (name = 'plop'::text)
(4 rows)
Time: 1.197 ms
Yes, but it’s also doing a new thing called a bitmap heap scan.
Each table block is only fetched once. The entire bitmap might not fit into memory. PostgreSQL will degrade to one bit per block. Less efficient but there’s only so much memory so….
You can drop an index like you’d drop a database.
For some purposes, having an index is super useful.
Let’s use a WHERE to create a partial index.
CREATE INDEX test_name_idx ON test (name) WHERE name NOT IN ('plop');
The index will only be created for rows that satisfy the WHERE condition. A partial index is also referred to as a conditional index.
In addition to having a partial index, you can have a unique index. You can also have a unique partial index.
You might wish to create a partial index that only includes conditions you are frequently querying for. Like:
CREATE INDEX on test (name) WHERE account_status NOT IN ('active');
Completely and quickly made up to demonstrate: querying only for users that are not active.
:thought_balloon: It’s really important to understand your particular application and dataset to know how to decide whether to make an index, whether a partial index would make sense and how to scope the partial index.
Great segue…
Ideally, the developer. Don’t cheat and only consider datasets that are too small and not relevant to production usage. Consider index creation when designing the query. Have realistic amounts of test data at some point between development and production. :pray: Please!
Get to [2021-training-notes-day-2].