hello, world. | notes.carranza.engineer

Day 1: Indexing / Transactions and Locking

Get to [2021-training-notes-day-2].

Sharding will be out of scope for the training. We will talk a bit about partitioning.

Docs, the greatest resource.

The PostgreSQL Wiki –> Good for things in the PostgreSQL ecosystem, broader scope than the docs.

About the Project

free, open source software. License: PostgreSQL is released under the PostgreSQL License, a liberal Open Source license, similar to the BSD or MIT licenses. A great many forks of PostgreSQL:

There are somewhere between 40-500 contributors to PostgreSQL, depending on how you define a contributor.

Versioning

See the versioning policy.

Major versions are released yearly, usually in the fall. We expect 14 in fall 2022. Releases are supported for 5 years.

Supported means you’ll get new minor releases (always back fixes, never new behavior).

For minor releases, the community considers not upgrading to be riskier than upgrading.

The goal is for people to be able to upgrade to a new minor release without needing to test their code. They focus on not introducing bugs in minor releases fo facilitate this.

Let’s do something!

Install PostgreSQL

There are a few ways to do it (compile from source, repos from distro, straight from PG, Docker, etc). The docs have more information on installing it on Ubuntu. Installing it via apt creates a PostgreSQL cluster. Nice but what if I want to control how it’s created?

dpkg -l | grep postgres

Install debugging symbols to make analysis with perf less annoying:

apt install -y postgresql-13 postgresql-13-dbgsym

Let’s become the user postgres to proceed from here. (su - postgres)

Let’s take a look:

# pwd && ls
/usr/lib/postgresql/13
bin  lib
# bin/postgres  --version
postgres (PostgreSQL) 13.4 (Ubuntu 13.4-1.pgdg20.04+1)

There are more goodies in /usr/share/postgresql/13/ on Ubuntu.

Omnibus: /opt/gitlab/embedded/postgresql/[version]/lib

11:52 Break! 12:03 Back!

Postgres Cluster and Creating a Database

Run initdb to create a database cluster.

Try /usr/lib/postgresql/13/bin/initdb -? | less

Consider adding /usr/lib/postgresql/13/bin to your $PATH.

initdb

It creates a DATADIR. There is no default location but the standard is /var/lib/postgresql/13/main. That’s familiar! The install put stuff there but we are nuking it. Only a DATADIR must be passed to initdb.

Specific options to consider:

-E, --encoding=ENCODING set default encoding for new databases
--locale=LOCALE set default locale for new databases
--lc-messages=LOCALE controls the language for error messages, use **English** for better search results
--lc-mone

PostgreSQL encoding options: consider UTF-8. The default is taken from the environment (shell running initdb or whatever), don’t leave the default encoding mechanism to fate. ;-)

An option for calling initdb:

initdb -E UTF-8

Set --locale and leave things like lc-collate or lc-monetary alone unless you want to override them. A few of these options can not be changed later, like --lc-collate and --lc-ctype. This controls how databases are created in your cluster moving forward. Once the database has been created the character type and collation can not be changed.

--lc-ctype controls character type, determines what class a character belongs to (punctuation, number, letter, space, etc). --lc-collate determines the sort order of strings.

NOTE: PostgreSQL uses the collations from C when left to its own devices (no ORDER BY). Upgrading C library could change collation order in PostgreSQL. Not great, could corrupt indices by changing collation. Consider rebuilding all indices.

Try locale -a to see what collation options are available.

# locale -a
C
C.UTF-8
POSIX

Using C collations means that upgrading C libraries is not a concern: the code points won’t change.

-k, --data-checksums use data page checksums

In databases, everything is stored in a block. Blocks are the same size. PostgreSQL block size: 8K. With -k, checksums are written/read to disk to alert of possible storage problems. (Something changed between when the block was written and when it was later read.)

PostgreSQL Cluster

What? Why?

The term cluster is somewhat unfortunate. A cluster contains several databases. A better word would be instance.

A cluster contains:

We have the same set of users for all databases in a cluster. The permissions can be different for one user across different databases.

Default Port: 5432

When connecting to postgres, you must be connected to a database. Once you are connected to that database, that’s the one. You can’t query a different database from that connection.

Let’s create a cluster!

Forget about initdb and use Ubuntu’s thing.

  1. Remove cluster created during install with /usr/bin/pg_dropcluster 13 main
  2. Create cluster with pg_createcluster which accepts initdb options:
pg_createcluster  13 main -- -E UTF-8   --locale=C.UTF-8     -k

Docker: pg_ctlcluster 13 main start Ubuntu: systemctl start postgresql && systemctl enable postgresql

Now that it’s up and running:

$ ps  x  -u postgres
  PID TTY      STAT   TIME COMMAND
 6214 ?        Ss     0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file
 6216 ?        Ss     0:00 postgres: 13/main: checkpointer
 6217 ?        Ss     0:00 postgres: 13/main: background writer
 6218 ?        Ss     0:00 postgres: 13/main: walwriter
 6219 ?        Ss     0:00 postgres: 13/main: autovacuum launcher
 6220 ?        Ss     0:00 postgres: 13/main: stats collector
 6221 ?        Ss     0:00 postgres: 13/main: logical replication launcher

Postgres uses a multi-process architecture and does not use threading. It’s designed to be portable and run on a range of OSs which will implement multithreading differently. That leaves us with a dedicated process on the PostgreSQL server for each database connection.

Connecting to a database in a cluster

There are three methods we will cover:

The only client shipped is psql, our friend. See the list of PostgreSQL clients on the wiki.

It’s like vi. You might not like it but you might need to use it one day so try to get along. :-)

There are four things we must specify with each connection:

psql -h db.brie.dev -p 5432 -U postgres -d nameofdatabase 

You’ll be prompted for a password with the above so we’ll have to try another way. Thankfully, PostgreSQL listens on TCP and Unix sockets. Yay! :tada:

psql -h /var/run/postgresql -p 5432 -U postgres -d postgres

The psql client is smart enough to know what to do with the directory with the command above. Now, we have a prompt:

$ psql -h /var/run/postgresql -p 5432 -U postgres -d postgres
psql (13.4 (Ubuntu 13.4-1.pgdg20.04+1))
Type "help" for help.

postgres=#

How do I bail? (It’s not :wq!.) Bail with \q or Ctrl + d.

Commands starting with \ are client commands; all other commands are server commands.

Defaults with psql:

So, we can just say psql. :laughing:

We can use connection strings:

psql 'host=/var/run/postgresql port=5432 user=postgres dbname=postgres'

Support for this comes from the shared linked libraries (specifically libpq.so.5).

HELP: /?

export PAGER='less -RS'

Consider ls -RS as the pager.

Get a list of databases with \l.

Maybe you don’t need an interactive session:

psql -d template1 -l

The above just says connect to the template1 database and list all databases.

Creating a database

You do have to be connected to a database (like postgres) in order to create a database.

Get help for SQL statements with \h! That’s kind of awesome… :sparkles:

\h CREATE DATABASE
postgres=# \h CREATE DATABASE
Command:     CREATE DATABASE
Description: create a new database
Syntax:
CREATE DATABASE name
    [ [ WITH ] [ OWNER [=] user_name ]
           [ TEMPLATE [=] template ]
           [ ENCODING [=] encoding ]
           [ LOCALE [=] locale ]
           [ LC_COLLATE [=] lc_collate ]
           [ LC_CTYPE [=] lc_ctype ]
           [ TABLESPACE [=] tablespace_name ]
           [ ALLOW_CONNECTIONS [=] allowconn ]
           [ CONNECTION LIMIT [=] connlimit ]
           [ IS_TEMPLATE [=] istemplate ] ]

URL: https://www.postgresql.org/docs/13/sql-createdatabase.html

Creating a database in PostgreSQL is essentially copying an existing database (and then doing a little bit of magic). You can use any database as a template. (Permits cloning or duplicating a database.)

You’ll get the default locales and collation information when you create a database unless you specify them differently when creating the database.

A database can only be used as a template if no one else is connected to it. (To assist with this, you can toggle ALLOW_CONNECTIONS to prevent anyone from connecting to a database.)

You get tab completion inside the psql prompt. Double tab for all options. Terminate statements with ;. The prompt will change to indicate that the server is still waiting on you. Observe:

postgres=# CREATE DATABASE cutecat
postgres-#
postgres-# ;
CREATE DATABASE
postgres=#

We need a new connection in order to connect to our new database. We can do this from within the same psql session! :smile_cat:

Use \c cutecat for this:

postgres=# \c cutecat
You are now connected to database "cutecat" as user "postgres".
cutecat=#

Change the default database info with environment variables like PGHOST, PGPORT, PGDATABASE and PGUSER.

13:02: Break! 13:19: Back!

Case in Commands

We wrote CREATE DATABASE course; to create a database called course. SQL demands that everything is case folded to upper case but PostgreSQL case folders to lower case. We could do cReAtE dAtAbAsE course; and the same thing would happen. Prevent case folding with double quotes. Do CREATE DATABASE "COURSE"; to really create a database called COURSE, otherwise CREATE DATABASE COURSE; would create a database called course. :warning: Exercise caution with using "". (You could use "" to create a database called USER but maybe don’t do that because it’s a reserved keyword and DROP DATABASE USER; won’t do what you would expect. It’ll nuke user, not USER.)

See Capitalization/case-folding from Stanford’s NLP group or what the W3 has to say about case mapping and case folding.

Template Databases

cutecat=# \c template0
FATAL:  database "template0" is not currently accepting connections
Previous connection kept
cutecat=# \c template1
You are now connected to database "template1" as user "postgres".
  datname  | datallowconn
-----------+--------------
 postgres  | t
 cutecat   | t
 template1 | t
 template0 | f
 course    | t

We see that template0 is not allowing connections. The docs have more info about the template databases.

Having multiple databases in a single cluster comes with some pros and cons:

## Indexing

We need a bunch of rows to make the database sweat. :cold_sweat:

 CREATE TABLE test (id integer GENERATED ALWAYS AS IDENTITY, name text);

Describe a table with \d table_name.

 cutecat=# \d test;
                          Table "public.test"
 Column |  Type   | Collation | Nullable |           Default
--------+---------+-----------+----------+------------------------------
 id     | integer |           | not null | generated always as identity
 name   | text    |           |          |

That query would be akin to the following in MySQL:

 CREATE TABLE test (
     id INT NOT NULL AUTO_INCREMENT,
     name CHAR(30) NOT NULL,
     PRIMARY KEY (id)
);

The GENERATED ALWAYS AS IDENTITY has been part of PostgreSQL since 11. Text columns are limited to a size of 1 GB in PostgreSQL. (What are you doing?!?) use text if you don’t have a limit that needs to be enforced. (Don’t add a limit from the database side when there is no such limit in the application. )

Let’s add a few rows to our table:

cutecat=# INSERT INTO test (name) VALUES ('plop'), ('boo radley');
INSERT 0 2

Duplicate the database:

cutecat=# INSERT INTO test (name) SELECT name FROM test;
INSERT 0 3
cutecat=# SELECT * FROM test;
 id |    name
----+------------
  1 | plop
  2 | boo radley
  3 | fluffy
  4 | plop
  5 | boo radley
  6 | fluffy
(6 rows)

Run the query above until you are happy with how large and slow your database will be. :smile_cat:

Measure time from client:

cutecat=# \timing
Timing is on.
cutecat=#

OK, now we have a super slow query:

cutecat=# SELECT * FROM test WHERE id = 42;
 id | name
----+-------
 42 | tiger
(1 row)

Time: 679.481 ms

What happens with a query you provide?

Use EXPLAIN to get the execution plan for your query.

# EXPLAIN SELECT * FROM test WHERE id = 42;
                               QUERY PLAN
------------------------------------------------------------------------
 Gather  (cost=1000.00..82407.93 rows=1 width=11)
   Workers Planned: 2
   ->  Parallel Seq Scan on test  (cost=0.00..81407.83 rows=1 width=11)
         Filter: (id = 42)
(4 rows)

Time: 0.734 ms

Let’s turn parallel off to make the plan simpler. SET max_parallel_workers_per_gather = 0;. Here’s the updated EXPLAIN result with this change to the parallel works:

# EXPLAIN SELECT * FROM test WHERE id = 42;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..137334.80 rows=1 width=11)
   Filter: (id = 42)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

Time: 34.111 ms

The value of the cost is not in any unit, it’s primarily useful for comparison. You get two numbers. From above:

We don’t care much about the width value.

Analogy: A table is a library. A book is a table row. If you go to the library with no information about how the books are organized, you’ll have to do a sequential scan, looking at each book, spine-by-spine. Thankfully, that’s not reality! Today, you consult a card catalog. This is akin to a database index.

Tables are also referred to as heaps in PostgreSQL. Tables have no particular order (not ordered by id). An index is ordered. Each entry in an index points to the corresponding row in the table. (OK, think about it that way. It’s actually implemented in a tree structure.)

Create an Index

# CREATE INDEX test_id_idx ON test (id);
CREATE INDEX
Time: 7084.656 ms (00:07.085)

Let’s look at that SELECT again:

cutecat=# EXPLAIN SELECT * FROM test WHERE id = 42;
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=11)
   Index Cond: (id = 42)
(2 rows)

Time: 1.027 ms

The start up cost is higher but wow is the total cost lower. We did it. :thumbsup:

No matter what: stay safe, use bigint.

– Laurenz Albe

If you didn’t stay safe and you have this problem, here are (from memory) guidance on converting from int to bigint:

:pray: Please just use bigint.

14:00 Break! 14:09 Back!

Using an index, looking at query plans

Using EXPLAIN to understand what PostgreSQL does when we give it queries.

SELECT reltuples, relpages FROM pg_class WHERE relname = 'test';

Tell me how many rows (reltuples) and 8K pages (relpages). These are estimates (they’ll be in the right order of magnitude, “close to the truth”.)

seq_page_cost and cpu_tuple_cost give us the cost associated with

Picking the lowest cost query is easy; generating queries with lowest costs accurately is….not so easy.

Let’s look at a more interesting query:

EXPLAIN SELECT * FROM test WHERE id < 42;

When scanning more than one row with an index: PostgreSQL

The real cost in an index scan is not in reading the index. The expensive part is in reading the table row that the index points to. The index can be read from the beginning to the end or from the end to the beginning.

With a query like SELECT * FROM test WHERE id > 42;, the query optimizer could choose to do an index scan but we see it did a sequential scan instead:

cutecat=# EXPLAIN SELECT * FROM test WHERE id > 42;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..137335.20 rows=7669977 width=11)
   Filter: (id > 42)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

Time: 1.158 ms

With a query like this, it’s back to the index scan:

cutecat=# EXPLAIN SELECT * FROM test WHERE id > 4000000;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Index Scan using test_id_idx on test  (cost=0.43..121539.97 rows=3587802 width=11)
   Index Cond: (id > 4000000)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)

Time: 1.070 ms

We can be reasonably sure that the optimizer is doing the fastest thing when choosing between index scan and a sequential scan.

The index is useful for doing things like EXPLAIN SELECT max(id) FROM test;:

 Result  (cost=0.46..0.47 rows=1 width=4)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.46 rows=1 width=4)
           ->  Index Only Scan Backward using test_id_idx on test  (cost=0.43..218357.71 rows=7670016 width=4)
                 Index Cond: (id IS NOT NULL)
(5 rows)

The index is useful also for sorting:

cutecat=# EXPLAIN SELECT * FROM test ORDER BY id;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Index Scan using test_id_idx on test  (cost=0.43..240645.67 rows=7670016 width=11)
(1 row)

Time: 0.598 ms

I can forbid PostgreSQL from using an index scan with SET enable_indexscan = off;. Use RESET enable_indexscan; to revert that change.

:smile: The index is clearly super helpful but…there are associated downsides to be aware of. :frowning:

Inserting a row into the heap is cheap.

As a result: it is not always necessarily a good idea to have an index. You need to balance speed when querying vs speed when updating. (If you are querying monthly but inserting daily……maybe don’t bother with the index.)

An index you don’t use is a bad thing: it degrades performance and provides no benefit.

What if I create an index on the text column?

CREATE INDEX test_name_idx ON test (name);

Will be index even be used?

# EXPLAIN SELECT * FROM test WHERE name = 'plop';
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=24286.17..92968.34 rows=2177773 width=11)
   Recheck Cond: (name = 'plop'::text)
   ->  Bitmap Index Scan on test_name_idx  (cost=0.00..23741.73 rows=2177773 width=0)
         Index Cond: (name = 'plop'::text)
(4 rows)

Time: 1.197 ms

Yes, but it’s also doing a new thing called a bitmap heap scan.

Each table block is only fetched once. The entire bitmap might not fit into memory. PostgreSQL will degrade to one bit per block. Less efficient but there’s only so much memory so….

You can drop an index like you’d drop a database.

For some purposes, having an index is super useful.

Partial Index

Let’s use a WHERE to create a partial index.

CREATE INDEX test_name_idx ON test (name) WHERE name NOT IN ('plop');

The index will only be created for rows that satisfy the WHERE condition. A partial index is also referred to as a conditional index.

In addition to having a partial index, you can have a unique index. You can also have a unique partial index.

You might wish to create a partial index that only includes conditions you are frequently querying for. Like:

CREATE INDEX on test (name) WHERE account_status NOT IN ('active');

Completely and quickly made up to demonstrate: querying only for users that are not active.

:thought_balloon: It’s really important to understand your particular application and dataset to know how to decide whether to make an index, whether a partial index would make sense and how to scope the partial index.

Great segue…

Who creates indexes?

Ideally, the developer. Don’t cheat and only consider datasets that are too small and not relevant to production usage. Consider index creation when designing the query. Have realistic amounts of test data at some point between development and production. :pray: Please!

  1. Analyze query performance
  2. Observe results
  3. Improve query (within reason)
  4. Build/design index

Get to [2021-training-notes-day-2].