ruk·si

🐘 PostgreSQL
Basics

Updated at 2015-09-23 23:54

This note is about PostgreSQL, briefly going through all the important features. Related to PostgreSQL tuning notes and PostgreSQL extensions notes. PostgreSQL is object-relational database management system.

Formatting

SQL formatting in the official PostgreSQL documentation is a mess. Identifiers with or without underscores, functions in uppercase or lowercase. Decide which style you'd like to use with other team members and stick to it. This can avoid subtle bugs as PostgreSQL will automatically lowercase anything not double quoted.

This is the formatting I prefer:
    Keywords and constants are in UPPER_SNAKECASE.
        SELECT      FROM        AS      CASE
        WHERE       IN          NOT     LEFT JOIN
        AND         OR          ENUM    NOT
        DEFAULT     TRUE        FALSE   NULL
        integer ARRAY[4] => ARRAY is a keyword, integer is not.

    Everything else is in lower_snakecase.
        database_name
        schema_name
        table_name
        column_name
        data_type
            my_enum_type
            integer
            integer[4]
        If there are parentheses `()` after a word, it's a function.
            function_name()
            array_upper()
            count()
            max()
            clock_timestamp()
            extract(month FROM timestamp '2013-09-22') AS month
SELECT t.* FROM my_schema.my_table t;

-- Specifying columns is __always__ better than *.
SELECT t.name, t.some_cost FROM my_schema.my_table t;

-- Prefer filtering results in the query.
SELECT t.name, t.some_cost FROM my_schema.my_table t WHERE some_cost > 0;

Use consistent naming. Use singular or plural names whole way through each project. Don't mix table naming styles e.g. users and order_row, it should be order_rows. I personally prefer singular names as they are easier to work with but switch to the plural if it's the framework standard e.g. in Ruby on Rails.

1. Many words have multiple plural forms e.g. person vs persons, people.
2. Many words sound silly in plural e.g. user_status vs user_statuses.
3. Many words are irregular in plural e.g. repository vs repositories.
4. Many words don't have a plural e.g. samurai, but there are ninjas.
5. Plural forms will mix up alphabetical sorting of tables
   e.g. order, order_detail vs order_details, orders.

Use descriptive names. Avoid using abbreviations if they are not commonly known. You can use aliases to reduce written SQL.

-- bad
SELECT rep.rep_id FROM rep;
-- good
SELECT report.report_id FROM report;
-- even better
SELECT r.daily_report_id FROM daily_usage_report r

Use whitespace to make SQL readable. Don't worry about consistent indention, align related statements for easier reading, especially important as many developers you don't use proper syntax highlight with SQL. Joins should be clearly intended from the surrounding markup. Same approach as in Haskell and other functional languages as SQL is more functional than object oriented.

SELECT
  s.name AS supplier_name,
  p.name AS product_name,
  p.some_cost,
  p.other_cost
 FROM my_schema.product p
      LEFT JOIN my_schema.supplier s
          USING (supplier_id)
WHERE p.some_cost > 0
  AND p.some_cost < (p.other_cost / 50);

Use comments in SQL. Comments are started with -- and continue to the end of line.

-- This is a comment
SELECT * FROM table; -- This is also comment

SELECT
  name,
  age   -- This is a comment about the age column
FROM user;

Maximum name length is 30. Otherwise some environments might give you troubles.

SELECT * FROM really_long_table_name_but_ok

Names must start with a letter and not end in an underscore. Some adapters and drivers have problems with these.

Never give a table the same name as one of the columns. And vice versa.

-- bad
SELECT k.key FROM key k

Avoid using just id as the primary key column. Use more descriptive name instead, like user_id. Allows using USING statement in joins. A table can have multiple columns that end in _id e.g. foreign keys.

Always use aliasing when joining tables. The SQL becomes easier to read and extend later. Prefer using without the optional AS keyword.

SELECT
  u.email,
  pr.created_at
FROM user u
  LEFT JOIN password_reset pr
    USING (user_id);

Functions must contain a verb.

-- bad
tax_price()

-- good
get_tax_price()

Always include newline ...

  • ... before AND or OR.
  • ... after each keyword definition.
  • ... after a comma when separating multiple columns into logical groups.
  • ... after semicolons to separate queries for easier reading.
  • ... to separate code into related sections.

General

Basic examples:

Use UUIDs as your primary keys. Seriously, this is not 1990, use a UUID as the primary key. The size increase is marginal compared to the benefits of horizontal scaling.

  • integer sequences and serial are useful but not as primary keys
  • If you just want collision avoidance, UUIDv4 is enough.
  • If you want lexicographical sorting...
    • xid or ulid both work great
    • UUIDv6 or UUIDv7 if you want to follow the standards
Here are some comparison when using separate primary keys when
inserting 10 000 000 rows to a PostgreSQL database table.

UUID is generated with plain uuid_generate_v4() of the uuid-ossp module.

Single Operations:

column type     time (s)    size w index (MB)
BIGSERIAL       4262        636.7
UUID            4367        890.0
BIGINT          4624        636.7

Bulk:

column type     time (s)    size w index (MB)
BIGSERIAL       898         636.7
UUID            991         890.0
BIGINT          1147        636.7

CASE clause is like an if-statement in programming languages.

SELECT
  p.name,
  CASE WHEN (p.cost > 100) THEN
    'expensive'
  ELSE
    'cheap'
  END AS cost_class
FROM my_schema.product p;
-- name,    cost,  cost_class
-- Android  99     cheap
-- iPhone   299    expensive

DISTINCT is used to remove returned duplicates. You should limit DISTINCT usage only to situations that require it, it should not be used to fix incorrect joins.

SELECT DISTINCT m.surname
FROM my_schema.member m
ORDER BY m.surname;

Prefer IN over OR.

-- bad
SELECT *
 FROM my_schema.company c
WHERE c.company_id = 1
   OR c.company_id = 5;

-- good
SELECT *
FROM my_schema.company c
WHERE c.company_id IN (1, 5);

Extensions

Extensions usually create functions, data types, operators and index support methods. Can be removed with DROP EXTENSION.

Use extensions. They are really worth it. More about useful extension in PostgreSQL extension notes.

Schemas

A database contains one or more named schemas, which in turn contain tables, data types, functions and operations.

If no schema name is given, system determines which table is meant by following search path, which is a list of schemas.

-- First is a schema named same as the user.
SHOW search_path;
--    search_path
-----------------
-- "$user",public

 -- Changing the search path.
 SET search_path TO my_schema, public;

PostgreSQL has pg_catalog schema that is in every database. It is always searched, even if it is not in the search path. Catalogs are the place where PostgreSQL stores metadata, e.g. information about tables and columns. Both they are just regular tables.

When you create tables and such without specifying an schema name, they are added to schema named 'public'. Public schema is not special, just a default that is always created. Overall, it's a good practice to use schemas.

-- The same thing:
CREATE TABLE product ( ... );
CREATE TABLE public.product ( ... );

Why to use schemas:

  • Allowing users to use one database without interfering with each other.
  • Organize database objects into logical groups to make them more manageable.
  • Third-party applications can be placed on own schemas so they do not collide.
CREATE SCHEMA schema_name;
CREATE SCHEMA schema_name AUTHORIZATION username; -- who owns created objects?
DROP SCHEMA schema_name CASCADE;

By default, users cannot access any objects in schemas they do not own. To allow that, the owner of the schema needs to grant the USAGE privilege on the schema.

GRANT CREATE ON SCHEMA schema_name TO public;  -- allow creating stuff
GRANT USAGE ON SCHEMA schema_name TO public;   -- allow viewing stuff

By default everyone has CREATE and USAGE privileges on public.

-- How to remove creating rights from public.
-- `FROM public` means "every user".
REVOKE CREATE ON SCHEMA public FROM public;

Locale

Database locale defines how rows are sorted, language of messages and formats of returned numbers, currency, dates and times.

Sorting using C collation.
    A, B, C, a, b, c
Sorting using utf8.en_US collation:
    A, a, B, b, C, c

PostgreSQL offers standard ISO C and POSIX locales, which are the same thing. Other available locales depend on your operating system.

If you don't have a special reason, use en_US.UTF-8. It supports multibyte characters. You can change locale per query or per table basis if it is required.

initdb --locale=en_US.UTF-8

You can also specify collation by column, but you rarely need to.

CREATE TABLE test1 (
    a text COLLATE "de_DE",
    b text COLLATE "es_ES"
);

Joins

INNER JOIN is the most common way of joining tables. Inner join returns only the rows that exist in the both tables.

SELECT b.start_time
  FROM my_schema.booking b
    INNER JOIN my_schema.member m ON (m.member_id = b.member_id)
 WHERE m.first_name = 'John'
   AND m.surname = 'Doe';

JOIN is an alias for INNER JOIN.

SELECT
      b.start_time,
      f.name AS facility_name
 FROM my_schema.booking b
    JOIN my_schema.facility f ON (f.facility_id = b.facility_id)
WHERE b.start_time > '2012-09-21' AND b.start_time < '2012-09-22'
  AND f.name LIKE '%Tennis Court%';

You may join tables to themselves. This usually is too confusing concept and should be avoided.

SELECT DISTINCT
      recommender.first_name,
      recommender.surname
 FROM my_schema.member m
    INNER JOIN my_schema.member recommender
        ON (recommender.member_id = m.recommended_by)
ORDER BY surname, first_name;

SELECT
      first_name,
      surname
 FROM my_schema.member
WHERE member_id IN (SELECT recommended_by FROM my_schema.member)
ORDER BY surname, first_name;

LEFT JOINs are used to get optional data from the joined table. LEFT JOIN is the same as LEFT OUTER JOIN.

-- All members which have used Tennis Court
SELECT DISTINCT
      m.first_name,
      m.surname,
      f.name
 FROM my_schema.booking b
    LEFT JOIN my_schema.member m ON (m.member_id = b.member_id)
    LEFT JOIN my_schema.facility f ON (f.facility_id = b.facility_id)
WHERE f.name LIKE '%Tennis Court%'
ORDER BY first_name, surname;

Unions

UNION ALL is used to combine multiple queries.

    SELECT DISTINCT surname
    FROM my_schema.member
UNION ALL
    SELECT DISTINCT name
    FROM my_schema.facility;

Subqueries

Complex SQL statements will use subqueries, a query inside a query.

Subqueries can be in a SELECT-clause. This is usually a bad practice because the subquery is ran on every result.

SELECT DISTINCT
    mem.first_name || ' ' ||  mem.surname as member,
    (
        SELECT rec.first_name || ' ' || rec.surname as recommender
          FROM my_schema.member rec
         WHERE rec.member_id = mem.recommended_by
    )
FROM my_schema.member mem
ORDER BY member;

Subqueries can be in a FROM clause, which are called inline views.

SELECT member, facility, totalcost
FROM (
    SELECT
          mem.first_name || ' ' || mem.surname AS member,
          fac.name AS facility,
          CASE WHEN mem.member_id = 0 THEN
              bk.slots * fac.guest_cost
          ELSE
              bk.slots * fac.member_cost
          END AS totalcost
     FROM my_schema.member mem
        INNER JOIN my_schema.booking bk ON memmem.member_id = bk.member_id
        INNER JOIN my_schema.facility fac ON bk.facility_id = fac.facility_id
    WHERE bk.start_time BETWEEN '2012-09-14' AND '2012-09-15'
) AS booking_with_cost
WHERE totalcost > 30
ORDER BY totalcost DESC;

Subqueries can be in a IN clause. These subquery must return 1 column, where it acts much like an array.

SELECT f1.*
FROM my_schema.facility f1
WHERE f1.facility_id IN (SELECT f2.facility_id FROM my_schema.facility f2);

Aggregate Functions

Aggregate functions take a column of data and output a single value. For the full list of aggregate functions, check PostgreSQL aggregate function documentation.

-- This will not work because count() wants to collapse the results
-- while facility_id wants to keep them separate.
SELECT
     facility_id,
     count(*) AS my_count
FROM my_schema.facility

-- This will work, and PostgreSQL will cache the result as
-- it knows it is a repeated aggregate function.
SELECT
     facility_id,
     (SELECT count(*) FROM my_schema.facility) AS my_count
FROM my_schema.facility

-- This also does not work as the aggregate function result is only usable
-- after the `WHERE` has been applied.
SELECT
      facility_id,
      (SELECT count(*) FROM my_schema.facility) AS my_count
 FROM my_schema.facility
WHERE my_count > 0

-- You can use `HAVING` which filters the result set after selection.
SELECT
     facility_id,
     (SELECT count(*) FROM my_schema.facility) AS my_count
FROM my_schema.facility
GROUP BY facility_id
HAVING (SELECT count(*) FROM my_schema.facility) > 2
ORDER BY facility_id
-- Or create a subquery:
SELECT *
FROM (
    SELECT
        facility_id,
        (SELECT count(*) FROM my_schema.facility) AS my_count
    FROM my_schema.facility
    GROUP BY facility_id
    ORDER BY facility_id
) AS counted_totals
WHERE my_count > 0

Using aggregate function, IN-clause and a subquery to filter rows.

SELECT first_name, surname, join_date
  FROM my_schema.member
WHERE join_date = (
    SELECT max(join_date)
    FROM my_schema.member
);

count() function returns the number of results.

-- Number of members.
SELECT count(*) FROM my_schema.member

-- Number of members with address.
SELECT count(address) FROM my_schema.member

-- Number of members with different address.
SELECT count(DISTINCT address) FROM my_schema.member

-- Number of members with 'nom' in their name.
SELECT count(*) FROM my_schema.member WHERE name LIKE '%nom%'

sum() function returns the sum of the results.

SELECT
     facility_id,
     sum(slots) as slots_total
FROM my_schema.booking
GROUP BY facility_id
ORDER BY facility_id

min() function returns the smallest result. Works with dates too.

-- Return first time each product has been sold after the start of September.
SELECT
      p.product_id,
      p.name
      min(s.sell_time) AS first_sell_time
FROM my_schema.sale_item s
    INNER JOIN my_schema.product p on (p.product_id = s.product_id)
WHERE s.sell_time >= '2012-09-01'
GROUP BY p.product_id, p.name
ORDER BY p.product_id;

Common Table Expression

Create CTEs using WITH. Common table expressions (CTE) are used to create temporary views to data. They can greatly simplify queries with a lot of or complex subqueries.

WITH sum AS (
    SELECT
         facility_id,
         sum(slots) AS slots_total
    FROM my_schema.booking
    GROUP BY facility_id
)
SELECT
      facility_id,
      slots_total
 FROM sum
WHERE slots_total = (SELECT max(slots_total) FROM sum);

You can make recursive CTEs with RECURSIVE keyword.

WITH RECURSIVE increment(number) AS (
        -- Initial statement.
        SELECT 1
    UNION ALL
        -- Recursive statement
        SELECT increment.number + 1
          FROM increment
         WHERE increment.number < 5
)
SELECT *
  FROM increment;
-- => Creates 5 rows from 1 to 5.

User Management

Every connection to the database server is made in the name of some particular role, and this role determines the initial access privileges for commands issued on that connection. Roles may have a password.

The right to modify or destroy an object is always the privilege of the owner only and owner is always the user that created the object.

If user has SUPERUSER attribute, user can always access all object.

Privileges:

SELECT, INSERT, UPDATE, DELETE, TRUNCATE,
REFERENCES, TRIGGER, CREATE,
CONNECT, TEMPORARY, EXECUTE, USAGE

-- ALL means all privileges
GRANT UPDATE ON accounts TO joe;

-- PUBLIC means all users.
REVOKE ALL ON accounts FROM PUBLIC;

Roles and groups are helpful to maintain privileges to multiple users. Both are managed with keyword ROLE and difference comes how you use the created role.

CREATE ROLE joe WITH LOGIN INHERIT; -- Allow logging in.

CREATE ROLE admin NOINHERIT; -- Create admin group.
CREATE ROLE god NOINHERIT; -- Create god group.
GRANT admin TO joe; -- Tell that joe is an admin.
GRANT god TO admin; -- Tell that admins are gods.
-- Note that joe has privileges of 'joe' and 'admin' but not 'god'
-- because of NOINHERIT.

REVOKE admin FROM joe;
-- Now joe has only privileges of 'joe'.

-- Original privilege state
SET ROLE joe; -- Can be direct or indirect non-inherited
SET ROLE NONE;
RESET ROLE;

-- Removing a group.
DROP ROLE admin;

-- Change is login allowed.
ALTER ROLE john LOGIN; -- Add login.
ALTER ROLE john NOLOGIN; -- Remove login.

The role attributes LOGIN, SUPERUSER, CREATEDB, and CREATEROLE can be thought of as special privileges, but they are never inherited as ordinary privileges on database objects are. You must actually SET ROLE.

Functions and triggers allow users to insert code into the backend server that other users might execute unintentionally. Hence, both mechanisms permit users to "Trojan horse" others with relative ease. The only real protection is tight control over who can define functions.

Tablespaces

Tablespaces allow database administrators to define locations in the file system where the files representing database objects are stored.

CREATE TABLESPACE fastspace LOCATION '/mnt/sda1/postgresql/data';

There is usually not much point in making more than one tablespace per logical file system, since you cannot control the location of individual files within a logical file system.

CREATE TABLE foo(i int) TABLESPACE space1;
-- Or
SET default_tablespace = space1;
CREATE TABLE foo(i int);

Two tablespaces are automatically created by initdb. The pg_global tablespace is used for shared system catalogs. The pg_default tablespace is the default tablespace of the template1 and template0 databases and therefore will be the default tablespace for other databases as well, unless overridden by a TABLESPACE clause in CREATE DATABASE.

Why to use tablespaces:

  • If the partition or volume on which the cluster was initialized runs out of space and cannot be extended, a tablespace can be created on a different partition and used until the system can be reconfigured.
  • Tablespaces allow an administrator to use knowledge of the usage pattern of database objects to optimize performance. For example, an index which is very heavily used can be placed on a very fast, highly available disk, such as an expensive solid state device. At the same time a table storing archived data which is rarely used or not performance critical could be stored on a less expensive, slower disk system.

Moving a table from a tablespace to another. This should be automated.

ALTER DATABASE "[database_name]" SET default_tablespace = [new_tablespace];
ALTER TABLE "[table_name]" SET TABLESPACE "[new_tablespace]";
ALTER INDEX "[index_name]" SET TABLESPACE "[new_tablespace]";

Sources