🐘 PostgreSQL - Basics
This note is about PostgreSQL, briefly going through all the important features. Related to PostgreSQL tuning notes and PostgreSQL extensions notes. PostgreSQL is object-relational database management system.
Formatting
SQL formatting in the official PostgreSQL documentation is a mess. Identifiers with or without underscores, functions in uppercase or lowercase. Decide which style you'd like to use with other team members and stick to it. This can avoid subtle bugs as PostgreSQL will automatically lowercase anything not double quoted.
This is the formatting I prefer:
Keywords and constants are in UPPER_SNAKECASE.
SELECT FROM AS CASE
WHERE IN NOT LEFT JOIN
AND OR ENUM NOT
DEFAULT TRUE FALSE NULL
integer ARRAY[4] => ARRAY is a keyword, integer is not.
Everything else is in lower_snakecase.
database_name
schema_name
table_name
column_name
data_type
my_enum_type
integer
integer[4]
If there are parentheses `()` after a word, it's a function.
function_name()
array_upper()
count()
max()
clock_timestamp()
extract(month FROM timestamp '2013-09-22') AS month
SELECT t.* FROM my_schema.my_table t;
-- Specifying columns is __always__ better than *.
SELECT t.name, t.some_cost FROM my_schema.my_table t;
-- Prefer filtering results in the query.
SELECT t.name, t.some_cost FROM my_schema.my_table t WHERE some_cost > 0;
Use consistent naming. Use singular or plural names whole way through each project. Don't mix table naming styles e.g. users
and order_row
, it should be order_rows
. I personally prefer singular names as they are easier to work with but switch to the plural if it's the framework standard e.g. in Ruby on Rails.
1. Many words have multiple plural forms e.g. person vs persons, people.
2. Many words sound silly in plural e.g. user_status vs user_statuses.
3. Many words are irregular in plural e.g. repository vs repositories.
4. Many words don't have a plural e.g. samurai, but there are ninjas.
5. Plural forms will mix up alphabetical sorting of tables
e.g. order, order_detail vs order_details, orders.
Use descriptive names. Avoid using abbreviations if they are not commonly known. You can use aliases to reduce written SQL.
-- bad
SELECT rep.rep_id FROM rep;
-- good
SELECT report.report_id FROM report;
-- even better
SELECT r.daily_report_id FROM daily_usage_report r
Use whitespace to make SQL readable. Don't worry about consistent indention, align related statements for easier reading, especially important as many developers you don't use proper syntax highlight with SQL. Joins should be clearly intended from the surrounding markup. Same approach as in Haskell and other functional languages as SQL is more functional than object oriented.
SELECT
s.name AS supplier_name,
p.name AS product_name,
p.some_cost,
p.other_cost
FROM my_schema.product p
LEFT JOIN my_schema.supplier s
USING (supplier_id)
WHERE p.some_cost > 0
AND p.some_cost < (p.other_cost / 50);
Use comments in SQL. Comments are started with --
and continue to the end of line.
-- This is a comment
SELECT * FROM table; -- This is also comment
SELECT
name,
age -- This is a comment about the age column
FROM user;
Maximum name length is 30. Otherwise some environments might give you troubles.
SELECT * FROM really_long_table_name_but_ok
Names must start with a letter and not end in an underscore. Some adapters and drivers have problems with these.
Never give a table the same name as one of the columns. And vice versa.
-- bad
SELECT k.key FROM key k
Avoid using just id
as the primary key column. Use more descriptive name instead, like user_id
. Allows using USING
statement in joins. A table can have multiple columns that end in _id
e.g. foreign keys.
Always use aliasing when joining tables. The SQL becomes easier to read and extend later. Prefer using without the optional AS
keyword.
SELECT
u.email,
pr.created_at
FROM user u
LEFT JOIN password_reset pr
USING (user_id);
Functions must contain a verb.
-- bad
tax_price()
-- good
get_tax_price()
Always include newline ...
- ... before
AND
orOR
. - ... after each keyword definition.
- ... after a comma when separating multiple columns into logical groups.
- ... after semicolons to separate queries for easier reading.
- ... to separate code into related sections.
General
Basic examples:
Use UUIDs as your primary keys. Seriously, this is not 1990, use a UUID as the primary key. The size increase is marginal compared to the benefits of horizontal scaling.
integer
sequences andserial
are useful but not as primary keys- If you just want collision avoidance,
UUIDv4
is enough. - If you want lexicographical sorting...
xid
orulid
both work greatUUIDv6
orUUIDv7
if you want to follow the standards
Here are some comparison when using separate primary keys when
inserting 10 000 000 rows to a PostgreSQL database table.
UUID is generated with plain uuid_generate_v4() of the uuid-ossp module.
Single Operations:
column type time (s) size w index (MB)
BIGSERIAL 4262 636.7
UUID 4367 890.0
BIGINT 4624 636.7
Bulk:
column type time (s) size w index (MB)
BIGSERIAL 898 636.7
UUID 991 890.0
BIGINT 1147 636.7
CASE
clause is like an if-statement in programming languages.
SELECT
p.name,
CASE WHEN (p.cost > 100) THEN
'expensive'
ELSE
'cheap'
END AS cost_class
FROM my_schema.product p;
-- name, cost, cost_class
-- Android 99 cheap
-- iPhone 299 expensive
DISTINCT
is used to remove returned duplicates. You should limit DISTINCT
usage only to situations that require it, it should not be used to fix incorrect joins.
SELECT DISTINCT m.surname
FROM my_schema.member m
ORDER BY m.surname;
Prefer IN
over OR
.
-- bad
SELECT *
FROM my_schema.company c
WHERE c.company_id = 1
OR c.company_id = 5;
-- good
SELECT *
FROM my_schema.company c
WHERE c.company_id IN (1, 5);
Extensions
Extensions usually create functions, data types, operators and index support methods. Can be removed with DROP EXTENSION
.
Use extensions. They are really worth it. More about useful extension in PostgreSQL extension notes.
Schemas
A database contains one or more named schemas, which in turn contain tables, data types, functions and operations.
If no schema name is given, system determines which table is meant by following search path, which is a list of schemas.
-- First is a schema named same as the user.
SHOW search_path;
-- search_path
-----------------
-- "$user",public
-- Changing the search path.
SET search_path TO my_schema, public;
PostgreSQL has pg_catalog
schema that is in every database. It is always searched, even if it is not in the search path. Catalogs are the place where PostgreSQL stores metadata, e.g. information about tables and columns. Both they are just regular tables.
When you create tables and such without specifying an schema name, they are added to schema named 'public'. Public schema is not special, just a default that is always created. Overall, it's a good practice to use schemas.
-- The same thing:
CREATE TABLE product ( ... );
CREATE TABLE public.product ( ... );
Why to use schemas:
- Allowing users to use one database without interfering with each other.
- Organize database objects into logical groups to make them more manageable.
- Third-party applications can be placed on own schemas so they do not collide.
CREATE SCHEMA schema_name;
CREATE SCHEMA schema_name AUTHORIZATION username; -- who owns created objects?
DROP SCHEMA schema_name CASCADE;
By default, users cannot access any objects in schemas they do not own. To allow that, the owner of the schema needs to grant the USAGE privilege on the schema.
GRANT CREATE ON SCHEMA schema_name TO public; -- allow creating stuff
GRANT USAGE ON SCHEMA schema_name TO public; -- allow viewing stuff
By default everyone has CREATE
and USAGE
privileges on public
.
-- How to remove creating rights from public.
-- `FROM public` means "every user".
REVOKE CREATE ON SCHEMA public FROM public;
Locale
Database locale defines how rows are sorted, language of messages and formats of returned numbers, currency, dates and times.
Sorting using C collation.
A, B, C, a, b, c
Sorting using utf8.en_US collation:
A, a, B, b, C, c
PostgreSQL offers standard ISO C and POSIX locales, which are the same thing. Other available locales depend on your operating system.
If you don't have a special reason, use en_US.UTF-8
. It supports multibyte characters. You can change locale per query or per table basis if it is required.
initdb --locale=en_US.UTF-8
You can also specify collation by column, but you rarely need to.
CREATE TABLE test1 (
a text COLLATE "de_DE",
b text COLLATE "es_ES"
);
Joins
INNER JOIN
is the most common way of joining tables. Inner join returns only the rows that exist in the both tables.
SELECT b.start_time
FROM my_schema.booking b
INNER JOIN my_schema.member m ON (m.member_id = b.member_id)
WHERE m.first_name = 'John'
AND m.surname = 'Doe';
JOIN
is an alias for INNER JOIN
.
SELECT
b.start_time,
f.name AS facility_name
FROM my_schema.booking b
JOIN my_schema.facility f ON (f.facility_id = b.facility_id)
WHERE b.start_time > '2012-09-21' AND b.start_time < '2012-09-22'
AND f.name LIKE '%Tennis Court%';
You may join tables to themselves. This usually is too confusing concept and should be avoided.
SELECT DISTINCT
recommender.first_name,
recommender.surname
FROM my_schema.member m
INNER JOIN my_schema.member recommender
ON (recommender.member_id = m.recommended_by)
ORDER BY surname, first_name;
SELECT
first_name,
surname
FROM my_schema.member
WHERE member_id IN (SELECT recommended_by FROM my_schema.member)
ORDER BY surname, first_name;
LEFT JOIN
s are used to get optional data from the joined table. LEFT JOIN
is the same as LEFT OUTER JOIN
.
-- All members which have used Tennis Court
SELECT DISTINCT
m.first_name,
m.surname,
f.name
FROM my_schema.booking b
LEFT JOIN my_schema.member m ON (m.member_id = b.member_id)
LEFT JOIN my_schema.facility f ON (f.facility_id = b.facility_id)
WHERE f.name LIKE '%Tennis Court%'
ORDER BY first_name, surname;
Unions
UNION ALL
is used to combine multiple queries.
SELECT DISTINCT surname
FROM my_schema.member
UNION ALL
SELECT DISTINCT name
FROM my_schema.facility;
Subqueries
Complex SQL statements will use subqueries, a query inside a query.
Subqueries can be in a SELECT
-clause. This is usually a bad practice because the subquery is ran on every result.
SELECT DISTINCT
mem.first_name || ' ' || mem.surname as member,
(
SELECT rec.first_name || ' ' || rec.surname as recommender
FROM my_schema.member rec
WHERE rec.member_id = mem.recommended_by
)
FROM my_schema.member mem
ORDER BY member;
Subqueries can be in a FROM
clause, which are called inline views.
SELECT member, facility, totalcost
FROM (
SELECT
mem.first_name || ' ' || mem.surname AS member,
fac.name AS facility,
CASE WHEN mem.member_id = 0 THEN
bk.slots * fac.guest_cost
ELSE
bk.slots * fac.member_cost
END AS totalcost
FROM my_schema.member mem
INNER JOIN my_schema.booking bk ON memmem.member_id = bk.member_id
INNER JOIN my_schema.facility fac ON bk.facility_id = fac.facility_id
WHERE bk.start_time BETWEEN '2012-09-14' AND '2012-09-15'
) AS booking_with_cost
WHERE totalcost > 30
ORDER BY totalcost DESC;
Subqueries can be in a IN
clause. These subquery must return 1 column, where it acts much like an array.
SELECT f1.*
FROM my_schema.facility f1
WHERE f1.facility_id IN (SELECT f2.facility_id FROM my_schema.facility f2);
Aggregate Functions
Aggregate functions take a column of data and output a single value. For the full list of aggregate functions, check PostgreSQL aggregate function documentation.
-- This will not work because count() wants to collapse the results
-- while facility_id wants to keep them separate.
SELECT
facility_id,
count(*) AS my_count
FROM my_schema.facility
-- This will work, and PostgreSQL will cache the result as
-- it knows it is a repeated aggregate function.
SELECT
facility_id,
(SELECT count(*) FROM my_schema.facility) AS my_count
FROM my_schema.facility
-- This also does not work as the aggregate function result is only usable
-- after the `WHERE` has been applied.
SELECT
facility_id,
(SELECT count(*) FROM my_schema.facility) AS my_count
FROM my_schema.facility
WHERE my_count > 0
-- You can use `HAVING` which filters the result set after selection.
SELECT
facility_id,
(SELECT count(*) FROM my_schema.facility) AS my_count
FROM my_schema.facility
GROUP BY facility_id
HAVING (SELECT count(*) FROM my_schema.facility) > 2
ORDER BY facility_id
-- Or create a subquery:
SELECT *
FROM (
SELECT
facility_id,
(SELECT count(*) FROM my_schema.facility) AS my_count
FROM my_schema.facility
GROUP BY facility_id
ORDER BY facility_id
) AS counted_totals
WHERE my_count > 0
Using aggregate function, IN
-clause and a subquery to filter rows.
SELECT first_name, surname, join_date
FROM my_schema.member
WHERE join_date = (
SELECT max(join_date)
FROM my_schema.member
);
count()
function returns the number of results.
-- Number of members.
SELECT count(*) FROM my_schema.member
-- Number of members with address.
SELECT count(address) FROM my_schema.member
-- Number of members with different address.
SELECT count(DISTINCT address) FROM my_schema.member
-- Number of members with 'nom' in their name.
SELECT count(*) FROM my_schema.member WHERE name LIKE '%nom%'
sum()
function returns the sum of the results.
SELECT
facility_id,
sum(slots) as slots_total
FROM my_schema.booking
GROUP BY facility_id
ORDER BY facility_id
min()
function returns the smallest result. Works with dates too.
-- Return first time each product has been sold after the start of September.
SELECT
p.product_id,
p.name
min(s.sell_time) AS first_sell_time
FROM my_schema.sale_item s
INNER JOIN my_schema.product p on (p.product_id = s.product_id)
WHERE s.sell_time >= '2012-09-01'
GROUP BY p.product_id, p.name
ORDER BY p.product_id;
Common Table Expression
Create CTEs using WITH
. Common table expressions (CTE) are used to create temporary views to data. They can greatly simplify queries with a lot of or complex subqueries.
WITH sum AS (
SELECT
facility_id,
sum(slots) AS slots_total
FROM my_schema.booking
GROUP BY facility_id
)
SELECT
facility_id,
slots_total
FROM sum
WHERE slots_total = (SELECT max(slots_total) FROM sum);
You can make recursive CTEs with RECURSIVE
keyword.
WITH RECURSIVE increment(number) AS (
-- Initial statement.
SELECT 1
UNION ALL
-- Recursive statement
SELECT increment.number + 1
FROM increment
WHERE increment.number < 5
)
SELECT *
FROM increment;
-- => Creates 5 rows from 1 to 5.
User Management
Every connection to the database server is made in the name of some particular role, and this role determines the initial access privileges for commands issued on that connection. Roles may have a password.
The right to modify or destroy an object is always the privilege of the owner only and owner is always the user that created the object.
If user has SUPERUSER
attribute, user can always access all object.
Privileges:
SELECT, INSERT, UPDATE, DELETE, TRUNCATE,
REFERENCES, TRIGGER, CREATE,
CONNECT, TEMPORARY, EXECUTE, USAGE
-- ALL means all privileges
GRANT UPDATE ON accounts TO joe;
-- PUBLIC means all users.
REVOKE ALL ON accounts FROM PUBLIC;
Roles and groups are helpful to maintain privileges to multiple users. Both are managed with keyword ROLE
and difference comes how you use the created role.
CREATE ROLE joe WITH LOGIN INHERIT; -- Allow logging in.
CREATE ROLE admin NOINHERIT; -- Create admin group.
CREATE ROLE god NOINHERIT; -- Create god group.
GRANT admin TO joe; -- Tell that joe is an admin.
GRANT god TO admin; -- Tell that admins are gods.
-- Note that joe has privileges of 'joe' and 'admin' but not 'god'
-- because of NOINHERIT.
REVOKE admin FROM joe;
-- Now joe has only privileges of 'joe'.
-- Original privilege state
SET ROLE joe; -- Can be direct or indirect non-inherited
SET ROLE NONE;
RESET ROLE;
-- Removing a group.
DROP ROLE admin;
-- Change is login allowed.
ALTER ROLE john LOGIN; -- Add login.
ALTER ROLE john NOLOGIN; -- Remove login.
The role attributes LOGIN
, SUPERUSER
, CREATEDB
, and CREATEROLE
can be thought of as special privileges, but they are never inherited as ordinary privileges on database objects are. You must actually SET ROLE
.
Functions and triggers allow users to insert code into the backend server that other users might execute unintentionally. Hence, both mechanisms permit users to "Trojan horse" others with relative ease. The only real protection is tight control over who can define functions.
Tablespaces
Tablespaces allow database administrators to define locations in the file system where the files representing database objects are stored.
CREATE TABLESPACE fastspace LOCATION '/mnt/sda1/postgresql/data';
There is usually not much point in making more than one tablespace per logical file system, since you cannot control the location of individual files within a logical file system.
CREATE TABLE foo(i int) TABLESPACE space1;
-- Or
SET default_tablespace = space1;
CREATE TABLE foo(i int);
Two tablespaces are automatically created by initdb. The pg_global
tablespace is used for shared system catalogs. The pg_default
tablespace is the default tablespace of the template1 and template0 databases and therefore will be the default tablespace for other databases as well, unless overridden by a TABLESPACE
clause in CREATE DATABASE
.
Why to use tablespaces:
- If the partition or volume on which the cluster was initialized runs out of space and cannot be extended, a tablespace can be created on a different partition and used until the system can be reconfigured.
- Tablespaces allow an administrator to use knowledge of the usage pattern of database objects to optimize performance. For example, an index which is very heavily used can be placed on a very fast, highly available disk, such as an expensive solid state device. At the same time a table storing archived data which is rarely used or not performance critical could be stored on a less expensive, slower disk system.
Moving a table from a tablespace to another. This should be automated.
ALTER DATABASE "[database_name]" SET default_tablespace = [new_tablespace];
ALTER TABLE "[table_name]" SET TABLESPACE "[new_tablespace]";
ALTER INDEX "[index_name]" SET TABLESPACE "[new_tablespace]";