🐘 PostgreSQL - Basics
This note is about PostgreSQL, briefly going through all the important features. Related to PostgreSQL tuning notes and PostgreSQL extensions notes. PostgreSQL is object-relational database management system.
SQL formatting in the official PostgreSQL documentation is a mess. Identifiers with or without underscores, functions in uppercase or lowercase. Decide which style you'd like to use with other team members and stick to it. This can avoid subtle bugs as PostgreSQL will automatically lowercase anything not double quoted.
This is the formatting I prefer: Keywords and constants are in UPPER_SNAKECASE. SELECT FROM AS CASE WHERE IN NOT LEFT JOIN AND OR ENUM NOT DEFAULT TRUE FALSE NULL integer ARRAY => ARRAY is a keyword, integer is not. Everything else is in lower_snakecase. database_name schema_name table_name column_name data_type my_enum_type integer integer If there are parentheses `()` after a word, it's a function. function_name() array_upper() count() max() clock_timestamp() extract(month FROM timestamp '2013-09-22') AS month
SELECT t.* FROM my_schema.my_table t; -- Specifying columns is __always__ better than *. SELECT t.name, t.some_cost FROM my_schema.my_table t; -- Prefer filtering results in the query. SELECT t.name, t.some_cost FROM my_schema.my_table t WHERE some_cost > 0;
Use consistent naming. Use singular or plural names whole way through each project. Don't mix table naming styles e.g.
order_row, it should be
order_rows. I personally prefer singular names as they are easier to work with but switch to the plural if it's the framework standard e.g. in Ruby on Rails.
1. Many words have multiple plural forms e.g. person vs persons, people. 2. Many words sound silly in plural e.g. user_status vs user_statuses. 3. Many words are irregular in plural e.g. repository vs repositories. 4. Many words don't have a plural e.g. samurai, but there are ninjas. 5. Plural forms will mix up alphabetical sorting of tables e.g. order, order_detail vs order_details, orders.
Use descriptive names. Avoid using abbreviations if they are not commonly known. You can use aliases to reduce written SQL.
-- bad SELECT rep.rep_id FROM rep; -- good SELECT report.report_id FROM report; -- even better SELECT r.daily_report_id FROM daily_usage_report r
Use whitespace to make SQL readable. Don't worry about consistent indention, align related statements for easier reading, especially important as many developers you don't use proper syntax highlight with SQL. Joins should be clearly intended from the surrounding markup. Same approach as in Haskell and other functional languages as SQL is more functional than object oriented.
SELECT s.name AS supplier_name, p.name AS product_name, p.some_cost, p.other_cost FROM my_schema.product p LEFT JOIN my_schema.supplier s USING (supplier_id) WHERE p.some_cost > 0 AND p.some_cost < (p.other_cost / 50);
Use comments in SQL. Comments are started with
-- and continue to the end of line.
-- This is a comment SELECT * FROM table; -- This is also comment SELECT name, age -- This is a comment about the age column FROM user;
Maximum name length is 30. Otherwise some environments might give you troubles.
SELECT * FROM really_long_table_name_but_ok
Names must start with a letter and not end in an underscore. Some adapters and drivers have problems with these.
Never give a table the same name as one of the columns. And vice versa.
-- bad SELECT k.key FROM key k
Avoid using just
id as the primary key column. Use more descriptive name instead, like
user_id. Allows using
USING statement in joins. A table can have multiple columns that end in
_id e.g. foreign keys.
Always use aliasing when joining tables. The SQL becomes easier to read and extend later. Prefer using without the optional
SELECT u.email, pr.created_at FROM user u LEFT JOIN password_reset pr USING (user_id);
Functions must contain a verb.
-- bad tax_price() -- good get_tax_price()
Always include newline ...
- ... before
- ... after each keyword definition.
- ... after a comma when separating multiple columns into logical groups.
- ... after semicolons to separate queries for easier reading.
- ... to separate code into related sections.
Use UUIDs as your primary keys. Seriously, this is not 1990, use a UUID as the primary key. The size increase is marginal compared to the benefits of horizontal scaling.
serialare useful but not as primary keys
- If you just want collision avoidance,
- If you want lexicographical sorting...
ulidboth work great
UUIDv7if you want to follow the standards
Here are some comparison when using separate primary keys when inserting 10 000 000 rows to a PostgreSQL database table. UUID is generated with plain uuid_generate_v4() of the uuid-ossp module. Single Operations: column type time (s) size w index (MB) BIGSERIAL 4262 636.7 UUID 4367 890.0 BIGINT 4624 636.7 Bulk: column type time (s) size w index (MB) BIGSERIAL 898 636.7 UUID 991 890.0 BIGINT 1147 636.7
CASE clause is like an if-statement in programming languages.
SELECT p.name, CASE WHEN (p.cost > 100) THEN 'expensive' ELSE 'cheap' END AS cost_class FROM my_schema.product p; -- name, cost, cost_class -- Android 99 cheap -- iPhone 299 expensive
DISTINCT is used to remove returned duplicates. You should limit
DISTINCT usage only to situations that require it, it should not be used to fix incorrect joins.
SELECT DISTINCT m.surname FROM my_schema.member m ORDER BY m.surname;
-- bad SELECT * FROM my_schema.company c WHERE c.company_id = 1 OR c.company_id = 5; -- good SELECT * FROM my_schema.company c WHERE c.company_id IN (1, 5);
Extensions usually create functions, data types, operators and index support methods. Can be removed with
Use extensions. They are really worth it. More about useful extension in PostgreSQL extension notes.
A database contains one or more named schemas, which in turn contain tables, data types, functions and operations.
If no schema name is given, system determines which table is meant by following search path, which is a list of schemas.
-- First is a schema named same as the user. SHOW search_path; -- search_path ----------------- -- "$user",public -- Changing the search path. SET search_path TO my_schema, public;
pg_catalog schema that is in every database. It is always searched, even if it is not in the search path. Catalogs are the place where PostgreSQL stores metadata, e.g. information about tables and columns. Both they are just regular tables.
When you create tables and such without specifying an schema name, they are added to schema named 'public'. Public schema is not special, just a default that is always created. Overall, it's a good practice to use schemas.
-- The same thing: CREATE TABLE product ( ... ); CREATE TABLE public.product ( ... );
Why to use schemas:
- Allowing users to use one database without interfering with each other.
- Organize database objects into logical groups to make them more manageable.
- Third-party applications can be placed on own schemas so they do not collide.
CREATE SCHEMA schema_name; CREATE SCHEMA schema_name AUTHORIZATION username; -- who owns created objects? DROP SCHEMA schema_name CASCADE;
By default, users cannot access any objects in schemas they do not own. To allow that, the owner of the schema needs to grant the USAGE privilege on the schema.
GRANT CREATE ON SCHEMA schema_name TO public; -- allow creating stuff GRANT USAGE ON SCHEMA schema_name TO public; -- allow viewing stuff
By default everyone has
USAGE privileges on
-- How to remove creating rights from public. -- `FROM public` means "every user". REVOKE CREATE ON SCHEMA public FROM public;
Database locale defines how rows are sorted, language of messages and formats of returned numbers, currency, dates and times.
Sorting using C collation. A, B, C, a, b, c Sorting using utf8.en_US collation: A, a, B, b, C, c
PostgreSQL offers standard ISO C and POSIX locales, which are the same thing. Other available locales depend on your operating system.
If you don't have a special reason, use
en_US.UTF-8. It supports multibyte characters. You can change locale per query or per table basis if it is required.
You can also specify collation by column, but you rarely need to.
CREATE TABLE test1 ( a text COLLATE "de_DE", b text COLLATE "es_ES" );
INNER JOIN is the most common way of joining tables. Inner join returns only the rows that exist in the both tables.
SELECT b.start_time FROM my_schema.booking b INNER JOIN my_schema.member m ON (m.member_id = b.member_id) WHERE m.first_name = 'John' AND m.surname = 'Doe';
JOIN is an alias for
SELECT b.start_time, f.name AS facility_name FROM my_schema.booking b JOIN my_schema.facility f ON (f.facility_id = b.facility_id) WHERE b.start_time > '2012-09-21' AND b.start_time < '2012-09-22' AND f.name LIKE '%Tennis Court%';
You may join tables to themselves. This usually is too confusing concept and should be avoided.
SELECT DISTINCT recommender.first_name, recommender.surname FROM my_schema.member m INNER JOIN my_schema.member recommender ON (recommender.member_id = m.recommended_by) ORDER BY surname, first_name; SELECT first_name, surname FROM my_schema.member WHERE member_id IN (SELECT recommended_by FROM my_schema.member) ORDER BY surname, first_name;
LEFT JOINs are used to get optional data from the joined table.
LEFT JOIN is the same as
LEFT OUTER JOIN.
-- All members which have used Tennis Court SELECT DISTINCT m.first_name, m.surname, f.name FROM my_schema.booking b LEFT JOIN my_schema.member m ON (m.member_id = b.member_id) LEFT JOIN my_schema.facility f ON (f.facility_id = b.facility_id) WHERE f.name LIKE '%Tennis Court%' ORDER BY first_name, surname;
UNION ALL is used to combine multiple queries.
SELECT DISTINCT surname FROM my_schema.member UNION ALL SELECT DISTINCT name FROM my_schema.facility;
Complex SQL statements will use subqueries, a query inside a query.
Subqueries can be in a
SELECT-clause. This is usually a bad practice because the subquery is ran on every result.
SELECT DISTINCT mem.first_name || ' ' || mem.surname as member, ( SELECT rec.first_name || ' ' || rec.surname as recommender FROM my_schema.member rec WHERE rec.member_id = mem.recommended_by ) FROM my_schema.member mem ORDER BY member;
Subqueries can be in a
FROM clause, which are called inline views.
SELECT member, facility, totalcost FROM ( SELECT mem.first_name || ' ' || mem.surname AS member, fac.name AS facility, CASE WHEN mem.member_id = 0 THEN bk.slots * fac.guest_cost ELSE bk.slots * fac.member_cost END AS totalcost FROM my_schema.member mem INNER JOIN my_schema.booking bk ON memmem.member_id = bk.member_id INNER JOIN my_schema.facility fac ON bk.facility_id = fac.facility_id WHERE bk.start_time BETWEEN '2012-09-14' AND '2012-09-15' ) AS booking_with_cost WHERE totalcost > 30 ORDER BY totalcost DESC;
Subqueries can be in a
IN clause. These subquery must return 1 column, where it acts much like an array.
SELECT f1.* FROM my_schema.facility f1 WHERE f1.facility_id IN (SELECT f2.facility_id FROM my_schema.facility f2);
Aggregate functions take a column of data and output a single value. For the full list of aggregate functions, check PostgreSQL aggregate function documentation.
-- This will not work because count() wants to collapse the results -- while facility_id wants to keep them separate. SELECT facility_id, count(*) AS my_count FROM my_schema.facility -- This will work, and PostgreSQL will cache the result as -- it knows it is a repeated aggregate function. SELECT facility_id, (SELECT count(*) FROM my_schema.facility) AS my_count FROM my_schema.facility -- This also does not work as the aggregate function result is only usable -- after the `WHERE` has been applied. SELECT facility_id, (SELECT count(*) FROM my_schema.facility) AS my_count FROM my_schema.facility WHERE my_count > 0 -- You can use `HAVING` which filters the result set after selection. SELECT facility_id, (SELECT count(*) FROM my_schema.facility) AS my_count FROM my_schema.facility GROUP BY facility_id HAVING (SELECT count(*) FROM my_schema.facility) > 2 ORDER BY facility_id -- Or create a subquery: SELECT * FROM ( SELECT facility_id, (SELECT count(*) FROM my_schema.facility) AS my_count FROM my_schema.facility GROUP BY facility_id ORDER BY facility_id ) AS counted_totals WHERE my_count > 0
Using aggregate function,
IN-clause and a subquery to filter rows.
SELECT first_name, surname, join_date FROM my_schema.member WHERE join_date = ( SELECT max(join_date) FROM my_schema.member );
count() function returns the number of results.
-- Number of members. SELECT count(*) FROM my_schema.member -- Number of members with address. SELECT count(address) FROM my_schema.member -- Number of members with different address. SELECT count(DISTINCT address) FROM my_schema.member -- Number of members with 'nom' in their name. SELECT count(*) FROM my_schema.member WHERE name LIKE '%nom%'
sum() function returns the sum of the results.
SELECT facility_id, sum(slots) as slots_total FROM my_schema.booking GROUP BY facility_id ORDER BY facility_id
min() function returns the smallest result. Works with dates too.
-- Return first time each product has been sold after the start of September. SELECT p.product_id, p.name min(s.sell_time) AS first_sell_time FROM my_schema.sale_item s INNER JOIN my_schema.product p on (p.product_id = s.product_id) WHERE s.sell_time >= '2012-09-01' GROUP BY p.product_id, p.name ORDER BY p.product_id;
Common Table Expression
Create CTEs using
WITH. Common table expressions (CTE) are used to create temporary views to data. They can greatly simplify queries with a lot of or complex subqueries.
WITH sum AS ( SELECT facility_id, sum(slots) AS slots_total FROM my_schema.booking GROUP BY facility_id ) SELECT facility_id, slots_total FROM sum WHERE slots_total = (SELECT max(slots_total) FROM sum);
You can make recursive CTEs with
WITH RECURSIVE increment(number) AS ( -- Initial statement. SELECT 1 UNION ALL -- Recursive statement SELECT increment.number + 1 FROM increment WHERE increment.number < 5 ) SELECT * FROM increment; -- => Creates 5 rows from 1 to 5.
Every connection to the database server is made in the name of some particular role, and this role determines the initial access privileges for commands issued on that connection. Roles may have a password.
The right to modify or destroy an object is always the privilege of the owner only and owner is always the user that created the object.
If user has
SUPERUSER attribute, user can always access all object.
SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, CREATE, CONNECT, TEMPORARY, EXECUTE, USAGE -- ALL means all privileges GRANT UPDATE ON accounts TO joe; -- PUBLIC means all users. REVOKE ALL ON accounts FROM PUBLIC;
Roles and groups are helpful to maintain privileges to multiple users. Both are managed with keyword
ROLE and difference comes how you use the created role.
CREATE ROLE joe WITH LOGIN INHERIT; -- Allow logging in. CREATE ROLE admin NOINHERIT; -- Create admin group. CREATE ROLE god NOINHERIT; -- Create god group. GRANT admin TO joe; -- Tell that joe is an admin. GRANT god TO admin; -- Tell that admins are gods. -- Note that joe has privileges of 'joe' and 'admin' but not 'god' -- because of NOINHERIT. REVOKE admin FROM joe; -- Now joe has only privileges of 'joe'. -- Original privilege state SET ROLE joe; -- Can be direct or indirect non-inherited SET ROLE NONE; RESET ROLE; -- Removing a group. DROP ROLE admin; -- Change is login allowed. ALTER ROLE john LOGIN; -- Add login. ALTER ROLE john NOLOGIN; -- Remove login.
The role attributes
CREATEROLE can be thought of as special privileges, but they are never inherited as ordinary privileges on database objects are. You must actually
Functions and triggers allow users to insert code into the backend server that other users might execute unintentionally. Hence, both mechanisms permit users to "Trojan horse" others with relative ease. The only real protection is tight control over who can define functions.
Tablespaces allow database administrators to define locations in the file system where the files representing database objects are stored.
CREATE TABLESPACE fastspace LOCATION '/mnt/sda1/postgresql/data';
There is usually not much point in making more than one tablespace per logical file system, since you cannot control the location of individual files within a logical file system.
CREATE TABLE foo(i int) TABLESPACE space1; -- Or SET default_tablespace = space1; CREATE TABLE foo(i int);
Two tablespaces are automatically created by initdb. The
pg_global tablespace is used for shared system catalogs. The
pg_default tablespace is the default tablespace of the template1 and template0 databases and therefore will be the default tablespace for other databases as well, unless overridden by a
TABLESPACE clause in
Why to use tablespaces:
- If the partition or volume on which the cluster was initialized runs out of space and cannot be extended, a tablespace can be created on a different partition and used until the system can be reconfigured.
- Tablespaces allow an administrator to use knowledge of the usage pattern of database objects to optimize performance. For example, an index which is very heavily used can be placed on a very fast, highly available disk, such as an expensive solid state device. At the same time a table storing archived data which is rarely used or not performance critical could be stored on a less expensive, slower disk system.
Moving a table from a tablespace to another. This should be automated.
ALTER DATABASE "[database_name]" SET default_tablespace = [new_tablespace]; ALTER TABLE "[table_name]" SET TABLESPACE "[new_tablespace]"; ALTER INDEX "[index_name]" SET TABLESPACE "[new_tablespace]";