Skip to content

SQL conventions

Preface

Database guidelines at eurodata

Be sure to read the database guidelines in the eurodata engineering handbook. Whenever a Mizool Playbook rule reiterates, clarifies, extends, tweaks or changes eurodata guidelines, it references the respective rule like this: ed-DB-321.

For non-eurodata developers

Unfortunately, the eurodata engineering handbook is not available to third parties. Therefore, when starting to work on a PR for an open source project that uses this Mizool Playbook, please ask the eurodata developers if there are any rules that you should know about.

Applicability to legacy code

Legacy code often violates many of the rules given below. Unless working on a modernization effort with an explicit strategy for dealing with (or intentionally breaking) backwards compatibility, the rules only apply to newly created elements.

In situations where old and new code meet (e.g. adding a new column to an old table), applying current conventions could lead to inconsistencies with neighboring elements. If you encounter such a case, discuss it with other developers, then choose an approach to use and make a corresponding note on the surrounding element.

Naming conventions

General naming

  • Use lower case for all identifiers in databases, and use snake_case for multi-word identifiers (see Capitalization and format, ed-DB-115 and ed-DB-116).
  • Avoid abbreviations and specifically making up your own abbreviations. For details, see General rules and ed-DB-118.
  • In general, names use English. However, German terms should be used when they refer to any of the following:

    • business domain terms with company-specific meaning
    • terms that are hard to translate as they are specific to German laws and institutions
    • objects in other databases that the application interacts with
    • arguments/fields in internal protocols (e.g. REST calls, asynchronous messages) or official standards

    Examples

    • The company-specific terms "Beraternummer" and "Mandantennummer" are not translated.
    • "bank_statement" (or "bankstatement") is okay as the domain concept ("Kontoauszug") is not at all company specific and translating it does not cause loss of meaning.
      • However, it is not acceptable if it refers to a specific table in another database that uses the German term.

    Info

    This rule is similar to, but less strict than ed-DB-112.

Tables and views

  • Table names use singular (book, not books). See ed-DB-117.
  • Do not add a general prefix or suffix to the name of views to distinguish them from tables (a practice suggested by ed-DB-119).
    • A side effect of this rule is that you cannot have a table foo_status accompanied by a v_foo_status view, which we think would be confusing. If a view has some specific purpose, include it in the name instead, e.g. foo_status_printable. This way, it (mostly) becomes an implementation detail whether a name refers to a view or a table.

Columns

  • Primary key column names do not repeat the name of the table (see ed-DB-119).

    Example

    A book table could have a primary key column called id or key, but not book_id or book_key.

  • Do not add prefixes and suffixes to the name of a column based on its type or usage.

    • Specifically, do not add fk_ to the name of a column that is (part of) a foreign key constraint (a practice suggested by ed-DB-119).
  • Names of columns that hold ID/key values generated by another system should always unambiguously identify that system.
    • You may feel that it's obvious to everybody that external_id or source_id refers to the company's "SplineReticulator Directory" product. But even if that were true at the time of writing, that does not mean that the same is true for future you or somebody else in charge of deciphering the code in 5 years.
    • Also, following this rule makes it easier to connect additional systems (each with their own column), or migrate from one to the other.
    • As an alternative, a pair of columns can be used if the source system varies between rows. For example, source_id holds the ID itself and source_system specifies the system that ID is from.

Keys, indexes and constraints

The following rules are conscious departures from ed-DB-119.

Names of keys, indexes and constraints follow the structure given below, with segments separated by double underscores (__).

Element Name segments Examples
Primary key table name
pk
my_cool_table__pk
Foreign key constraint owning table
fk
each included column of the owning table
book__fk__author_id
complaint__fk__order_id__order_line_pos
Unique index
(full or partial)
table name
u
title / each included column
document__u__id__account_id
subscription__u__one_active_per_account_id
Other index
(full or partial)
table name
i
title / each included column
group__i__parent_id
orders__i__unbilled_order_numbers
Check constraint table name
c
title
account__c__name_length
contact__c__email_or_phone_set

Name length

PostgreSQL's maximum name length is 63 characters, which should be sufficient for most elements, even foreign key constraints with multiple columns.

In case a name constructed according to the rules above ends up being too long, abbreviate or truncate some segments. However, double check that it is still obvious which column each name segment refers to, and that there is no ambiguity.

Sequences and triggers

  • Sequence names consist of the corresponding table name and the suffix __seq (same as ed-DB-119 but with __ instead of _).

    Identity columns

    Before creating a sequence, double check whether an identity column could be used instead. See the respective rule in the Table and column declarations section below.

  • Trigger names consist of the table name, two underscores and a short but meaningful title. We intentionally do not add prefixes (like trg_ from ed-DB-119) or suffixes (e.g. bu for "before update" from ed-DB-119).

    Examples

    • book__insert, book__update and book__delete
    • my_cool_table__track_history
    • payment__trigger_processing

Table and column declarations

  • Define primary keys, foreign key constraints, check constraints and indexes (unique or otherwise) explicitly, not as part of a column definition.
  • Always specify the name of keys, constraints and indexes. Letting the RDBMS generate names would prevent developers from documenting their intention and often makes code hard to comprehend.
  • Foreign key constraints
    • Each relation between tables must be covered by a foreign key constraint (see ed-DB-107).
    • Only specify the clauses that deviate from the default, e.g. include on delete cascade but don't add on update no action.
    • If the referenced column is the primary key, specify only the table name without a column name:
      constraint book__fk__author_id
          foreign key (author_id)
              references author   -- leave out `(id)` here
              on delete cascade
      
  • Primary keys should always be a surrogate key, also known as a synthetic key (see ed-DB-108).
    • Using natural/business keys (like email addresses, or an employee ID generated by another system) may feel more intuitive at first, but it often causes problems later on.
    • Instead, include the natural/business key as a data column, possibly with an index. However, if the data originates from another system, be careful with unique indexes as you may not be able to trust that system to enforce uniqueness at all times.
    • This rule applies even if (or especially if) you cannot think of a scenario where the "obvious" natural/business key value of a row would change.
  • In PostgreSQL, columns holding database-generated integer IDs should be declared bigint not null generated by default as identity. Compared to the explicit creation of a sequence object, such identity columns are less prone to errors, and better document the author's intention.

    Note

    We considered using the generated always clause instead of generated by default. However, we decided to use the latter as the alternative would make it very cumbersome to keep existing ID values during special operations like database moves and migrations.

    Still, during regular operation, our applications and scripts should never set their own values for identity columns.

  • As columns are nullable by default, omit the null keyword in column definitions.

Datatypes

Textual data

  • In PostgreSQL, whenever we use the varchar datatype, we do not specify its maximum length. Such columns can store several hundreds of thousands of characters. Not having to define the length in advance makes our SQL code more maintainable.

    Implementing length restrictions

    If you do need to enforce a length restriction on the database level, use a check constraint instead. This way, changing the length restriction later on will not cause the RDBMS to physically rewrite the entire table.

    Non-standard behavior

    In ANSI SQL, the maximum length of a varchar must be specified. Still, the PostgreSQL behavior is so useful that we accept this tradeoff in portability.

  • Avoid using PostgreSQL's text datatype. While it offers the same advantage as an unbounded varchar (see above), it would prevent us from ever using that column as part of an index.

Numeric data

  • In PostgreSQL, our default choice for storing numbers of any kind is to use decimal without specifying a precision or scale. Such columns can store several thousands of digits both before and after the decimal point.1 Not having to define precision or scale in advance makes our SQL code more maintainable.

    Non-standard behavior

    ANSI SQL specifies that leaving out the precision or scale is equivalent to (0), i.e. integer values. Still, the PostgreSQL behavior is so useful that we accept this tradeoff in portability.

  • Only when there is a good reason to have the database coerce or round input values to a certain precision or scale, use decimal(p,s) with appropriate values. One such reason, obviously, is that the respective project does not use PostgreSQL.

  • ANSI SQL specifies that numeric and decimal are synonyms. To avoid any potential confusion, our code exclusively uses the decimal name for this datatype.

Boolean values

  • In PostgreSQL and other RDBMS that support it or an equivalent datatype, we use boolean for all boolean-valued columns.

    • Always specify boolean columns as not null unless null is forced upon us by the source of the data (e.g. an upstream system). Allowing null makes it harder to use the column later on, as it ends up having three possible states instead of two, leaving it up to debate whether null means false or true.
    • If you do need to allow null, be sure to add a comment explaining the meaning of null values.

      Examples

      • "null means there was no value present when importing."
      • "If the user has not overridden the default value inherited from their group config, this column is null."
  • If boolean is not supported, it can be emulated with the smallint datatype (or equivalent). To maintain integrity and document your intention, add a check constraint that limits the column values to 0 and 1.

Syntax

  • Do not quote identifiers unless it is required, for example because a column name is a reserved word (e.g. timestamp).
  • Write SQL keywords, names of built-in objects and other predefined words in lower case.
  • Type casts
    • Prefer ANSI SQL syntax for casts over RDBMs specific syntax:
      cast(my_value as text)  -- good!
      
      my_value::text          -- bad, avoid PostgreSQL syntax where possible.
      
    • Avoid scattering type casts all over the place, it makes the code harder to read (and may even lead to back-and-forth casting which is confusing). Instead, try to centralize them using either of the following strategies:
      • near the original columns (e.g. the lowest levels of a multi-level Join) when working with raw data not usable otherwise (e.g. when dates or numbers are imported and stored as strings)
      • at the latest possible step (e.g. when required by a join)
  • Avoid RDBMS specific syntax whenever possible and use ANSI SQL instead.

    Examples

    • Oracle: use coalesce() instead of nvl().
    • PostgreSQL: to return a single row table, use SELECT VALUES('A Test String') instead of SELECT 'A Test String'.

  1. See the paragraph on "unconstrained numeric" columns in the numeric datatypes chapter of the PostgreSQL docs.