Acyclic and directed graph in Postgres with the LTREE extension
Acyclic directed graphs are a very useful concept that has many real world applications: hierarchical groups, organisation charts and product categories are just a few of all the situations of where they can be essential.
This article contains a complete demonstration of an acyclic and directed graph implemented in Postgres mainly by using the LTREE extension and table triggers. It also contains some useful queries related to the graph such as getting all parents of a node and more.
An organisation chart
We are going to implement an acyclic, directed graph based around an organisation storing a table of employees, where some of the employees are managers with responsibility for a number of other employees. Managers can themselves have a manager. In essence, these properties are required:
- An employee can have one manager to report to, or none at all
- A manager can have many “subordinates” i.e. people reporting to him/her
- Cycles are forbidden, i.e. two managers cannot report to each other, even indirectly
We then need to be able to query this data structure efficiently. These queries are important:
- Is employee A managed by manager B? (directly or indirectly)
- Which is the direct manager of employee C and which are his/hers indirect managers?
- Which employees are the responsibility of manager D?
With this in mind we can create the following basic table of employees in Postgres:
|
|
Postgres might complain when this statement is executing saying it can’t find the LTREE type. Luckily this is easy to fix.
Introduction to ltree
The Postgres ltree
extension creates a type named ltree
that was made for “representing labels of data stored in a hierarchical tree-like structure” and we are going to make use of it. This is not a default extension in Postgres, so we need to enable it for use in the database we are using with the following command:
|
|
An ltree
represents a path from the root of a hierchichal tree to a particular node, and it might look like this: 1.2.3.4
indicating simply that 4 belongs to 3 which belongs to 2 which belongs to 1, which is the top node in this small tree. The extension contains many useful operators that can be used to query these labels as we are about to find out.
Foreign keys
While ltree
affords us a lot of functionality related to hierchichal data it does not enforce any kind of consistency of the data entered, which is important. We are going to create a foreign key constraint on the column manager_id
that refers back to the employee
table, and this column will be the primary way that we manipulate the relationship between employees and managers. We will not touch the ltree manager_path
column directly, but will have it automatically generated for us depending on the relationships created with manager_id
.
|
|
The above constraint basically says that manager_id
must contain an existing employee
row id
, and that if the manager is deleted all “subordinates” will have manager_id
set to null. This constraint could also have been created directly in the CREATE TABLE statement:
|
|
Indexes
Indexes are important for efficiency when querying large tables. We are going to index both manager_id
and manager_path
. Since manager_path
is of type ltree
we need to use a GiST index. This is required for the special ltree
related operators to be able to use the index at all.
|
|
Triggers
Unfortunately the foreign key constraint won’t get us very far by itself. We need to use some triggers in order to make sure manager_path
is set correctly.
|
|
The function employee_update_manager_path
will make sure that manager_path
always is set to the correct value when a new employee is inserted or updated. manager_path
will be set even if there is no manager: then it will just contain the employee id
. In fact, the manager_path
of every employee will always end with the id
of the employee.
The function employee_update_manager_path_of_managed
will on the other hand only be executed when the manager_path
changes. This handles the scenario when a team of employees is reassigned to a different manager. The function makes sure that even “subordinates” have the correct manager_path
set. Consider this test data of a simple organisation:
|
|
The state of the table now looks like this:
id | name | manager_id | manager_path
----+-------+------------+--------------
1 | Alice | | 1
2 | Bob | 1 | 1.2
3 | Carol | 1 | 1.3
4 | David | 2 | 1.2.4
5 | Eve | 4 | 1.2.4.5
From the output above we can see that Alice manages Bob (and Carol), which manages David, which manages Eve. So what happens when Bob is set to report to Carol, instead of Alice?
|
|
Now the table looks like below. As we can see, Carols id
3 was added to all direct and indirect reports/subordinates at the right place in manager_path
. This would not have happened unless employee_update_manager_path_of_managed
had been called, and we would have been left with invalid manager paths.
id | name | manager_id | manager_path
----+-------+------------+--------------
1 | Alice | | 1
3 | Carol | 1 | 1.3
2 | Bob | 3 | 1.3.2
4 | David | 2 | 1.3.2.4
5 | Eve | 4 | 1.3.2.4.5
Cycle detection
Nonsensical cycles are really easy to create. In fact, if we attempted to make Eve the manager of Alice with UPDATE employee SET manager_id = 5 WHERE id = 1;
above we would send Postgres into an infinite trigger-loop. We really need to avoid that. In comes the third and last trigger to detect cycles:
|
|
Important! The trigger tgr_employee_update_manager_path_of_managed
must be created after tgr_employee_detect_cycle
due to their similar conditions for execution. We must ensure that cycles are checked before updating children! A case could be made that these two triggers and handlers could be combined into one due to this, however I have left it as above for clarity. The handler employee_detect_cycle
uses a technique for recursive queries called “Common Table Expressions” that proves very useful here. The actual handler was adapted for this purpose from this article.
Note that Postgres 14 (the next major version at the time of writing) will contain SEARCH
and CYCLE
statements that will make cycle detection easier.
If we now attempted to make Eve the manager of Alice then this would happen:
|
|
Useful queries
Now that we are set up here are some useful queries for these relationships. Reinserting some data:
|
|
Querying for everyone under Bob using an ltree specific operator:
SELECT * FROM employee WHERE manager_path ~ '*.2.*{1,}';
-- id | name | manager_id | manager_path
------+-------+------------+--------------
-- 4 | David | 2 | 1.2.4
-- 5 | Eve | 4 | 1.2.4.5
-- 6 | Frank | 2 | 1.2.6
Querying for everyone above Eve by extracting ids from manager_path
:
SELECT * FROM employee WHERE id IN (
SELECT unnest(string_to_array(manager_path::text, '.')::integer[])
FROM employee WHERE id = 5
) AND id <> 5;
-- id | name | manager_id | manager_path
------+-------+------------+--------------
-- 1 | Alice | | 1
-- 2 | Bob | 1 | 1.2
-- 4 | David | 2 | 1.2.4
I hope you found this article useful. Let me know via email if you have any suggestions or feedback. Some things that I believe could be improved:
- Time complexity in the interplay between
employee_detect_cycle
andemployee_update_manager_path_of_managed
for large trees - Not requiring the employee
id
to always be the last label inmanager_path