May 2021

1403 words

Acyclic and directed graph in Postgres with the LTREE extension

Acyclic directed graphs are a very useful concept that has many real world applications: hierarchical groups, organisation charts and product categories are just a few of all the situations of where they can be essential.

This article contains a complete demonstration of an acyclic and directed graph implemented in Postgres mainly by using the LTREE extension and table triggers. It also contains some useful queries related to the graph such as getting all parents of a node and more.

An organisation chart

We are going to implement an acyclic, directed graph based around an organisation storing a table of employees, where some of the employees are managers with responsibility for a number of other employees. Managers can themselves have a manager. In essence, these properties are required:

An employee can have one manager to report to, or none at all
A manager can have many “subordinates” i.e. people reporting to him/her
Cycles are forbidden, i.e. two managers cannot report to each other, even indirectly

We then need to be able to query this data structure efficiently. These queries are important:

Is employee A managed by manager B? (directly or indirectly)
Which is the direct manager of employee C and which are his/hers indirect managers?
Which employees are the responsibility of manager D?

With this in mind we can create the following basic table of employees in Postgres:

1
2
3
4
5
6


CREATE TABLE employee (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL DEFAULT '',
  manager_id INTEGER,
  manager_path LTREE NOT NULL
);

Postgres might complain when this statement is executing saying it can’t find the LTREE type. Luckily this is easy to fix.

Introduction to `ltree`

The Postgres ltree extension creates a type named ltree that was made for “representing labels of data stored in a hierarchical tree-like structure” and we are going to make use of it. This is not a default extension in Postgres, so we need to enable it for use in the database we are using with the following command:

1

CREATE EXTENSION ltree;

An ltree represents a path from the root of a hierchichal tree to a particular node, and it might look like this: 1.2.3.4 indicating simply that 4 belongs to 3 which belongs to 2 which belongs to 1, which is the top node in this small tree. The extension contains many useful operators that can be used to query these labels as we are about to find out.

Foreign keys

While ltree affords us a lot of functionality related to hierchichal data it does not enforce any kind of consistency of the data entered, which is important. We are going to create a foreign key constraint on the column manager_id that refers back to the employee table, and this column will be the primary way that we manipulate the relationship between employees and managers. We will not touch the ltree manager_path column directly, but will have it automatically generated for us depending on the relationships created with manager_id.

1
2
3
4


ALTER TABLE employee
ADD CONSTRAINT employee_manager_id_fkey
FOREIGN KEY (manager_id) REFERENCES employee (id)
ON UPDATE CASCADE ON DELETE SET NULL;

The above constraint basically says that manager_id must contain an existing employee row id, and that if the manager is deleted all “subordinates” will have manager_id set to null. This constraint could also have been created directly in the CREATE TABLE statement:

1
2
3
4
5
6


CREATE TABLE employee (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL DEFAULT '',
  manager_id INTEGER REFERENCES employee ON DELETE CASCADE ON UPDATE SET NULL,
  manager_path LTREE NOT NULL
);

Indexes

Indexes are important for efficiency when querying large tables. We are going to index both manager_id and manager_path. Since manager_path is of type ltree we need to use a GiST index. This is required for the special ltree related operators to be able to use the index at all.

1
2


CREATE INDEX employee_manager_id_idx ON employee (manager_id);
CREATE INDEX employee_manager_path_idx ON employee USING GIST (manager_path);

Triggers

Unfortunately the foreign key constraint won’t get us very far by itself. We need to use some triggers in order to make sure manager_path is set correctly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


CREATE OR REPLACE FUNCTION employee_update_manager_path() RETURNS TRIGGER AS $$
DECLARE
  new_manager_path ltree;
BEGIN
  IF NEW.manager_id IS NULL THEN
    NEW.manager_path = NEW.id::text::ltree;
  ELSIF TG_OP = 'INSERT' OR OLD.manager_id IS NULL OR OLD.manager_id != NEW.manager_id THEN
    SELECT manager_path FROM employee WHERE id = NEW.manager_id INTO new_manager_path;
    IF new_manager_path IS NULL THEN
      RAISE EXCEPTION 'Invalid manager_id %', NEW.manager_id;
    END IF;
    NEW.manager_path = new_manager_path || NEW.id::text::ltree;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION employee_update_manager_path_of_managed() RETURNS TRIGGER AS $$
BEGIN
  UPDATE employee SET manager_path = NEW.manager_path || id::text::ltree WHERE manager_id = NEW.id;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER tgr_employee_update_manager_path
BEFORE INSERT OR UPDATE ON employee
FOR EACH ROW EXECUTE PROCEDURE employee_update_manager_path();

CREATE TRIGGER tgr_employee_update_manager_path_of_managed
AFTER UPDATE ON employee
FOR EACH ROW WHEN (NEW.manager_path IS DISTINCT FROM OLD.manager_path)
EXECUTE PROCEDURE employee_update_manager_path_of_managed();

The function employee_update_manager_path will make sure that manager_path always is set to the correct value when a new employee is inserted or updated. manager_path will be set even if there is no manager: then it will just contain the employee id. In fact, the manager_path of every employee will always end with the id of the employee.

The function employee_update_manager_path_of_managed will on the other hand only be executed when the manager_path changes. This handles the scenario when a team of employees is reassigned to a different manager. The function makes sure that even “subordinates” have the correct manager_path set. Consider this test data of a simple organisation:

1
2
3
4
5


INSERT INTO employee (id, name, manager_id) VALUES (1, 'Alice', NULL);
INSERT INTO employee (id, name, manager_id) VALUES (2, 'Bob', 1);
INSERT INTO employee (id, name, manager_id) VALUES (3, 'Carol', 1);
INSERT INTO employee (id, name, manager_id) VALUES (4, 'David', 2);
INSERT INTO employee (id, name, manager_id) VALUES (5, 'Eve', 4);

The state of the table now looks like this:

 id | name  | manager_id | manager_path
----+-------+------------+--------------
  1 | Alice |            | 1
  2 | Bob   |          1 | 1.2
  3 | Carol |          1 | 1.3
  4 | David |          2 | 1.2.4
  5 | Eve   |          4 | 1.2.4.5

From the output above we can see that Alice manages Bob (and Carol), which manages David, which manages Eve. So what happens when Bob is set to report to Carol, instead of Alice?

1

UPDATE employee SET manager_id = 3 WHERE id = 2;

Now the table looks like below. As we can see, Carols id 3 was added to all direct and indirect reports/subordinates at the right place in manager_path. This would not have happened unless employee_update_manager_path_of_managed had been called, and we would have been left with invalid manager paths.

 id | name  | manager_id | manager_path
----+-------+------------+--------------
  1 | Alice |            | 1
  3 | Carol |          1 | 1.3
  2 | Bob   |          3 | 1.3.2
  4 | David |          2 | 1.3.2.4
  5 | Eve   |          4 | 1.3.2.4.5

Cycle detection

Nonsensical cycles are really easy to create. In fact, if we attempted to make Eve the manager of Alice with UPDATE employee SET manager_id = 5 WHERE id = 1; above we would send Postgres into an infinite trigger-loop. We really need to avoid that. In comes the third and last trigger to detect cycles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


CREATE OR REPLACE FUNCTION employee_detect_cycle() RETURNS TRIGGER AS $$
BEGIN
  IF NEW.manager_id IS NOT NULL AND EXISTS (
    WITH RECURSIVE list_managers (parent) AS (
     SELECT s.manager_id
     FROM employee AS s
     WHERE s.id = NEW.manager_id
    UNION
     SELECT s.manager_id
     FROM employee AS s
     INNER JOIN list_managers lr ON lr.parent = s.id
    )
    SELECT * FROM list_managers WHERE list_managers.parent = NEW.id LIMIT 1
  ) THEN
    RAISE EXCEPTION 'Invalid cycle detected in manager/managed relationship in employees';
  ELSE
    RETURN NEW;
  END IF;
END;
$$ LANGUAGE plpgsql;

-- important: must be before the trigger tgr_employee_update_manager_path_of_managed
CREATE TRIGGER tgr_employee_detect_cycle
AFTER INSERT OR UPDATE ON employee
FOR EACH ROW EXECUTE PROCEDURE employee_detect_cycle();

-- dropping and recreating to ensure order
DROP TRIGGER IF EXISTS tgr_employee_update_manager_path_of_managed ON employee;
CREATE TRIGGER tgr_employee_update_manager_path_of_managed
AFTER UPDATE ON employee
FOR EACH ROW WHEN (NEW.manager_path IS DISTINCT FROM OLD.manager_path)
EXECUTE PROCEDURE employee_update_manager_path_of_managed();

Important! The trigger tgr_employee_update_manager_path_of_managed must be created after tgr_employee_detect_cycle due to their similar conditions for execution. We must ensure that cycles are checked before updating children! A case could be made that these two triggers and handlers could be combined into one due to this, however I have left it as above for clarity. The handler employee_detect_cycle uses a technique for recursive queries called “Common Table Expressions” that proves very useful here. The actual handler was adapted for this purpose from this article.

Note that Postgres 14 (the next major version at the time of writing) will contain SEARCH and CYCLE statements that will make cycle detection easier.

If we now attempted to make Eve the manager of Alice then this would happen:

1
2
3


UPDATE employee SET manager_id = 5 WHERE id = 1; 
ERROR:  Invalid cycle detected in manager/managed relationship in employees
CONTEXT:  PL/pgSQL function employee_detect_cycle() line 15 at RAISE

Useful queries

Now that we are set up here are some useful queries for these relationships. Reinserting some data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


TRUNCATE employee;
INSERT INTO employee (id, name, manager_id) VALUES (1, 'Alice', NULL);
INSERT INTO employee (id, name, manager_id) VALUES (2, 'Bob', 1);
INSERT INTO employee (id, name, manager_id) VALUES (3, 'Carol', 1);
INSERT INTO employee (id, name, manager_id) VALUES (4, 'David', 2);
INSERT INTO employee (id, name, manager_id) VALUES (5, 'Eve', 4);
INSERT INTO employee (id, name, manager_id) VALUES (6, 'Frank', 2);

SELECT * FROM employee;

-- id | name  | manager_id | manager_path
------+-------+------------+--------------
--  1 | Alice |            | 1
--  2 | Bob   |          1 | 1.2
--  3 | Carol |          1 | 1.3
--  4 | David |          2 | 1.2.4
--  5 | Eve   |          4 | 1.2.4.5
--  6 | Frank |          2 | 1.2.6

Querying for everyone under Bob using an ltree specific operator:

SELECT * FROM employee WHERE manager_path ~ '*.2.*{1,}';

-- id | name  | manager_id | manager_path
------+-------+------------+--------------
--  4 | David |          2 | 1.2.4
--  5 | Eve   |          4 | 1.2.4.5
--  6 | Frank |          2 | 1.2.6

Querying for everyone above Eve by extracting ids from manager_path:

SELECT * FROM employee WHERE id IN (
  SELECT unnest(string_to_array(manager_path::text, '.')::integer[]) 
  FROM employee WHERE id = 5
) AND id <> 5;

-- id | name  | manager_id | manager_path
------+-------+------------+--------------
--  1 | Alice |            | 1
--  2 | Bob   |          1 | 1.2
--  4 | David |          2 | 1.2.4

I hope you found this article useful. Let me know via email if you have any suggestions or feedback. Some things that I believe could be improved:

Time complexity in the interplay between employee_detect_cycle and employee_update_manager_path_of_managed for large trees
Not requiring the employee id to always be the last label in manager_path