Pitfall in copy table in Postgres

We countered a very wield runtime error today, after migrated some data from a legacy database.

Because there is no change on the models, so we just create the table, and copied the data from the legacy database directly.
To ensure the migration doesn’t break anything, we also wrote some migration test to verify the data integrality. And we found all tests are passed.

Everything looks perfect until the app goes live. We found the app crashes occasionally when we’re trying to create new data record in the system.
Sometimes it works fine, but sometimes we got an error says “duplicate key value violates unique constraint ‘xxxxx_pkey’”.

It is wield because we’re really confident about our unit test and migration test. The problem must not related to migration and logic.

After some manually tests, we found we also got error when create entry with raw SQL Insert Query. So it seems to be a postgres issue.
And the problem is caused because of the primary key, which is a auto-generated id.

Postgres introduces the Sequence to generate the auto-increase index. Sequence remember the last index number it generated, and calculate the new index by +1.
During the data migration, we copy the data rows from another table to a new table. To keep the relationship between records, we also copied the primary key in the row. As a result, although we had inserted a number of records into the table, but the sequence binding to the primary key doesn’t been updated.

For example, we have inserted the following 3 entries:

  • {id: 1, name: ‘Jack’}
  • {id: 2, name: ‘Rose’}
  • {id: 4, name: ‘Hook’}

But because the id is also inserted, so the sequence is still at 1, so when we execute the following SQL: `

Insert entry
1
2
3
4
INSERT INTO users (name)
VALUES ('Robinhood');

And sequence will generate 1 as the id, which is conflicted with entry {id: 1, name: 'Jack'}, and then database yield exception “duplicated key”.
But usually the id is not continues because of deletion of the records, which looks like there are “holes” in the records. So our app can successfully insert entry successfully when new entry falls into the “hole”.

To solve this problem, we need to also update the sequences in the table, including the primary sequence. Postgres allow Sequence to be updated by using ALTER SEQUENCE command, and we can set the sequence to a big enough integer:

Update Sequence
1
2
3
ALTER SEQUENCE users_id_seq RESTART 10000

A smarter way is to query the whole table to find out the maximum id number, and set the sequence to that number + 1.

The "Cursed" NULL in postgres

Comparison to NULL

In postgres, NULL is treat as a speical value, that which is not equal to any other value, which means the expression NULL = NULL yields false.
It can be verified by using the following query

SELECT NULL
1
2
3
4
5
SELECT n
FROM unnest(ARRAY(NULL,1,2,3,4,5)) n
WHERE n = NULL

The query returns empty set, because no element equals to NULL even NULL itself.
If you think this experiment is not convincing enough, then you can try this:

CASE NULL
1
2
3
4
5
6
SELECT
n,
CASE WHEN n = NULL THEN 'NULL' ELSE 'NOT NULL' END
FROM unnest(ARRAY(NULL,1,2,3,4,5)) n

This query should yield “ , NOT NULL” which means NULL does not equal to NULL.

To test whether or whether not a value equals to NULL, you should use IS NULL or IS NOT NULL

So if you replace the n = NULL with n IS NULL in previous 2 statements, you will get expected result:

SELECT NULL
1
2
3
4
5
6
7
SELECT
n,
CASE WHEN n IS NULL THEN 'NULL' ELSE 'NOT NULL' END
FROM unnest(ARRAY(NULL,1,2,3,4,5)) n
WHERE n IS NULL

NULL in Crosstab

In most cases, the NULL special character doesn’t hurt much, since we always can alter our expression to fix the problem. But if you’re using table functions to create pivot, and there is NULL in your columns, then you will find NULL is a cursed value, which brought a lot of trouble to you.

Postgres provide tablefunc extension, which can provide a series functions called “crosstab#”. And with these functions, you can convert a set of rows into a pivot table.

Function crosstab accept 2 sql queries. First query should yield 3 columns: row in pivot table, column in pivot table and value in pivot table.
Second query should yield a series of value which defines the columns of the pivot table.
crosstab function will group the rows yield by 1st query by 1st column. Then map each row in the group to a column by comparing the 2nd column in the row yield by 1st query with the value generated by 2nd query, if the value is equal then the 3rd column of the row yield by 1st query will be placed in the column defined by value of the 2nd query.
It if a very convenient feature provided by postgres, and it works perfect in most cases.

But in our case, we met a problem that we have NULL value in the 2nd column of 1st query, which means we have NULL value in pivot table columns!

And the crosstab decide which column the value should be placed to by comparing the whether the value is equal!

And NULL never equals to NULL!

BAM!!!

As a result, the column of NULL in the pivot table is always empty!

CASE WHEN

To solve the problem, we should play a little trick in the first query, we should translate all the NULL into a “normal” value.
Here is our 1st query, and we want to get a pivot table with period as row axis, rating as column axis and volume of the order as content.

Original 1st query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
Periods.period,
Profiles.rating,
SUM(Orders.volume)
FROM
orders
LEFT OUTTER JOIN Periods ON (Periods.id = Orders.period_id)
LEFT OUTTER JOIN Profiles ON (Profiles.id = Orders.profile_id)
GROUP BY
Periods.period,
Profiles.rating
ORDER BY
Periods.period,
Profiles.rating

And we will get NULL in Profiles.rating, so we can translate NULL as 0 in rating. To achieve this we can use CASE WHEN statement.

Original 1st query with CASE WHEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
Periods.period,
CASE WHEN Profiles.rating IS NULL THEN 0 ELSE Profiles.rating,
SUM(Orders.volume)
FROM
orders
LEFT OUTTER JOIN Periods ON (Periods.id = Orders.period_id)
LEFT OUTTER JOIN Profiles ON (Profiles.id = Orders.profile_id)
GROUP BY
Periods.period,
Profiles.rating
ORDER BY
Periods.period,
Profiles.rating

COALESCE

The solution works fine. But in personal perspective, I don’t like it, because it repeat the statement and is not concise. But luckily, the value we should deal with the the special value NULL and postgres has provided a group of functions to deal with NULL.

What we want is the function COALESCE, which accept a group of value as arguments, and returns the first not null value.
So we can simplify our statement with this super function:

Original 1st query with COALESCE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
Periods.period,
COALESCE(Profiles.rating, 0)
SUM(Orders.volume)
FROM
orders
LEFT OUTTER JOIN Periods ON (Periods.id = Orders.period_id)
LEFT OUTTER JOIN Profiles ON (Profiles.id = Orders.profile_id)
GROUP BY
Periods.period,
Profiles.rating
ORDER BY
Periods.period,
Profiles.rating

In the statement, if the rating is not null, the COALESCE function will return the actual value if the rating is NULL, then the COALESCE will find the next not null value, which must be 0.

Besides COALESCE function, there is another function called NULLIF, which might mislead you to a totally wrong way just as what I had.
According to postgres document, the function might behave in a totally opposite way than you expected.

The NULLIF function returns a null value if value1 equals value2; otherwise it returns value1. This can be used to perform the inverse operation of the COALESCE