Sorting in Lakes

Published on Jul 2, 2021

In case we sort rows using SQL we need to distinguish between duplicate values of unique rows and duplicate values due to duplicate rows.

The nice thing about the row_number SQL function is that it provides you with a unique number (per partition) which is often used as a mean to de-duplicate a set. The rank function does not provide this property because duplicate values share the same rank.

    dense_rank() OVER (ORDER BY a),
    rank() OVER (ORDER BY a),
    row_number() OVER (ORDER BY a)
FROM (VALUES (1),(1),(2)) AS t(a);

 a | dense_rank | rank | row_number
 1 |          1 |    1 |          1
 1 |          1 |    1 |          2
 2 |          2 |    3 |          3

The problem occurs if you use it the wrong way. Recently I spent a some time debugging a weird behavior in a report. The numbers changed during different executions on an immutable data set — which is not what you expect.

SELECT a, b,
    row_number() OVER (ORDER BY a) AS rn
    (2, 'c'),
    (1, 'a'),
    (1, 'b')) AS t(a, b)

 a | b | rn
 1 | a |  1
 1 | b |  2
 2 | c |  3

If you execute this query on PostgreSQL a couple of times it gives you most likely always the same ordered set. But this is not given and a unintended side effect of providing the data using VALUES. Logically ordering of duplicate values is not defined and we can not rely on this result. We have to expect that sometimes ‘a’ and sometimes ‘b’ will get row number 1 assigned. It should be more probable to happen in distributed environments — like Presto or Spark (since PostgreSQL these days also supports parallel execution we should it expect to happen as well).

If in such a situation the expression rn=1 is used to filter and count the values in column b='a' we will get a different result every time we execute the query. As a result it will look to the observer as if the query engine works non-deterministic on stable input.

Maybe be it should help in such a case to ask whether or not we could safely exchange row_number with rank. If not, we should change the ordering criteria.

Should we always prefer rank over row_number? It depends I think.

It becomes more important when working on big data for two reasons. A data lake of plain files does not provide guarantees like a good old database with well defined constraints could. Furthermore data lakes are solutions in environments where we rely on “at least once” message delivery. This means for technical reason we have to expect duplicates. In such a situation we need to use row_number.

The notional exchange of both functions should help to find out whether or not the duplicate value can happen only in case of a duplicate row.