On our way from monolithic applications to a microservice architecture a property provided by databases got lost — the unique constraint. We need an approach to mitigate the consequences of this missing property.
id
is not sufficientWhen designing a database schema it is considered a good habit to use a natural key as primary key. If our database table contains information about things which we can point at (e.g. items in a warehouse, houses in a street), we often are able to use a natural key . Those things are labeled with a name and we can use that name as a natural primary key in our table. In this case uniqueness ensures identity of our table with the real world.
But more often we deal with relations at which we can not point: e.g. a mother or a purchase. If we need to handle those relations in our information system we need to understand their identity. A purchase has an identity because any item on stock can be sold only once by the seller to the buyer. Our purchase has a natural unique key containing customer, merchant and item1.
In such a case having a synthetic key only (also known as surrogate
key, a purchase_number
in our example) is not sufficient, we should have
additionally an unique key (the tuple [customer merchant item]
in our
example). But why? Because we want to be able to derive predications about the
real world. Uniqueness as provided by a synthetic key does not help here. When
we generate an UUID or use a sequence object we will achieve
uniqueness. But it lacks the property of identity for the rows in
the table.
The presence of a single column primary key simplifies the definition of references in our database. But more important for the purpose of the information system is the presence of the unique key2. With its definition no row could enter the system which does not represent the entity (the relation in our case) in the real world. Therefore a PostgreSQL database could act as an oracle telling always the truth. We would lose this valuable property if we merely had a primary key which is automatically generated.
In our new world of distributed microservices our database schema got distributed across many microservices. Along the way we got events with a schema persisted in a data lake. The schema on write made way for the schema on read3. Different encodings like json, parquet or avro focus on the definition of the record. But…
The unique constraints got lost.
Some systems declare unique identifiers and leave the responsibility to the producer. Since services emit events we tend to treat all the different event types as similar and therefore it seems to be obvious to declare an event id as a general identifier. That unique identifier is like a surrogate key prone to the error an additional unique key would prevent in our relational database.
Since there was a need to create unique values without locks in a distributed environment UUID has been widely adopted. Now we use those as identifiers. They are unique but it is unclear, what they identify. It depends on how they are generated and associated with the event. If a developer generates a new value for the event id on every retry of a failed sending of events we get a unique ID - but duplicate rows can end up in the data lake. Those duplicates cannot be eliminated using the event id.
By generalizing the event id (as a technical means) we lose the connection to the business. Only adding the generalized event id creates the possibility that both information diverge. Not having an event id would require a different approach to deal with duplicates.
With the decision for a distributed system we also accepted that messaging is not reliable. If we go for at-least-once delivery we have to expect duplicates which end up in our data lake. Since there is no instance enforcing a unique constraint (like the database was) the rules need be defined on read. The unique constraint on write made way for unique constraint on read.
Now that business data and event id diverge the event id cannot be used for that purpose. Having the unique fields declared in the event type metadata would be useful for many consumers.
If there is no (machine readable) definition of unique constraints, different readers will eventually implement deduplication differently. This will lead to the situation that different analysts will analyze the data differently.
It is already a known problem in science that the same data set can be analyzed in respect to a given question with contradicting results. An important remark from the research is
“that analysts’ prior beliefs about the effect did not explain the variation in outcomes, nor did a team’s level of statistical expertise or the peer ratings of analytical quality.”
Making sense out of data is a difficult business. I think a missing common view of what a row identifies makes it even more complicated.
How could we mitigate this issue? I think there are a couple of options:
We assume there are no returned items. ↩︎
It is also referred to as candidate key. ↩︎
An extensive discussion of schemaless data models, schema-on-read vs schema on-write can be found in Kleppmann, M.: Designing Data-Intensive Applications. O’Reilly 2017. ↩︎