Coming up with identifying a case both internally and externally isn’t an easy thing. This post shows how we solved this issue for us.
Why we need an identifier
When we’re helping our patients, a request we’re handling for a single patient is called a case. Within a case we’re helping a patient with a specific problem to find a doctor that is uniquely qualified to help that patient with his/her specific problem.
The whole operation of collecting information from the patient, searching for the best doctors and communicating the results back to the patient happens in the context of that case.
At several points both internally at BetterDoc and when communicating with the patient we need to identify a specific case.
A typical identification might be something like “The case for is Mr. Müller, the guy with pain in his left knee”.
There are a variety of reasons why this is not the way we identify our cases.
(1) It doesn’t scale.
The most common last name in Germany is “Müller” so it’s highly likely that there is more than one Mr. Müller. The changes are also very high that there is more than one Mr. Müller who has pain in his left knee.
(2) It doesn’t respect the privacy of the patient.
Every patient has the right to privacy and as we’re dealing with medical data we go a long way to ensure that we keep their data private. Talking to a colleague in the office or out the hallway about Mr. Müller is easy to overhear for other people and we don’t want anyone to know that Mr. Müller has pain in his left knee who doesn’t really have to know.
(3) It’s not a good way to store data in a technical system.
For technical systems, especially a database, having an identifier like “Mr. Müller with pain in the left knee” is a very bad identifier. It isn’t unique and also leaks personal information.
Technical systems, especially databases, have solved references by auto generating identifiers for entries in a database table.
The simplest form of such an identifier is an auto incrementing integer where the first entry within a database table gets the identifier
1, the second entry gets the identifier
2 and so on.
A case could be an entry in a database table, so referencing a case by its auto generated identifier would work. However this is not what we do, again for a variety of reasons:
(1) It’s insecure.
It’s pretty easy to detect that an identifier is auto generated.
When I communicate with BetterDoc and my personal case is numbered
42 it’s pretty straight forward to assume that there is a case numbered
41 and potentially a case numbered
We don’t want to leak that much information about our internal systems.
(2) It’s cumbersome.
When communicating with our patients we need to make sure that we can identify them and their case clearly. An identifer like
42 is easy to communicate over the phone, but what about
6380419330? Was that a one or a seven? Could you repeat the fifth character? Are there really multiple threes?
Reading a long number from a piece of paper and understanding it over the phone is cumbersome. It will work ultimately but it may take a lot of time and frustration on both sides to clearly communicate the identifier.
An alternative solution (which solves the issue of pure incrementation) would be to use a UUID.
However it would still be cumbersome to dictate a sequence like
123e4567-e89b-12d3-a456-426614174000 over the phone.
It’s similar like to dictating an IBAN.
Nobody wants to do that.
Taking all these things into account we came up with a set of rules that our case identifiers should comply with:
- The identifier must be randomly generated (e.g. not based on any incrementing value)
- The identifier must have a checksum to tell if it is valid without having to look it up, in order to prevent communication errors
- The identifier must be free from accidental profanity by random value generation (e.g. not generating an identifier like
- The identifier must be easy to dictate via the phone and must be easy to read from a document
To create a unique identifier we’re generating a random sequence of 45 bits.
Let’s do that:
17142624302842 written as base 10 integer)
A dedicated service stores the identifiers that we have already generated, so in the (unlikely but not impossible) case then we generate the same random identifier twice we can detect that this identifier is already in use. In that we simply continue generating new identifiers until we have generated one that hasn’t been used already.
To compute the checksum we’re using a simple modulo function:
17142624302842 % 37 = 3
We have fulfilled our rules (1) and (2) but definitely not (4). This is not readable at all.
To encode this identifier into a readable format we use the Crockford Base 32 encoding.
The concepts laid out within Crockford Base 32 encoding match perfectly with our requirements:
- Be human readable and machine readable.
- Be compact. Humans have difficulty in manipulating long strings of arbitrary symbols.
- Be error resistant. Entering the symbols must not require keyboarding gymnastics.
- Be pronounceable. Humans should be able to accurately transmit the symbols to other humans using a telephone.
The Crockford 32 encoding also has some nice built-in “error corrections”:
1 can be easily confused with both the letters
I. The same happens with the digit
0 and the character
O. The Crockford 32 algorithm will use only
1 while encoding a value, ignoring
O. When decoding a value it will automatically convert an
I or an
L into a
1 and an
O into a
0. So even if a patient misreads one of these characters our system can automatically compensate for that.
Crockford 32 is also explicitly designed to fulfill our rule (3) (“the identifier must be free from accidental profanity”).
Using the Crockford 32 encoding our random value
17142624302842 is encoded as
FJXA0GDQT and the checksum is encoded as
03. We combine these two into a single string, ending up with
To be safe against future changes we prefix this identifier string with the name of the algorithm that we used to create it. At the moment the solution described above is the only algorithm we’re using and we’re calling it
So we’re ending up with:
If at some point in the future we should decide to not use Crockford 32 any longer we can still decode our old identifiers (“If the prefix is
A then decode using Crockford 32, if the prefix is
C, etc. then use whatever algorithm we defined for them”).
The final step to comply with our rule number (4) (“The identifier must be easy to dictate via the phone and must be easy to read from a document”) we split the string into blocks of four characters separated with a dash, ending up with:
This is easy to dictate for a patient and easy to understand for our team members.
The dashes are completely ignored when decoding the value so it doesn’t really matter if the patient dictates
A, F, J, DASH, X A, ... or
A, F, J, X, DASH, A, .... It also doesn’t matter if the patient dictates
A, ZERO, G, D, ... or
A, O, G, D, ... as the Crockford 32 decoder knows the
O is actually a
So this is how we use identifiers.