Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unnatural Keys – Nature doesn’t come with identifiers (matt-schellhas.medium.com)
183 points by thunderbong on May 28, 2023 | hide | past | favorite | 112 comments


The really key thing with this is to identify what your database/Id is solving and what it is not. What guarantees are you giving and what problems should you raise to people. You can rarely solve all of them.

I helped make GRID which was a database of research institutes in the world (morphed into ROR). But what is one institute? Is it the legal entity? Is a lab an institute? What about a research boat? A hospital? Which is the parent, the hospital or the university?

The answer to lots of that is "maybe, it depends". We had to decide what use cases we were solving, and often it came down to some human fuzzy definition, how do people use this in reality?

We built it because others had made very strict sets of rules or were very open and then were just not useful. A curated list of a 1000 institutes set up as people generally used them solved most issues. Scaling it up another 10x solved most others. Not all, not for everyone.

A recording of background noise or silence isn't a song. Unless of course it's 4'33".

> They’re all arbitrary, unnatural, human inventions

A really great summary.


Which goes to show that we often conflate identification and classification because it's convenient. It's not always the best decision longer term but it may be good enough.

For your particular task, you could have chosen to identify things of interest and assign them an ID. Separately, you could have made (and revised) a decision on whether a particular thing is a lab or a research institute and/or some other kind of entity.

In my experience, organisations are some of the most difficult things to model. The ways in which they can merge, split, be part of one another, be the same and not the same in different jurisdictions or according to different sets of rules are basically infinite.

So when you say "The really key thing with this is to identify what your database/Id is solving and what it is not" I totally agree. But I would add that identification and classification solve overlapping problems that are often deceptively similar but not quite the same.


> For your particular task, you could have chosen to identify things of interest and assign them an ID. Separately, you could have made (and revised) a decision on whether a particular thing is a lab or a research institute and/or some other kind of entity.

I get the point but we couldn't. The very concept of what one thing is, is a fuzzy decision in itself. There's also not a single hierarchy that makes sense for different uses so you can't just group and then classify - legal ownership, financing and operational control can all be distinct structures. Then how people actually work can be another.

Even then, fundamentally what we needed was to get away from trying to model the world then work out how to use it. Because then you have one Id that maps one way for one user and another for another user and now their work is incompatible - but in a subtle and hidden way.

Edit - this doesn't take away from your point, classification is a distinct problem.


The really key thing with this is to identify what your database/Id is solving and what it is not.

That's the money quote. If the indirection of long, composite, maybe changing foreign keys could be managed by the system, natural keys would be the way to go. But that doesn't happen.

> They’re all arbitrary, unnatural, human inventions

That part from the article is the weaker point. Semantics. "Natural" only means that it comes from the problem domain, it doesn't really matter.


> The really key thing with this is to identify what your database/Id is solving and what it is not.

On this context, it is very simple and straight-forward. It also generalizes almost completely.

Your database id is solving the problem of creating/managing the relationships within your data. This generalizes to basically everybody because "database id" is a concept created within a mathematical theory exactly and exclusively for that purpose.

If you try to add more purposes for it, you will get conflicting requirements.


4'33" isn't a song either. If you call a tail a leg, a dog still has 4 legs.


Since we're going into this meta-philosophical thing, your statement is false. If I call a tail a leg, dogs definitely have five legs. Unless they were of course born without a leg, or have otherwise lost one. There is no "reality" from a human perspective, it's all sensory input that goes through a lot of very subjective processing.

You can't _really_ categorize things. You can mostly sort of do it, but it never actually works out, because categories are human stuff, and human stuff is subjective and thus different depending on who you ask. Sometimes unclear even to the same person.

I can say dogs have five legs and the only reason most people will think I'm wrong is that society has come to categorize limbs based on function instead of length.


Getting a little pedantic 2. a large branch of a tree. "the bare limbs of a high tree" Similar: branch

limb (n.2) late 14c., "edge of a quadrant or other instrument," from Latin limbus "ornamental border, hem, fringe, edge," a word of uncertain origin.

Just thought of the limbic system and how that might be related, doesn't seem to be about function but about positioning against a core system


My point was that categorizing limbs (differentiating them from one another) is done by function. Now that I think about it limb isn't what I had in mind anyway, what I meant to say was that we categorize parts of the body in terms of function.

But you're right, limbs are not defined by function.


So in OP’s example of tracking every song, if I want/need 4’33” in my database of all songs, but I need to keep track of whether cwillu think’s it’s a song for some reason (they seem pretty authoritative, better make note of it), I could create a foreign key for that and set it to values like “cwillu said it’s a song”, “cwillu said it’s NOT a song”, and “cwillu has not specified if it’s a song yet”. That’s why natural keys don’t exist.


For a whole bunch of useful definitions and use cases, it's a song. Arguing about the precise delineations of song/not song is missing my point.

> If you call a tail a leg, a dog still has 4 legs.

I can't think of a useful definition of legs that includes the tail. I can for calling a recording of a musical piece a song.


In real life dogs can exist with more or less than 4 legs.


Yes! The existence of a five-legged dog makes “all dogs have four legs” false but it doesn’t make “dogs have four legs” false. Similarly, the existence of a two-legged dog makes “some dogs have only two legs” true while “dogs have two legs” is still not (unless you just mean the former).

Dog is a categorical abstraction over individuals. The properties of a category tend to have a complex relationship to the members of the category. This is true even if it is indisputable who the members are and what properties the members have.


In Sweden, almost everyone is assigned a number at birth, a personnummer, with the format YYMMDD-NNNN (or YYYYMMDDNNNN in computing, to avoid having a Y2K issue every time someone turns 100).

A lot of computer systems here seem to treat it as a natural key, which is a terrible idea, because:

• Not everyone has one (notably, asylum seekers and temporary residents don't), so if you rely on them, you're excluding those people.

• They can change. They won't for 99.9% of people, but not everyone is so lucky. If you use these as your primary key, you're going to create a headache for you and your users for that 0.1% of cases.


Yes, but what would be the alternative?

Of course you can use artificial keys like a random UUID for each of your db objects. Which will work quite well if you only interact with your own system.

But as soon as one of your db objects needs to be linked to some other system, you will need some common ground for a correlation, and something like a personnummer will help immensely and solve 99.9% of your problems. The 0.1% of problematic cases is far less than the headaches that the absence of some common ID scheme will cause you.

Imagine e.g. having to correlate on the usual "Name, First Name, Birth Name, Place of Birth, Date of Birth" dance: My place of birth used to be called Lower Unxton at the time of my birth, now it has been reformed with Upper Unxton and Exampleville into Greater Unxton. Which one do you want? Did I always give the same answer to that question? My parents also divorced and my birth name legally changed. Do you want the old or the new birth name? Btw., my legal date of birth is actually in a non-gregorian calendar. How do I input that into your form field, it doesn't seem to like Showa 50 as a year, not even to speak about the proper characters?

I'd claim any imperfect personnummer is still far superior to all the problems one would have without it.


They’re not suggesting you don’t store it… I don’t think you’re addressing the real concerns here.

How I read the concerns: External connections to your data belong in a field or a separate table to ensure that your data is not so tightly coupled to the key you’ve chosen that you limit your data model’s expressive power. Don’t make an external identifier the primary key in your database.


Ah. Thanks for the explanation. With that I agree completely.


You deal with it the same way as you deal with any value that may change and may not be unique: you don't make it your primary key.

Just index it separately, and resolve your links at insertion time to your internal actually-stable-and-unique keys. Then when your system runs into one of these duplicates you can tell what is happening, and you have a chance to fix it, rather than blindly plowing ahead and ruining even more data.

None of this has anything to do with how you represent your API to other systems. Do whatever you need there. And if system X is confused and thinks public-ID-Y is actually Z, well now you can handle it - that's just another discriminator, but you select this one based on the querier.


> Yes, but what would be the alternative?

Not have a nation-wide ID scheme. The problem is that people will rely too much on it being always correct and a lot of infrastructure will grow on top of it so that it is really hard to fix things.

A better approach is to keep the human in the loop. If a government or a company wants to create a connection between databases, let the human whose data is requested, provide the external keys/IDs upon request. That's less automation, but it allows for more flexibility and better personal data protection.


Yeah the US system of hijacking numbers that were never meant for this purpose and treating them as sufficient to take out credit in your name, and then having fifty redundant DMVs handling identification cards makes much more sense.


We tried that in the US, and we ended up with an ad-hoc one anyway. It seems to be far too useful of a concept in modern society to avoid this happening whether you want it to or not.


Not so long ago I had to deal with the Austrian version of a Social Security Number and it turns out that using birthdates inside such a number isn't a good idea, since

* Some people doesn't have a birthdate with which the Austrian state agrees

* For those people the Austrian state at least assumes a birthyear with a birthday at January 1st or July 1st

* This leads to overflowing the available numbers for those two dates

* If a date overflows, they get SSN with month "13“. The 13th month can have as many days as needed (so > 31 is possible).

* Now you have SSNs with a month part of e.g 2023-13-32.

Why even bother with birthdates then?


I think that's true for Social Security numbers in France/US, and National Insurance numbers in UK.

About 15 years ago, after working in Germany I came back to France for a couple of years, the country where I was born / assigned a number at birth / following a similar pattern with the year in the code.

My health insurance from Germany did not transfer according to EU rules or something, at least that's what the regional French National Security told me. The way they got me out of the Kafkaesque situation was to give me a "temporary" social security number, like they would give a foreign immigrant, and have it revert back to my "normal" number after 3 months. At some point I had "two" national security numbers. It was a real mess to fix the whole thing afterwards as you can imagine, and considered I left the country only after a couple of years, I wonder why I even bothered.


French bureaucracy will kill your soul slowly but surely


I'm curious, was there any pushback to its introduction ?

In France, when the government tried to introduce something like this in 1973, it (eventually) caused a scandal (after a scathing Le Monde article), which resulted in the whole project being scrapped, and the French data protection authority CNIL being created :

https://www.lemonde.fr/blog/bugbrother/2010/12/23/safari-et-... (fr)

(CNIL was later defanged in 2004.)


Personally I think this is very bad idea especially because it would be be very tempting to use the id to - for example - trivially "decide" if you can, for example, smoke, drink alcohol, watch porn.

But this would also make some tricks to extend the scheme to corner cases (e.g.: refugees will get YYMMDD-NNNN where the date part is when they were registered at entry) fail miserably.

At the very least, a single letter code in the mix would mitigate the problems a bit.


Should’t there be discussion about changing this number scheme?

According to Wikipedia the NNNN part is really NNGC. C is a checksum, giving NNG as a serial number with G even/odd for gender. Sweden seems to have ca. 115000 births per year, making 315 births per day. But those may not be evenly distributed. The available space or serial numbers seems awfully close.


Yes, there's only about 500 numbers per legal gender per day. There are overflows on certain dates and so some people get assigned a number with a slightly wrong birthdate.

As inconvenient as this is for those individuals, it'd be a lot of work for society at large to move to a new system, so it's probably never going to happen.


Any scheme that covers 99.9% of cases is a success, something quite hard to achieve. Calling it a terrible idea is a bit melodramatic, no?


In what cases can this number change?


The personnummer encodes gender in the last digit (which, IMO, is very silly) - so when that changes, the number has to change too.


That explains all the fuss of recent years. Transgenderism is an affront against data schematization.


Poland has a very similar numbering system, the PESEL. It can change if the date of birth was recorded inaccurately and you need to correct it, or, because the PESEL encodes gender, when you get your gender marker legally changed.


PESEL system was bought from Sweden.


Gender change, incorrectly recorded time of birth, duplicate number (though that one hasn't happened in the last 30 years or so), replacement of a temporary number with a permanent number.

The extreme case is witness protection, but in that case only the police would even know about the connection.


This. I'm actually surprised the linked Wikipedia article on "Natural key" only mentions one potential disadvantage, that somewhen in the future, maybe, at some point, somehow, the other system may change, and then your keys are bust. They even give US SSNs as an example of something that may stop working as unique key in the future.

There are no "natural keys". They don't exist in nature. They don't exist as a thing in our physical reality. And when it comes to human-assigned identifiers, it should be emphasized much more strongly that whatever external thing you think is a good unique ID, it isn't.

Names? Addresses? Dates? Nope - have you heard of the "Falsehoods programmers believe about..."?

Government IDs? Nope. US SSN was not designed to be used as UID from the start, but even when you have an identifier that was - like e.g. PESEL number in Poland - it's still not an unique ID, because people and computers make mistakes. Multiple people having the same PESEL is something that's been known to happen.

(Side note: "identification number" really means "opaque string". IDs are not numbers. If adding or multiplying them together doesn't make sense, they're not numbers.)

Anything business adjacent? Don't make me laugh. The article covered a lot of the reasons why any kind of tracking IDs are broken for the purpose - most of them dealing with the messy nature of reality - but businesses are businesses. They cut corners. They lie and cheat. That's why you see the same EAN code covering multiple products a store might even try and price differently. That's why SKUs get reused. That's why these days you often have multiple distinct products under the same SKU - first, quality batch, released to get early positive reviews, and then the main batch made with much cheaper and lower-quality parts. Product names? That's even worse - you'll have variations such as "extra batch of super-low-quality, reduced functionality, made specifically for Black Friday, so it can be sold under the same name as the OG product, seemingly at big discount".

This is not to say "never use other people's IDs". There's a trade-off here, but making a good choice starts with assuming that the external "unique ID" is neither unique, nor a good ID, and you'll have to deal with it at some point.


If I could go back 25 years and give myself advice, it would be this. A Lot of pain over my career would have been avoided.

I would also tell myself not to use sequential numbers for surrogate keys. They're massively convenient, but a real pain later on (data security, and distributed data being just two cases.)

I know there are folks who like natural keys, but frankly IMO they're a straight-up bug. YMMV.


Whatever you do, always let some manager decide about the actual key, then you can always blame them for their mistake :)


I find myself coming back to Data and Reality on a periodic basis to remind me that the model is just the model.

https://www.bkent.net/Doc/darxrp.htm


> There are no "natural keys". They don't exist in nature. They don't exist as a thing in our physical reality. And when it comes to human-assigned identifiers, it should be emphasized much more strongly that whatever external thing you think is a good unique ID, it isn't.

It seems to me that nature has plenty of natural keys, it’s human created constructs that don’t.

For example, the periodic table has natural keys.


Yeah, but you don't want to use that as a primary key for your stock database.

It'll work fine until you find you want to keep track of U-235 and U-238 separately.

What constitutes identity is context-dependent. Yeah, on one level they're the same thing. On another they aren't.


Make it a composite key of atomic and mass numbers, then.


But then you'd miss on crystal structure to distinguish your diamonds from your coal.


Which single element are they then?

Surely such a system would have a products table with something like nominal element fk, purity, mass columns anyway. (And stock table with fk to products.)

Then nothing wrong with 'natural' key on the elements, but sure, products are the thing you've (personally) made up, so of course there's no 'someone else's' key to use.


Diamonds and coal are both pure carbon (mostly carbon-12). So is graphite, and buckminsterfullerene, and a number of other things. Crystal structure matters a lot.


They're not pure carbon was my point. I wasn't saying crystal structure not important, I missed that but it'd be in my 'product derived from element x' table.


But then you need to have a stored procedure to maintain integrity every time an atom undergoes radioactive decay, that's got to be bad for performance!


Actually, in that case that would not be a properly normalized key because the mass number (counts both protons and neutrons) depends on the atomic number(protons).

Also it gets fun if you want to store anti-particles too.


Most of the time you don't interact directly with elements, but with molecules. Oh, and ions. And radicals. And and....


And isotopes. And transient, unstable configurations that may exist for a brief moment before decaying into something that slots into the periodic table better.

And also elements aren't a thing, quantum fields are. This is why you sometimes see things that just don't fit the idea of distinct atoms. But then quantum fields also aren't a thing. We don't know what exactly is the thing. We only have increasingly sophisticated working models.


No need to go to such philosophical depths. We're only poor engineers trying to come up with workable model. The periodical table of elements was a true boon and it is unlikely these "foreign keys" will change, but still - it alone isn't fully suitable as data model for chemistry.


The periodic table is a human construct. So is the concept of an identifier.


I’d wager that nonhuman intelligences out there in the universe have analogous tables to the periodic table.

And I suspect it would be possible for us to JOIN our periodic table onto theirs by matching on the atomic number field to create a mapping between, say, our names for the elements and theirs.

That seems unlikely to be something we would be able to do with literally any other dataset that an alien civilization has. Their star catalog will contain many of the same objects ours does - but there won’t be an obvious key that lets us join them up.

I guess there’s a few mathematical objects that would work - the list of Platonic solids maybe, or the symmetry groups… but in terms of real world physical things? Atomic numbers might be the only truly natural key there is.


The periodic table groups atoms by the number of protons they have and ignores other differences. Uranium 235 and Uranium 238 have the same "natural key" according to the periodic table. The periodic table is useful for solving a lot of problems, and for other problems where it doesn't work to treat U-235 and U-238 as the same, we (and other intelligences) use other ways of categorizing atoms.


Right, but there’s a reason we call them ‘isotopes of Uranium’ rather than saying U-238, Np-238 and Pu-238 are isotopes in the ‘family of atoms with atomic mass 238’.

Things with 92 protons in them behave chemically in similar ways. They form the same crystals and molecules.

If you were building a table of isotopes you’d put a column for ‘proton count’, and you’d find lots of the properties of the isotopes depend on the proton count not the atomic mass. So you could normalize that table structure out and create another table with the atomic number as the key, and put all the properties that are common to all isotopes of a given element in that table.

You would have made a table of elements and it would have the same primary key as the periodic table.

That’s what a natural key is. One that emerges naturally when you normalize data you have collected.

I suppose similarly ‘electrons in outer shell’ might emerge as a natural key by that process.


I don't think that's what people mean when they say "natural key." People say "natural key" when they're talking about a unique identifier (that's the "key" part) that is part of the domain model of an object (the "natural" part.)

If we drop the "unique identifier" aspect of a key, and we can distinguish between different objects that share the same value according to one key by using a different key, then anything can be a key. The number of pages in a book could be a "natural key." This conversation wouldn't exist if that's what people meant by "natural key," because the answer would be trivial.


> If we drop the "unique identifier" aspect of a key, and we can distinguish between different objects that share the same value according to one key by using a different key, then anything can be a key.

And the point the article is making, and that I was trying to emphasize further in my top-level comment, is that there are no perfectly unique identifiers. There are no UIDs in nature. UIDs do exist in human-designed systems, but they are only as good as the actual implementation of the system is, including its evolution over time.

Your own surrogate keys are the most reliable UIDs, because while you can still screw those up, it's in your power to avoid the mistake, and it's also in your power to fix any problem. Once you start using as keys "UIDs" that aren't owned by you, you lose both the ability to prevent bugs and to fix them.

For example, if you're making a library system and use a surrogate ID as a key, duplicates may happen - but they're always indicative of someone screwing up data entry, and can be trivially corrected by assigning a new ID for the entry. If you're using ISBN, or EAN-13, or some unholy hash of author/title/publication year, you're going to get a lot of corner cases, and you won't be able to fix them (you can't exactly change the barcode on everyone's copy of a printed book, just to undo the problems caused by the publisher playing funny games with SKUs).

> People say "natural key" when they're talking about a unique identifier (that's the "key" part) that is part of the domain model of an object (the "natural" part.)

Yes; under this phrasing, the point of TFA/my top comment is, your "natural key" is going to be only as good as your domain model of an object is - which, in practice, means it's going to be bad, because if your model defines some property correlated with real world to be unique, it's already broken.

Ultimately, the point is: there are no truly unique keys in the real world. Best you can do is to chose a key that's least likely to be duplicated in data. Perhaps so unlikely that, in non-critical systems, you can get away pretending it's unique.


You seem to be missing the rather important point that chemical elements are a thing. The atomic number uniquely identifies an element. There are not two distinct elements with atomic number 92 - there is one element, which we call Uranium. It has a number of isotopes (as do all elements - you can keep stuffing as many neutrons into a nucleus as your particle accelerator will let you). But Uranium is identifiably Uranium.

Sure, if I’m trying to build a table of atoms I can’t use their atomic number as a key. I would need some sort of UUID scheme. But in among all those atoms I’d find a bunch of them had stuff in common, and that that stuff is 100% correlated to each of those atoms’ atomic number.

See, the challenge here is to build a table of something where everyone can agree there is a single unambiguous natural key we can use for that thing.

And my (and OP’s) proposal is that you can do that with a table of ‘elements’ with ‘atomic number’ as the key.

You can’t do the same normalization trick with books.

Say you build a table of books with ‘title’, ‘author’, ‘publisher’, ‘year’, ‘ISBN’, ‘edition’, ‘page count’.

Are any of these - or combinations of them - actually a natural key of a common entity in disguise?

Maybe author/title/year? No. For one, I’m sure you can find distinct books that share these values. And ‘year’ is, it turns out, a synthetic key for identifying arbitrary Earth orbital cycles - not a ‘natural’ label for a time period at all. And ‘author’ names are a synthetic key too - many authors publish under multiple names (JK Rowling = Robert Galbraith) and names are just arbitrary keys assigned to specific consciousnesses embodied in particular bundles of cells for a period of years. And of course they aren’t unique, so… no good as a key anyway.

Sure you could say ‘well, I want to make “n page books” into an entity and extract that to a table but you wouldn’t find any properties you could extract out to that table - there isn’t anything that all 92 page books have in common that you could tag onto that relation.

Whereas your chemical elements table has LOTS of properties that are common to everything with each atomic number. Electron structure. What molecules it can form. What crystalline structures it forms. Sure, not its half life or its melting point, but still - useful facts that are common to all atoms with the same atomic number.


> You seem to be missing the rather important point that chemical elements are a thing.

I fundamentally disagree with this. Elements are a human-created categorization of atoms (granting that atoms are real) by atomic number.

> Maybe author/title/year? No. For one, I’m sure you can find distinct books that share these values.

Author/title/year is very much like atomic number: extremely useful and sufficient for very many practical purposes. But when dealing with books, sometimes we need more than author/title/year, and when dealing with atoms, sometimes we need more than atomic number.

Atomic number is a natural key to the elements, because it's part of the definition of element.

This is why natural keys don't exist in nature, but can exist for things we invent: because a key uniquely identifies any member of a set. A key makes every possible distinction. When we categorize things, we select some distinctions (such as the number of protons in an atom's nucleus) and throw away others (such as the number of neutrons.) The enumeration and definition of relevant distinctions is necessary for the existence of a key. We defined the elements, and as a result of how we defined them, they have a key. When we define a book table in a database, it has a key as a consequence of how it is defined.

If we don't define the set ourselves, as our invention, it is impossible to anticipate all of the possible distinctions between members, so it is impossible to construct a key.


Most elements are stable, especially most of the ones we deal with, and they fall in clear periods and rows based on chemical valence. This is true for main group elements and mostly true for transition metals.

I think the periodic table was discovered, not invented. My guess is that whenever we communicate with some advanced alien species, the periodic table is going to be one of those things we share.


Periodic table is recording multiple different patterns in atoms. Some better than others. By nature, its a dimensional reduction. There are many alternative renderings of the periodic table (or spiral, or cube, or ...) emphasizing different regularities. The one you refer to is just the most popular rendering.


The concept of an identifier comes with the concept of a bijection (or at least a injection). That is a truly basic concept in Mathematics.

You could argue Mathematics is a human construction. But at that point, isn't every concept a human construction? In that case it is hardly helpful, or meaningful to say a concept is a human construct, since all concepts are.


The problem isn't the concept of a bijective function, it is picking the right function and that is a social problem.


> (Side note: "identification number" really means "opaque string". IDs are not numbers. If adding or multiplying them together doesn't make sense, they're not numbers.)

Eeehhh I'm not sure if this necessarily true. A telephone number is definitely a number but addition doesn't make all too much sense on them.

If you want to specifically talk about objects that require addition and multiplication to be properly defined on them then you can always use Groups and Rings.


A phone number is definitely not a number - for example, it can often be written as starting with a +. Not only that, but one number can have multiple valid expressions (eg. a number with and without a country code). That's ignoring the different ways people like to format them (spaces, dashes, letters, etc.)


Yes, and leading zeroes are significant in phone numbers.


You could construct a telephone numbering system where the sum of two 'real' numbers could be dialed to start a three-way conference call. That would be.. amusing, to some subset of users.


Give everyone prime numbers and a N-way conference can be done with multiplication.


With some very long numbers and a hot market for short ones!


But think of the ease of conferencing!!!!

You just have to whip out your calculator (likely an app on your mobile device), type all the participants' numbers in, and...

Wait a minute!


A phone number is a number in a colloquial sense (a string of digits) but I would hate to use any computer system where phone numbers are stored or reasoned about as actual mathematical numbers.


I was about to suggest just hashing our DNA, but then, remembered identical twins+ exist.


Oh it's worse than that. There are lots of edge cases with DNA.

Chimera exist [1], plus there are issues with transplant recipients and so on. I expect there are more edge cases I don't know about.

[1] https://www.scientificamerican.com/article/3-human-chimeras-....


and gene therapy


I ran into the same problem while building https://learnawesome.org/ . Forget the broad class of "learning resources", even the "books" category doesn't have a usable unique ID. Not everything gets an ISBN for example. There's also the ambiguity between a "work" and an "edition" of a work.

This is probably why OpenLibrary supports mapping of books with 40+ identifiers: https://github.com/internetarchive/openlibrary/blob/master/o...


ISBN is an interesting case as the publisher has leeway on how it's handled accross the lifetime of a book. In particular some books will get multiple ISBN for the same version, and some other books can keep the same ISBN through multiple minor revisions.


Always use surrogate primary keys. Use alternate keys (AK) for your "natural keys". Modify the AK business rules as and when needed. There are several advantages to using meaningless random integers, sequential integers or UUIDs as your table's PK. The most important one is that if the PK participates (or is expected to participate) as a foreign key in other tables, it's crucial to us a design where you never ever have to change the PK!


All this, but I'd steer away from sequential numbers. They suffer from security problems, data-merging problems, data-distribution and replication problems, and more.

Over a long career I've come to favor UUID, the cost (space and performance) are well worth the advantages.


UUIDv7 has all the advantages of UUID, plus sortability, plus an embedded timestamp which can replace your "created_at" thus reducing the cost aspect. It's also soon becoming an official part of the UUID spec.

In fact, if your created_at is indexed and you can get rid of it via uuid7, there is practically no difference in terms of table size. The size cost is solely on foreign keys.


I really hope that I can find someone on the MS SQL Server team to champion implementing native UUIDV7 support, from what I've heard they are really hesitant to add more data types but a person's gotta try :)


Nature doesn’t come with identifiers; that is why Apollo gave us UUIDs. And always remember, the goal of a database is not to model the domain - it is to model your knowledge of the domain.


Ironically DNA is an identifier and nature does use identifiers like facial recognition, etc--it only happens not to be text-based nor 100% unique/accurate.



If you are mapping to a series of bits, then a checksum of some sort seems like a fine identifier to a specific file.

But something like "song", minting ARKs[0] for FRBR[1]-esq Work/Expression/Manifestation records would be my first instinct, not a natural key. The Manifestation record could have the checksums of the files.

[0] https://arks.org

[1] https://en.wikipedia.org/wiki/Functional_Requirements_for_Bi... [1]


FRBR makes a lot of sense. There's some really good insights from library sciences that folks in computer science aught to spend some time studying.

[edit] removed a statement about an ARK URL not loading, it loads fine now. Temp issue on my end?


I agree with the substance of your comment re; FRBR, but ... typo maybe? [1]

[1] https://grammarist.com/homophones/ought-vs-aught/


https://en.m.wikipedia.org/wiki/Identity_(philosophy)

For when you want to think real deep about what the identifier identifies.



I deal with inventory systems and there is a difference between your system's key and a 'natural' one--the natural one existed first. The good answer is to support both and use either as appropriate. There are many ways of ID'ing an item, usually a SKU (literally stock keeping unit) is used. The owner of the items can choose to use an existing one (manufacturer, UPC, etc) but more commonly assigns their own so that interchangeable items with different manufacturer codes can share a SKU. Ideally a system should support as many methods of ID'ing as the user deems useful.

If you do use surrogate keys everywhere, avoid exposing a one if a pair of existing ones will do. e.g. rows of an association table, unless of course that association has attributes and is a well-known thing to itself.

Making every user of your systems always use system generated identifiers (where external ones already in use commonly exist) is kind of like a file-system that only uses tmp-generated names and the user has to do external book-keeping to remember which is which.



Slightly OT but related: one of the problems with identifiers that I have come to accept is that the things underlying the identifiers change over time. Things come and go, businesses launch and fail or are acquired/absorbed into other businesses, humans are born and die, humans change their names or genders, people buy and sell and subdivide or combine real property, people paint things different colors- the only constant is change. And it generally wreaks havoc because people who thought up identifiers for things usually assumed immutability of a mutable thing, and often don't have workable procedures for keeping up with change.

I've come to the conclusion that identifiers should be structured in such a way that they have no relationship to what they identify - for instance, never ever ever use a name as an identifier, no matter how you normalize it. All properties of things should be mutable descriptors. Identifiers should be numeric-ish. Identifiers are not user interface, either - don't try to format them for readability - the audience for identifiers should be automation, not humans. You should give humans a way to serialize them, such as a URL or UUID or whatever, but the point is just to port the thing unambiguously between systems, not to be human readable. You can build UX that looks up the ID and shows you what the descriptive properties are.


I also tend toward surrogate keys, and would also point out that the overhead concerns frequently weigh in favor of them as well (though not always).

If your natural key is bigger than the appropriately sized surrogate key, and it will be referenced across the database in other tables, the overhead of referencing it everywhere (and indexing it in those places -- potentially in much larger tables) can easily outweigh the overhead of the extra bytes on the original table.

Additionally, surrogate keys can realize much better performance WRT cache locality.


> DALL-E — “A photograph of meat in the shape of a key on a solid background”

I feel like I'm now reading Matrix code where we can do away with the image itself and just keep the prompt.


I hate to throw OT shade, but as a commercial artist, I'm gonna– the hero photo in this article shows the importance of my profession. Creating images is a lot easier than deciding how best to visually communicate an idea and conceptually manipulate the most relevant components. I probably could have made a vector drawing in 10 minutes, or a photorealistic image in photoshop in about 15 that would have done the job a million times better, and considering how much cheaper artists and designers are than developers, it probably would have been about the same labor cost if money was involved. I get it... It's a blog post... But it's seriously like the clipart revolution all over again, just without the guard rails provided by pre-made imagery.


a "natural key" is frequently just a really foreign key in a database you and your org don't manage. — 'wwweston, https://news.ycombinator.com/item?id=27349246


Consistency in one database is important and "natural keys" can help, it's just a bad idea to apply them to messy nature. They're great for derived or linked data, though.


"There are no real world natural keys — only someone else’s surrogate key... SSN, VIN, email address, UPC, tax numbers, country codes, language codes, URIs… they’re all just someone else’s made up identifier."

Every time this topic comes up, there are major definitional problems, and it seems like either surrogate keys or natural keys get defined out of existence.

The author is drawing the line such that natural keys are surrogate keys that are out of your control, and surrogate keys are in your control. That doesn't seem to match the definition from wikipedia: https://en.wikipedia.org/wiki/Surrogate_key

I agree with the author that control over the ID is an important distinction, but if you're going to make up a new ID, why not just say so? "Surrogate key" just introduces confusion -- it makes it sound like something only programmers and database admins care about, which is definitely not true. If customers are seeing it, everyone in the organization should care that it's in a recognizable pattern (so you don't mix up different IDs from different places) and memorable and easy to understand when spoken.


VIN is a good example of how this stuff gets pretty crazy. The standards define it so as to be unique, but only unique within defined n year ranges of model years (VIN identifies vehicles), where n depends, however, on the vehicle type, which can be determined from some specific sub-fields of the VIN. So, multiple vehicles may be on the road with the same VIN if at least one has survived more than n model years (which map flexibly and only approximately to ordinary calendar time). In addition, sometimes a manufacturing error produces non-identical vehicles with the same VIN because of blunder by human, machine, or both in concert. These scenarios are resolved by IT means similar to the method used by Mark Twain and his parents to avoid confusion between him and his mythical twin brother.


As someone who has been thinking about a system to organize my improvisation band's rehearsal recordings I have made many of the same realizations.

Only with us some of the points are even wilder, because for us "a song" is the abstract concept or idea for an improvisation rather than a fixed set of notes played in a fixed way in a fixed order. Sometimes that idea can be a word, sometimes it can be very precise sound settings, high-level structural ideas, a set of lyrics, rhythmic patterns, melodic themes etc.

So any given recording can contain within itself any number of songs (including zero). Those songs can exist or not. And my conclusion for our case was that there is no automated way of deciding which is which, as at times it could be hard even for us ourselves.


Nature + Identifiers is an issue (e.g. see https://github.com/tdwg/tag/issues/36). I've routinely mentioned to others in my field to look at other industries for ways forward, particularly ones like the music industry, so its interesting, if unsurprising to see all the same problems arise there. For those who know, in our field, when Identifiers comes up in conversation at conferences etc., we tiptoe away, somehow people can't learn from the past issues.

There are two issues that keep coming up in my mind: 1) People want Identifiers to do something (like resolve), rather than just be identifiers and 2) People think that there are such things as "unique" identifiers (one identifier per "thing"). Neither, in my mind, are the purpose of identifiers. Identifiers should do one thing, localize you to some concept. By localize I mean that if you can find the digital space (or in physical collections where identifiers are used the physical "printed" identifier) that "contains" the identifier then you should have a reasonable probability of finding the thing/concept that identifier is for. That's all. No certainty, no uniqueness. It's very akin to what we do when we cite something in a publication, we are giving the researcher who reads it a reasonable chance of finding the origin. This isn't to say that we shouldn't try to keep identifiers unique though, it's to say that when it comes down to crunch time we should never assume 1) that they are unique, and 2) that their special properties (e.g. that they resolve) actually work.

I've seen numerous identifier schemes come and go. We have specifically designed a 1-many for things-to-identifiers in our systems (sitting on top our internal IDs, yay, another ID). DOIs? They must be unique, right? Nope. Institutional CODENs? Nope (though the botanists have done it pretty well through community peer-pressure). People IDs? Do you mean wikidata or ORCiD or?

As others have noted, identifiers really are just labels, though things like UUIDs have the game-changing property of reducing the probability that you're looking at a homonymous label.


Slap a synthetic key on everything, you probably overestimate how bad it is to have multiple records for "the same thing".

Even if it is indeed bad, it's probably unrealistic to expect to never let it slip on entry (however good the natural keys might be). Maybe it makes more sense to incorporate deduplication into the design right from the start.


When you use a surrogate key, you should still put a unique constraint on an alternate key. Multiple records for the same thing is a fundamental problem in a database.


I think in the future the idea of labelling every passing moment and every sounds that anything ever makes will lose prominence. Wondering what will replace them. I think we will either drop this need altogether or the "natural key" will come from "the state of reality".


This is not true. Given two songs A, B, an informed human can tell whether A and B are the same song or not. Now you’ve created a UNIQUE constraint on your data, so you have a unique identifier.


This is not true. The article gives the great examples of a song through its lifecycle: covered, remixed, remastered, compiled on a greatest hits record, included on a soundtrack, etc. Same song, sometimes. Same recording, sometimes. Different database entries, always.

To the extent that "an informed human" and "the same song" have meaningful definitions of "informed" and "same", you've just recreated the problem.


All this song data has been standardized and revamped multiple times and these standards have been implemented in mp3 players to decode and display info about that to which you are listening. Waitaminute -- mp3 players can play lots of sounds, including non-song sounds, e.g. speeches, air-checks, phone messages, commentaries, advertisements, radio dramas or comedies, bird-calls of birds, bird-calls of hunters, podcasts, phone-taps, recipes, soundtracks for walking or driving tours, audiobooks of diverse genres, etc, etc, which are quite likely to be acceptably identified and described by various attributes other than those applicable to songs. However, humans find it easier to put square pegs in round holes than to rebilk Romeo in a day.


But MP3 tags don't attempt to store a universally unique song identifier.

They live in situ with the file, and multiple encodings of the same canonical "song" may not be identical.


"To keep reading this story, get the free app or log in."


one might even say that all form is interpretation. Or that all facts are assertions.


Something that anyone who's brushed shoulders with the S1000D document specification knows in their guts.

S1000D is a CCS (component content system), which "chunks" the content into units called DMs (Data Modules). Each DM has a DMC (Data Module Code), which is also its filename, its natural key, and its addressable unit. The Data Module Code has tons of little IDs indicating all sorts of things[1], but see, they're all tied in to different business processes. I don't think I need to explain that this whole scheme goes batshit haywire at the smallest provocation, leading to many S1000D systems to be kept in the MIL-STD equivalent of a locked sacristy inside of the Logistics office. Locked sacristies don't get updated a whole lot, so the documents are comically - dangerously - out of sync with basically everything.

CCSs in general, and S1000D in particular, were architected by people that were outsiders to the broader trends in computing and programming. DITA came from Horn's Information Mapping via IBM, and Rob Horn was a PoliSci guy. IBM did what it could, but the basic notion that "document" could be replaced by "map" was a conceptual leap not taken by equivalent designers with LaTeX and the others in the more programmer-focused world. And compared to DITA, S1000D's far, far, far worse, being shaped by a steering group inside of a steering committee that's controlled by this . . this . . sort of Lovecraftian block diagram, if I am looking at the right issue of the specification and lighting the right candles. Both of the big CCS specs run into big thick walls of complexity as a fundamental consequence of their explicitly not studying computational complexity at design.

Alright, yeah, I have a few chips on my shoulder about this. But I hate waste, I hate it with the brilliance of a million suns, and I have seen so many decades of bodies thrown at this problem. It's beyond time at least a few of us raised our hands and asked, "So why does this never work?". The first step, as indicated, is admitting that your mental abstractions aren't actually real, not in the touchy-space-verse, and then to design systems accordingly.

[1] Namespaces? What's a namespace? S1000D avoids all things namespaces. Instead, when you have a new MODELIC + SDC combination, you mail a letter to NISPA (NATO Support and Procurement Agency) to register your new MODELIC, and then your DMCs will be - very hopefull y- uniquely addressable. How often do people actually register their MODELICs? Ah, I see you found the flaw in our cunning plan! MIL-STD-3031 (the USAA BREX / S1000D thingy) explicitly tells the publications administrator to NOT register their MODELICs with NISPA, because EUROS ARE GAYYYYYYY. Also, MIL-STD-3031 has some interesting ideas about what different parts of the DMC actually mean, which makes all this really pertinent (I think) to the article.


[flagged]


Medium has had paywalls for a long time.


> we want a database of all of the songs in the world so that we can properly identify unknown songs and provide attribution so that folks can get paid appropriately. It is a noble goal

That it is a noble goal is this internet rando's own opinion, and there certainly is no consensus on that.

Inflicting it as a fait-accompli to his readership does not bode well for the rest of the article




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: