Upsert anti-pattern in SQL Server

mumblemumble · on Sept 4, 2020

This is specific to Microsoft SQL Server, which has a rather fraught history in this department.

I believe the best option in PostgreSQL is still INSERT ... ON CONFLICT ...

In general, this is one of those spots where every single RDBMS behaves differently, so knowledge from one platform doesn't necessarily transfer well.

(Which, in turn, is one of the reasons why it's not necessarily a good idea to try to be database-agnostic.)

liability · on Sept 4, 2020

Sqlite has the same as Postgres; INSERT ... ON CONFLICT ... DO ...

It's very ergonomic and intuitive. I understand why SQL Server needs to retain older forms, but why can't they also adopt these new forms that are more pleasant for the end user? I suppose the people buying licenses to SQL Server usually aren't the ones actually using it.

mumblemumble · on Sept 4, 2020

Officially, MERGE is the method defined in the SQL standard, and is by far the most widely supported syntax.

Everyone else uses different, DBMS-specific, non-portable syntaxes. A lot of them (e.g., PostgreSQL and Sqlite) have some similarities, but are still mutually incompatible.

I think that Microsoft arguably made the right decision by choosing to conform to the standard. It's just kind of unfortunate that the ISO standard syntax is (in my personal opinion) not very good.

petergeoghegan · on Sept 4, 2020

I am the primary author of INSERT ... ON CONFLICT in Postgres. I extensively studied similar features from other database systems as part of that project.

MERGE simply doesn't provide the guarantees that most users want -- not in theory and not in practice. For example, the Oracle docs introduce MERGE by saying: "This statement is a convenient way to combine multiple operations. It lets you avoid multiple INSERT, UPDATE, and DELETE DML statements."

It's likely impossible to make the concurrency guarantees work sensibly in READ COMMITTED mode while supporting MERGE's very general semantics. There are subtle but very real problems with making it behave like ON CONFLICT in one context but not in others. In my view MERGE is a perfectly reasonable feature, but it just isn't what people want when they say they want to upsert.

The SQL standard deliberately underspecifies anything related to transaction isolation or concurrency. I cannot see the standard providing any guidance on upsert for this reason.

andorov · on Sept 4, 2020

Thank you for your work! Postgres is great to use and the regular release of useful features has been a very nice benefit.

One unobvious (to me) but possibly necessary side effect of the INSERT...ON CONFLICT pattern is that it reserves a primary id for every row ahead of time, presumably because it does not know what will be updated and what will be inserted. We have a heavily updated table that started failing inserts one day because the primary id sequence hit the max int (2.1b), despite the table only reaching 100mm rows. The immediate fix was to reset the sequence to -1 and have it start going down...

petergeoghegan · on Sept 4, 2020

I hear that one from time to time. It's a consequence of the fact that sequences are generally not transactional, and the fact that the underlying "speculative insertion" infrastructure really does have to do most of the work of the insertion before it can decide whether or not the insertion really should go ahead.

I'm sympathetic, but unfortunately I cannot think of any tractable way of avoiding the problem at the implementation level. Maybe it would be possible to do something with an identity column (as opposed to a SERIAL column or a raw sequence) -- we have access to sufficient context there. Maybe we could skip consuming a sequence in the common case where there is a clear conflict at the start, and the sequence isn't accessed through the EXCLUDED pseudo table. We wouldn't actually promise anything more, but in practice we wouldn't burn through sequences at the same rate in cases where updates are common.

This is quite a significant project, but maybe it'll happen some day.

edoceo · on Sept 4, 2020

insert isn't really reserving so much as just getting nextval from the seq, and may have to nextval a second time. I've found for tables like this to NOT use int-pk - I use ULIDs for these kinds (and actually, I've been using ULID for like everything lately)

castorp · on Sept 4, 2020

The Postgres team decided against implementing MERGE when they designed the ON CONFLICT feature (it actually started as a MERGE).

It's been a while, but as far as I remember, the main reason was that the behaviour of MERGE mandated by the SQL standard, wasn't clear enough and doesn't deal with race conditions when multiple transactions try the same insert (or update) (not so much the syntax itself).

cogman10 · on Sept 4, 2020

Unfortunately, SQL Server's merge statement has a LOT of issues. One of the worst being that it is isn't atomic (without non-SQL-standard hints)[0]!

It really feels like it was a half-assed feature.

Great that they followed the standard, not so great that they made it nearly unusable with such a poor implementation.

I was super excited when we got SQL server 2008 at work (since upserts are a relatively common issue) and was devastated to learn all the issues with MERGE. I didn't care that the syntax was funky, I just wanted to avoid needing to manage transactions/locks/etc.

[0] https://weblogs.sqlteam.com/dang/2009/01/31/upsert-race-cond...

gfody · on Sept 4, 2020

> it is isn't atomic

merge is a single, logically-atomic statement in sql server (with or without hints)

it doesn't have the same physical locking characteristics as a single insert or update statement, how could it?

liability · on Sept 4, 2020

The way I see it, the standard is should be taken as a bare minimum requirement. Ideally standards compliant SQL would work on any RDBMS, though I don't think that ideal has ever been realized. Supplementing the standard with extensions when the standard happens to have ergonomic problems seems to do more good than harm. IMHO writing SQL with the extensions of postgres or sqlite is a lot more enjoyable, which I think counts for a lot.

mumblemumble · on Sept 4, 2020

There's room for argument there. From my perspective, as someone who's had to spend a lot of time maintaining databases, I would rather have just one way of doing it, even if the one way of doing it that I get stuck with is not the way I would have picked if I had had my druthers.

The problem with TMTOWTDI is that it means there's also more than one way to have to understand and reason about. Which ultimately makes my job harder.

bosswipe · on Sept 4, 2020

It's annoying that databases don't implement UPSERT natively. As far as I can tell the solution provided by every db requires the repetition of all the damn fields. To pick on the supposedly more ergonomic postgres method, here's a typical example,

  INSERT INTO students(id, firstname, lastname, gender, d_o_b, email)
  VALUES (1516, 'Gerard', 'Woodka', 'M', 'January 27 1995', 'gerardo_woodka@gmail.com')
  ON CONFLICT (id) DO UPDATE SET
    firstname = EXCLUDED.firstname, 
    lastname = EXCLUDED.lastname, 
    gender = EXCLUDED.gender, 
    d_o_b = EXCLUDED.d_o_b, 
    email = EXCLUDED.email;

At least it doesn't force you to repeat the values like many other servers. But still the repetition is bug prone and hard to maintain as the table changes. And it's just annoying to type.

nkozyra · on Sept 4, 2020

I think you really want to repeat the fields because inserting versus updating could require dramatically different behavior.

masklinn · on Sept 4, 2020

There are many cases where you only want to update a subset of the fields, or even not update anything, but it's also true that a shortcut to update everything but the conflicting fields would be nice.

nkozyra · on Sept 4, 2020

Oh I get the convenience, it just seems like an easy way to shoot yourself in the foot, like an update without a limit clause.

ufmace · on Sept 4, 2020

I appreciate that it's a little annoying and repetitive, but I think in this case, it really is better, because the business logic for exactly what should be done with each value in the case of a conflict can vary so widely. It's good to have the full flexibility of SQL and all of the associated values to work with for the purpose of constructing the new value for the column.

gfody · on Sept 4, 2020

> I think everyone already knows my opinions about MERGE

not everyone is stuck on a 15-year-old version of SQL Server. odds are you can go ahead and use modern language features. I can understand why a DBA consultant that frequently encounters production issues with old bugs would feel this way though.

masklinn · on Sept 4, 2020

It's certainly not jus sql server. The implementor of INSERT…ON CONFLICT UPDATE in Postgres considered and rejected MERGE[0] on grounds that it's simply not specified to do what people generally want

> MERGE, as implemented within other major systems and as described by the SQL standard does not require an atomic insert-or-update for even the simplest possible case of OLTP style insert-or-update.

> SQL MERGE as implemented within other systems doesn't require that a unique index be defined on the column or columns of the MERGE statement's outer join. I am not enthusiastic about the idea of falling back to table-level locking when an appropriate index isn't available. That could be a foot-gun for the large majority of users that just want to solve their basic insert-or-update problem.

A MERGE patch was submitted for pg11, it got bumped multiple time and returned with comments early 2019 with no visible activity since.

[0] https://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=YYDG...

gfody · on Sept 4, 2020

this has nothing to do with SQL Server. on SQL Server MERGE is atomic, if the source has no indexes it's probably a temp table, of course evaluating "not matched" requires joining the source to the target, on SQL Server you should be thinking and working in sets and avoiding single row (RBAR) whenever possible. if you're doing highly concurrent single-row upserts you're already SQL Servering wrong and I'll concede that's a case where you probably need a bunch of lock hints (regardless if it's a merge statement or an update followed by an insert)

teh_klev · on Sept 4, 2020

SQL Server didn't introduce MERGE until 2008, so not quite 15 years. Also SQL Server bugs can take a long time to be fixed, if at all:

https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

In my own experience of working with MS SQL Server since 1998 (6.5), Microsoft are fairly conservative about changes to the core database engine, despite seeing new releases every year or so. New features tend to be stuff built on top of the engine.

In real life you're far more likely to encounter older versions of SQL running mission critical apps because upgrades are that one time where you're likely to screw up the crown jewels (your data), and so therefore many organisations tend to be conservative about that too. Now it's nice to get to work on a greenfield application where you can kick off on a shiny new version of SQL, but there's a good chance those bugs and behaviours are still likely to be present because the core engine is pretty much the same.

gfody · on Sept 4, 2020

with all due respect to Aaron, I've had this link shown to me by developers not wanting to write a simple merge statement instead of some block of semantically weird shit full of lock hints enough times for it to be annoying. be careful, of course, but don't be superstitious - start with straight forward code and test.

cogman10 · on Sept 4, 2020

The problem with merge is that it fails under contention, not under simple testing.

We've had multiple issues with SQL Server caused by the usage of MERGE statements. It's been bad enough that we've ended up banning them as an organization in favor of the block of "semantically weird shit". Why? Because that block has become common, clear, and most importantly, doesn't tend to cause weird deadlocks, race conditions, etc.

gfody · on Sept 4, 2020

> The problem with merge is that it fails under contention, not under simple testing.

that depends on so much else going on with your database - fine policy but when I said "test" I didn't mean whatever you mean by "simple testing".

> caused by the usage of MERGE statements

"exacerbated by" is probably more accurate - just because replacing a merge statement with something else alleviated an issue doesn't mean that the root cause was some bug in SQL Server.

TekMol · on Sept 4, 2020

It is such a common operation that I think MariaDB is doing the right thing by having a dedicated command for it, "INSERT ... ON DUPLICATE KEY UPDATE":

https://mariadb.com/kb/en/insert-on-duplicate-key-update/

jakobdabo · on Sept 4, 2020

Also, REPLACE INTO...

gweinberg · on Sept 4, 2020

I've been burned with replace. The problem is that if you have fields with default values they will be set/reset to the default even if nothing changed. What I'd like to happen is that for any values that I don't supply, the table values remain unchanged.

WesolyKubeczek · on Sept 4, 2020

You can kiss your ON DELETE CASCADE foreign-keyed data goodbye each time you use it.

masklinn · on Sept 4, 2020

postgres has insert … on conflict update since 9.5, and sqlite copied that.

adontz · on Sept 5, 2020

Actually for many rows I delete and insert.

Much easier to write.

DELETE FROM t WHERE id IN (SELECT id FROM new_data); INSERT INTO t SELECT id, other_field FROM new_data;

It seems so natural to me that I do not have to repeat field list.

Queries were mostly in context of ETL, so never considered performance issues. Never wrote such code for OLTP.

jimktrains2 · on Sept 7, 2020

That'll update all ties though, including fields like created_at that you may not want updated.

shmoogy · on Sept 5, 2020

That's an interesting idea, I'll probably borrow this for ETL purposes.

grugagag · on Sept 4, 2020

I agree with this but this is a small nitpick compared to what kind of SQL queries Ive seen in production in different places I worked at. Not to mention the amount of redundant queries from ORMS performance hit which a lot of developers aren't even aware of.

mulmen · on Sept 4, 2020

Shouldn't the IF EXISTS be done after starting the transaction? What happens if the key is deleted before the transaction can start?

mjevans · on Sept 4, 2020

Title possibly: "Stop writing upserts the wrong way (MS-SQL Server)" since it's less clear which flavor of SQL this is relevant?

hobs · on Sept 4, 2020

SQL Server is the name of the product, hence the inclusion in the title.

bragh · on Sept 4, 2020

In a common forum that is not purely dedicated to Microsoft technology, MS product names really ought to be prefixed — context is important. For example, it applies to Entity Framework also: it has happened a few times to me that a non-MS developer assumed that ORMs in general were discussed, while it was related only to the MS implementation.

ksenzee · on Sept 4, 2020

The concept is applicable to relational databases in general, and it's a good idea. ETA: The point here is not the syntax, it's "don't leave race conditions sitting around in your upserts." Whatever RDBMS you're using, you need to make sure you're at the very least inside a transaction, and at best avoiding an unnecessary transaction.

mulmen · on Sept 4, 2020

It’s not though because basically every RDBMS does this differently.

dathinab · on Sept 4, 2020

Yes the article isn't really applicable to e.g. postgres where you would simply use on conflict do update which is atomic so no need to bother with things like a serializable transaction.

Semaphor · on Sept 4, 2020

when I need to do some kind of upsert on SQL server, I tend to refer to this post: https://michaeljswart.com/2017/07/sql-server-upsert-patterns...

betimsl · on Sept 4, 2020

You also gain by executing one less transaction at the times of successful update.

Quite a good pattern actually.

kardianos · on Sept 4, 2020

True. Don't use MERGE in MS SQL Server. Use two separate statements.

teh_klev · on Sept 4, 2020

It's better to back up assertions, even if you're correct:

By the same author (who I highly respect):

https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

dathinab · on Sept 4, 2020

And that is (part of) why RDB "failed" to keep people even for use-cases where they should be better then NoSQL.

- To much hidden complexity which can leaf to bugs if you don't know about the hidden complexity.

- To messy and/or misunderstood concurrency.

- Many People having fundamental misunderstandings about how to do things and how far transactions are useful for given (often implicit default) options.

One thing I had to learn the hard way is that many programmers which supposedly "know" about RDMS/SQL don't even understand isolation levels properly.

Also is it just me or is this way easier in postgres (with on conflict ...).

noobar · on Sept 4, 2020

How are these issues any less for NoSQL databases? They don't have non-obvious idiosyncracies? You don't need to understand their concurrency models?

dathinab · on Sept 4, 2020

The problem is if you don't know about a lot of SQL complexity and you are not aware of you not knowing it really easy to end up with a major problem with "corrupted" (here incorrect) data while the main reason which is often cited for RDB is to prevent exactly that.

This can easily lead to an situation where people (wrongly) come to the conclusion that RDBs don't work anyway and they should just go to NoSQL DBs as it's "just easier" to not have to handle a scheme.

Which yes is completely flawed as a argument as doing NoSQL without any implicitly schema can bit you very hard and you still want to do transactions but now your DBMS might not support the transactions you need (instead of you just doing them incorrectly) so you "somehow" bolt them on top of it and if it fails now it's "clearly your fault". Except there is a good chance it isn't as some NoSQL DBMS are much more tricky to handle then they seem (like Mongo not necessary keeping to the Durability part in ACID |EDIT: In the past I don't know about now).

I have seen this fallacy or parts of it more then one time both in "real live" and comments on Hacker News.

teh_klev · on Sept 4, 2020

It makes me weep when interviewing and I ask developers about fundamental stuff such as "what is ACID?"[0]

None of this stuff is hard to learn, and I'm not exactly Mr Ace Developer. But time and again I've been asked to debug/fix data inconsistencies that were caused simply because the developer didn't know what a transaction was, whilst running multiple inserts and updates, and one of them failed along the way.

[0]: https://en.wikipedia.org/wiki/ACID

teh_klev · on Sept 5, 2020

This is a general response to a bunch of comments below saying they'd no awareness of ACID, that genuinely surprised me.

ACID wasn't new to me, I've been around databases since the 80's in their various forms. But I learned the solids about ACID through, believe it or not, being sent on a jolly to learn me some Oracle back in '98 and then the concept being hammered home whilst taking the SQL Server module when acquiring my MCSD (that was a jolly-free self study to earn more bucks - which paid off).

Now I know a bunch of people poo-poo corporate certifications e.g. the MCP and MCSD etc, but I've yet to encounter in any modern code-school/bootcamp type educational material that covers these important fundamentals. And that's a shame because as developers it makes you think more carefully about decisions you make when interacting with a database product such as MS SQL, MySQL, Postgres etc, especially via ORM's. i.e. proper training material that sets up the knowledge fundamentals before being let loose.

I'm not having a go at the competency of folks that aren't aware of ACID, but it's a fundamental concept you need to be aware of when dealing with databases such as Oracle, SQL Server, MySQL, Postgres etc. Knowing this stuff allows you to build reliable and consistent scalable databases by balancing trade-offs around the ACID concept.

ragnese · on Sept 4, 2020

I don't know if it's worth a damn, but I have no idea what the acronym stands for. I'm sure I knew it at some point and I know it's a database compliance standard kinda thing.

But I use the shit out of transactions. I rather have contention issues than non-atomic operations done to corrupt our data. When I get to Facebook scale I'll let you know what I think about eventually-consistent and NoSQL, etc. ;)

dathinab · on Sept 4, 2020

The problem is people not even knowing that if you use the default isolation level (on pg, read-commited) that you can't do something like "read cell X, modify X in code (e.g. X+=1), write X back" without running into race conditions which you can either avoid by using a lock (e.g. a row lock) or a different isolation level.

No need to know what ACID is an acronym for (I forget it all the time) or how the isolation levels are called, but you need to know what to look out for (mainly data dependencies between read an writes in the same transaction having race conditions you need to "team" and multiple read inconsistencies)

teh_klev · on Sept 5, 2020

> No need to know what ACID is an acronym for

Sure, but if you're mentoring a junior developer then I'd like to think you'd pass this knowledge on and make them understand its importance in a formal way, rather than some partial hand-wavey "this kinda means this". Get them to go and learn these fundamentals and test them on it, which is good for you because it keeps it fresh in your mind to just roll off the tongue at parties :)

ragnese · on Sept 5, 2020

Right. Good point. I use row locks in my transactions as well. I always have to refresh myself on the isolation levels.

lanyard-textile · on Sept 4, 2020

Anecdata: I don’t know what that acronym is off the top of my head, and I’ve worked with databases in some form consistently for over ten years.

dathinab · on Sept 4, 2020

Funny think, I had the same issue just now.

I did know all the important thinks it stands for but not the exact definition. But I think the he/she meant that people didn't understand ACID and not that they didn't know the exact definition.

EDIT:

I would never expect someone to know the exact definition of ACID or SOLID or similar. But if they have a lot of experience with db they should always know the ideas behind it and even if they are not an expert (but will have to write SQL) I would expect them to know what problems rough to look out for and when they should make sure they are currently not running into such an problem by e.g. checking the postgres documentation. It's good.

tuatoru · on Sept 4, 2020

I don't understand why INSERT ... ON CONFLICT ... DO is even a thing.

Yesterday there was a item on a piece by rachelbythebay about checking for and dealing with unexpected conditions in your code as a practice that separated professionals from ... er, idiots.

This would appear to be a prime example of that antipattern -- in general, not just in SQL Server. If a tuple snuck into your relation while you weren't looking, don't you need to rollback and find out the current state of things?

edit: Rachel's piece: https://rachelbythebay.com/w/2020/09/02/check/

NegativeLatency · on Sept 4, 2020

You can sometimes get better performance or be easier on the db itself if you have some records that you want to update in bulk. `ON CONFLICT` gives you a way to tell the db how to handle conflicts (duplicates/collisions/errors) without having to write potentially slower/buggier application code to handle the situation.

magicalhippo · on Sept 4, 2020

Our application has an article catalog. There's an article number (primary key), along with description, classification codes, weights and other things, in addition to specific stuff for our application.

Our users have integrations which can update the catalogs "common" data. Descriptions can change, classification codes can change etc. A few users have several million articles, and they simply hand us a dump of them each night.

I tried initially handling this client side, load the articles, check if each article already existed, if so update else insert. If no changes don't do anything. I quickly discovered it used 4-5 hours each night when one of these big catalog changes started rolling in.

Then I rewrote it to issuing "insert ... on conflict" and just overwrote everything. Now it uses 15-20 minutes tops.

samatman · on Sept 4, 2020

A better way to look at it, imho: you can check for the unexpected condition in the client, or in the database.

With ON CONFLICT .. DO, it's in the database, which has advantages sometimes: it's typical to have multiple clients, and one database.

Even better, you can treat conflict as an expected condition, and specify what to DO when it arises.

An example in SQLite which I happened to be working on this week: I have a table with a (foreign key, ordinal) tuple as the primary key of the table. Bad things happen to the client code if there are duplicate ordinals, but the row itself is mutable, so: on conflict, update.

There's nothing left to check, this is simply a compact way for me to specify the correct behavior. The client can simply write new rows, without having to delete the old ones.

manigandham · on Sept 4, 2020

What do you mean by "you weren't looking"? The SQL statement which is doing what you tell it to do, when there's a conflict.

tuatoru · on Sept 4, 2020

I'm asking, why is there a conflict? Why don't you know whether or not that tuple is in the relation before you start?

In OLTP, this practice (UPSERT) would cause all sorts of mayhem.

icedchai · on Sept 4, 2020

In practice, these conflicts can happen quite often during data synchronization. Say someone uploads a list of inventory, keyed by ITEM_ID. You want all your items to be either added (new) or modified (previously existing.) You could loop through them one-by-one, or you can just let the database handle it with a bulk upsert.

masklinn · on Sept 4, 2020

> Why don't you know whether or not that tuple is in the relation before you start?

Because you've got a unique constraint on some sort of natural key and the service this data is fed from doesn't have any guarantee it'll only deliver the information once. Or the feed can be replayed in the normal course of operation (e.g. system imports / synchronisations).

Handrolling your upsert creates extra failure modes aplenty and is very tricky to get right, which is why built-in support was ultimately introduced.

> In OLTP, this practice (UPSERT) would cause all sorts of mayhem.

So?

manigandham · on Sept 4, 2020

Does every application (and process) using your database automatically know what's in every table at every moment? That's highly unlikely, so you still have to do queries to find out. And then what would you do if the tuple was or wasn't in there already?

Whatever your answer is, UPSERT just combines that into a single more efficient statement. There's no new logic, just a better expression of it.

If you want examples: 1) analytics with `key(id,timestamp)` to avoid manually creating new rows as time goes on. 2) OLTP updates where we only want the latest data. 3) new data from multiple sources concurrently and we want the last write. 4) duplicate data and we want a single copy for each key.

Ancapistani · on Sept 4, 2020

> Why don't you know whether or not that tuple is in the relation before you start?

... because that requires a redundant query, its associate lookup cost, and in order to be sure that tuple wasn't added or removed between the time your check and the insert/update executes, either a write lock on the table or on the row itself (if it exists).

"Blind" inserts are the right way to go IMO.