Hi, Michael. Yes, VoltDB is very fast, but they self-admittedly *do not* perform...

jbellis · on Nov 25, 2013

So basically, VoltDB acknowledges that cross-partition transactions are Hard and has put a lot of effort into minimizing them. (This is basically the entire point of the original HStore paper.)

InfiniSQL says don't worry, we'll just use 2PC. But not just yet, we're still working on the lock manager.

I look forward to your exegesis of how you plan to overcome the well-documented scaling problems with 2PC. Preferably after you have working code. :)

mtravis · on Nov 25, 2013

I'm not doing 2PC. I'm curious what gave you that impression, and I'll clarify any documents that seem to give off that impression.

I'd like InfiniSQL to handle hard workloads as well as it can, to minimize the amount that developers need to redesign their applications.

Please go to http://www.infinisql.org/community/ and follow the project on Twitter, and sign up for the newsletter, to keep apprised of progress.

Thanks for the input!

jbellis · on Nov 25, 2013

http://www.infinisql.org/docs/overview/#idp37097184 gave me that impression.

mtravis · on Nov 25, 2013

Oh, I thought you said 2PC (two-phased commit).

Yes, InfiniSQL uses two phase locking. The Achilles' Heel is deadlock management. By necessity, the deadlock management will need to be single-threaded (at least as far as I can figure). No arguments there, so deadlock-prone usage patterns may definitely be problematic.

I don't think there's a perfect concurrency management protocol. MVCC is limited by transactionid generation. Predictive locking can't get rolled back once it starts, or limits practical throughput to single partitions.

2PL works. It's not ground-breaking (though incorporating it in the context of inter-thread messaging might be unique). And it will scale fine other than for a type of workload that tends to bring about a lot of deadlocks.

leif · on Nov 26, 2013

HAT (http://www.bailis.org/papers/hat-vldb2014.pdf) looks pretty promising in terms of multi-term transactions. It turns out that you can push the problem off to garbage collection in order to make transaction id generation easy, and garbage collection is easier to be sloppy and heuristic about. The only problem is it isn't clear yet that HATs are as rich as 2PL-based approaches, and that nobody's built an industrial strength implementation yet.

MichaelGG · on Nov 26, 2013

TransactionID generation, as you mentioned it, is probably being limited by the incredible expense of cross-core/socket/etc. sync.

Go single-threaded and divide up a single hardware node (server) into one node per core, and your performance should go way up. You'd want to do something like this anyways, just to avoid NUMA penalties. But treating every core as a separate node is just easy and clean, conceptually. I/O might go into a shared pool - you'd need to experiment.

I've seen this improvement on general purpose software. Running n instances where n=number of cores greatly outperformed running 1 instance across all cores.

Only major design change from one node/proc is that your replicas need to be aware of node placement, so they're on separate hardware. You may even consider taking this to another level, so that I can make sure replicas are on separate racks. Some sort of "availability group" concept might be an easy way to wrap it up.

Also: your docs page clearly says 2PC was chosen (it's in the footnote). Maybe I'm misreading what "basis" means.

MichaelGG · on Nov 25, 2013

Hey Mark. The overview seem to be much of the same. There's a lot of excited talk, which is fine, but should be limited to a leading paragraph. The fundamental issue is performance in face of transactions that need to do 2PC among multiple nodes (which also need to sync with the replicas).

I'm not much of an expert at all, but I like reading papers on databases. It seems to me that if you really did discover a breakthrough like this, you should be able to distill it to some basic algorithms and math. And a breakthrough of this scale would be quite notable.

If I'm reading correctly, there's no replica code even involved ATM. So 500Ktx/s really boils down to ~83Ktx/sec per node, on an in-memory database. Is it possible on modern hardware that this is just what to expect?

I am curious, and I'm not trying to be dismissive, but the copy sounds overly promising, without explaining how, even in theory, this will actually work. I'd suggest to explain that part first, then let the engineering come second.

mtravis · on Nov 26, 2013

Your advice that I create formal academic-style paper is reasonable, and I agree that it should be something that I pursue. Will you follow me somehow (by links at http://www.infinisql.org) so that when such is produced, you'll see it? I can't guarantee getting front page here again, and don't want to be missed in the future, especially as you (and others) have been asking for this type of information.

And, yes, many thousands of transactions per node in memory is what should be expected. But scalability of ACID (lacking durable, as discussed) transactions on multiple nodes--that's the unique part. I'll try to distill that into a paper.

MichaelGG · on Nov 26, 2013

I signed up for the newsletter.

It doesn't have to be formal and academic enough to be published. Just something that explains how performance is going to be achieved - any sort of analysis.