TransactionID generation, as you mentioned it, is probably being limited by the ...

TransactionID generation, as you mentioned it, is probably being limited by the incredible expense of cross-core/socket/etc. sync.

Go single-threaded and divide up a single hardware node (server) into one node per core, and your performance should go way up. You'd want to do something like this anyways, just to avoid NUMA penalties. But treating every core as a separate node is just easy and clean, conceptually. I/O might go into a shared pool - you'd need to experiment.

I've seen this improvement on general purpose software. Running n instances where n=number of cores greatly outperformed running 1 instance across all cores.

Only major design change from one node/proc is that your replicas need to be aware of node placement, so they're on separate hardware. You may even consider taking this to another level, so that I can make sure replicas are on separate racks. Some sort of "availability group" concept might be an easy way to wrap it up.

Also: your docs page clearly says 2PC was chosen (it's in the footnote). Maybe I'm misreading what "basis" means.