Hi, Leif. It's not hard to get to the throughput limits of a single log device, ...

leif · on Nov 26, 2013

I'm not talking about the speed of the array, rather a battery backed controller. With one of those, as long as you're under the sequential write bandwidth of the underlying array, it pretty much doesn't matter how much you fsync the log, so that bottleneck (commit frequency) goes away.

If you're planning to write data faster than disk bandwidth, then you have no hope of being durable and we're talking about problems too different to be worth comparing, and in that case I retract my comment.

I don't understand what distinction you're trying to make between the "array itself" and the "log file has a limit to how many blocks can be appended". Can you clarify what limit you're talking about?

mtravis · on Nov 26, 2013

Well, array=battery backed controller with a bunch of disks hanging off of it. Actually, there is latency associated with every sync. What I've seen on HDS arrays with Linux boxes and 4GB fibre channell adapters is about 200us per 4KB block write. That is very very good for disk. It's also slower than memory access by many orders of magnitude. This was about 3 years ago. Things are bound to be faster by now, but still not as fast as RAM.

I don't think it's unreasonable to want to write faster than an I/O subsystem can handle. Maybe it's not for every workload, but that doesn't mean it's for no workload.

The distinction I wasn't being clear about was that the storage array (the thing with the RAID controller, cache and disks hanging off of it) is not being saturated if a single transaction log file is being continuously appended to. But that headroom in the array does not translate to more throughput for the transaction log. I don't know if it's an important distinction.

VladRussian2 · on Nov 26, 2013

>The log writes don't saturate the array itself--but the log file has a limit to how many blocks can be appended--even on fast arrays

yes, the issue usually isn't the transaction log append speed. Instead, it happens too frequently that the log is configured to be too small. A log file switch causes a flush of accumulated modified datablocks of tables and indexes [buffer cache flush in Oracle parlance] from RAM to disk. With small log file size, the flush happens too frequently for too small amounts of modified data - this is where GP mentioned random IO bites in the neck.

leif · on Nov 26, 2013

I think you're talking about an insert buffer, not a transaction log, and in that case, no matter how big your insert buffer is, it will eventually saturate and you'll end up hitting the performance cliff of the B-tree. You really need better data structures (like fractal trees or LSM trees) to get past it.

VladRussian2 · on Nov 26, 2013

no, i'm talking about transaction log ("redo log" in Oracle parlance). Switching log files causes checkpoint (ie. flush - it is when the index datablocks changed by the inserts you mention will finally hit the disk )

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUES...

MS and DB2 have similar behavior.

leif · on Nov 26, 2013

Oh ok, now I see what you're saying, it's still similar to an insert buffer in that case. B-tree behavior is still to blame for this, and if you make your log file bigger it lets you soak up more writes before you need to checkpoint but you'll either have even longer checkpoints eventually, or you'll run out of memory before you have to checkpoint.

We also checkpoint before trimming the log, but our checkpoints are a lot smaller because of write optimization.

VladRussian2 · on Nov 26, 2013

>even longer checkpoints

yes, that is the point as big flush instead of many small ones would take either the same or, usually, less time than cumulative time of small flushes because of IO ordering and probability of some writes hitting the same data block.