Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi, Leif. It's not hard to get to the throughput limits of a single log device, even on a fast array. I've done it on Sybase, WebSphere MQ, Oracle, MySQL, basically on enough platforms that I assume it to be the general case. The log writes don't saturate the array itself--but the log file has a limit to how many blocks can be appended--even on fast arrays. But imagine getting rid of the transaction log entirely--the entire code path. That will be faster even than a transaction log write to memory-backed filesystem.

But I agree that other write (and read) activity going on in the background and foreground, also limits performance--and in fact, I've seen the index write bottleneck that you describe in real life, more-so than simple transaction log writes. So, you're correct.

I've read about Toku, but I really doubt that it writes faster to disk than writing to memory. Are you really trying to say that?

I think it would be great for InfiniSQL to be adapted to disk-backed storage, in addition to memory. The horizontal scalability will also apply, making for a very large group of fast disk-backed nodes.

I think your input is good.



I'm not talking about the speed of the array, rather a battery backed controller. With one of those, as long as you're under the sequential write bandwidth of the underlying array, it pretty much doesn't matter how much you fsync the log, so that bottleneck (commit frequency) goes away.

If you're planning to write data faster than disk bandwidth, then you have no hope of being durable and we're talking about problems too different to be worth comparing, and in that case I retract my comment.

I don't understand what distinction you're trying to make between the "array itself" and the "log file has a limit to how many blocks can be appended". Can you clarify what limit you're talking about?


Well, array=battery backed controller with a bunch of disks hanging off of it. Actually, there is latency associated with every sync. What I've seen on HDS arrays with Linux boxes and 4GB fibre channell adapters is about 200us per 4KB block write. That is very very good for disk. It's also slower than memory access by many orders of magnitude. This was about 3 years ago. Things are bound to be faster by now, but still not as fast as RAM.

I don't think it's unreasonable to want to write faster than an I/O subsystem can handle. Maybe it's not for every workload, but that doesn't mean it's for no workload.

The distinction I wasn't being clear about was that the storage array (the thing with the RAID controller, cache and disks hanging off of it) is not being saturated if a single transaction log file is being continuously appended to. But that headroom in the array does not translate to more throughput for the transaction log. I don't know if it's an important distinction.


>The log writes don't saturate the array itself--but the log file has a limit to how many blocks can be appended--even on fast arrays

yes, the issue usually isn't the transaction log append speed. Instead, it happens too frequently that the log is configured to be too small. A log file switch causes a flush of accumulated modified datablocks of tables and indexes [buffer cache flush in Oracle parlance] from RAM to disk. With small log file size, the flush happens too frequently for too small amounts of modified data - this is where GP mentioned random IO bites in the neck.


I think you're talking about an insert buffer, not a transaction log, and in that case, no matter how big your insert buffer is, it will eventually saturate and you'll end up hitting the performance cliff of the B-tree. You really need better data structures (like fractal trees or LSM trees) to get past it.


no, i'm talking about transaction log ("redo log" in Oracle parlance). Switching log files causes checkpoint (ie. flush - it is when the index datablocks changed by the inserts you mention will finally hit the disk )

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUES...

MS and DB2 have similar behavior.


Oh ok, now I see what you're saying, it's still similar to an insert buffer in that case. B-tree behavior is still to blame for this, and if you make your log file bigger it lets you soak up more writes before you need to checkpoint but you'll either have even longer checkpoints eventually, or you'll run out of memory before you have to checkpoint.

We also checkpoint before trimming the log, but our checkpoints are a lot smaller because of write optimization.


>even longer checkpoints

yes, that is the point as big flush instead of many small ones would take either the same or, usually, less time than cumulative time of small flushes because of IO ordering and probability of some writes hitting the same data block.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: