A Ceph war story

emmericp · on April 10, 2021

Lesson 1: Never ever reboot multiple Ceph nodes without checking if Ceph is happy between reboots. This failure happened early during boot and this could have been handled with no downtime if they checked the rebooted nodes before rebooting the next one.

Lesson 2: Avoid using RAID controllers except for the most simple "pass through" mode.

Lesson 3: XFS+Ceph never really worked out. BlueStore solved so many problems by just removing the XFS dependency for the actual data. Recommended reading: https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf

ceph-volume finally fully removed dependency on file systems. Yeah the LVM-mess is sometimes annoying and early version of ceph-volume had many problems, but nowadays I wouldn't want ceph-disk back.

INTPenis · on April 10, 2021

>Lesson 3: XFS+Ceph never really worked out. BlueStore solved so many problems by just removing the XFS dependency for the actual data. Recommended reading: https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf

This gave me a concern. My kube nodes do use XFS in some cases but Ceph uses raw block devices. So XFS is only used for system files, not for Ceph. Except of course to store Ceph config on each node.

So I assume I'm safe. I'm not entirely sure how you'd use XFS with Ceph because Ceph uses a raw device file and formats it for its own storage.

Nullabillity · on April 10, 2021

Ceph OSD has two different storage backends:

- Filestore is the legacy backend that uses files on a filesystem (strongly recommended to be XFS)

- Bluestore is the modern backend that uses raw device files directly

sp332 · on April 10, 2021

From the linked PDF: For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most dis- tributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code.

merb · on April 10, 2021

> Lesson 2: Avoid using RAID controllers except for the most simple "pass through" mode.

well raid 1 works aswell, if you can take the performance hit. most raid 1 drives inside a raid controller can be run outside of the raid. we already needed to do that since one of our customers tought it is a good idea to have a running server temporarly near an open window with production data and no backup (we exclude our liability in this case and monitor if backups are created) so we needed to recover the data. which worked by using another controller with passtrough and just running a single disk. (the other disk was destroyed, so was the raid controller) btw. rainwater damages a server, especially if you do not notice it for 30 minutes and a full bucket of water inside it. kudos to dell the server kept running for 25 minutes, when it was full of water until it died. (we transported the server and still had water in it...)

karmakaze · on April 10, 2021

Interesting read and helpful lesson list.

I've only used Ceph as provided to be by others and considered setting it up in some instances. Didn't know about the development of BlusStore and it does seem much simpler. The choice between xfs, btrfs, ext4 always seemed a bit unclear (except that I had experienced non-Ceph troubles with btrfs).

Note to self: use ceph-volume/BlueStore.

_mikulely · on April 10, 2021

ceph-volume still relies on LVM, which brings unnecessary complexity.

We'd like to stick to ceph-disk(already unavailable in the P release) with raw block device only.

marcan_42 · on April 10, 2021

ceph-disk relies on partitions (sometimes with magic type IDs) and a stub XFS filesystem, which is more complexity than ceph-volume.

Really, ceph-volume is better. You create an LVM PV/VG/LV (which is completely standard, well supported Linux stuff) on your OSD drive and then pass it to ceph-volume. It puts the OSD metadata in LVM metadata (no stub partition! No XFS!), and the actual OSD directory just gets mounted as a tmpfs and populated from that data. Only one LV for the BlueStore block device. It all just works, and is much easier to reason about than the partitioning stuff with ceph-disk.

Plus you can play around with multiple OSDs on the same device, or OSDs plus system volumes, or RAID members, or anything. I used to have to do some horrible stuff to get somewhat "interesting" Ceph setups with e.g. a system volume on a small RAID next to the OSDs on the same disks, with ceph-disk. All that just works without any confusion with ceph-volume, just make more LVs. Bog standard stuff.

fulafel · on April 10, 2021

This is also a story about how complexity is at odds with availability on many levels. Ceph, the fancy RAID controller, and XFS are stacked building blocks to get more 9's of availability in the model where the enemy is hardware failures, but make it harder to understand the whole system well enough so you know you can operate, troubleshoot & recover it.

bsder · on April 10, 2021

> fancy RAID controller

I simply don't understand why people use hardware RAID controllers. Anything above JBOD is asking for disaster.

Hardware RAID controllers always cause problems--whether due to being a throughput bottleneck or being a pile of bad firmware/hardware bugs.

The whole point of Ceph is to use commodity hardware and be reliable. Either Ceph works and things are reliable or it doesn't and you need to put it in the trash and get something that does. If Ceph doesn't really work, RAID hardware is just adding an extra failure point that adds nothing.

fulafel · on April 10, 2021

They state "The disks were attached as JBOD devices to a ServeRAID M5210 controller (with a stripe size of 512)". I interpret the stripe size to mean a CEPH stripe size.

So the fancy RAID controller can bite you in the ass even if you try to lock its risky functionality away in a closet.

Maybe it's hard to buy name-brand server hardware with lots of disk bays and a safely dumb controller?

bsder · on April 10, 2021

Ah, good catch, I somehow missed that they had it in "JBOD" mode.

As you point out, though, it doesn't seem like it's "really" JBOD mode. It seems like that card is bonding the disks together somehow into a larger logical "stripe". Weird.

lillecarl · on April 11, 2021

JBOD can also mean "raid 0 with different sized disks". It's a great way to lose all your data!

marcan_42 · on April 10, 2021

They weren't attached as JBOD: JBOD disks have no stripe size. The firmware changelog talks about VDs, which implies this was a RAID volume.

So I'm pretty sure what they did is the usual trick of a 1-disk "RAID0" to get "fake JBOD" mode, which is the only way to do it on crappy MegaRAID cards with old firmware without JBOD passthrough. Except, as they found out, this involves all the mess of RAID code in the controller, leading to bugs and other problems.

That controller does supposedly support proper JBOD passthrough mode though, so this was a configuration mistake.

Either way, the real lesson here is that RAID cards largely suck, and you want to stay as far away from any RAID features as possible if you're trying to run disks for object storage. For these MegaRAID cards, the best option is to flash IT (HBA) firmware if you can, or IR (HBA with "simple" RAID support) firmware. If you have to use MegaRAID firmware (the full fat RAID thing which uses a completely different driver), then get a version that supports true JBOD passthrough. Only use the RAID0 trick if you really, really have no other option.

luma · on April 10, 2021

RAID controllers are great if you're using them to do the thing that they do. If you are going to host a software defined storage solution on top, then the thing RAID controllers do for you (abstract the hardware storage devices into a block volume device) becomes counter productive.

Most modern array controllers can be switched to HBA mode, which disables the RAID bits entirely and passes SAS/SATA commands through directly to the device. The article doesn't say if this was done, but based on the description of events, it wasn't.

Running software defined storage on top of hardware RAID functionality is begging for trouble.

effie · on April 10, 2021

> I simply don't understand why people use hardware RAID controllers. Anything above JBOD is asking for disaster.

1) inertia 2) simpler way to get up and running for people who are not skilled in CLI or other SW RAID tools (HW RAID controllers have nice BIOS setup) 3) HW RAID BIOS setup always boots and is accessible even if OS can't boot 4) HW RAID stays the same for decades so SW updates are less likely to screw something

m463 · on April 10, 2021

I got a hardware raid controller at home, many years ago. And I had a failure. I couldn't get to anything without buying an identical controller. Swore I'd never do again.

axiolite · on April 10, 2021

Linux md-raid supports LSI MegaRAID drives just fine.

https://askubuntu.com/questions/1310586/megaraid-lsi-raid-to...

marcan_42 · on April 10, 2021

The fancy RAID controller just shouldn't be there - there's no reason to use it unless you're stuck with it. Sometimes you can flash these LSI controllers with HBA/simple-RAID firmware ("IT" or "IR") to fix this.

But really, this had little to do with Ceph (other than the ops mistake of rebooting several machines at once without waiting for an all clear, which is just a bad idea all around when you're upgrading a cluster of any kind). It was an interaction between XFS (not used on modern ceph-volume systems), the kernel, and the RAID controller.

I find Ceph to actually be a lot more introspectable than "typical" multi disk storage management filesystems like btrfs and zfs. On those, if something goes wrong, you're often left with a corrupted or unmountable filesystem, or worse, kernel panics or errors. It's all a monolith in the kernel and hard to fix anything. On Ceph, you can dig through the layers, and things are split into separate daemons that can be debugged separately.

For example, want to see where the data of a CephFS file is stored? Easy: the inode number is the first part of the object name in the data pool. Append the block number after that and you have your object name. You can just fetch that using the rados tool, bypassing CephFS entirely. Want to go deeper? Ask it to hash the object name to a PG ID. Use `ceph pg dump` to figure out what OSDs that pg lives in. Now you know what disks have your data. OSDs broken? Use the objectstore tool to read the data directly without booting an OSD.

Obviously you need to learn all this stuff, but the tooling is there and this is way more introspectable than a bunch of magic structures in a filesystem.

jlengrand · on April 10, 2021

Completely tangential but I had never heard of Ceph. Went to the website [1], literally no info as to what it does. However I can download, read tweets, or do some training. It's a shame because the documentation page [2] has an amazing, concise and precise description of what it's doing :

Ceph uniquely delivers object, block, and file storage in one unified system.

It takes one line, why is it not there?

[1] https://ceph.io/ [2] https://docs.ceph.com/en/latest/

ali_m · on April 10, 2021

To be fair, it does say:

> Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability.

although it's annoyingly hidden away in a carousel. Carousels - not even once.

jlengrand · on April 10, 2021

Oh wow indeed. In third place out of 4

terom · on April 10, 2021

Apart from the NTP tangent, this sounds like a Linux XFS / ServeRAID M5210 firmware issue. Your XFS filesystems created using the incorrect block/io sizes reported by the RAID controller would have been unmountable on the newer Linux kernel regardless of Ceph.

Lesson learned: your configuration management also needs to control for firmware versions such that the same issue would have shown up in a dev/test environment before turning into a prod nightmare :/

mikagrml · on April 10, 2021

(Author here) Yes, it was an XFS/controller issue, but Ceph reported the failure. :) (IMO it wasn't really a good decision from Ceph to use 100MB XFS partitions as a kind of database, but nowadays ceph-disk (which uses those XFS partitions) is gone, and instead ceph-volume uses a different approach via LVM.)

Regarding configuration management/firmware version: yes - especially, as you'd need to also rebuild disks in the dev/test environment with the identical configuration (firmware, disks,...), to ensure it's actually identical. And even if we neglect load/capacity/usage issues (problems might show up only under specific work loads), there are also further "invisible" layers/components like cables, NICs, switches,… and their firmware versions which are also relevant. Not exactly trivial. :)

ibotty · on April 13, 2021

> IMO it wasn't really a good decision from Ceph to use 100MB XFS partitions as a kind of database

It has been shown, that you are right. But not because of bugs like the one you encountered. The problem could as well have happened with a regular xfs fs holding a maildir.

nopurpose · on April 10, 2021

Kernel is somewhat at fail here. Why new kernel refuses to mount XFS partition even if some metadata is not right? If they introduce stricter checks, there should be a way to bypass them if needed. Reverting back to previous kernel just to be able to move forward with repairs is not OK.

Tepix · on April 10, 2021

Excellent writeup and impressive analysis! Are many people using Ceph and what are your experiences like?

antongribok · on April 10, 2021

I (with 2-3 other people) run over 100 PB on Ceph clusters in production, supporting some critical functions of a Fortune 50 enterprise.

We use Ceph for block and object workloads (no CephFS). Most of our clusters are still on Luminous (v12) and Filestore (XFS), and only our newer clusters that are being built now are on Nautilus (v14) Bluestore. We plan on migrating to Bluestore this year (and likely next year).

I'm on-call basically all the time, but we'll have maybe 1 issue per year where we have to act immediately. Most failures that would happen on a Saturday can wait until Monday to be acted on.

The smallest cluster that we'll build is 6 nodes. Our largest cluster now is 120 nodes with 1920 OSDs. We might build a larger cluster this year.

Back in Ceph Hammer days, I had a 6-node cluster lose 3 nodes, one node at a time, over a course of a couple of days with zero downtime.

We deploy with Ansible, and have our own, very paranoid and opinionated playbook for doing a rolling cluster reboot or restart, that checks multiple things before moving on to the next node.

hutrdvnj · on April 10, 2021

Ceph user here. Ceph works fine 99.9% of the time until it doesn't. In this case you get a bit panic and start to google, ask in IRC for help and take a look at the bug tracker. In all critical cases of the past we were able to recover and gained ceph experience. I guess this is how you become a ceph expert.

effie · on April 10, 2021

Curious about Ceph, what feature are you using it for? Would it make sense to serve big number of websites, stored on 20 servers, using single Ceph filesystem with some SSD cache? Or am I better off having just 20 independent servers?

hutrdvnj · on April 10, 2021

> Curious about Ceph, what feature are you using it for?

Mainly CephFS. It allows you similar to NFS to have a big shared filesystem over network. If you have clients that require access to a triple digit terabyte or petabyte sized filessystem, then you might want to consider ceph.

> Or am I better off having just 20 independent servers?

If you have 20 independent websites, then I would recommend to provide them their own storage. That is because if your ceph instance has a downtime, then all your 20 websites would be down simultaneously.

luto · on April 10, 2021

can confirm this on two 24 node / 100TB / 132 OSD clusters and ~2 years.

sekh60 · on April 10, 2021

I run a home ceph cluster as a hobby. It works really well. I have hit a few snags, but the mailing list is super friendly and helpful and the devs are quick to respond to bug reports.

rjzzleep · on April 10, 2021

Ceph and XFS actually for the longest time had an issue locking up the kernel. Ceph works fine, until it doesn't. I used gluster and Ceph for a while in production. I faintly remember Gluster supposedly being more performant for small writes, but it was an absolutely pain to deal with. Ceph is a bit better and ran stable until it didn't.

One of the issues is that the officially recommended way to install it seems to be rook-ceph with kubernetes. But rook has had so many issues, many of them fixed, but somehow it seems that all kubernetes "cloudnative" released software has some quirks which you just have to accept as part of the solution(this is from someone who's oftentimes pushing for kubernetes based solutions). But I also tend to recommend clients to just spend 15k to buy a TrueNAS or something similar with iSCSI and NFS.

EDIT: the XFS issue was a hung task issue that was added to the 5.6 kernel "recently". Meaning it not in Ubuntu 20.04 either.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

ps · on April 10, 2021

Reading other comments here, I guess I am lucky enough Ceph still works fine for me, however I am still baffled by its potential performance. Last cluster deployed is 3 nodes each with 4x P4610 NVMe as OSDs, 25Gbps network pushes only 20k IOPS during 4k read, queue depth 128. Single disk is supposed to push around 600k...

“One of the issues is that the officially recommended way to install it seems to be rook-ceph with kubernetes.“

I believe the recommended way is using cephadm.

I have been using ceph-ansible since luminous and had to deal with nasty issues or quirks as well caused probably by complexity and human errors.

sekh60 · on April 10, 2021

You are correct about cephadm being the recommended install method.

For NVMe OSDs how many daemons are using the same drive? Normally to get performance you have to colocate two or four osd daemons on a single NVMe device.

Also ceph benefits the most from massive parallelism, how many clients do you have? 4 OSDd is not a lot to spread the load over, even if the drives are lightning fast.

There is a new io pathway in the works called crimson which should make NVMe drives more performant. I was hoping a preview would lend in Pacific, but I guess it isn't close enough to being ready.

rjzzleep · on April 10, 2021

Cephadm is very new. So yes it may be the recommended solution NOW. Precisely because all previous solutions were very fragile. I don’t have any experience with it for these reasons.

Somehow Linux new filesystems take very long from initial release to reliable use. As seen also with btrfs.

Maybe filesystems are just hard I don’t know but somehow the solutions seem very Hacky.

marcan_42 · on April 10, 2021

XFS has a history of reclaim/memory management issues. It used to have a habit of blocking on I/O when cleaning memory even with plenty of clean page cache available.

For the longest time, I had issues on my home server (XFS on top of RAID at the time) with large latency hiccups, to the tune of several seconds. This was disrupting some real-time data ingestion and causing data loss. I thought it was about committing to disk, so I added big memory buffers, but no dice. I spent years with this annoying issue. I even had a kernel patch in to increase kernel-side buffers (which were not subject to this problem) to work around it. It wasn't even consistent.

Eventually one day I got sick enough of it, and sat down trying to reproduce it. I figured out that it only happened when true memory usage (including buffer cache) was ~100%; if there was truly unused memory around, things were fine, and I could evict the buffer cache and it would fix the problem until it grew to consume all free RAM again. Eventually I managed to get a stack trace of a process that was stalling even though it wasn't anywhere near writing to disk, and I found out it was stalling during a write(). To a pipe. Because the kernel had to allocate data for the pipe buffers. And it was asking XFS. And XFS decided to evict some dirty inodes, and block on that. Even with gigabytes of clean page cache available to evict. What.

Swearing ensued - here I thought I had some weird kernel/hardware issue causing latency spikes, and it was XFS all along. I eventually ripped XFS out and replaced it with ext4 and that solved the issue.

This eventually got fixed in 2019: https://lwn.net/Articles/795098/

adrian_b · on April 10, 2021

I do not doubt that you had those problems with XFS, but I believe that they must have required certain specific conditions to manifest themselves.

I have been using XFS intensively for almost 20 years and I have never encountered any similar problems, so there must have been some differences in my setup, so that this behavior was not triggered.

While I do not know what was different in my case, possible differences might have been caused by the fact that I have always used generous quantities of DRAM in all my computers and I have never used swap in any of them.

yjftsjthsd-h · on April 10, 2021

Gluster is also so much easier to set up IMO; the only reason I wouldn't use it is that it seems to have some consistency issues such that ex. you could run postgres on ceph, but not (safely) on gluster. But I don't understand it well enough to know why.

trabant00 · on April 10, 2021

Anybody got experience comparing ceph vs gluster vs lustre, etc? I am interested in simplicity and resilience in case of node outage. Is any of these close to set and forget?

notacoward · on April 10, 2021

I was a Gluster developer for about ten years, including time as a project maintainer and as part of the team responsible for the largest deployment in the world. I also worked professionally with Lustre for a couple of years, and have some familiarity with Ceph.

Ceph and Lustre are both largely "set and forget" for the object-storage nodes which are the most numerous type. On the other hand, both have separate metadata servers which can be much more troublesome if/when one fails. Gluster only has one kind of node. One of the main issues here is: when do you actually give up on a node and start regenerating its data somewhere else? Starting that process and then having the node actually come back can be anywhere from a waste of time to a total disaster, so you don't want to be too "twitchy" about it, but you also don't want to run in a degraded state forever. In Gluster at least, the approach was to assume that a node's coming back unless/until explicitly told otherwise (possibly by external automation).

The larger point IMO is that no distributed storage system is "set and forget" at any significant scale. At the very least you'll want robust monitoring (which to some extent has to be built into the code) and alerting, and somebody to respond to the more serious alerts. I'd give a nod to Ceph in this respect. Even better is to have your own site-specific automation for common tasks like capacity addition and upgrades. Even if the storage system itself is doing everything "right" it can get pretty messy if that's not happening in sync with other systems such as provisioning and service discovery - not to mention the systems actually using the storage.

Also, even though the "POSIX is dead" folks are my sworn enemies, it's still true that an ever-decreasing number of workloads actually require an actual filesystem and its associated complexity. Not zero, probably never zero, but smaller every day. If you reasonably can get away with deploying a simpler kind of storage, I recommend it.

Jouvence · on April 10, 2021

Lustre is not aligned at all with your requirements, so forget that one.

Ceph is much more complex than Gluster, but also more capable.

Honestly unless you are dealing with hundreds of TB of storage (and therefore need multiple servers anyway), I expect the complexity any distributed file-system adds is going to be detrimental to uptime and stability more often than it provides extra resilience. Use a single box with ZFS if you can, and add Gluster on top only if it can't be avoided.

fulafel · on April 10, 2021

And even if you need hundreds of TB of storage, in many cases it might be better to just shard over normal storage and use your backups in case of failures. Current disks are big so you don't need too many for eg 200 TB.

ibotty · on April 10, 2021

I have mildly bad experience with gluster. I did never lose data, but the maintenance load was very noticable! Ceph will rebalance by itself when adding (or removing) disks. That alone reduces maintenance by a huge amount. For Storage clusters I don't care about as much, it's also nice when a disk fails that ceph will notice and react automatically. Replacing the disk is then something you ought to to at one time, but you don't have to do pretty soon.

I am running some rook-managed ceph clusters now that make even ceph infrastructure nodes automatically recoverable. But I did not test that enough yet.

leonardp · on April 10, 2021

i only have experience with ceph and would argue that if you set up a cluster with version 15.x right now it is pretty close to set and forget.

mikesabbagh · on April 10, 2021

I feel the pain of the author. Have been using rook/ceph for the past year. We had some miserable times because of the different rook issues, especially the garbage collection bug that deleted more than it should. Since 3 months now, I almost forgot we are using it. no issues at all, touch wood

Tepix · on April 12, 2021

If i have 100 clients (laptops) with 100GB available on each laptop's disk, is Ceph a suitable method of making this space available as a large network volume?

Edit: I realize that it's only 10TB and probably not worth the hassle.

ibotty · on April 13, 2021

No! Not at all. You have to control uptime of the single nodes holistically. That just does not work with Ceph. I can't imagine that there are storage systems that work with that setup.

trebligdivad · on April 10, 2021

Why didn't they just roll back to the old kernel before doing anything else?

mikagrml · on April 10, 2021

(Author here) Because it was absolutely unclear at that time yet, that an older kernel version could mount the XFS partition but the newer kernel versions could not (this only came up during the post mortem/RCA later). Furthermore the clock skew and mon_host issues gave us a wrong picture of the situation. (Also as the hosts are running as hypervisor systems, the kernel version should ~match with the environment (Proxmox/KVM), so there might be other/unclear risks with running such a setup.)

KronisLV · on April 10, 2021

A few things jump out to me after skimming the article:

  1. It appears that Ceph/XFS is pretty complex and complicated, and that it doesn't have the best discoverability.
  2. Distributed storage in general feels like it could cause a lot of problems unless you have someone who has lots of knowledge about it.

The dashboard seems pretty handy to get an overview of the overall state of the cluster, but the error messages in the logs do feel really low level and not that actionable on their own (unless you are really familiar with XFS and Ceph). Without such familiarity, i'd probably be stuck for a long time.

On an unrelated note, the command at the end of the article would cause me problems because i wouldn't tell what it does at a glance (though maybe that's just because i'd prefer longer scripts instead of long one liners). Is there a reason why people don't store comments with every command, so that they remember what it does after a few months?

For example:

  # Check whether XFS mount points are affected by a problem in the 4.18-rc1 kernel version: https://bugzilla.kernel.org/show_bug.cgi?id=202127
  # Will feed the current mounts into xfs_info to figure out if sunit and swidth values are incorrect.
  awk '$3 == "xfs"{print $2}' /proc/self/mounts | while read mount ; do echo -n "$mount " ; xfs_info $mount | awk '$0 ~ "swidth"{gsub(/.*=/,"",$2); gsub(/.*=/,"",$3); print $2,$3}' | awk '{ if ($1 > $2) print "impacted"; else print "OK"}' ; done

Or do you just get used to "decoding" what the awk and gsub invocations do eventually?

Also, i bet that Ceph is good in an enterprise setting where you really need to store a lot of data, but what about smaller and simpler distributed file systems? I feel like maybe something is missing between just having some servers with local storage or maybe NFS, and full blown solutions like Ceph. I have heard about GlusterFS, LizardFS, MooseFS, SeaweedFS and others, yet all of them seem noticeably more complicated than setting up a Docker Swarm cluster would be (essentially just telling a bunch of nodes to communicate amongst themselves with 1 CLI command per node, and letting them sort the rest out themselves).

Plus, i've heard that even some file systems can have a noticeable impact on the resource consumption on the server (though i just searched and can't find anything concrete in DuckDuckGo, even the Wikipedia page i used doesn't seem to have any good recommendations for CPU/RAM resources: https://en.wikipedia.org/wiki/Comparison_of_distributed_file... ), so setting clusters like that up doesn't feel like something that a person would do in their small homelab with a few old Athlon processors. :(

Of course, it may be that distributed file storage is just inherently more complex and demanding than making a few servers talk amongst themselves and achieve consensus (the Docker Swarm example above). Any suggestions/opinions?