As an aside, I'm always impressed by Discord's engineering articles. They are incredibly pragmatic - typically using commonly available OSS to solve big problems. If this was another unicorn company they would have instead written a custom disk controller in Rust, called it a Greek name, and have done several major conference talks on their unique innovation.
> Here's a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.
> The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script.
I generally agree, but it's probably only 10 lines if you assume you never have to deal with any errors.
It's not flaky at all, it is merely that most people don't code bash/etc to catch errors, retry on failure, etc, etc.
I will 100% agree that it has disadvantages, but it's unfair to level the above at shell scripts, for most of your complaint, is about poorly coded shell scripts.
An example? sysvinit is a few C programs, and all of it wrapped in bash or sh. It's far more reliable than systemd ever has been, with far better error checking.
Part of this is simplicity. 100 lines of code is better than 10k lines. "Whole scope" on one page can't be underestimated for debugging and comprehension, which also makes error checking easier too.
Can I, with off the shelf OSS tooling, easily trace that code that’s “just wget and xargs”, emit metrics and traces to collectors, differentiate between all the possible network and http failure errors, retry individual requests with backoff and jitter, allow individual threads to fail and retry those without borking the root program, and write the results to a datastore in idempotent way, and allow a junior developer to contribute to it with little ramp-up?
It’s not about “can bash do it” it’s about “is there a huge ecosystem of tools, which we are probably already using in our organization, that thoroughly cover all these issues”.
The Unix way, is that wget does the backoff. And wget is very, very good at retry, backoff, jitter handling, etc. Frankly, you'll not find anything better.
If wget fails, you don't retry... at least not until next run.
And wget (or curl, or others) do reply with return codes which indicate what kind of error happened. You can also parse stderr.
Of course you could programmatically handle backoff in bash too, but.. why? Wget is very good at that. Very good.
===
In terms of 'junior dev', a junior dev can't contribute to much without ramp up first. I think you mean here, 'ramp up on bash' and that's fair... but, the same can be said for any language you use. I've seen python code with no error checking, and a gross misunderstanding of what to code for, just as with bash.
Yet like I said, I 100% agree there are issues in some cases. What you're saying is not entirely wrong. However, what you're looking for, I think, is not required much of the time, as wget + bash is "good enough" more often than you'd think.
So I think our disagreement here is, how often your route is required.
that’s fair. If you’re a grey haired old school Unix wiz who’s one of a handful of devs on the team I’d say by all means. But at a certain point technology choice is an organizational problem as well.
And while it sounds Unixy to let wget do its thing, a fully baked program like that is much less “do one thing and do it well” than the http utilities in general purpose programming languages.
That can be solved at the design level: write your get step as an idempotent “only do it if it isn’t already done” creation operation for a given output file — like a make target, but no need to actually use Make (just a `test -f || …`.)
Then run your little pipeline in a loop until it stops making progress (`find | wc` doesn’t increase.) Either it finished, or everything that’s left as input represents one or more classes of errors. Debug them, and then start it looping again :)
Not redoing steps that appear to be already done has its own challenges- for example, a transfer that broke halfway through might leave a destination file, but not represent a completion (typically dealt with by writing to a temp file and renaming).
The issue here is that your code has no real-time adaptability. Many backends will scale with load up to a point then start returning "make fewer requests". Normally, you implement some internal logic such as randomized exponential backoff retries (amazingly, this is a remarkably effective way to automatically find the saturation point of the cluster), although I have also seen some large clients that coordinate their fetches centrally using tokens.
Having that logic in the same place as the work of actually driving the fetch/crawl, though, is a violation of Unix “small components, each doing one thing” thinking.
You know how you can rate-limit your requests? A forward proxy daemon that rate-limits upstream connections by holding them open but not serving them until the timeout has elapsed. (I.e. Nginx with five lines of config.) As long as your fetcher has a concurrency limit, stalling some of those connections will lead to decreased attempted throughput.
(This isn’t just for scripting, either; it’s also a near-optimal way to implement global per-domain upstream-API rate-limiting in a production system that has multiple shared-nothing backends. It’s Istio/Envoy “in the small.”)
Setting up the nginx server is one more server (and isn't particularly a small component doing one thing) to manage.
Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server. That means responding to errors in a way that doesn't lead to thundering herds. The nice thing about this is that, like modern TCP, it auto-tunes to the capacity of the system, while also handling outages well.
Not really; I’m talking about running non-daemonized Nginx as part of the same pipeline. You could even fit the config into the pipeline, with sed+tee+etc, to make the whole thing stateless. Short-lived daemons are the network-packet equivalent to shell pipelines. :)
> Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server.
I disagree. The goal should be to make the server behave in such a way that a client using entirely-default semantics for the protocol it’s speaking, is nudged and/or coerced and/or tricked into doing the right thing. (E.g. like I said, not returning a 429 right away, but instead, making the client block when the server must block.) This localizes the responsibility for “knowing how the semantics of default {HTTP, gRPC, MQPP, RTP, …} map into the pragmatics of your particular finicky upstream” into one reusable black-box abstraction layer.
Isn't that also only an incredibly simplified crawler? I can't see how that works with the modern web. Try crawling many websites and they'll present difficulties such that when you go to view what you've downloaded you realise it's useless.
The truth is, though, that aws and the other cloud providers that have more than hosted storage and compute, are building their own “operating system” to build these systems.
We Unix graybeards may be used to xargs, grep and wget. The next generation of developers are learning how to construct pipelines from step functions, sqs, lambda and s3 instead. And coming as someone who really enjoys Unix tooling, the systems designed with these new paradigms will be more scalable, observable and maintainable than the shell scripts of yore.
I think cloud gets much maligned — but all the serious discussions with, eg, AWS employees work from this paradigm:
- AWS is a “global computer” which you lease slices of
- there is access to the raw computer (EC2, networking tools, etc)
- there are basic constructs on top of that (SQS, Lambda, CloudWatch, etc)
- there are language wrappers to allocate those for your services (CDK, Pulumi, etc)
…and you end up with something that looks surprisingly like a “program” which runs on that “global computer”.
I know that it wasn’t always like that — plenty of sharp edges when I first used it in 2014. But we came to that paradigm precisely because people asked “how can we apply what we already know?” About mainframes. About Erlang. About operating systems.
I think it’s important to know the Unix tools, but I also think that early cloud phase has blinded a lot of people to what the cloud is now.
All the crawler needs to be is a quick crawler script, a Typescript definition of resources, and you get all the AWS benefits in two files.
Maybe not “ten lines of Bash” easy, but we’re talking “thirty lines total, with logging, retries, persistence, etc”.
This. This 100%. I’m exhausted by how this has become the norm, it’s such an endemic issue in the tech industry that even rationally minded people will disagree when pragmatic solutions are proposed, instead suggesting something harder or “more complete” to justify why our solution is better or more valuable. Complexity kills, but people really enjoy building complex things.
There are some times when writing a custom solution does make sense though.
In their case, I'm wondering why the host failure isn't handled at a higher level already. A node failure causing all data to be lost on that host should be handled gracefully through replication and another replica brought up transparently.
In any case, their usage of local storage as a write through cache though md is pretty interesting, I wonder if it would work the other way around for reading.
Scylla (and Cassandra) provides cluster-level replication. Even with only local NVMes, a single node failure with loss of data would be tolerated. But relying on "ephemeral local SSDs" that nodes can lose if any VM is power-cycled adds additional risk that some incident could cause multiple replicas to lose their data.
It seems that the biggest issue then is that the storage primitives that are available (ephemeral local storage and persistent remote storage) make it hard to have high performance and highly resilient stateful systems.
Huh, maybe Greek named Rust disk controller would be better. Since it's not written we do not know one way or the other. Besides all these messaging/chat apps have same pattern: Optimize on the cloud/server side, peddle some electron crap on client side and virtue signal all the way about how serious are they about engineering(on server side).
Agreed, I thought this was a great writeup about how they solved an interesting problems with standard, battle-tested Unix tools.
Related, I hope GCP is listening and builds an "out-of-the-box" solution that automatically combines this write-through caching solution into one offering. Just like I shouldn't have to worry (much) about how the different levels of RAM caching work on a server, I shouldn't have to worry much about different caching layers of disk in the cloud.
This is only pragmatic if you accept the first order assumption that they must use GCP. Which, maybe it's the case for a dev at Discord, but it's a somewhat idiosyncratic choice to call pragmatic. Seems like a lot of developer time and GCP charges to overcome limitations of GCP and wind up with something that's far more brittle than just running your databases on real hardware.
They already run on gcp and presumably have a negotiated agreement on pricing. Their core competencies don’t include managing hardware. Migration off is deliberately expensive and hard. Own hardware still needs a story for backup and restore in lieu of attached volume snapshots, or they’d have to figure out if their database can do something for them there. Any of the above are good reasons to not migrate willy nilly, in fact the only reason to migrate is that either you can’t do what you need to in the cloud or the margins can’t support it.