As an aside, I'm always impressed by Discord's engineering articles. They are in...

whizzter · on Aug 15, 2022

They've probably subscribed to Taco Bell Programming.

http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...

tablespoon · on Aug 15, 2022

>> http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...

> Here's a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.

> The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script.

I generally agree, but it's probably only 10 lines if you assume you never have to deal with any errors.

skrtskrt · on Aug 15, 2022

Errors, retries, consistency, observability. For actual software running in production, shelling out is way too flaky and unobservable

b112 · on Aug 15, 2022

It's not flaky at all, it is merely that most people don't code bash/etc to catch errors, retry on failure, etc, etc.

I will 100% agree that it has disadvantages, but it's unfair to level the above at shell scripts, for most of your complaint, is about poorly coded shell scripts.

An example? sysvinit is a few C programs, and all of it wrapped in bash or sh. It's far more reliable than systemd ever has been, with far better error checking.

Part of this is simplicity. 100 lines of code is better than 10k lines. "Whole scope" on one page can't be underestimated for debugging and comprehension, which also makes error checking easier too.

skrtskrt · on Aug 15, 2022

Can I, with off the shelf OSS tooling, easily trace that code that’s “just wget and xargs”, emit metrics and traces to collectors, differentiate between all the possible network and http failure errors, retry individual requests with backoff and jitter, allow individual threads to fail and retry those without borking the root program, and write the results to a datastore in idempotent way, and allow a junior developer to contribute to it with little ramp-up?

It’s not about “can bash do it” it’s about “is there a huge ecosystem of tools, which we are probably already using in our organization, that thoroughly cover all these issues”.

b112 · on Aug 15, 2022

The Unix way, is that wget does the backoff. And wget is very, very good at retry, backoff, jitter handling, etc. Frankly, you'll not find anything better.

If wget fails, you don't retry... at least not until next run.

And wget (or curl, or others) do reply with return codes which indicate what kind of error happened. You can also parse stderr.

Of course you could programmatically handle backoff in bash too, but.. why? Wget is very good at that. Very good.

===

In terms of 'junior dev', a junior dev can't contribute to much without ramp up first. I think you mean here, 'ramp up on bash' and that's fair... but, the same can be said for any language you use. I've seen python code with no error checking, and a gross misunderstanding of what to code for, just as with bash.

Yet like I said, I 100% agree there are issues in some cases. What you're saying is not entirely wrong. However, what you're looking for, I think, is not required much of the time, as wget + bash is "good enough" more often than you'd think.

So I think our disagreement here is, how often your route is required.

skrtskrt · on Aug 15, 2022

that’s fair. If you’re a grey haired old school Unix wiz who’s one of a handful of devs on the team I’d say by all means. But at a certain point technology choice is an organizational problem as well.

And while it sounds Unixy to let wget do its thing, a fully baked program like that is much less “do one thing and do it well” than the http utilities in general purpose programming languages.

dgfitz · on Aug 16, 2022

> If you’re a grey haired old school Unix wiz who’s one of a handful of devs on the team

Is grey hair a new dependency on using wget? Upstream hasn’t updated their dependencies apparently.

lazide · on Aug 15, 2022

It depends on what you mean by ‘production’ of course.

Plenty of stuff runs in production as shell scripts at Google - if you’re a dev, you just don’t notice it often.

Most of the rest of the glue is Python.

Spooky23 · on Aug 16, 2022

Why do you think these shell tools exist in the first place?

These tools were processing and generating phone bills in the 1980s with computers with less computing power than your watch.

DonHopkins · on Aug 18, 2022

Not only processing and generating, but also running up phone bills, like uucp!

derefr · on Aug 15, 2022

That can be solved at the design level: write your get step as an idempotent “only do it if it isn’t already done” creation operation for a given output file — like a make target, but no need to actually use Make (just a `test -f || …`.)

Then run your little pipeline in a loop until it stops making progress (`find | wc` doesn’t increase.) Either it finished, or everything that’s left as input represents one or more classes of errors. Debug them, and then start it looping again :)

dekhn · on Aug 15, 2022

Not redoing steps that appear to be already done has its own challenges- for example, a transfer that broke halfway through might leave a destination file, but not represent a completion (typically dealt with by writing to a temp file and renaming).

The issue here is that your code has no real-time adaptability. Many backends will scale with load up to a point then start returning "make fewer requests". Normally, you implement some internal logic such as randomized exponential backoff retries (amazingly, this is a remarkably effective way to automatically find the saturation point of the cluster), although I have also seen some large clients that coordinate their fetches centrally using tokens.

derefr · on Aug 15, 2022

Having that logic in the same place as the work of actually driving the fetch/crawl, though, is a violation of Unix “small components, each doing one thing” thinking.

You know how you can rate-limit your requests? A forward proxy daemon that rate-limits upstream connections by holding them open but not serving them until the timeout has elapsed. (I.e. Nginx with five lines of config.) As long as your fetcher has a concurrency limit, stalling some of those connections will lead to decreased attempted throughput.

(This isn’t just for scripting, either; it’s also a near-optimal way to implement global per-domain upstream-API rate-limiting in a production system that has multiple shared-nothing backends. It’s Istio/Envoy “in the small.”)

dekhn · on Aug 15, 2022

Setting up the nginx server is one more server (and isn't particularly a small component doing one thing) to manage.

Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server. That means responding to errors in a way that doesn't lead to thundering herds. The nice thing about this is that, like modern TCP, it auto-tunes to the capacity of the system, while also handling outages well.

derefr · on Aug 16, 2022

Not really; I’m talking about running non-daemonized Nginx as part of the same pipeline. You could even fit the config into the pipeline, with sed+tee+etc, to make the whole thing stateless. Short-lived daemons are the network-packet equivalent to shell pipelines. :)

> Having built several large distributed computing systems, I've found that the inner client always needs to have a fair amount of intelligence when talking to the server.

I disagree. The goal should be to make the server behave in such a way that a client using entirely-default semantics for the protocol it’s speaking, is nudged and/or coerced and/or tricked into doing the right thing. (E.g. like I said, not returning a 429 right away, but instead, making the client block when the server must block.) This localizes the responsibility for “knowing how the semantics of default {HTTP, gRPC, MQPP, RTP, …} map into the pragmatics of your particular finicky upstream” into one reusable black-box abstraction layer.

dekhn · on Aug 16, 2022

That's an interesting perspective, certainly not one that would have immediately come to mind. Does this pattern have a name?

lstodd · on Aug 16, 2022

"Common sense"

cortesoft · on Aug 16, 2022

What happens if a page is partially downloaded?

hn_version_0023 · on Aug 15, 2022

I’d add GNU parallel to the tools used; I’ve written this exact crawler that way, saving screenshots using ghostscript, IIRC

bigDinosaur · on Aug 16, 2022

Isn't that also only an incredibly simplified crawler? I can't see how that works with the modern web. Try crawling many websites and they'll present difficulties such that when you go to view what you've downloaded you realise it's useless.

Sohcahtoa82 · on Aug 15, 2022

Fair, but a simply Python script could probably handle it. Still don't need a message queue.

ipython · on Aug 16, 2022

The truth is, though, that aws and the other cloud providers that have more than hosted storage and compute, are building their own “operating system” to build these systems.

We Unix graybeards may be used to xargs, grep and wget. The next generation of developers are learning how to construct pipelines from step functions, sqs, lambda and s3 instead. And coming as someone who really enjoys Unix tooling, the systems designed with these new paradigms will be more scalable, observable and maintainable than the shell scripts of yore.

zmgsabst · on Aug 16, 2022

Well, yeah.

I think cloud gets much maligned — but all the serious discussions with, eg, AWS employees work from this paradigm:

- AWS is a “global computer” which you lease slices of

- there is access to the raw computer (EC2, networking tools, etc)

- there are basic constructs on top of that (SQS, Lambda, CloudWatch, etc)

- there are language wrappers to allocate those for your services (CDK, Pulumi, etc)

…and you end up with something that looks surprisingly like a “program” which runs on that “global computer”.

I know that it wasn’t always like that — plenty of sharp edges when I first used it in 2014. But we came to that paradigm precisely because people asked “how can we apply what we already know?” About mainframes. About Erlang. About operating systems.

I think it’s important to know the Unix tools, but I also think that early cloud phase has blinded a lot of people to what the cloud is now.

All the crawler needs to be is a quick crawler script, a Typescript definition of resources, and you get all the AWS benefits in two files.

Maybe not “ten lines of Bash” easy, but we’re talking “thirty lines total, with logging, retries, persistence, etc”.

MonkeyMalarky · on Aug 15, 2022

The article is from 2010 and uses the term DevOps! Just how long has that meme been around?

Sohcahtoa82 · on Aug 15, 2022

Oooh, I like this. I gotta remember this term.

lordpankake · on Aug 15, 2022

Awesome article!

Pepe1vo · on Aug 15, 2022

Discord has done its fair share of RIIR though[0][1] ;)

[0] https://discord.com/blog/why-discord-is-switching-from-go-to...

[1] https://discord.com/blog/using-rust-to-scale-elixir-for-11-m...

verelo · on Aug 16, 2022

This. This 100%. I’m exhausted by how this has become the norm, it’s such an endemic issue in the tech industry that even rationally minded people will disagree when pragmatic solutions are proposed, instead suggesting something harder or “more complete” to justify why our solution is better or more valuable. Complexity kills, but people really enjoy building complex things.

jfim · on Aug 15, 2022

There are some times when writing a custom solution does make sense though.

In their case, I'm wondering why the host failure isn't handled at a higher level already. A node failure causing all data to be lost on that host should be handled gracefully through replication and another replica brought up transparently.

In any case, their usage of local storage as a write through cache though md is pretty interesting, I wonder if it would work the other way around for reading.

mikesun · on Aug 15, 2022

Scylla (and Cassandra) provides cluster-level replication. Even with only local NVMes, a single node failure with loss of data would be tolerated. But relying on "ephemeral local SSDs" that nodes can lose if any VM is power-cycled adds additional risk that some incident could cause multiple replicas to lose their data.

jfim · on Aug 15, 2022

It seems that the biggest issue then is that the storage primitives that are available (ephemeral local storage and persistent remote storage) make it hard to have high performance and highly resilient stateful systems.

stevenpetryk · on Aug 15, 2022

That's a common theme here. We try to avoid making tools into projects.

geodel · on Aug 15, 2022

Huh, maybe Greek named Rust disk controller would be better. Since it's not written we do not know one way or the other. Besides all these messaging/chat apps have same pattern: Optimize on the cloud/server side, peddle some electron crap on client side and virtue signal all the way about how serious are they about engineering(on server side).

hn_throwaway_99 · on Aug 16, 2022

Agreed, I thought this was a great writeup about how they solved an interesting problems with standard, battle-tested Unix tools.

Related, I hope GCP is listening and builds an "out-of-the-box" solution that automatically combines this write-through caching solution into one offering. Just like I shouldn't have to worry (much) about how the different levels of RAM caching work on a server, I shouldn't have to worry much about different caching layers of disk in the cloud.

patentatt · on Aug 15, 2022

This is only pragmatic if you accept the first order assumption that they must use GCP. Which, maybe it's the case for a dev at Discord, but it's a somewhat idiosyncratic choice to call pragmatic. Seems like a lot of developer time and GCP charges to overcome limitations of GCP and wind up with something that's far more brittle than just running your databases on real hardware.

hnav · on Aug 16, 2022

They already run on gcp and presumably have a negotiated agreement on pricing. Their core competencies don’t include managing hardware. Migration off is deliberately expensive and hard. Own hardware still needs a story for backup and restore in lieu of attached volume snapshots, or they’d have to figure out if their database can do something for them there. Any of the above are good reasons to not migrate willy nilly, in fact the only reason to migrate is that either you can’t do what you need to in the cloud or the margins can’t support it.

_ktx2 · on Aug 16, 2022

Ironically, Discord rewrote all of their messaging centric services from Go to Rust due to pretty niche GC issues.

armchairhacker · on Aug 16, 2022

> Pragmatic

> Rewrite it in Rust

Idk, I don't see the irony here