Introduction to High-Performance Scientific Computing

acidburnNSA · on Jan 24, 2017

Looks like a good collection of important topics. The tutorials feel a bit 2005 to me. GNUplot and svn? In my scientific universe it's all matplotlib and git these days. Maybe I'm unique.

jarvist · on Jan 24, 2017

Gnuplot is still used quite a lot in my area. You need to work more to make something aesthetically pleasing, but it's generally a better fit to plotting large amounts of data coming from another code. I've also been quite badly burnt by Matplotlib scripts no longer producing the same plot with changes in the library. Gnuplot version 5+ is nicer to use and has some pretty powerful features.

This blog has been a good cookbook reference for how to use it in a modern way: http://www.gnuplotting.org/

leephillips · on Jan 24, 2017

Gnuplot can handle data files with millions of points, that some users have reported to cause Matplotlib to crash. It's actually easy to make the plots look good. It can be controlled through a socket interface from any programming language, and seemlessly integrates with LaTeX.

We're about to start rolling out chapters from my book, which covers the latest version:

https://alogus.com/publishing/gnuplot5/

EDITed to reflect comment below.

metaobject · on Jan 24, 2017

Hmmm, I've never had mpl crash and we routinely plot data sets with 10s of millions of points. We do a mixture of 2D scatter plots. 1D/2D histograms, and a few 3D plots. What type of plots are you generating? What type of data?

leephillips · on Jan 24, 2017

I never had the problem myself, but have seen a handful of user reports in mailing lists and SO. Perhaps recent releases are better performing. I used to use mpl regularly and found it very capable, but never liked the API.

I've edited my comment to make it more accurate.

n00b101 · on Jan 24, 2017

> we routinely plot data sets with 10s of millions of points

What is the performance like on that?

notalaser · on Jan 24, 2017

The nice thing about Ggnuplot is that it's extremely easy to connect it with anything, since it's basically its own language. Python has made a lot of inroads into the scientifc computing field but it's by no means the only one. This is a field where twenty year-old Fortran libraries are still relevant.

SVN also has the important advantage of being relatively easy to pick up by someone who's not a programmer. It would not be my first choice for a version control system now that we have Mercurial and git (frankly, it wasn't not even back when our only affordable alternatives were CVS and RCS), but it's very easy to teach it.

williamstein · on Jan 24, 2017

Another big surprise to me -- Gnuplot is NOT free/open source software in the GNU sense: "Gnuplot is freeware in the sense that you don’t have to pay for it. However it is not freeware in the sense that you would be allowed to distribute a modified version of your gnuplot freely. Please read and accept the modification and redistribution terms in the Copyright file." (see http://www.gnuplot.info/faq/faq.html) I had to remove Gnuplot from SageMath because of their GPL-incompatible license.

leephillips · on Jan 24, 2017

Gnuplot is free enough to be included in Debian. But no, not GPL licensed.

robmccoll · on Jan 24, 2017

GLE can produce some decent plots and is BSD-licensed, but the DSL used to create the plots can be cumbersome. http://glx.sourceforge.net/examples/3dplots/index.html

andrepd · on Jan 24, 2017

Gnuplot is more customizable than matplotlib, more powerful, fastar and handles larger datasets, exports seamlessly to images, LaTeX, PDF, HTML5, animated gifs, among others, etc. It's a better tool than matplotlib, for me at least.

That it takes more effort to produce a "beautiful" output is myth, in my experience. People used to say that in large part because up to gnuplot 4 the default colors were ugly primary colors. On gnuplot 5 and above, it's no longer the case. Plus, you can customize the default colors/styles by editing the .gnuplot file. I've done so: https://ghostbin.com/paste/pvj5m

victotronics · on Jan 24, 2017

1. Do you know how much time it takes to keep a 600 page book up to the minute? 2. But yeah. I'm going to roll the tutorials into a volume of their own.

neutronicus · on Jan 24, 2017

It really depends on the size and the age of the code you're working with. We, for instance, use a code package that was born in 2005 (and continuously developed since), so it lives in SVN. For all the heavy-duty scientific visualization, we use VisIt [1], because the size and dimensionality of the data require a special-purpose visualization package.

To me, "gnuplot or matplotlib" is a little beside the point - if we're using one of those it's for something quick 'n dirty, or for a relatively simple plot of summary data.

[1] https://wci.llnl.gov/simulation/computer-codes/visit

acidburnNSA · on Jan 24, 2017

I've tried using VisIt (and Paraview) a few times but have never really dedicated myself to it. It looks so nice and seems so powerful. I just haven't been able to get it working in a way that allows me to quickly and quantitatively (with data labels) explore multiple axial levels of a nuclear reactor state and step through time to see how the shuffling algorithms did and whatnot so I keep sticking to this specialized .NET thing an intern wrote 7 years ago. I'm sure VisIt can do it, I think I just need an expert to teach me or something.

espeed · on Jan 24, 2017

Link to course page, including the source and latest revision of the book (2nd ed, revision 2016):

http://pages.tacc.utexas.edu/~eijkhout/istc/istc.html

agumonkey · on Jan 26, 2017

Then https://zenodo.org/record/49897/files/EijkhoutIntroToHPC.pdf

victotronics · on Jan 26, 2017

No. Please don't. The book still gets updated regularly, so any copy that is not straight from the repository will get quickly out of date. (I first published this book 6 years ago. You can find pdf copies out on the intertubes that are 200 pages shorter than the most recent version.)

Also, if you link straight to the pdf you don't get to see links to my other books.

Or links to places where you can get a paper copy. Which actually earns me a couple of pennies.

So please: don't make your own link to the pdf file. Don't.

agumonkey · on Jan 26, 2017

Apologies ..

dang if you see this, could you delete or edit my above comment ? Thanks

advanderveer · on Jan 24, 2017

At Nerdalize we're building a cloud that's specifally build for high performance scientific compute: http://www.nerdalize.com/cloud/

gumoro · on Jan 24, 2017

You might want to run a spell check on that page, I just saw "comprimise" in the most prominent sentence.

Also, white text on sky background with white patches (sun, clouds) has unreadable portions.

maaikest · on Jan 24, 2017

Thanks for the comment, we fixed the typo. Any ideas on the content of the website are also very welcome.

axiom92 · on Jan 24, 2017

"Widget not in any sidebars". Neat idea. Reminds me of something I did a while back: http://www.journalrepository.org/media/journals/BJMCS_6/2014.... We used idle systems in the computer lab to build large code repositories.

inlineint · on Jan 24, 2017

This MBP model is a little bit dated and the photo is blurry: http://www.nerdalize.com/wp-content/uploads/2016/06/mockup-e...

metaobject · on Jan 24, 2017

I don't understand what this means:

"Great chance that it is cost efficient to run your job on our servers. Our servers are distributed over homes, so you don’t have to pay for the overhead of a datacenter. This means that your cost-per-job is up to 55% lower and you compute sustainably, as we use the produced heat to heat homes."

Distributed over homes? As in "houses"? Your customer's data is stored at someone's (an employee's?) house?

shoyer · on Jan 24, 2017

It looks like they install (sell?) racks of computers as household "heaters". Scroll down to the "Win, win, win!" section on their homepage with a video.

This is a cute idea but I am skeptical that it makes sense from either an economic or environmental perspective. There are far more efficient ways to produce heat than electric heaters that run 24/7, and likewise cooling in data centers can be extremely efficient by making use of water, e.g., see https://www.google.com/about/datacenters/efficiency/internal...

Also, maintaining servers in people's homes must be quite expensive and there is limited capacity. It's hard to see that scaling.

advanderveer -- do you have some sort of white-paper that compares the alternatives?

Disclaimer: I work for Google, but not on Google Cloud.

j-pb · on Jan 24, 2017

> There are far more efficient ways to produce heat than electric heaters that run 24/7.

Do you mean cheaper? Because generating heat always has 100% efficiency. The only difference is that if you go from burnable materials to heat directly you don't get the nice side effect of getting computation done, so burning stuff is actually less efficient.

pkolaczk · on Jan 25, 2017

Technically you're right, but what you really want at home is not generating heat, but having more heat inside. These are not the same things. You can actually move some heat from outside to inside by using a heat pump (powered by electricity), commonly known as "air conditioner". Heat pumps can typically move 2x-6x more heat then they consume energy. So practically their heating or cooling efficiency is 2x-6x better than a resistance-based heater.

As for burning stuff - burning stuff is typically much cheaper, although it is actually the least efficient way of heating, in terms of a ratio between the usable heat you get and the total chemical energy converted to heat.

vmarsy · on Jan 24, 2017

If this is similar to Qarnot[1], the servers are also heaters in people home. I'm not sure how the Internet connection data transfers are handled by the ISP

[1] https://www.qarnot.com/qrad/

danpalmer · on Jan 24, 2017

I never got into high performance scientific computing, but I believe the stuff that was done in my department at university was all MPI based and required very high interconnect speeds (like with Infiniband). It looks like your offering is much more standard, what's the thinking there, or am I just wrong/out of date?

j-pb · on Jan 24, 2017

It depends heavily on the kind of work. If you have a large scale simulation that needs to be partitioned like a weather system you are IO bound and need as thick interconnects as possible. However there are some problems which are very hard computationally but not very large. Basically everything in NP and exp is a good candidate. There you can distribute the same problem to a bazillion systems with a different starting configuration and let them run until one of them obtains a solution.

If you look at the BOINC project those are basically all problems of this kind. Folding proteins like folding@home does for example. The description of a protein is fairly small, a couple megabyte max. However it takes a long time to simulate the behaviour, since chemistry is a messy probabilistic process with lots of back and forth. Nature does this on trillions of proteins at the same time within nanoseconds, and while we cannot reasonably increase the simulation speed of an individual protein, we can at least simulate as many proteins at once as possible.

dekhn · on Jan 24, 2017

An important secret in HPC is that MPI is rarely required to achieve your objectives. In many ways, vendors just use MPI as a way to sell expensive systems. If you can find any way to make your system scale using threads on a single machine, or use non-latency-sensitive networking, do so.

stonogo · on Jan 24, 2017

If you don't need a high-speed interconnect, you don't need HPC. That's not to say that MPI per se must always be involved, but if for instance the 10gbit connection on Amazon's half-baked "HPC" offering is sufficient, then you definitely don't need a supercomputer.

There is a ton of important scientific work waiting for core hours that really shouldn't be. A loosely-connected grid of laptops would serve a lot of projects very well. On the other hand, there is a large body of work that does require a classical supercomputer, so it doesn't really do anyone any good to accuse MPI of being a sales gimmick.

dekhn · on Jan 24, 2017

There is plenty of HPC that does not need interconnect. It's false, categorically, to say that HPC requires interconnect of any kind.

An isolated, off-net computer - even a desktop PC- stuffed to the gills with GPUs can do HPC. On the other hand, machines connected with 10gbit might do HPC, but you'll have trouble getting codes to scale in a way that is "high performance", relative to what you can get out of threading on a single machine, or a small number of GPUs.

Very little work truly requires classic supercomputers or MPI- there are very few codes where an important engineering problem must be run on a system with low latency, high bandwidth.

apathy · on Jan 24, 2017

Or rent a bigger AWS/EC2 instance to prepare for the eventual demise of old school HPC

j-pb · on Jan 24, 2017

I love that you work on computational heating. Have you thought about open sourcing your heater design like back-blaze does with their storage pods? I tried really hard to get fiber but Ziggo 300 is the best one can do here sadly, so I build my own heating rigs that run folding@home 24/7. But I'd guess I'm not the only one interested in heating their home with science, so maybe you could gather a crowd of computational heating enthusiast and diy around you and learn from one another!

BeetleB · on Jan 24, 2017

Has some fairly good topics, but I'm surprised it is lacking some of the basics you normally find: Numerical integration and optimization algorithms.

Root-finding/optimizing is something many people do/need.

dagss · on Jan 24, 2017

With "HPC" you usually mean "cluster computing". The problems you mention rarely go beyond what a laptop can do in seconds. So I am not sure if they qualify as HPC? Just "numerics", which is more of a prerequisite you might learn before HPC.

Also, compared to that curriculum, the topics you mention have robust methods and stable libraries? So you can use what somebody else did. It is more likely that you need to know the gritty details if you solve PDEs, vs solving ODEs.

BeetleB · on Jan 24, 2017

I guess I misunderstood the audience for the book - I took it more as an introductory level book that goes deep into topics (some of the items listed in the TOC are very "basic", which was why I was surprised it had some advanced topics but not all the basics).

As for "laptop can do in seconds" - well, not if your objective function takes a few minutes to run. The last time I needed it, the objective function took about 2 minutes, and there were 8 parameters we were optimizing over. Standard derivative based optimizing algorithms will require 9 invocations of the function per iteration. So one iteration of the algorithm took about 18 minutes. Certainly not "seconds". Of course, those evaluations could be done in parallel, so I just had them run on multiple cores, bringing it down to only a few minutes per iteration.

Without knowing something about the algorithms my library was using, I would have been totally lost (not to mention I would have likely picked the wrong algorithm for the job at hand).

But yes - I did not write my own algorithm - just used an off the shelf one. However, if you expect that someone who hasn't studied the topic can just use a random optimization algorithm and get things to work, you are mistaken.

victotronics · on Jan 24, 2017

Basics of what? High performance computing? I'd say those are numerical analysis topics and there are tons of books for that. Unless you can make a case that there are high performance aspects to root finding, I'm not going to include it. (You should have said FFT. That has very funky interaction with caches and TBL that absolutely necessitate its inclusion.)

playing_colours · on Jan 24, 2017

Is this a good book to get into the topic? Are there any other titles that can be recommended?

jabl · on Jan 24, 2017

The author is one of the heavyweights in numerical linear algebra (which a lot of hpc boils down to in the end), and certainly knows his stuff. And based on some skimming of the book I did a while ago, yes, I'd say it's good.

As for whether it's a good book "to get into the topic", I guess it depends on what you mean exactly. If you're a scientist who needs to write simulation code that can run on current HPC resources, congratulations , you're smack in the middle of the target demographic of this book (I guess). If not, well, maybe some other book is more appropriate.

arcanus · on Jan 24, 2017

> The author is one of the heavyweights in numerical linear algebra (which a lot of hpc boils down to in the end), and certainly knows his stuff

Came here to say this. I know Victor, and he is top notch.

RhysU · on Jan 25, 2017

Agreed.

jilele · on Jan 24, 2017

I feel the topics should also cover topics on GPGPU, OpenGL/CUDA.

victotronics · on Jan 24, 2017

Section 2.9.3

raattgift · on Jan 24, 2017

4.2.1 (p. 177):

> Hyperbolic PDEs ... will not be discussed in this book.

Aw. :(

victotronics · on Jan 24, 2017

Contact me if you want to to discuss the outline of a short section with me. My reason for not adding the hyperbolic case was that it didn't seem to add much computationally to the discussion.

Karnickel · on Jan 24, 2017

[flagged]

sctb · on Jan 24, 2017

Please don't do that.

abrookewood · on Jan 24, 2017

With an attitude like that, are you surprised? How about trying to bring it back up or offer some constructive criticism? Alternatively, you can just leave ...

apathy · on Jan 24, 2017

And take his/her valuable contributions elsewhere? Y combinator will surely collapse like a house of cards /s