Clustering with unix [Archive]

Detrius

Sep 23, 2002, 09:41 AM

Originally posted by real:
Is there unix software that you can runn in X-windows or the like and have the power of 2 or even more cpu's for other software on the OS X Workstation. Or does each piece of software have to be writen to take advantage of the unix app running the cluster. Thanks for the info. Is this Possible?

It is possible. Each piece of software does have to be written to take advantage of this. The reason for this is because optimizations vary. If you are coding assuming that everything is running on one machine, then you will tend to share certain objects. This can be done in cluster computing, but it is not optimal as different machines must transfer these objects between each other over ethernet, as opposed to the blazing speeds between the RAM and the processor.

Pooch is a Mac-based, user friendly way to do this.

Free ways include MPI and PVM. These are both open standards (Pooch uses MPI).

I prefer the lam-mpi distribution as it makes setting up the cluster easier, and it is still free.

Kristoff

Sep 23, 2002, 12:08 PM

Another possibility is to use Java and the Jini API to write your programs....

Much clustering has been done to create massively parallel systems for crunching radio telescope data using Jini.

real

Sep 24, 2002, 02:58 AM

Thanks for all the info guys, So it seems that you have to have a app that is writen to take advantage of the cluster or it doesnt work. With Pooch of an example take EI and put in the pooch jobs window and select the other machines but nothing seems to speed up(most likely its from the network not being able to process the info fast enough to make it seem faster)or does pooch not even work like that. It seems the only app that is written for clutering is that fractal app. There are no real world apps that take advantage of the cluster other than if your bigg on mapping your DNA, and the such. It would seem that every app and more so from the graphics apps the developers would take advantage of it. Or is it some trouble than it worth to code for. Anyone else out there have any ideas for clustering in OS X for real world apps(Photoshop, After effects etc.) I think im dreaming but I have to do something with old machines to have them earn there keep.
Sorry about the long post.

drmbb2

Sep 24, 2002, 08:15 AM

Well, in terms of Pooch and true parallel computing on a multi node computer, there are, simply put, some compute problems/applications that are inherently non-parallelizable (is that a word ?). In other words, some compute algorithms are strictly linear - each step in completing the algorithm is intimately linked to the previous step, and all steps must be processed in a literal, linear progression. Such algorithms cannot benefit from a parallel compute architecture as there is no way to parcel out the problem into mutliply subsets which can be simultaneously computed.

Other compute problems can be recoded to run as multiple, simultaneous, routines - for example, large database searches, where the database can be split into multiple subsets, and each compute node can work only on its piece of the search - each node then returns its results to a master node which puts the separate searchs together and compiles the final results. Also, many mathematic computations (such as weather forecasting models) can be run as many, parallel, computations.

The bottom line is, that there are many problems/applications that just can't be coded in a way to use parallel processing. I'm no image software guru, but I would imagine that many things people do in Photoshop would be like this - you can't complete the second phase of the computation/manipulation until the previous step is complete, hence there is nothing for a second compute node to work on.

P.S. I'm talking about true parallel processing here (aka Pooch and Beowulf architectures) - NOT using multiple CPU's in a shared memory environment, which, of course OS X is quite capable of (as long as an application is coded/compiled for multi-threaded use, which Photoshop, for example, is).

Detrius

Sep 25, 2002, 10:31 PM

If I remember correctly, the only real world app to include cluster computing support is Bryce. Even then, it's still a rather specialized application.

Currently, the possibility of an efficient cluster is so rare that it's pointless to include it in a shipping product. This is something that tends to be programmed on a case-by-case basis. As such, it's not very practical. However, if you have a huge problem you need solved, this is a good way to do it. You just have to know how to program it.

Richard Edgar

Sep 27, 2002, 10:29 AM

P.S. I'm talking about true parallel processing here (aka Pooch and Beowulf architectures) - NOT using multiple CPU's in a shared memory environmentPlease excuse my ignorance, but in what way does an O3k fail to qualify as a 'true parallel computer?'

drmbb2

Sep 27, 2002, 02:19 PM

In a multi-processor system, like Apple's PowerMac line, both processors share a common memory bus, and only one kernel image is loaded at boot. The kernel then distributes the computational workload across the available cpu's. Thus, in a SMP model the user of the system sees essentialy a ``regular'' desktop system.

In a parallel processing cluster, there are multiple nodes, each of which may have multiple processors, but each node is an SMP machine. Specialized softare, like PVM, MPICH, or Pooch, then parcel up the computational tasks and assign a node to each sub-task. From the user's point of view, each node is a separate machine (with Pooch, they may not even be in the same country as machines may be clustered over a conventional LAN or even the internet). Any given node on the cluster can be doing it's own thing, with its own protected memory environment, and its kernel image running the local show.

Oh, and there are two kinds of parallel machines, those based on the Beowulf concept where the cluster nodes are physically integrated as a single mass machine, and those based on distributed computing concepts, like Pooch or The Legion project ( http://legion.virginia.edu/centurion/Centurion.html ). Some people here at UVa have actually secured fixed IP addresses for their lab/student/postdoc Macs so they can cluster them ith Pooch.

Richard Edgar

Sep 27, 2002, 03:05 PM

Thus, in a SMP model the user of the system sees essentialy a ``regular'' desktop systemNow, that's a definition of an O3k that I've never heard before. I still don't see how one of those beasts can be regarded as doing anything other than "true parallel computing." Particularly since there's none of that ridiculous messing about with "send these bytes to that processor."

drmbb2

Sep 27, 2002, 04:29 PM

Sorry, I was trying to give a general answer to, what I saw, as a question about SMP versus Parallel architecture. In terms of SGI's 3000-series machines, they can be configured either way (up to 512 CPU SMP, or multiple parallel processing nodes). The distinction still holds though, and is dependent upon what software will be run on the machine - if configured as a SMP machine, YES, the user logs into a SINGLE machine, and the kernel is the ultimate determinant of load allocation (of course, overlayed with additional support modules which may mimic some, but not all, aspects of a true parallel installation - there is no way to give a specific cpu its own protected memory in a SMP setup). Alternatively, if configured as multiple, independent compute nodes, then the machine is more akin to a beowulf cluster than to any SMP machine. Direct from SGI's support pages for the 3800-series- "it is flexible enough to be configured with software as a single 512-processor shared-memory server or be divided into several partitions, each running a separate OS". In terms of developing and applying applications, the distinction is highly significant (see my previous post). That's why NOAA runs its weather modelling SGI as a multi-node cluster, NOT as an SMP machine.

P.S. maybe this concept helps - "Parallel computing" generally refers to running multiple, simultaneous and independent computations in an independent compute environment, while SMP refers to running multiple computations in a shared and inter-dependent compute environemnt (i.e. shared memory, kernel image, etc).

Richard Edgar

Sep 27, 2002, 05:02 PM

So, let me get this straight.... if I have a shared memory multiprocessor machine, and I get all the processors working on the same problem, then I am not doing parallel processing? However, if I arbitrarily partition the machine into single processor chunks (and hence negate most of the advantages of the highly expensive interconnect hardware) and have them grunt at each other using MPI, I am parallel processing? That's one messed up definition.

drmbb2

Sep 27, 2002, 06:12 PM

Yes - if you get all the cpu's on an SMP machine working on the "exact " same problem then you have gained nothing. If, however, you had a parallel setup, and could partition the problem into many separate sub-problems, and compute them simultaneously, then you would gain. Also note, the "highly expensive interconnect hardware" issue is kind of moot - Apple's Xserve machines (and desktops, and laptops) come with gigabite connections standard, and Intel machines can be set up similarly for little expense, so the interconnectedness of separate compute nodes is not a huge issue these days (unless your wolrdly network is really slow)

Also note that the solution to multi-cpu SMP and multi-cpu Parallel processing is highly software dependent - if your code isn't built to take advantage of either hardare architecture, then you gain nothing (eg. running MS Word on an SMP machine with 2 or more processors has NO advantage - running Photoshop on a dual-processor SMP machine is great - running Photoshop on a many-machine, dual-processor-per-machine Pooch cluster gains you nothing)

So, no, the "definition" isn't FUBAR'd, just your understanding/interpretation of it is not quite in line with what's practically do-able in terms of computer hardare and software.

Originally posted by Richard Edgar:
So, let me get this straight.... if I have a shared memory multiprocessor machine, and I get all the processors working on the same problem, then I am not doing parallel processing? However, if I arbitrarily partition the machine into single processor chunks (and hence negate most of the advantages of the highly expensive interconnect hardware) and have them grunt at each other using MPI, I am parallel processing? That's one messed up definition.

Richard Edgar

Sep 27, 2002, 06:38 PM

If, however, you had a parallel setup, and could partition the problem into many separate sub-problems, and compute them simultaneously, then you would gainI had presumed that it should be obvious to all that that was what I was talking about. Now, I think that you might find that that is what one does with shared memory machines.
Also note, the "highly expensive interconnect hardware" issue is kind of moot - Apple's Xserve machines (and desktops, and laptops) come with gigabite connections standardSorry? Gigabit ethernet is fast? Gigabit ethernet automatically fetches data from processor caches with a 250ns latency?
So, no, the "definition" isn't FUBAR'd, just your understanding/interpretation of it is not quite in line with what's practically do-able in terms of computer hardare and software.It's a misconception that I seem to share with SGI (http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Developer/books/MPro7F90CD_RM/sgi_html/ch05.html), IBM (http://www.research.ibm.com/journal/sj/342/agerwala.html), Sun (http://soldc.sun.com/articles/openmp.html) and the OpenMP Architecture Review Board (http://www.openmp.org/). But then, what would any of them know about it?

drmbb2

Sep 27, 2002, 06:59 PM

Originally posted by Richard Edgar:
I had presumed that it should be obvious to all that that was what I was talking about. Now, I think that you might find that that is what one does with shared memory machines.
Sorry? Gigabit ethernet is fast? Gigabit ethernet automatically fetches data from processor caches with a 250ns latency?
It's a misconception that I seem to share with SGI (http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Developer/books/MPro7F90CD_RM/sgi_html/ch05.html), IBM (http://www.research.ibm.com/journal/sj/342/agerwala.html), Sun (http://soldc.sun.com/articles/openmp.html) and the OpenMP Architecture Review Board (http://www.openmp.org/). But then, what would any of them know about it?

First off, no, it wasn't obvious to me at least, given the original post that started this thread (call me stupid, go ahead, I don't care). Second, NO, it is NOT what we do in our software development, in terms of the distinction between SMP and Parallel processing. Third, YES, Gigabit eithernet does offer a performance gain, IN A PARALLEL ENVIRONMENT, in terms of inter-node data/file passing.

Finally, I do not see what your links do in terms of countering my points - we have an IBM SP2 machine here, plus SGI's and SUN's - they all fully understand the distinction between the SMP and parallel compute environments, so what is so difficult about this???

Detrius

Sep 28, 2002, 12:14 AM

I think a few things need to be cleared up here, as I think there are something that you guys aren't aware of.

Pooch is MPI. The only distinction is that MPI is a standard; Pooch is an implementation of that standard. Comparison: Red Hat is a distribution of Linux.

That said, I prefer the lam-mpi distribution. It's free and slightly easier to use than other Unix-based MPI distributions (such as mpich). Pooch is not free. PVM is not programmer friendly.

Also, SMP IS parallel computing. Just because there is a shared memory architecture does not make it NOT parallel. There are multiple threads running at the exact same time. It's parallel computing. Also, there is such a thing as distributed shared memory... where multiple physical machines are sharing the same "logical" memory space. The professor I learned this stuff from wrote the book on this. http://www.coe.uncc.edu/~abw/textbooks/ Check it out.

Richard Edgar

Sep 28, 2002, 03:06 AM

Third, YES, Gigabit eithernet does offer a performance gain, IN A PARALLEL ENVIRONMENT, in terms of inter-node data/file passingA performance gain over NUMAflex? I somehow doubt it.
Finally, I do not see what your links do in terms of countering my pointsBecause, if you'd bothered to read them, every one of them referred to parallel computing in a shared memory environment.

There are multiple threads running at the exact same time. It's parallel computing.Well, exactly. And there's the nice advantages that
a) It's extremely easy to program
b) The serial version of the code remains in the source
c) You don't have to worry about moving information about - the hardware ensures that it's there when you need it
Also, there is such a thing as distributed shared memory... where multiple physical machines are sharing the same "logical" memory spaceHas anyone got one of those things working at a sensible speed yet? When I last did some research, I got the impression that everyone had given up. It would be nice, since ccNUMA systems are rather expensive.

Detrius

Sep 29, 2002, 02:27 AM

Originally posted by Richard Edgar:

Has anyone got one of those things working at a sensible speed yet? When I last did some research, I got the impression that everyone had given up. It would be nice, since ccNUMA systems are rather expensive.

That I don't know. He keeps talking about it like it's his baby, but he also keeps offering it as a senior project... as if some student somewhere along the line is going to get it working the way he wants. However, I think he's more interested in visualizing what's going on... maybe so he can work out bugs.

I started parallel programming with MPI... I'm working my way back down to SMP and threads... it's easier to come up with the algorithms when you DON'T have shared memory.

Richard Edgar

Sep 29, 2002, 02:56 AM

it's easier to come up with the algorithms when you DON'T have shared memory.Surely, on a shared memory system, you just go through the serial program, find which sections have no data dependencies, and instruct the compiler to run them in parallel? My personal record for this was five seconds to parallelise a program - it was a simple one, and all it needed was converting

barray = elementalfunc( anarray )

to

!$OMP PARALLEL WORKSHARE
barray = elementalfunc( anarray )
!$OMP END PARALLEL WORKSHARE

which instantly gave me scaling to four processors (I don't have easy access to a bigger machine). There are some complications with avoiding false shares (the dreaded 'Store on shared cache line'), but essentially, it's extremely straightforward - and you can even keep the serial version in the parallel code, since the parallelisation directives are just comments (of course, if you rejig your algorithm slightly, then there might be complications).

OreoCookie

Sep 29, 2002, 03:23 AM

You guys talk a lot about communication as a limiting factor for scaling.

We do some numerical surface simulations on a Linux cluster (20 machines). All are diskless systems with 100/10 MBit ethernet. For our applications, there is little communication between the nodes, and thus the slower network is sufficient.

You have to find that out for your application.

It scales best on four nodes per calculation.

Detrius

Sep 29, 2002, 09:17 PM

Originally posted by OreoCookie:
You guys talk a lot about communication as a limiting factor for scaling.

We do some numerical surface simulations on a Linux cluster (20 machines). All are diskless systems with 100/10 MBit ethernet. For our applications, there is little communication between the nodes, and thus the slower network is sufficient.

You have to find that out for your application.

It scales best on four nodes per calculation.

This is dependent on whether the problem being solved is easily parallelisable. For example: With the mandelbrot fractal, each pixel on the screen is independent of the other pixels. Thus, this is a problem that can easily be split among multiple processors. Some problems are not so easy to make parallel. If the parallelization requires large amounts of data transfer, then the networking medium is very important. With the mandelbrot set, the network medium would not be very important.