Welcome to the MacNN Forums.

darcybaston · Oct 30, 2005, 07:15 PM

Alex, I still have your client sitting in boinc data. I updated my prefs to use 100% and 1 cpu, and it still wants to crunch 2 units at a time. I did an 'update' to the project after setting those prefs of course and ran a benchmark.

If there's another way to make it scale back so that only one sattelite icon shows up in my menubar, let me know. Thanks!

dtsang · Oct 30, 2005, 09:07 PM

Upgraded from alpha-4 on Panther to alpha-5 on Tiger. My work unit times have shortened from ~10,100 sec to ~8,400 sec.

Running a Power Mac G4 466 MHz.

Good work guys!

halimedia · Oct 31, 2005, 10:50 AM

Darcy, you may want to create an additional location profile, such as 'School' or 'Home'. I had problems getting settings to stick as well when using 'General' and 'Work' pofiles. There must be something wonky with these location profiles in either the current version of BOINC Menubar or the underlying CLI client.

HTH,

Ron

darcybaston · Oct 31, 2005, 03:13 PM

Thanks Ron. I'm taking a break from all this SETI excitement for a while. I'll let you know when the interest to tinker rises again.

rick · Nov 2, 2005, 11:43 AM

Thought I'd post an update because I haven't been on the board for ages. I was writing up the alpha 6 code but started getting a few accuracy problems so I decided to get this sorted out once and for all.

It's taking a while because this is an area I'm not particularly familiar with and I've had to read up on a lot of stuff. This should also solve those (few) earlier work units that were producing invalid results.

However, it might mean that the work units take slightly longer but if this means greater accuracy then so be it (science first, speed second). I have no idea what the trade offs might be but I should have some more info by the weekend.

Also, for those of you with dual core systems, we might be able to squeeze some more speed out. I think Alex mentioned this briefly.

The dual core architecture means that each physical CPU (containing two cores) is fed by a single memory bus (i.e. a single dual core has one memory bus, a quad has two memory buses). In this case it might be advantageous to make the client use multiple threads so we can make better use of memory bandwidth and caches.

This might require some changes in your BOINC settings, i.e. to only have one process per two cores (i.e. one SETI@home process for a single dual core, two processes for a quad).

I doubt we'll implement this before alpha 7 (which isn't even on the radar yet) and we'd need the help of people with dual core systems to see if this would be an improvment. It's possible that it won't help that much, particularly given Darcy's benchmark when running two processes.

Kadin2048 · Nov 10, 2005, 08:46 PM

Hello, all -

Just thought I'd say hello and ask a quick question. I'm a new SETI'er, after getting involved indirectly (we have a team at work that runs the World Community Grid but sadly it's x86 only).

I think I've got the enhanced worker ready to go, I copied it and the xml file into my Library/Application Support/BOINC Data/projects/setiathome.. folder, but haven't restarted anything so right now it's still using the unenhanced client. I just didn't want to dump all the work it's done on the work unit that it's running right now, and it sounds as though it won't switch to the new worker until it downloads the next unit? (If anyone can clarify that...)

The other question I have is, how do I set things, using the GUI client, so that one project acts as a backup of the other? Right now I have Einstein and SETI switching off in one-hour increments, but I think what I'd like to do is switch to SETI nonstop, with Einstein as a backup just in case the SETI servers stop responding like they did yesterday. Can anyone give me instructions on how to do that? Is it just an issue of turning one's resources share way down?

I also am not running the superbench client, because I figured I'd change one thing at a time. Should I be? And how dramatic of a performance improvement is it going to cause?

I have two computers, a G4/400 Sawtooth and a iBook G3 800. Both are idle during the day and do nothing but run BOINC. If anyone has any general suggestions for what I could do to speed things up as much as possible, I'm open to them.

beadman · Nov 10, 2005, 11:48 PM

My $0.02 worth: I'd suggest using the SuperBench version, as the enhanced clients work so much faster, you may not download enough work to keep the machines busy. Installing is not complicated; just quit BOINC, trash the old app, replace it in the exact same location with the enhanced version, restart BOINC.

For the SETI client, I think you need to remove the old version of the SETI client and leave the new version, and it's xml file, in the folder. It should finish the work it's presently doing, then switch to the newer verion for new work. About the only way you can tell is that your WU finish in about half the "normal" time. BTW, the enhanced SETI only really works better on G4 and G5 machines - the G3 apparently can't make use of the Altivec stuff. My old 333MHz iMac G3 takes about 80,000 seconds for a SETI work unit. My 1.33GHz G4 iBook does one in about 7,800 seconds using the enhanced client; 4 times faster clock, but more than 10 times faster WU calculation.

The way to run one client "all the time" is as you surmised: set it's share at something like 1,000 and set the "backup" client at 1. That way, the backup will only run 1 hour out of 1001 total hours running time.

beadman

Kadin2048 · Nov 11, 2005, 12:39 AM

Beadman,

Thanks for the tips. I'll do as you suggest about prioritizing them. Right now I've just suspended Einstein until I get SETI tested out and working.

I just quit the BoincManager application, removed the old worker from the Application Support folder, leaving the new worker and the xml file, and restarted BoincManager. On startup I got these messages:

Thu Nov 10 23:30:39 2005||Starting BOINC client version 5.2.5 for powerpc-apple-darwin
Thu Nov 10 23:30:40 2005||libcurl/7.14.0 OpenSSL/0.9.7g zlib/1.2.3
Thu Nov 10 23:30:40 2005||Data directory: /Library/Application Support/BOINC Data
Thu Nov 10 23:30:40 2005|SETI@home|Found app_info.xml; using anonymous platform
...

So that seems pretty encouraging. Is there any way to tell whether I'm using the new worker or the old one?

halimedia · Nov 11, 2005, 08:44 AM

Originally Posted by Kadin2048

So that seems pretty encouraging. Is there any way to tell whether I'm using the new worker or the old one?

If you're using alpha-5 (Tiger), you should see 'setiathome-G4-a5' as an active process in Activity Monitor. That's alpha-5 of the optimized worker at work. The standard worker identifies itself just as 'setiathome_4.18_', IIRC.

HTH,

Ron

Kadin2048 · Nov 12, 2005, 02:53 PM

Originally Posted by halimedia

If you're using alpha-5 (Tiger), you should see 'setiathome-G4-a5' as an active process in Activity Monitor. That's alpha-5 of the optimized worker at work. The standard worker identifies itself just as 'setiathome_4.18_', IIRC.

Humm. I had thought I was definitely running alpha-5 because I've noticed that the screensaver graphics no longer appear, but just for the heck of it I went into Activity Monitor and saw that the active process is not "setiathome-G4-a4," it's "setiathome-4.18".

Very odd, because I had thought that I'd noticed an increase in speed also. Can anyone confirm whether this definitely means I'm still running the old worker? Or is the optimized worker just reporting itself wrongly/differently?

halimedia · Nov 12, 2005, 05:29 PM

Strange... What OS are you running? Did you dl alpha-5 via the link posted by Alex as an edit to the first post of this thread? The alpha-4 processes identified themselves the same as the stock worker, IIRC.

A sure way of finding circumstantial evidence for running the optimized worker is by looking at the CPU times of your workunits on the SETI/BOINC results pages for your machine. You should see a significant drop in CPU times after installation of the optimized worker (on the order of three to four times less than before).

HTH,

Ron

jaspervicenti · 04:59 AM

Rick and/or Alex,

Great job on these SETI clients! I was getting bored with OGR and I had tried out seti "many" years
back but was disappointed that they were not altivec enhanced. Now, I'm seeing my ancient Dual G4 500 performance times comparable to some 2.5+ GHz machines. Pretty cool.

Anyways, I've been doing some AltiVec programming (mostly 8-bit integer/pixel processing) and have been using Shark a bit to help with performance tuning of my own projects. I profiled your alpha 5 client and have a suggestion that should eliminate a few cycles from the chirp loop

At the start of the loop there are the following high visibility loads. Shark claims they are taking 20% of the time in the chirp procedure. I don't believe this is accurate but moving these load instructions outside of the loop would save at least 8 cycles:

1.0% 0x60a4 lwz r9,10086(r21) 2:1 ! Invariant load
14.8% 0x60ac lwz r10,4(r28) 2:1 ! Invariant load
1.9% 0x60b0 lwz r8,10078(r30) 2:1 ! Invariant load
2.1% 0x60b8 lwz r7,4(r27) 2:1 ! Invariant load
Current code:
for (i = 0; i < end; i += 8) {
float *re = data.realp + i;
float *im = data.imagp + i;
float *cre = chirped.realp + i;
float *cim = chirped.imagp + i;

Suggested change:
float *re = data.realp;
float *im = data.imagp;
float *cre = chirped.realp;
float *cim = chirped.imagp;
for (i = 0; i < end; i += 8) {
re += i;
im += i;
cre += i;
cim += i;

I haven't had a chance to test this out yet to see if there is any improvement.

Hope this helps!
Jasper Vicenti

alexkan · Nov 13, 2005, 04:44 AM

Oh, don't worry...we're doing more than our fair share of Shark profiling. As far as I know, the reason those loads are taking up a lot of time in the Shark traces is because most of those loads are cache misses. Prefetching helps somewhat, but I think in the end, we're still bound by load stalls when we chirp.

Also, those of you who pointed out the issues with certain clients' memory usage bloating over the course of a work unit are right.

I missed a subtle memory leak in the code. It shouldn't be hard to roll out a change to fix this, but since we're looking at accuracy and validation-related things, it might have to wait until we're sure the code we're looking at now is right.

Hopefully all these changes will be finished before SETI starts rolling out Enhanced and we have to start porting our changes over to the other codebase.

rick · 12:26 PM

Originally Posted by jaspervicenti

At the start of the loop there are the following high visibility loads. Shark claims they are taking 20% of the time in the chirp procedure. I don't believe this is accurate but moving these load instructions outside of the loop would save at least 8 cycles:

Hey, jasper, nice to see another coder.

The main reason that Shark points to those parts is because we're accessing global variables (data and chirped are global structures with array pointers in them). The C language requires that you reload global variables from memory each time you access them (as opposed to keeping the array pointers in registers which is faster). This means that we first have to load the array pointer from memory, and then load the memory that it points to.

The newer alpha 6 code has this sort of thing cleaned up (hmm... I thought I might have done this for alpha 5... never mind). The start of the new chirp code looks like this:

Code:
<pre>static void
chirp(int offset, int numPoints, double chirpRate)
{
        double rate = chirpRate / (swi.subband_sample_rate * swi.subband_sample_rate);
        double roundVal = rate >= 0.0 ? TWO_TO_52 : -TWO_TO_52;
        const float *rep = data.realp + offset;
        const float *imp = data.imagp + offset;
        float *crep = chirped.realp + offset;
        float *cimp = chirped.imagp + offset;
        int end;
        int i;
        
        // just copy if no chirping is required
        if (chirpRate == 0.0) {
                memcpy(crep, rep, numPoints * sizeof(float));
                memcpy(cimp, imp, numPoints * sizeof(float));
                return;
        }
        
        // main vectorised loop
        end = numPoints - (numPoints & 7);
        for (i = 0; i < end; i += 8) {
                const float *vrep = rep + i;
                const float *vimp = imp + i;
                float *vcrep = crep + i;
                float *vcimp = cimp + i;</pre>

so that the array pointers are always in registers. There are also a few other places in other parts of the code where I've cleaned this sort of thing up.

Note that the above chirp code is quite a bit different from the alpha 5 code cos we chirp in chunks that will fit into L2 cache as opposed to doing the whole data set at once (hence the new offset variable).

If you'd like to have a dig around the newer alpha 6 code then give me a shout. We don't have the facilities to host an online version control system but I can supply it as an zip archive or as a dumped Subversion repository.

We can always use an extra pair of eyes to go over the code. Alex usually has to point out at least one retarded thing I do.

dgoldsmith · Nov 15, 2005, 01:29 AM

Hi,

Have you thought any about eventually getting this up on Intel? Using Accelerate.framework should help a lot already. The transition documentation has some material on writing code that can compile for both Altivec and SSE.

Right now I'm running stock boinc and SETI clients that I've compiled on my DTS Mac system (I'm running your alpha 5 client on my PPC Macs). It would be interesting to see how much acceleration the vector version of the SETI client can give.

rick · Nov 15, 2005, 01:05 PM

Originally Posted by dgoldsmith

Have you thought any about eventually getting this up on Intel? Using Accelerate.framework should help a lot already. The transition documentation has some material on writing code that can compile for both Altivec and SSE.

There aren't any plans (or volunteers) for this at the moment, but hopefully we would be able to combine our efforts with the current x86 optimisation efforts.

Even though we use the Accelerate framework in a few places, we're moving more towards custom Altivec code. Apart from a few places (like the initial smoothing step) I think that we'll only end up using the FFT routines from the vDSP library.

It's all kind of academic until someone with our code base and SSE x86 experience gets hold of it. There was someone on one of the BOINC development mailing lists a while ago who was trying to get the code to compile on a DTS.

Plus we won't know the relative performance figures until the real hardware starts shipping. Recent x86 systems seem to have a lot more L2 cache than Macs so this will make quite a big difference.

I reckon by that time the code base should be clear enough to make an Altivec to SSE translation relatively painless. At the moment we aren't using any Altivec instructions that would be difficult to translate to SSE. Someone would then have to tweak extra stuff like cache streaming to make maximum use of the architecture.

Originally Posted by dgoldsmith

Right now I'm running stock boinc and SETI clients that I've compiled on my DTS Mac system (I'm running your alpha 5 client on my PPC Macs). It would be interesting to see how much acceleration the vector version of the SETI client can give.

If people are interested then I could make a preliminary x86 branch. This would basically be the latest code base but with all the Altivec specific stuff ripped out and standard ANSI C in it's place (or with equivalent Accelerate functions). It won't be representative of full optimisation, but it's a start.

Karl Schimanek · Nov 15, 2005, 02:56 PM

I hope you wouldn't support x86!!!

Karl

rick · Nov 16, 2005, 08:35 AM

Originally Posted by Karl Schimanek

I hope you wouldn't support x86!!!

Why not?

Karl Schimanek · Nov 16, 2005, 04:24 PM

Intel keeps killing off processor architectures one way or another. When nothing but 'x86 is left, evolution has ended.

Linky

Full ACK!
I don't support Intel.

amigoivo · Nov 19, 2005, 11:25 AM

Originally Posted by Karl Schimanek

Linky

Full ACK!
I don't support Intel.

Hello,

i would like to post my first PowerMac QUAD G5 results.

2x DualCore G5 @ 2,5GHz
GeForce 6600 256MB Video Ram
3GB DDR2 RAM
250GB Maxtor HD with 16MB Cache

I crunched the referent units with the alpha5 client with folowing results:

real 33m53.230s
user 33m11.539s
sys 0m25.842s

Unfortunately BOINC uses only 2 cores, the other 2 are "useless" is ther the possibility to recompile
it?

Greetz, Ivo

Shaktai · Nov 19, 2005, 12:45 PM

Originally Posted by amigoivo

Hello,

Unfortunately BOINC uses only 2 cores, the other 2 are "useless" is ther the possibility to recompile
it?

Greetz, Ivo

amigoivo, Go into your SETI Account prefences under General settings and change the number of processors to be used. The default is a maximum of 2. Change that to 4 and it should then use all four processors. This is managed by the BOINC client through the preferences, not by the SETI app.

Congratulations on your Quad PowerMac. I wish I could afford one.

Knightrider · Nov 19, 2005, 12:46 PM

Originally Posted by amigoivo

Hello,

Unfortunately BOINC uses only 2 cores, the other 2 are "useless" is ther the possibility to recompile
it?
Greetz, Ivo

Hi Ivo,

Try this - in the library/application suport/BOINC data/slots folder check to see how many folders there are. If there are only two Folders, one named "0" and another folder named "1" then add two more folders calling them "2" and "3" (without quotes). Then see what happens. You should also check the activity monitor found in utilities, to see what work the cpu's are doing.

K.

Shaktai, I had this problam also, altough I had set my prefs to 4 cpu's. This seems to resolve it, I will see how it goes. It may have sorted itself out anyway?

K.

Shaktai · Nov 19, 2005, 12:56 PM

Originally Posted by Knightrider

Hi Ivo,

Shaktai, I had this problam also, altough I had set my prefs to 4 cpu's. This seems to resolve it, I will see how it goes. It may have sorted itself out anyway?

K.

Good to know. Keep in mind that when you change the account prefences, they don't take effect immediately. You may have to force a manual update or two from the BOINC client.

amigoivo · Nov 19, 2005, 02:05 PM

Originally Posted by Shaktai

Good to know. Keep in mind that when you change the account prefences, they don't take effect immediately. You may have to force a manual update or two from the BOINC client.

Hello,

thanks a lot for the fast help!

Unfortunatly i couldn´t download some seti WUs, but the prefs in Einstein@home works fine.
For lack of Seti WUs the quad is now crunching for E@H, but i will download seti WUs as fast as possible to make some speed tests.

Greetz, Ivo

rick · Nov 21, 2005, 01:36 PM

Those dual / quad benchmarks are sweet. Hopefully we can get them under 30 minutes with alpha 6. Out of curiosity, could someone with a dual or quad system run the following command in terminal:

Code:
sysctl -a | grep cpu

You should get something like this:

Code:
arrakis 17:34:25 rick % sysctl -a | grep cpu
hw.ncpu = 1
hw.cpufrequency = 999999996
hw.availcpu = 1
hw.ncpu: 1
hw.activecpu: 1
hw.physicalcpu: 1
hw.physicalcpu_max: 1
hw.logicalcpu: 1
hw.logicalcpu_max: 1
hw.cputype: 18
hw.cpusubtype: 11
hw.cpufrequency: 999999996
hw.cpufrequency_min: 999999996
hw.cpufrequency_max: 999999996
arrakis 17:34:31 rick %

If we decide to make alpha 7 multithreaded, this would help distinguish between dual CPU and dual core systems (this might be important for better utilising the memory bandwidth).

Cheers.

amigoivo · Nov 21, 2005, 01:50 PM

Hello rick,

here are my results:

PowerMac G5 QUAD:

PMQuadG5:~ imaynick$ sysctl -a | grep cpu
hw.ncpu = 4
hw.cpufrequency = 2500000000
hw.availcpu = 4
hw.ncpu: 4
hw.activecpu: 4
hw.physicalcpu: 4
hw.physicalcpu_max: 4
hw.logicalcpu: 4
hw.logicalcpu_max: 4
hw.cputype: 18
hw.cpusubtype: 100
hw.cpufrequency: 2500000000
hw.cpufrequency_min: 2500000000
hw.cpufrequency_max: 2500000000

PowerMac G5 DUAL 2,5GHz

IvosPowerMacG5:~ imaynick$ sysctl -a | grep cpu
hw.ncpu = 2
hw.cpufrequency = 2500000000
hw.availcpu = 2
hw.ncpu: 2
hw.activecpu: 2
hw.physicalcpu: 2
hw.physicalcpu_max: 2
hw.logicalcpu: 2
hw.logicalcpu_max: 2
hw.cputype: 18
hw.cpusubtype: 100
hw.cpufrequency: 2500000000
hw.cpufrequency_min: 2500000000
hw.cpufrequency_max: 2500000000

MacOS X 10.4.3 on both machines.

Have a nice day,
Ivo

rick · Nov 21, 2005, 02:20 PM

Originally Posted by amigoivo

Hello rick,

here are my results:

Thanks. Unfortunately, this doesn't look as though it's going to be helpful. I was hoping it would be something like:

Code:
hw.logicalcpu: 2
hw.physicalcpu: 1

or something so we could distinguish dual core systems. Perhaps there's another way to do it. I'll ask on one of the Apple mailing lists.

I guess "logicalcpu" will probably be used by the forthcoming x86 Macs if they have Hyper Threading.

halimedia · Nov 21, 2005, 06:42 PM

Rick, couldn't you use the Machine Model (i.e. PowerMac7,3 for a Dual 2.5 GHz) for identification? True, this wouldn't be bullet-proof, as future CPU upgrade cards could change the CPU configuration. But that would be a while off, I would guess.

Incidentally, what do the dual-core G5s report when issued the 'machine' command? Maybe that could be used in conjunction with the Machine Model to give you the info you need...

HTH,

Ron

halimedia · Nov 21, 2005, 07:06 PM

Rick, I just issued 'sysctl -a | grep hw' and got the following output:

Code:
hw.machine = Power Macintosh
hw.model = PowerMac7,3
hw.ncpu = 2
hw.byteorder = 4321
hw.physmem = 2147483648
hw.usermem = 1828810752
hw.pagesize = 4096
hw.epoch = 1
hw.vectorunit = 1
hw.busfrequency = 1250000000
hw.cpufrequency = 2500000000
hw.cachelinesize = 128
hw.l1icachesize = 65536
hw.l1dcachesize = 32768
hw.l2settings = 2147483648
hw.l2cachesize = 524288
hw.tbfrequency = 33329541
hw.memsize = 5368709120
hw.availcpu = 2
net.link.ether.inet.apple_hwcksum_tx: 1
net.link.ether.inet.apple_hwcksum_rx: 1
hw.ncpu: 2
hw.byteorder: 4321
hw.memsize: 5368709120
hw.activecpu: 2
hw.physicalcpu: 2
hw.physicalcpu_max: 2
hw.logicalcpu: 2
hw.logicalcpu_max: 2
hw.cputype: 18
hw.cpusubtype: 100
hw.pagesize: 4096
hw.busfrequency: 1250000000
hw.busfrequency_min: 1250000000
hw.busfrequency_max: 1250000000
hw.cpufrequency: 2500000000
hw.cpufrequency_min: 2500000000
hw.cpufrequency_max: 2500000000
hw.cachelinesize: 128
hw.l1icachesize: 65536
hw.l1dcachesize: 32768
hw.l2cachesize: 524288
hw.tbfrequency: 33329541
hw.optional.floatingpoint: 1
hw.optional.altivec: 1
hw.optional.graphicsops: 1
hw.optional.64bitops: 1
hw.optional.fsqrt: 1
hw.optional.stfiwx: 1
hw.optional.datastreams: 0
hw.optional.dcbtstreams: 1

That should give you all the info you need, including CPU type, cache sizes, etc. Wouldn't that be sufficient?

Cheers,

Ron

Knightrider · Nov 21, 2005, 08:48 PM

Hi,
Here are my terminal results for the command =' sysctl -a | grep hw'

hw.machine = Power Macintosh
hw.model = PowerMac11,2
hw.ncpu = 4
hw.byteorder = 4321
hw.physmem = 536870912
hw.usermem = 467030016
hw.pagesize = 4096
hw.epoch = 1
hw.vectorunit = 1
hw.busfrequency = 1250000000
hw.cpufrequency = 2500000000
hw.cachelinesize = 128
hw.l1icachesize = 65536
hw.l1dcachesize = 32768
hw.l2settings = 2147483648
hw.l2cachesize = 1048576
hw.tbfrequency = 33330173
hw.memsize = 536870912
hw.availcpu = 4
net.link.ether.inet.apple_hwcksum_tx: 1
net.link.ether.inet.apple_hwcksum_rx: 1
hw.ncpu: 4
hw.byteorder: 4321
hw.memsize: 536870912
hw.activecpu: 4
hw.physicalcpu: 4
hw.physicalcpu_max: 4
hw.logicalcpu: 4
hw.logicalcpu_max: 4
hw.cputype: 18
hw.cpusubtype: 100
hw.pagesize: 4096
hw.busfrequency: 1250000000
hw.busfrequency_min: 1250000000
hw.busfrequency_max: 1250000000
hw.cpufrequency: 2500000000
hw.cpufrequency_min: 2500000000
hw.cpufrequency_max: 2500000000
hw.cachelinesize: 128
hw.l1icachesize: 65536
hw.l1dcachesize: 32768
hw.l2cachesize: 1048576
hw.tbfrequency: 33330173
hw.optional.floatingpoint: 1
hw.optional.altivec: 1
hw.optional.graphicsops: 1
hw.optional.64bitops: 1
hw.optional.fsqrt: 1
hw.optional.stfiwx: 1
hw.optional.datastreams: 0
hw.optional.dcbtstreams: 1

K.

Knightrider · Nov 21, 2005, 08:52 PM

These are my results for the command ' sysctl -a | grep cpu'

hw.ncpu = 4
hw.cpufrequency = 2500000000
hw.availcpu = 4
hw.ncpu: 4
hw.activecpu: 4
hw.physicalcpu: 4
hw.physicalcpu_max: 4
hw.logicalcpu: 4
hw.logicalcpu_max: 4
hw.cputype: 18
hw.cpusubtype: 100
hw.cpufrequency: 2500000000
hw.cpufrequency_min: 2500000000
hw.cpufrequency_max: 2500000000

K.

reader50 · Nov 21, 2005, 09:19 PM

Just a note to all our visitors. The forum backend database has been acting oddly for awhile, and we did a recent vBB version upgrade. A side effect is that there have been a lot of double- and triple-posts lately. And one Sextuple-post that I've seen.

Don't worry about them, I delete the duplicates when I see them. rick, I got yours.

If anyone wants their duplicate posts to remain for artistic reasons, just say so.

dgoldsmith · Nov 22, 2005, 03:42 AM

Originally Posted by rick

If people are interested then I could make a preliminary x86 branch. This would basically be the latest code base but with all the Altivec specific stuff ripped out and standard ANSI C in it's place (or with equivalent Accelerate functions). It won't be representative of full optimisation, but it's a start.

That would be an interesting start. I'd at least try to build it and see what it did.

rick · Nov 22, 2005, 01:26 PM

Originally Posted by halimedia

Incidentally, what do the dual-core G5s report when issued the 'machine' command? Maybe that could be used in conjunction with the Machine Model to give you the info you need...

It looks as though 'machine' just reports the CPU type. For all G5s this should be 'ppc970'.

Originally Posted by halimedia

That should give you all the info you need, including CPU type, cache sizes, etc. Wouldn't that be sufficient?

Perhaps I'll just use your idea of the Power Mac model number.

I think it might be trickier than I originally thought anyway because it will depend which CPUs the kernel scheduler puts the threads on. I guess I'll have to ask some of the Smart People on the Apple lists.

Originally Posted by reader50

If anyone wants their duplicate posts to remain for artistic reasons, just say so.

Well I consider my profanity an art form so...

Originally Posted by dgoldsmith

That would be an interesting start. I'd at least try to build it and see what it did.

Cool, I'll give it a go. I'll probably just splice it into the fat binary with the alpha 6 release.

halimedia · Nov 24, 2005, 10:16 AM

Alex, Rick: Just out of curiosity - it seems to me that alpha-5 on G4s at equivalent clock speed might actually outperform G5s. I don't have any side-by-side comparisons, but when I extrapolate results I'm seeing from my machines, it certainly seems that way. Just a couple questions (not at all intended as criticisms) that others may be curious about as well:

- Why do G4s seem to benefit more from your optimizations than G5s?
- Are you making use of the G5s additional FPU in the G5-optimized code?

I'm looking forward to any insights you may be able to share...

Cheers, and thanks for all the work you're putting into this!

Ron

Karl Schimanek · Nov 24, 2005, 04:46 PM

http://forums.macnn.com/showthread.p...50#post2707650
http://forums.macnn.com/showthread.p...92#post2717192

Regards
Karl

halimedia · Nov 24, 2005, 05:32 PM

Karl, thanx for the references. Time to re-read the whole thread, I guess...

So it seems that the vDSP lib takes care of this 'automagically' (utilizing the FPUs). However, Alex also mentioned that their code mostly takes advantage of Altivec, because it provides more throughput. Any chance to use both (i.e. Altivec and _both_ FPUs on G5s in parallel)? Sorry if this sounds silly to you hard-core coders, but I'm curious...

Cheers,

Ron

halimedia · 05:44 PM

Originally Posted by halimedia

Any chance to use both (i.e. Altivec and _both_ FPUs on G5s in parallel)?

And while we're at it: are there any reasonably easily implementable ways to take advantage of the GPU without having to write custom code for every model out there? - got to beat those Windoze boxen, after all, don't we?

Happy Thanksgiving to all our friends across the pond...

halimedia · Nov 24, 2005, 05:50 PM

Has anyone used BOINC Menubar in any of the 5.x incarnations for any length of time? Anything new that's worth having? And is anyone in the know whether Superbench-optimized 5.x clients are coming our way?

TIA for any feedback!

Cheers,

Ron

Shaktai · Nov 24, 2005, 11:01 PM

Originally Posted by halimedia

And while we're at it: are there any reasonably easily implementable ways to take advantage of the GPU without having to write custom code for every model out there? - got to beat those Windoze boxen, after all, don't we?

Happy Thanksgiving to all our friends across the pond...

Haven't heard of any ways to currently use GPU for BOINC. Best way to beat the Windoze users is to add more macs. After all we already have the best optimized SETI app available. My iMac 1.6ghz running the SETI optimized app is only about 10-12% slower then an AMD 3700+ with 1meg L2 cache running the Windows optimized SETI app. A PowerMac 2.0 or greater would run rings around the AMD.

Apparently there was some significant changes to the source code, and mikkyo has had a chance to compile it. I've looked at compiling myself, but really can't make sense out of most of it. Haven't heard of anyone else compiling BOINC 5.2x for Mac.

halimedia · Nov 25, 2005, 04:49 AM

Originally Posted by Shaktai

My iMac 1.6ghz running the SETI optimized app is only about 10-12% slower then an AMD 3700+ with 1meg L2 cache running the Windows optimized SETI app. A PowerMac 2.0 or greater would run rings around the AMD.

Hmmm, odd... A $600 single-proc P4 3.2 GHz box with Hyperthreading on XP is running rings around my Dual 2.5 GHz G5 (both using optimized workers). There's got to be more than the privilege to own a Mac to the difference in price, IMO. Maybe that move to Intel isn't all that wrong, after all, Karl.

Greetings from snowy Switzerland!

Ron

rick · Nov 25, 2005, 07:22 AM

These kind of got answered by the references that Karl provided but I'll try and be more specific:

Originally Posted by halimedia

- Why do G4s seem to benefit more from your optimizations than G5s?

Currently, me and Alex are the only ones developing this and we've only got access to G4 PowerBooks. When we optimise a routine and benchmark it we can only see how fast it runs on a G4.

These routines will still probably be fast on a G5 but it may be possible to get more speed out of them by feeding them more data. However, without a G5 to benchmark several different routines on, it's mainly guess work. It's helpful when people supply Shark (program that shows code performance) traces from G5s, but to get a decent amount of speed out of it you really have to sit down at one and try a bunch of different things.

Originally Posted by halimedia

- Are you making use of the G5s additional FPU in the G5-optimized code?

Originally Posted by halimedia

So it seems that the vDSP lib takes care of this 'automagically' (utilizing the FPUs). However, Alex also mentioned that their code mostly takes advantage of Altivec, because it provides more throughput. Any chance to use both (i.e. Altivec and _both_ FPUs on G5s in parallel)? Sorry if this sounds silly to you hard-core coders, but I'm curious...

We're actually starting to use vDSP less and use more of our own routines because we can do better than vDSP.

The sort of data that we work on is single-precision floating point. To vastly simplify, Altivec can work on 4 of these at once (a vector), compared to only 1 for the (scalar) FPU.

In most of the algorithms we just use the Altivec unit and using the FPU wouldn't provide any extra benefit (Altivec has enough resources to handle all the data that you throw at it).

However, in (the forthcoming) alpha 6, the new chirp code uses double-precision (which you can't do in Altivec) so we use the scalar FPU for a bit before loading the data into the Altivec unit. Having 2 FPUs on the G5 makes this a bit faster, but I wouldn't think that this would make a massive difference.

Originally Posted by halimedia

And while we're at it: are there any reasonably easily implementable ways to take advantage of the GPU without having to write custom code for every model out there? - got to beat those Windoze boxen, after all, don't we?

To use the GPUs we'd have to either write some new code in OpenGL (programming interface for 2D/3D graphics) or use one of the GPGPU toolkits out there (which convert the code to OpenGL for us).

In theory, it might be possible to get a quite large boost out of the code. However, there are a couple of things that would make it trickier:

different graphics cards and different versions of the operating system will not necessarily support all the features that you would need for this (e.g. asychronous transfer, render to texture, enough RAM, shader programs),
for certain parts of the analysis, you would only be able to do the algorithm on the CPU (GPUs can't do sufficiently general computations yet). This means you have to read the data back from the graphics card which can be quite slow. The graphic card's AGP interface is better at writing data to the card rather than reading data from it. Newer systems with PCI-E / PCI-X (or whatever it's called) should be better for this as the read bandwidth is much better,
GPUs don't necessarily have sufficient precision for the calculations and this varies between different GPUs and vendors,
all of the above mean that you would have to test each different card to check that it will work correctly, whereas if you just write code for the CPU it will work almost identically (and at predictable speeds) across G4s, G5s, x86s, etc.

It would be an interesting area to explore, but I doubt we'd ever get round to this for several months.

Originally Posted by halimedia

Hmmm, odd... A $600 single-proc P4 3.2 GHz box with Hyperthreading on XP is running rings around my Dual 2.5 GHz G5 (both using optimized workers). There's got to be more than the privilege to own a Mac to the difference in price, IMO. Maybe that move to Intel isn't all that wrong, after all, Karl

I think the main (only?) advantage that the x86 machines have is the larger L2 cache which significantly reduces memory access times. Differences in memory bandwidth and RAM speed will also make a smaller difference.

The SETI@home worker works on 8 MiB of data (and is usually processing 1-2 MiB at once). With a 512 KiB cache, this means that you have to keep reading from RAM over the memory bus as opposed to the L2 cache which is on the CPU. At the moment, CPU speeds are much faster than memory speeds (and this gap keeps growing) so the more you can keep data in the CPU the better.

This is one of the reasons the new dual-core machines are stupid fast, they've got 1 MiB of L2 cache per core. I think they also have faster memory, but I haven't looked closely at their specs.

Shaktai · Nov 25, 2005, 01:15 PM

Originally Posted by halimedia

Hmmm, odd... A $600 single-proc P4 3.2 GHz box with Hyperthreading on XP is running rings around my Dual 2.5 GHz G5 (both using optimized workers). There's got to be more than the privilege to own a Mac to the difference in price, IMO. Maybe that move to Intel isn't all that wrong, after all, Karl.

Greetings from snowy Switzerland!

Ron

Karl, can you provide examples of the actual machines, or even their machine ID's for work unit comparisions. Are these machines that you own? I am curious about actual numerical comparisions. And of course the Intel's classicaly do better then the AMD's which I was comparing against. Are you comparing points awarded or seconds per average work unit?

E.T from tellus · Nov 25, 2005, 06:17 PM

http://setiathome.berkeley.edu/resul...hostid=1781916
Powermac G5 quad 2.5ghz - boinc 2.72beta - seti@home-G5-a5
Average wu time/processor 2 500sec, means almost SIX units/hour

halimedia · Nov 25, 2005, 07:26 PM

Originally Posted by Shaktai

Karl, can you provide examples of the actual machines, or even their machine ID's for work unit comparisions. Are these machines that you own? I am curious about actual numerical comparisions. And of course the Intel's classicaly do better then the AMD's which I was comparing against. Are you comparing points awarded or seconds per average work unit?

I'm not Karl, but I assume you meant me. I am comparing seconds per average WU.

Here's the DP 2.5 GHz G5:
http://setiweb.ssl.berkeley.edu/show...hostid=1116838

And here's the P4 3.2 GHz :
http://setiweb.ssl.berkeley.edu/show...hostid=1489145

An important note about the latter box: it is running with only one of its 'virtual' hyperthreading processors devoted to SETI (for noise reasons). When both 'virtual' processors are crunching, the crunch time rises to about 4100 seconds. Not exactly running circles around the G5, but still faster - and did I mention much cheaper (P4 x 4.5 = G5)?

Now imagine the score if I had bought four of these WIntel boxen and they were all crunching SETI. OK, OK, I admit - SETI performance is not the main criterion when I buy a machine. OS and app usability, spyware and worm susceptibility play a certain role, too

Must be the caches, anyhow. Look at that Quad fly!

FWIW,

Ron

amigoivo · Nov 25, 2005, 07:49 PM

Originally Posted by halimedia

I'm not Karl, but I assume you meant me. I am comparing seconds per average WU.

Here's the DP 2.5 GHz G5:
http://setiweb.ssl.berkeley.edu/show...hostid=1116838

And here's the P4 3.2 GHz :
http://setiweb.ssl.berkeley.edu/show...hostid=1489145

An important note about the latter box: it is running with only one of its 'virtual' hyperthreading processors devoted to SETI (for noise reasons). When both 'virtual' processors are crunching, the crunch time rises to about 4100 seconds. Not exactly running circles around the G5, but still faster - and did I mention much cheaper (P4 x 4.5 = G5)?

Now imagine the score if I had bought four of these WIntel boxen and they were all crunching SETI. OK, OK, I admit - SETI performance is not the main criterion when I buy a machine. OS and app usability, spyware and worm susceptibility play a certain role, too

Must be the caches, anyhow. Look at that Quad fly!

FWIW,

Ron

Hello Ron,

the maesurments of my PowerMacs are different to yours.
The integer speed of your DUAL 2,5GHZ G5 is 3x higher than mine??
DUAL 2,5:
http://setiathome.berkeley.edu/show_...?hostid=169352

And my QUAD G5 isn't much faster:
QUAD G5
http://setiathome.berkeley.edu/show_...hostid=1713083

If i run the BOINC 5.2.5 Benchmark i had the same results.

Greetz, Ivo

Karl Schimanek · Nov 25, 2005, 08:06 PM

Originally Posted by Shaktai

Karl, can you provide examples of the actual machines, or even their machine ID's for work unit comparisions. Are these machines that you own? I am curious about actual numerical comparisions. And of course the Intel's classicaly do better then the AMD's which I was comparing against. Are you comparing points awarded or seconds per average work unit?

Intel do better then AMD

To be talking a lot of hot air? That's for sure

For comparison: Crunshing reference WU:

PowerMac G4 733MHz using alpha5 (mine): ca. 10.800 seconds
PowerMac Quad G5 2.5GHz using alpha5 (mine): ca. 2.000 seconds
http://setiathome.berkeley.edu/show_...userid=8169359

AMD Athlon64 2.4GHz using "Harold Naparst client": ca. 1600 seconds
Intel? Nothing to speak of.

Here are other results:
http://www.marisan.nl/seti/reference.htm

Regards
Karl

Knightrider · Nov 26, 2005, 04:52 AM

These routines will still probably be fast on a G5 but it may be possible to get more speed out of them by feeding them more data. However, without a G5 to benchmark several different routines on, it's mainly guess work. It's helpful when people supply Shark (program that shows code performance) traces from G5s, but to get a decent amount of speed out of it you really have to sit down at one and try a bunch of different things.

Shark? Would someone post a link for me please.

TIA

K.

rick · Nov 26, 2005, 05:24 AM

Originally Posted by Knightrider

Shark? Would someone post a link for me please.

Shark will be part of the development tools that come on your Mac OS X installation disk. It will be under the CHUD tools (Computer Hardware something something...).

However, you'll probably want the latest version (there are a couple of bug fixes in it) from Apple's FTP server.

It comes with documentation but you'll probably also want to look at the Apple developer site.

If you need to download the latest developer tools (Xcode 2.2) then you need to sign up for a free account at http://connect.apple.com and go to the downloads section in your account.

When you actually make a Shark trace it's really useful if you have the source code attached (so it basically tells you exactly which lines of source code are running slowly and exactly how they translate into machine code). I'm going to try and make this easier to do with the alpha 6 source code (i.e. a special build environment).

If you need any help just post here or send me email.

halimedia · Nov 26, 2005, 05:59 AM

Originally Posted by amigoivo

the maesurments of my PowerMacs are different to yours.
The integer speed of your DUAL 2,5GHZ G5 is 3x higher than mine??

I'm using the MacNN Superbench BOINC Menubar client (4.44). That's the difference, I'd say. Odd - your Quad seems a fair bit slower than E.T from tellus'. I wonder why...

Just out of curiosity: why don't you let your Quad return results immediately after they're done?

Originally Posted by Karl Schimanek

AMD Athlon64 2.4GHz using "Harold Naparst client": ca. 1600 seconds
Intel? Nothing to speak of.

Karl, could you post a dl URL for the Naparst client? TIA!

Cheers,

Ron