 |
 |
G5 supercomputer comes up at disappointing 7.4 Gigaflops
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 2000
Location: Caught in a web of deceit.
Status:
Offline
|
|
|
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Mar 2000
Location: London, UK
Status:
Offline
|
|
I think you mean 7.4Tflops.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Nov 2000
Location: in front of my Mac
Status:
Offline
|
|
Yes, he was supposed to say TFlops, but that's not the worst he said.
Calling this "disappointing" is stupid. Or did somebody really happen to believe all the kiddies claiming this would be the number-one cluster...?
2k 2GHz G5s made 7.4
but
1.9k 2.4GHz Xeons made 4
1.2k 1.3GHz Power4 made 3.2
1k Itanium2 made 4.1
4k Alphas made 7.7
This puts the G5 cluster in the top league. This is great news. Never before have Macs been so high up in terms of pure supercomputing power.
Calling this disappointing just because some kid claimed in advance it should be 20TF/s because he multiplied the single max burst performace with the number of procs is simply ridiculous.
This is a great result, especially if you weight it with cost. The Apple solution will prove to be one of the most efficient for the Dollar.
|
|
•
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
Oops, yes Teraflops.
No, I agree it's very respectable, but it's still disappointing because:
1) Dongarra was claiming 80% efficiency for the G5 with 128 processors, which would put it at 13.5 Teraflops 2112 processor setup, and over 14 Teraflops for a 2200 processor setup. It was not just "some kid". It is the man who manages the top 500 list. Perhaps he simply misunderstood 128 G5s to mean 128 processors, when in fact it was 256 processors. (The people over at Ars were very surprised at the 80% figure actually, since POWER4 gets something like 50%, but it was Dongarra who said it.)
2) VT/Apple's main competitor was Linux NetworX, and if VT had managed to use all 2200 G5 CPUs, VT would have edged them out. But VT used 2112 processors, so they didn't.
But yeah, its still a very nice score. And also note:
Xeon 2.4 GHz - 1.38 Gflops/GHz
G5 2.0 GHz - 1.76 Gflops/GHz
Extrapolating, that means the G5 2.0 is as fast as a Xeon 2.54.
Edited because numbers wrong.
(Last edited by Eug Wanker; Oct 19, 2003 at 01:53 PM.
)
|
|
|
| |
|
|
|
 |
|
 |
|
Fresh-Faced Recruit
Join Date: Mar 2003
Status:
Offline
|
|
Originally posted by Eug Wanker:
Dongarra was claiming 80% efficiency for the G5 with 128 processors, which would put it at 13.5 Teralflops 2112 processor setup
Hmm...I though he said that he expected that 80% value to decrease as more processors were added. I don't remember where I read that but it would make sense that the overhead is going to increase as more processors are added.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
Originally posted by fhammond:
Hmm...I though he said that he expected that 80% value to decrease as more processors were added. I don't remember where I read that but it would make sense that the overhead is going to increase as more processors are added.
He did say that, but the problem is that I think he just misunderstood the submission.
There is a prelim score in there somewhere for 256 G5 processors, and it gets 40% efficiency.
ie. He probably misunderstood 128 Power Macs as 128 processors.
Oops.
|
|
|
| |
|
|
|
 |
|
 |
|
Fresh-Faced Recruit
Join Date: Aug 2003
Status:
Offline
|
|
Originally posted by Eug Wanker:
Oops, yes Teraflops.
But yeah, its still a very nice score. And also note:
Xeon 2.4 GHz - 1.38 Gflops/GHz
G5 2.0 GHz - 1.76 Gflops/GHz
Extrapolating, that means the G5 2.0 is as fast as a Xeon 2.54.
Isn't that just for this code though? I'm sure other code would run just as fast as a 3.06Ghz Xeon if not a little faster. Also consider you can use Altivec with this machine (but on on LINPACK), and that should really fly past any pentiums.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
Originally posted by jrod7350:
Isn't that just for this code though? I'm sure other code would run just as fast as a 3.06Ghz Xeon if not a little faster. Also consider you can use Altivec with this machine (but on on LINPACK), and that should really fly past any pentiums.
Yes, just with Linpack, and who knows if the scores will improve a bit once they've worked the kinks out.
|
|
|
| |
|
|
|
 |
|
 |
|
Junior Member
Join Date: Oct 2002
Status:
Offline
|
|
Originally posted by jrod7350:
Isn't that just for this code though? I'm sure other code would run just as fast as a 3.06Ghz Xeon if not a little faster. Also consider you can use Altivec with this machine (but on on LINPACK), and that should really fly past any pentiums.
We have already seen other code where a 2-GHz G5 is indeed faster than a 3.06-GHz Xeon. Still, it would have been good if that were the case for the Terascale Cluster.
As for Altivec, it is not useful for double-precision LINPACK.
Altivec will not necessarily guarantee a G5 to fly past any Pentiums because the current Pentium 4's and Xeons have Hyperthreading, and they have SSE2 anyway. Depending on the situation, either of them could fly past the other.
|
|
|
| |
|
|
|
 |
|
 |
|
Dedicated MacNNer
Join Date: Apr 2003
Status:
Offline
|
|
I never get tired of saying this
"HyperThreading" does not generally improve performance when you are running computationally intensive tasks on your machine because you still really have a single processor and all HT really does is to compensate for bad branch predictions that could cause a nasty stall in the P4/Xeon's unfeasibly long pipeline.
In recent tests, the Xeon chips with more cache than the P4 have performed marginally better, but there's really nothing that even mighty Intel can design to protect against a catastrophic branch misprediction!
Where the G5 may shine, if we step away from GCC, is in general floating point computation since it features *2* floating point units. I can't guess at how powerful these units are in comparison with, say, a P4/Xeon because I really don't know, but the Power4+ generally doesn't do too badly. The other feature that the G5 should do well in is where there is a large amount of memory utilization and other processor I/O .... again, this has yet to be proven definitively (or I've missed it.....)
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2001
Location: Folding customer returned size 52 underwear.
Status:
Offline
|
|
They wanted to be in the top 10, they are #5.
Since it has only been running a week they should be able to tweak things.
|

{ v2.3 Now Jesus free}
Religions are like farts: yours is good, the others always stink.
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Oct 2001
Location: Pacific Northwest
Status:
Offline
|
|
Does this mean surfing the web won't be as fast as I thought it would be with a Dual 2.0? 
|
|
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Jun 2003
Location: Hyrule
Status:
Offline
|
|
Not bad, but not good either. The fact that an INTEL powered cluster is whooping IBM's powaaa is a BAD thing.
Especially because now I have to put up with the 'h4x0r' admins on my network who will suddenly feel better about their antec boxed celery machines
|
|
Aloha
|
| |
|
|
|
 |
|
 |
|
Grizzled Veteran
Join Date: Apr 2001
Status:
Offline
|
|
Too bad noone has put an EV7 Alpha cluster togther yet. From what I know, HP refuses to publish benchmarks on these things, because they don't want Intel to get mad. Aparently the EV7 creams the Itanium something fierce.
And of course the EV8 is dead. As is PA-RISC chips.
I'm just glad IBM is smart enough not to kill off their own processor line in favor of the Itanium. After all, now Apple is benefitting from this.
|
|
<This space under renovation>
|
| |
|
|
|
 |
|
 |
|
Junior Member
Join Date: Dec 2002
Status:
Offline
|
|
Originally posted by Drakino:
And of course the EV8 is dead. As is PA-RISC chips.
Your right, Intel bought all but one member of the EV8 team to work on Itanium, oh, and the PA-RISC chips are pretty much dead, HP is basically running out the clock on these chips.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Enthusiast
Join Date: Mar 2003
Location: Globetrotting
Status:
Offline
|
|
A nice thing that should be on the chart is how much it cost to build each cluster.
|
|
If a group of mimes are miming a forest and one falls down, does he make a sound?
|
| |
|
|
|
 |
|
 |
|
Junior Member
Join Date: Dec 2002
Status:
Offline
|
|
Originally posted by power142:
I never get tired of saying this 
"HyperThreading" does not generally improve performance when you are running computationally intensive tasks on your machine because you still really have a single processor and all HT really does is to compensate for bad branch predictions that could cause a nasty stall in the P4/Xeon's unfeasibly long pipeline.
In recent tests, the Xeon chips with more cache than the P4 have performed marginally better, but there's really nothing that even mighty Intel can design to protect against a catastrophic branch misprediction!
Where the G5 may shine, if we step away from GCC, is in general floating point computation since it features *2* floating point units. I can't guess at how powerful these units are in comparison with, say, a P4/Xeon because I really don't know, but the Power4+ generally doesn't do too badly. The other feature that the G5 should do well in is where there is a large amount of memory utilization and other processor I/O .... again, this has yet to be proven definitively (or I've missed it.....)
Just where did you get the idea the HT was supposed to alleviate branch mispredict penalties?
The entire concept behind hyperthreading is that it essentially allows two virtual cores to run on a single cpu. This is a godsend for a processor like the Pentium 4 which usually can't complete (3 Uops or roughly 2 x86 instructions/cycle) nearly as many instructions as it's resources (2 double ALUs; 4 effective, SSE2, 1 FPU etc) allow and the improvements obviously shows in optimised benchmarks like Cinebench and Lightwave (and these are cpu intensive).
Exactly how is having a long pipeline bad? Higher clocked cpus with excellent branch prediction and OOOE (like the G5 and P4) will more than make up for the higher branch mispredict penalties. It's stupid to assume that longer pipelined cpus like the P4 will automatically be inferior to shorter pipelined cpus when the both have their advantages.
(Last edited by CubeBoy; Oct 20, 2003 at 12:30 PM.
)
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
Originally posted by Drakino:
Too bad noone has put an EV7 Alpha cluster togther yet. From what I know, HP refuses to publish benchmarks on these things, because they don't want Intel to get mad. Aparently the EV7 creams the Itanium something fierce.
And of course the EV8 is dead. As is PA-RISC chips.
I'm just glad IBM is smart enough not to kill off their own processor line in favor of the Itanium. After all, now Apple is benefitting from this.
EV7 beat Itanium when it was released, but it doesn't anymore (~1500SPECfp vs. ~2000 SPECfp for the I2). It's a shadow of what it could have been, had DEC/Compaq/HP not decided to kill it (it's basically just a die shrink of the EV6).
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Jan 2003
Location: ~/
Status:
Offline
|
|
$5.2 million for the 4th fastest computer in the world doesn't sound half bad.
|
|

|
| |
|
|
|
 |
|
 |
|
Fresh-Faced Recruit
Join Date: Sep 2003
Status:
Offline
|
|
Originally posted by CubeBoy:
Exactly how is having a long pipeline bad?
it is 'bad' when, as i'm sure you know, there is a missed branch. think of it this way; just as having a short pipleline does not make your CPU superior, neither would an endlessly long one. you wouldn't be defending a 200 stage pipeline, would you? you'd run out of operations to do at all those nice pipeline stages!
discovering who has the best combination of pipeline stages and branch prediction units is not likely to be resolved here..
It's stupid to assume that longer pipelined cpus like the P4 will automatically be inferior to shorter pipelined cpus when the both have their advantages.
and really, they both also have disadvantages, and you can't make many direct comparisons.. in particular, what is the worst-case branch mispredict penalty for the 970 and the p4? i can't find numbers for these, so maybe you might know them.
i'm jumping in because i recently had a discussion with a friend whose hardware runs on a very nice dual-core chip (and, as it happens, shorter pipelines) and it handily outperforms what they are trying to do with the 970. your mileage will indeed vary...
cheers..
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2001
Location: South of the Mason-Dixon line
Status:
Offline
|
|
Off the top of my head...
P4 has 19 stage pipeline and 90% branch prediction.
Current Athlon has 20 stage pipeline and 97% branch prediction.
Penalty for misprediction is severe (I've heard 50 clockcycles mentioned) in the P4 and half as much in the Athlon.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
Originally posted by Spliffdaddy:
Off the top of my head...
P4 has 19 stage pipeline and 90% branch prediction.
Current Athlon has 20 stage pipeline and 97% branch prediction.
Penalty for misprediction is severe (I've heard 50 clockcycles mentioned) in the P4 and half as much in the Athlon.
AthlonXP has a 10-14 stage pipeline depending on the type of operation. The P4's is 20. The Athlon64's is 12-16 (I think). There's a whitepaper by the guys working on the P5 about long pipelines and performance. They come to the conclusion that the ideal pipeline length for a P4-like processor with enough cache is 45 stages.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Dec 2001
Location: Atlanta, GA, USA
Status:
Offline
|
|
Originally posted by Spliffdaddy:
Off the top of my head...
P4 has 19 stage pipeline and 90% branch prediction.
Current Athlon has 20 stage pipeline and 97% branch prediction.
Penalty for misprediction is severe (I've heard 50 clockcycles mentioned) in the P4 and half as much in the Athlon.
If the pipeline for a chip is 19 stages, then the largest penalty it should incur would be if the error was discovered at the last stage (19), and then on the next cycle it had to reload the instruction, putting the penalty at 39 cycles. This is the worst-case scenario.
|
|
Mac Pro 2x 2.66 GHz Dual core, Apple TV 160GB, two Windows XP PCs
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
Confirmed by the New York Times. (Remember: macnn macnn for the login)
" Officials at the school said that they were still finalizing their results and that the final speed number might be significantly higher."
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Jan 2001
Location: San Francisco, CA
Status:
Offline
|
|
|
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Nov 2000
Location: in front of my Mac
Status:
Offline
|
|
Originally posted by slipjack:
Agree, Simon.
Just to add to my statement, I've asked how much the IBM Power3 cluster costed that we run our 3d multi-particle tracing codes on in Berkeley.
Apparently, the cluster was about 20 million $, but it's hard to get real numbers because IBM and UCB/LBL had some special deals.
Anyway, put these 20M in relationship to VT's 5M and then think about 8k Power3 getting just barely as much GF/s as 2k 970s.
I think this is a great advance. By making the 970 IBM has basically enabled Apple to bring Power4-style processing power down to a decent price point as an easy to install/maintain cluster. For Apple this is great news. It brings them back into the universities and back into clusters.
(Last edited by Simon; Oct 22, 2003 at 04:12 AM.
)
|
|
•
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 2000
Location: Caught in a web of deceit.
Status:
Offline
|
|
New numbers (as of Oct. 22):
Code:
Place Computer CPUs Rmax Rpeak Eff.
---------------------------------------------------------------------------------------
1 Earth Simulator 5120 35860 40960 .875
2 ASCI Q AlphaServer EV-68 (1.25 GHz w/Quadrics) 8160 13880 20480 .678
3 HP RX2600 Itanium 2 1.5 GHz w/Quadrics 1936 8633 11616 .743
4 Apple G5 dual 2.0 GHz Infiniband 4X/Cisco Gigabit 2112 8164 16896 .483
5 Linux NetworX/Quadrics (2.4 GHz Xeon w/Quadrics) 2304 7634 11059 .690
6 IBM SP Power 3 416 nodes 375 MHz 6656 7304 9984 .732
Apple/VT is now ahead of Linux NetworX, but the new HP Itanium 2 cluster has taken over 3rd place.
|
|
|
| |
|
|
|
 |
|
 |
|
Registered User
Join Date: Apr 2003
Location: The Internets
Status:
Offline
|
|
reagrding the NYT article. basically apple is getting the same speed (if they used 2300 cpus instead of 2100) for 1/3 to 1/2 the cost...
-Livermore system consisting of 2304 Intel Xeon processors, is capable of 7.63 trillion operations a second, at a price estimated at $10 million to $15 million. The Virginia Tech computer makes the cost-to-performance equation even starker.-
Very Very good news....
I'm happy again...
(just take the 10 million you save and buy 4200 more g5 and take on the japanese weather machine 
|
|
|
| |
|
|
|
 |
|
 |
|
Forum Regular
Join Date: Feb 2000
Location: Naples, ID
Status:
Offline
|
|
I really don't know much about clusters (they seem like an entirely different beast to me), but, in a way, they're quite facinating!
Anyway, in an effort to understand things a little better, I have a question:
Eug: I'm assuming the last column (Eff.) refers to the efficiency of the cluster as compared to its Rpeak score (Rmax / Rpeak = Eff.). If that's correct, what is holding the VT cluster at .483? I can only assume it has something to do with bandwidth limitations. Anyone care to speculate?
BTW I read that the VT cluster is running custom software that makes up for the G5s lack of support for ECC memory...could this account for any part of the efficiency hit?
|
|
- Design: QS G4 933 / GF4MX / R7k / 1GB / 160GB RAID / 60GB boot / Jaguar
- Games: Abit KD7-RAID / XP 2200+ / Ti4200 / 512MB GeIL PC3200 / 40GB / XP pro
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 2000
Location: Caught in a web of deceit.
Status:
Offline
|
|
Originally posted by jaisun:
I really don't know much about clusters (they seem like an entirely different beast to me), but, in a way, they're quite facinating!
Anyway, in an effort to understand things a little better, I have a question:
Eug: I'm assuming the last column (Eff.) refers to the efficiency of the cluster as compared to its Rpeak score (Rmax / Rpeak = Eff.). If that's correct, what is holding the VT cluster at .483? I can only assume it has something to do with bandwidth limitations. Anyone care to speculate?
BTW I read that the VT cluster is running custom software that makes up for the G5s lack of support for ECC memory...could this account for any part of the efficiency hit?
Eff. is efficiency yes.
I have no idea why it's so low. However, I don't know anything about this stuff.
What I can tell you is that the POWER series machines aren't characteristically efficient at this sort of stuff, and aren't expect to be. ie. Their theoretical speeds are very high, but nobody expects them to perform like that in real life. eg. The Itanium 2 1.5 has a theoretical max of 6 Gflops but the G5 2.0 has a theoretical max of 8 Gflops. However, nobody expects a G5 2.0 to outperform an Itanium 2 1.5.
BTW, somebody posted a screengrab of the numbers from the PDF:

|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Feb 2003
Location: Atlanta
Status:
Offline
|
|
|
(Last edited by coolmacdude; Oct 22, 2003 at 03:50 PM.
)
|
|
2.16 Ghz Core 2 Macbook, 3GB Ram, 120 GB
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 2000
Location: Caught in a web of deceit.
Status:
Offline
|
|
Whoops. They didn't post the new numbers. 
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Feb 2003
Location: Atlanta
Status:
Offline
|
|
Originally posted by Eug:
Whoops. They didn't post the new numbers.
Someone did in the comments, but I submitted that on Monday, sure took them long enough.
|
|
2.16 Ghz Core 2 Macbook, 3GB Ram, 120 GB
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 1999
Status:
Offline
|
|
That list is already "old." Plus, it's only running at 40% efficiency for the cluster. On a smaller test group it ran at over 70%. Already VT has added a teraflop to that number and are dedicating another 4 months just for optimzation It's also not running all 2200 processors, some of the Macs still aren't hooked up yet.
Plus, they only spent $5 million.
|
|
"…I contend that we are both atheists. I just believe in one fewer god than
you do. When you understand why you dismiss all the other possible gods,
you will understand why I dismiss yours." - Stephen F. Roberts
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 2000
Location: Caught in a web of deceit.
Status:
Offline
|
|
Originally posted by olePigeon:
That list is already "old."
Yes. See above posts.
On a smaller test group it ran at over 70%.
I don't think that's correct. It sounds like Dongarra made a mistake in saying 80% efficiency, when in fact it was exactly half that. The 128 node (256 CPU) numbers have already been recorded, and are in the database. It should be noted that the theoretical max numbers were incorrectly documented in his database at first, and then they changed later.
Already VT has added a teraflop to that number
Well, not quite. 7.41 -> 8.16.
and are dedicating another 4 months just for optimzation It's also not running all 2200 processors, some of the Macs still aren't hooked up yet.
Yeah, it will be interesting to see if they can add another 32 nodes (64 CPUs). If they can do that AND reach 50% efficiency, they will be able to beat the Itanium 2 cluster's current numbers.
Plus, they only spent $5 million.
Yes, very cheap. One wonders how much the Itanium 2 cluster costs.
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Sep 2001
Location: NYC*Crooklyn
Status:
Offline
|
|
Originally posted by osxisfun:
reagrding the NYT article. basically apple is getting the same speed (if they used 2300 cpus instead of 2100) for 1/3 to 1/2 the cost...
-Livermore system consisting of 2304 Intel Xeon processors, is capable of 7.63 trillion operations a second, at a price estimated at $10 million to $15 million. The Virginia Tech computer makes the cost-to-performance equation even starker.-
Very Very good news....
I'm happy again...
(just take the 10 million you save and buy 4200 more g5 and take on the japanese weather machine
it's ridiclous how cheap this cluster/supercomputer is compared to the other ones on that list.
the #1 is estimated at 250mil and the others are atleast double or triple that of the VT one.
7.4T disappointing? you can count on 1 hand the places that can reach that level. eug...you gotta edit that outta your thread title or everybody is gonna lose respect for you.
|
|
|
| |
|
|
|
 |
|
 |
|
Clinically Insane
Join Date: Dec 2000
Location: Caught in a web of deceit.
Status:
Offline
|
|
Originally posted by Apple Pro Underwear:
it's ridiclous how cheap this cluster/supercomputer is compared to the other ones on that list.
the #1 is estimated at 250mil and the others are atleast double or triple that of the VT one.
7.4T disappointing? you can count on 1 hand the places that can reach that level. eug...you gotta edit that outta your thread title or everybody is gonna lose respect for you.
Does anyone ever read thread messages before posting anymore?
Anyways, it was my understanding that after a certain period of time, the thread headings cannot be changed. Haven't tried it though.
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Feb 2003
Location: Atlanta
Status:
Offline
|
|
Originally posted by Apple Pro Underwear:
you gotta edit that outta your thread title or everybody is gonna lose respect for you
Why? If that's his opinion, so what.
|
|
2.16 Ghz Core 2 Macbook, 3GB Ram, 120 GB
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Sep 2001
Location: NYC*Crooklyn
Status:
Offline
|
|
Originally posted by coolmacdude:
Why? If that's his opinion, so what.
it's like me going to a jesus board and saying jesus is disapointing (when obviously jesus is not!)
i read this whole thread!!! were you making some sort of self-deprecating g5 joke? you needed a sarcastic line if that's the case...like:
"only 7.4 terabytes? why wonder it's a third of the cost"
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
Originally posted by Apple Pro Underwear:
it's like me going to a jesus board and saying jesus is disapointing (when obviously jesus is not!)
The lounge is this way -->
But yeah, I'm guessing 7.4 Gflops/s would be a bit disappointing even to the VT people, too. I betcha they're happier with the new numbers though.
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Sep 2001
Location: NYC*Crooklyn
Status:
Offline
|
|
Originally posted by Eug Wanker:
But yeah, I'm guessing 7.4 Gflops/s would be a bit disappointing even to the VT people, too. I betcha they're happier with the new numbers though.
actually the numbers are preliminary right? i read the actual paper on the train yesterday...but it did say that the scores could be improved dramatically after more diagnosis...
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
Originally posted by Apple Pro Underwear:
actually the numbers are preliminary right? i read the actual paper on the train yesterday...but it did say that the scores could be improved dramatically after more diagnosis...
Yes, the numbers are preliminary.
They're already up to 8.16, which is a nice boost. They are likely much happier with that, but are still working on improving it.
|
|
|
| |
|
|
|
 |
|
 |
|
Junior Member
Join Date: Dec 2002
Status:
Offline
|
|
Originally posted by docman:
it is 'bad' when, as i'm sure you know, there is a missed branch. think of it this way; just as having a short pipleline does not make your CPU superior, neither would an endlessly long one. you wouldn't be defending a 200 stage pipeline, would you? you'd run out of operations to do at all those nice pipeline stages!
discovering who has the best combination of pipeline stages and branch prediction units is not likely to be resolved here..
We're not talking about the perceived superiority of endlessly long pipelined cpus or even the advantages/disadvantages of longer pipelined architectures. If you want to go to the extremes, we might as well discuss the viability of processors who forego pipelining(and superscalar processing) altogether! That however, would not be the point.
No, we're restricting our discussion solely to how any longer pipelined cpu will automatically be branded "bad" as implied by the original poster because of greater branch mispredict penalties.
and really, they both also have disadvantages, and you can't make many direct comparisons.. in particular, what is the worst-case branch mispredict penalty for the 970 and the p4? i can't find numbers for these, so maybe you might know them.
Yes and no, while endlessly long pipelines are not indefinitely superior, the industry (desktops, workstations, servers etc) seems to be gradually moving towards longer pipelined processors which should tell us something regarding the advantages of either.
and yes, I do know the branch mispredict penalties of both, I've actually completed a analysis comparing the branch mispredict rates of several processors in another forum some time ago.
Pentium 4:
Branch History Table: 4096
Hit Rate: 95%
Min-Max Branch Prediction Penalty: 19-30
(1-Hit Rate))*Min): .95
(1-Hit Rate))*Max): 1.5
Motorola 7455 (G4e)
Branch History Table: 2048
Hit Rate: 92%
Min-Max Branch Prediction Penalty: 4-6
(1-Hit Rate))*Min): .32
(1-Hit Rate))*Max): .48
Athlon:
Branch History Table: 2048
Hit Rate: 92%
Min-Max Branch Prediction Penalty: 10-15
(1-Hit Rate))*Min): .8
(1-Hit Rate))*Max): 1.2
IBM PowerPC 970
Branch History Table: 16000 (plus 32000 more entries in two other tables)
Hit Rate: 98% (estimate)
Min-Max Branch Prediction Penalty: 14-18 (?)
(1-Hit Rate))*Min): .28
(1-Hit Rate))*Max): .36
As you can see, even though the PPC970 has a longer pipeline (and thus a greater branch mispredict penalty) than the G4, it’s far superior branch prediction more than makes up for it by itself. Let’s not forget that all the benefits of a longer pipeline (higher clock speeds, more instructions in flight, etc) still remain (compare the G5’s 192 OoO window to the G4’s measly 16 entry OoO window). The Pentium 4 doesn't have quite the branch prediction capabilities of the 970, but has several special instruction prefixes (HWNT, HST) that lessen the impact of branch mispredicts. These can be implemented by compilers as well as programmers usually after feedback directed optimisations.
i'm jumping in because i recently had a discussion with a friend whose hardware runs on a very nice dual-core chip (and, as it happens, shorter pipelines) and it handily outperforms what they are trying to do with the 970. your mileage will indeed vary...
Let's not forget that most programs will be able to fit completely within the Power4's huge cache, having dual cores would probably help as well. 
(Last edited by CubeBoy; Oct 23, 2003 at 09:05 AM.
)
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
|
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
Originally posted by CubeBoy:
We're not talking about the perceived superiority of endlessly long pipelined cpus or even the advantages/disadvantages of longer pipelined architectures. If you want to go to the extremes, we might as well discuss the viability of processors who forego pipelining(and superscalar processing) altogether! That however, would not be the point.
No, we're restricting our discussion solely to how any longer pipelined cpu will automatically be branded "bad" as implied by the original poster because of greater branch mispredict penalties.
Yes and no, while endlessly long pipelines are not indefinitely superior, the industry (desktops, workstations, servers etc) seems to be gradually moving towards longer pipelined processors which should tell us something regarding the advantages of either.
and yes, I do know the branch mispredict penalties of both, I've actually completed a analysis comparing the branch mispredict rates of several processors in another forum some time ago.
Pentium 4:
Branch History Table: 4096
Hit Rate: 95%
Min-Max Branch Prediction Penalty: 19-30
(1-Hit Rate))*Min): .95
(1-Hit Rate))*Max): 1.5
Motorola 7455 (G4e)
Branch History Table: 2048
Hit Rate: 92%
Min-Max Branch Prediction Penalty: 4-6
(1-Hit Rate))*Min): .32
(1-Hit Rate))*Max): .48
Athlon:
Branch History Table: 2048
Hit Rate: 92%
Min-Max Branch Prediction Penalty: 10-15
(1-Hit Rate))*Min): .8
(1-Hit Rate))*Max): 1.2
IBM PowerPC 970
Branch History Table: 16000 (plus 32000 more entries in two other tables)
Hit Rate: 98% (estimate)
Min-Max Branch Prediction Penalty: 14-18 (?)
(1-Hit Rate))*Min): .28
(1-Hit Rate))*Max): .36
As you can see, even though the PPC970 has a longer pipeline (and thus a greater branch mispredict penalty) than the G4, it’s far superior branch prediction more than makes up for it by itself. Let’s not forget that all the benefits of a longer pipeline (higher clock speeds, more instructions in flight, etc) still remain (compare the G5’s 192 OoO window to the G4’s measly 16 entry OoO window). The Pentium 4 doesn't have quite the branch prediction capabilities of the 970, but has several special instruction prefixes (HWNT, HST) that lessen the impact of branch mispredicts. These can be implemented by compilers as well as programmers usually after feedback directed optimisations.
Let's not forget that most programs will be able to fit completely within the Power4's huge cache, having dual cores would probably help as well.
Very interesting. Thanks for posting that. I hadn't realized quite how good the G5's branch prediction really is. That's pretty impressive.
|
|
|
| |
|
|
|
 |
|
 |
|
Fresh-Faced Recruit
Join Date: Sep 2003
Status:
Offline
|
|
look at the charts again. It appears the bigger number is simply a function of letting the benchmark run longer.
Nmax = the size of the largest problem run on the machine
Rmax = the performance in GFlops for the largest problem run on the machine.
Nmax changed from 450000 to 500000 when the Rmax value went up. The Itanium system has an Nmax of 835000. maybe the next list will show what the Mac server would do with an Rmax of 835000.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
Originally posted by alandail:
look at the charts again. It appears the bigger number is simply a function of letting the benchmark run longer.
Nmax = the size of the largest problem run on the machine
Rmax = the performance in GFlops for the largest problem run on the machine.
Nmax changed from 450000 to 500000 when the Rmax value went up. The Itanium system has an Nmax of 835000. maybe the next list will show what the Mac server would do with an Rmax of 835000.
Welcome to MacNN.
Interesting post, but I can't say I understand it. Are you saying it's just the test parameters which change the efficiency of the system, and that VT is currently searching for best problem size to maximum benchmark numbers?

(Last edited by Eug Wanker; Oct 24, 2003 at 12:51 AM.
)
|
|
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Jun 2003
Location: Hyrule
Status:
Offline
|
|
from what I see it still shows the theoretical output can put the g5 as high as #3 (look at the rmax)
|
|
Aloha
|
| |
|
|
|
 |
|
 |
|
Fresh-Faced Recruit
Join Date: Sep 2003
Status:
Offline
|
|
that's what I was thinking - that the larger the problem, the higher percentage of the time it spends doing floating point calculations, thus the more efficient it becomes.
The other thing I considered is that the algorithm or the compiler in the benchmark may not take advantage of the the dual floating point units in the G5. That may be the explination for why the efficiency is below 50%.
If that were the case, this cluster could actually be #2 in speed when the second floating point unit is utilized. Wouldn't it be nice if this were just a compiler issue that could be quickly solved or even an optimization setting they didn't turn on yet.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Jun 2003
Location: Dangling something in the water… of the Arabian Sea
Status:
Offline
|
|
POWER4 usually doesn't get much past 50% for efficiency. There have been explanations at Ars describing why one should not expect the G5 to get anywhere near close the theoretical peak.
|
|
|
| |
|
|
|
 |
 |
|
 |
|
|
|
|
|

|
|
 |
Forum Rules
|
 |
 |
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is Off
|
|
|
|
|
|
 |
 |
 |
 |
|
 |
|