 |
 |
Conroe early. Intel Power Macs forthcoming?
|
 |
|
 |
|
Senior User
Join Date: Jan 2001
Location: Seattle
Status:
Offline
|
|
http://digitimes.com/mobos/a20051221A1001.html
Nice! Intel seems to be cranking along with Merom/Conroe
Looks like if Apple is keeping up on the software front we may see a late 2006 announcement of Intel base Power Macs.
This dovetails nicely with rumors that Rosetta is coming along quicker than planned.
How does a 1066FSB sound to you?
How does 4MB of L2 cache sound?
DDR2 800 memory support!
I'm ready. Bring it on.
|
|
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Jul 2005
Location: Winnipeg, MB
Status:
Offline
|
|
1066Mhz FSB? Well that's 66Mhz faster than the lowest end Power Mac right now... I mean... not a massive deal. The 4MB of L2 Cache and memory aspects are far more appealing.
That said I don't think Apple will intro any new Power Macs until they can be certain the new ones will beat the current line up in bench marks.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2005
Location: Los Angeles, California
Status:
Offline
|
|
I agree with Salty. Apple wouldn't shoot themselves in the foot by announcing Intel PowerMacs unless they outperformed the PPC ones by a significant margin.
|
|
Linkinus is king.
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Jan 2001
Location: Salt Lake City, UT USA
Status:
Offline
|
|
Still, how often is that ghz bus's bandwidth being used to the max? Are you really cranking data through that sucker so much as to utilize it? I glanced at a couple of PC makers, and they're selling their top machines with 800mhz buses. I don't expect that we'd need more than what we've got.
That's a lot of cache.
I think between the higher clocked chips, and greater memory, we'll see a reasonable performance gain.
|
|
2008 iMac 3.06 Ghz, 2GB Memory, GeForce 8800, 500GB HD, SuperDrive
8gb iPhone on Tmobile
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
I think PowerMacs are going to be one of the last products to go... iMacs, on the other hand, could switch around Q3 next year.
|
|
|
| |
|
|
|
 |
|
 |
|
Forum Regular
Join Date: Dec 2002
Status:
Offline
|
|
Originally Posted by hmurchison2001
How does a 1066FSB sound to you?
How does 4MB of L2 cache sound?
DDR2 800 memory support!
I'm ready. Bring it on.
Well, at least two of the PowerMacs today have higher FSB (1.15Ghz and 1.25Ghz respectively). All PowerMacs have 1MB of L2 cache per core. That means the Quad has 4MB of L2 Cache.
Really, the only advancement seems to be the faster memory.
Hardly anything worth getting excited about.
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Higher frequency doesn't mean faster speed. However FSB1066 is slower, it can transmit 8 GB/s in total; a G5@2.5 GHz has a total bandwidth of 10 GHz per cpu which is faster, although it can only transmit 5 GB/s one-way.
Anyway, since in most Pentium motherboards, two cpus have to share an FSB (which is a front-side bus and not a point-to-point connection), the situation gets actually worse. However for two reasons, this may not be such an issue. One, the first Macs with Intel cpu will be consumer Macs, which also implies they will be single-cpu Macs (dual core, but single cpu). Two, there are a few motherboards which have a dedicated fsb for each cpu.
Keep in mind that this is the very reason for the rather big L2 cache! G5s and Opterons/Athlons have less, because they have a larger bandwidth.
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by sray
Well, at least two of the PowerMacs today have higher FSB (1.15Ghz and 1.25Ghz respectively). All PowerMacs have 1MB of L2 cache per core. That means the Quad has 4MB of L2 Cache.
FSB: Does it matter? Can you name a program or benchmark where the FSB speed (same CPU clockspeed and system architecture) has a nontrivial effect on performance? I can't. The FSB for the G5 is incredibly fast, but as far as I know it's well beyond the start of diminishing returns and could be cut in half (which was effectively done with the new dual core chips) without any noticable penalty.
Cache: The quad has 1MB per core for each of 4 cores, Conroe has 4MB shared by two cores; the shared (and larger per core) cache should offer a performance improvement, although I have yet to see anyone who has implemented it to benchmark it (as far as I know all of the consumer level dual core chips have individual CPU caches).
OreoCookie is right about the low-end multisocket Xeon boards, where the two or four sockets have to share the FSB. I thought Intel was going point-to-point with the Intel Core based multiprocessor capable chips (Whitfield?), but that may have changed.
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Originally Posted by mduell
FSB: Does it matter? Can you name a program or benchmark where the FSB speed (same CPU clockspeed and system architecture) has a nontrivial effect on performance? I can't. The FSB for the G5 is incredibly fast, but as far as I know it's well beyond the start of diminishing returns and could be cut in half (which was effectively done with the new dual core chips) without any noticable penalty.
The comparatively small throughput of the FSB is the reason for these huge caches in Intel cpus. Remember, a cache makes a cpu very expensive (considerably larger die, etc.). Especially the Xeons take quite a beating from the Opterons, coz their bandwidth is much larger (depending on the model, the Opterons have up to three 10 GB HT links, one is dedicated to cpu-cpu transfers).
Originally Posted by mduell
Cache: The quad has 1MB per core for each of 4 cores, Conroe has 4MB shared by two cores; the shared (and larger per core) cache should offer a performance improvement, although I have yet to see anyone who has implemented it to benchmark it (as far as I know all of the consumer level dual core chips have individual CPU caches).
Well, as long as the benchmark fits into the cache, it benefits tremendously. The large cache is in place to counteract the slow FSB.
Originally Posted by mduell
OreoCookie is right about the low-end multisocket Xeon boards, where the two or four sockets have to share the FSB. I thought Intel was going point-to-point with the Intel Core based multiprocessor capable chips (Whitfield?), but that may have changed.
Well, their roadmaps are more and more of a mess. Their numbering scheme contributed to that confusion considerably. Originally this point-to-point interconnect was planned to be out already, the first chipset with this new interconnect was supposed to be able to use either Itanic cpus or Xeons, but the Itanic has been delayed time and again.
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Jan 2001
Location: Seattle
Status:
Offline
|
|
Originally Posted by Salty
1066Mhz FSB? Well that's 66Mhz faster than the lowest end Power Mac right now... I mean... not a massive deal. The 4MB of L2 Cache and memory aspects are far more appealing.
That said I don't think Apple will intro any new Power Macs until they can be certain the new ones will beat the current line up in bench marks.
Guess you haven't seen benchmarks from Anandtech. Today's 3.6Ghz Xeon beat the G5 in memory bandwidth. Conroe will do even better.
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: Apr 2000
Location: Gothenburg, Sweden
Status:
Offline
|
|
Originally Posted by hmurchison2001
Guess you haven't seen benchmarks from Anandtech. Today's 3.6Ghz Xeon beat the G5 in memory bandwidth. Conroe will do even better.
No, it doesn't. Page 2 of that report states:
On the flipside of the coin is the excellent FSB bandwidth. The G5/Power PC 970FX 2.7 GHz has a 1.35 GHz FSB (Full Duplex), capable of sending 10.8 GB/s in each direction. Of course, the (half duplex) dual channel DDR400 bus can only use 6.4 GB/s at most. Still, all this bandwidth can be put to good use with up to 8 data prefetch streams.
The memory read and write tests come out ahead due to the latency issue, but the G5 still has better bandwidth.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Enthusiast
Join Date: Oct 2000
Location: Between heaven and hell
Status:
Offline
|
|
I can't see Apple selling an intel pro mac until FCP and Photoshop is an universal app.
|
|
Yes, I know I could buy a PC, but why?
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Jan 2001
Location: Seattle
Status:
Offline
|
|
Originally Posted by Anand
I can't see Apple selling an intel pro mac until FCP and Photoshop is an universal app.
I think FCP will be UB by NAB 2006 in April.
PS will be longer but with Rosetta the PPC binary should work well enough. I think this nextgen chip is going to be very good for Apple. Intel scrapped the successors to the Netburst line for the promise of Merom so I'm thinking that Apple will want to move with alacrity towards getting computers out and into our hands.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Dec 2000
Status:
Offline
|
|
Digitimes? They're the same guys who told us we were supposed to get a G5 iBook last summer. 
|
|
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Jan 2001
Location: Salt Lake City, UT USA
Status:
Offline
|
|
To be Frank about it, why would FCP not be UB right now? Don't think Apple would have been the first people to have it ready for rollout? I imagine that FCP is close, if not ready to go now. Photoshop can't be hard to get there either. The trick though, is what is Adobe doing to get it into the hands of users...
Do you think that they'll offer the new Binary for download with use of your S#, or will the next version simply be for intel, and they'll try to sell it to you with an upgrade? Maybe a CD exchange?
|
|
2008 iMac 3.06 Ghz, 2GB Memory, GeForce 8800, 500GB HD, SuperDrive
8gb iPhone on Tmobile
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by OreoCookie
The comparatively small throughput of the FSB is the reason for these huge caches in Intel cpus. Remember, a cache makes a cpu very expensive (considerably larger die, etc.). Especially the Xeons take quite a beating from the Opterons, coz their bandwidth is much larger (depending on the model, the Opterons have up to three 10 GB HT links, one is dedicated to cpu-cpu transfers).
Well, as long as the benchmark fits into the cache, it benefits tremendously. The large cache is in place to counteract the slow FSB.
The cache is consistent with the 2MB per core Dothans, except now they can make slightly more efficient use of the cache by sharing 4MB. Intel has done a lot of work on getting good yields with large chips; I think the larger cache is a relatively low cost way to improve performance. The onboard memory controller on the Opteron really helps their performance and I'm disappointed that Intel hasn't picked it up (although I understand why they haven't).
What are these benchmarks where FSB has a nontrivial impact on performance?
Originally Posted by OreoCookie
Well, their roadmaps are more and more of a mess. Their numbering scheme contributed to that confusion considerably. Originally this point-to-point interconnect was planned to be out already, the first chipset with this new interconnect was supposed to be able to use either Itanic cpus or Xeons, but the Itanic has been delayed time and again.
Indeed, and now I understand they're going to some 5 digit thing (1 letter plus 4 numbers) to indicate power consumption and performance. AMD's not doing much better since you need to look at the extra letters and/or know about cores to determine power consumption.
Originally Posted by P
The memory read and write tests come out ahead due to the latency issue, but the G5 still has better bandwidth.
hmurchison2001 was talking about memory throughput, not FSB. The G5 FSB is wildly fast, but does it really matter if the latency is so bad that you can't get memory performance out of it?
I think the G5 has pushed the FSB to the point where it's overkill. If anyone can link to a benchmark where the FSB speed (same clockrate/architecture) creates a nontrivial performance difference I'd like to see it.
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Originally Posted by mduell
The cache is consistent with the 2MB per core Dothans, except now they can make slightly more efficient use of the cache by sharing 4MB. Intel has done a lot of work on getting good yields with large chips; I think the larger cache is a relatively low cost way to improve performance. The onboard memory controller on the Opteron really helps their performance and I'm disappointed that Intel hasn't picked it up (although I understand why they haven't).
The cost of increasing the cache is very big. It makes Intel chips a lot more expensive, because they fit fewer (working) cores on a die. That's why IBM and AMD have smaller caches, they don't need to be so large. Especially the first few Pentium Ms has a very slow bus (FSB533), so a larger cache did allow them to compensate that somewhat.
Just some random figures: If you compare a Prescott P4 Extreme Edition (0.13 micron) to early G5s (PPC970, also in 0.13 mu), the P4EE's die is rougly twice the size (240 mm2 compared to 121 mm2).
The PPC970FX's die size is a mere 66 mm2 -- contrast that to a Dothan Pentium M (84 mm2) or a P4 (Prescott) (112 mm2). The PPC970MP is more than twice as big, due to the increase in cache size to 1 MB, 154 mm2, which is still small compared to a Pentium D (I think around 230 mm2). This means, Intel can put a lot less Pentium Ds on a wafer than IBM.
Actually either TheRegister or TheInquirer had a very funny piece about this. It looks like Intel will use such a nice point-to-point interconnect in 2009 ... when it was first available in 2001 (yet another reason why Alpha was simiply ahead of its time).
Originally Posted by mduell
What are these benchmarks where FSB has a nontrivial impact on performance?
Any multi-cpu system from Intel scales not that well compared to -- say -- Opteron systems, that's a pretty good benchmark, I would say. That's also one reason why Itanium doesn't really take off. And (I bet) never will be.
Also, G5s scale better with frequency than Intel CPUs.
Originally Posted by mduell
Indeed, and now I understand they're going to some 5 digit thing (1 letter plus 4 numbers) to indicate power consumption and performance. AMD's not doing much better since you need to look at the extra letters and/or know about cores to determine power consumption.
It's totally confusing to the degree that I stopped caring about those numbers altogether. I still look for the frequency (and the core) if necessary. Although that's not really necessary. I built my last Intel system something like two years ago (for my parents, I have an AMD PC).
Originally Posted by mduell
I think the G5 has pushed the FSB to the point where it's overkill. If anyone can link to a benchmark where the FSB speed (same clockrate/architecture) creates a nontrivial performance difference I'd like to see it.
To be technically correct, the G5, the Opteron and most other cpus don't use FSBs (front side buses), they use point-to-point interconnects.
It does help. G5 Macs whose bus is clocked at only 1/3 of the cpu speed are marginally slower. Granted, it's not much, but people in the PC world would buy memory twice the price for these kinds of performance gains
The other huge advantage is that G5s don't have to share their bandwidth, as mentioned before. Obviously, this doesn't have to do with the actual throughput, but the topology of the interconnect. I'd rather have Intel's problem of too much bandwidth than Intels of too little 
(Last edited by OreoCookie; Dec 24, 2005 at 07:19 AM.
)
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
I'm a bit sleepy to really participate in the discussion right now, but I did want to note that the effect of cache on yields is a bit nonintuitive, because it can be protected by including a bit extra for redundancy. Any failing cache blocks can be disabled and replaced by the redundant ones. The end result is that the only major effect cache size has on yields is in number of dies per wafer, whereas increased core size hits both dies/wafer and the defect rate. One could argue that Intel's large caches relative to its competition are simply a way to turn its extremely large manufacturing capacity into a technical advantage (or a way to avoid having to implement other technical advantages like on-die memory controllers or SOI).
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by OreoCookie
The cost of increasing the cache is very big. It makes Intel chips a lot more expensive, because they fit fewer (working) cores on a die. That's why IBM and AMD have smaller caches, they don't need to be so large. Especially the first few Pentium Ms has a very slow bus (FSB533), so a larger cache did allow them to compensate that somewhat.
Just some random figures: If you compare a Prescott P4 Extreme Edition (0.13 micron) to early G5s (PPC970, also in 0.13 mu), the P4EE's die is rougly twice the size (240 mm2 compared to 121 mm2).
The PPC970FX's die size is a mere 66 mm2 -- contrast that to a Dothan Pentium M (84 mm2) or a P4 (Prescott) (112 mm2). The PPC970MP is more than twice as big, due to the increase in cache size to 1 MB, 154 mm2, which is still small compared to a Pentium D (I think around 230 mm2). This means, Intel can put a lot less Pentium Ds on a wafer than IBM.
While cache/chip size is one factor in price, it's not everything. Despite having twice the cache, a 2.8Ghz Pentium D is still $100 (33%) cheaper than a 1.8Ghz Athlon X2.
The number of chips you can get from a wafer is not only a function of the size of the chip, but also what your yield is. Intel did a lot of work for Itanium on getting good yields (a must with a billion plus transistors) that they can presumably reuse for other chips.
The 533Mhz bus works very well with the Pentium M; the increase from 400Mhz was noticable (they doubled cache at the same time), but not that much. Even adding a second core Intel is only going to push it up to 667Mhz.
Originally Posted by OreoCookie
Any multi-cpu system from Intel scales not that well compared to -- say -- Opteron systems, that's a pretty good benchmark, I would say. That's also one reason why Itanium doesn't really take off. And (I bet) never will be.
Absolutely. Anything below 8-way and AMD's ptp system mops the floor with the Xeons and Itaniums; but on the high end (16-512 proc) I2 has its advantages.
Originally Posted by OreoCookie
Also, G5s scale better with frequency than Intel CPUs.
Link?
Originally Posted by OreoCookie
To be technically correct, the G5, the Opteron and most other cpus don't use FSBs (front side buses), they use point-to-point interconnects.
The Opteron is the odd man out, but the G5 and P4 are pretty much the same; one bus for one chip and they must access memory over that bus. Walks like a duck, swims like a duck, there's no meaningful differentiation between calling it a point to point interconnect and a front side bus when you have one CPU and one link.
Originally Posted by OreoCookie
It does help. G5 Macs whose bus is clocked at only 1/3 of the cpu speed are marginally slower. Granted, it's not much, but people in the PC world would buy memory twice the price for these kinds of performance gains
Show me. The only comparison I can think of is the PowerMac 1.8 (600FSB and 900FSB), but I haven't seen anyone compare the two.
Originally Posted by OreoCookie
The other huge advantage is that G5s don't have to share their bandwidth, as mentioned before. Obviously, this doesn't have to do with the actual throughput, but the topology of the interconnect. I'd rather have Intel's problem of too much bandwidth than Intels of too little
That's all great, but what are you going to use the paper bandwidth for? The next fastest part of the system is the RAM and it's a good bit slower (espically with the latency issues in the G5 system controller). HDD, GPU, and network don't come close.
|
|
|
| |
|
|
|
 |
|
 |
|
Forum Regular
Join Date: Nov 2005
Status:
Offline
|
|
Originally Posted by Salty
That said I don't think Apple will intro any new Power Macs until they can be certain the new ones will beat the current line up in bench marks.
Apple could put Yonah into the PowerMacs and they would beat most of the current G5 line, except the Quad (and then, the Quad would only win on heavily-multithreaded benchmarks). The P-M architecture is really fast per-clock. If you look at the SPEC integer benchmarks, the Pentium-M 780 is like 800 SPECint/GHz, while the G5 is only ~550 SPECint/GHz. And that's considering IBM's compiler. The fact that GCC's x86 backend is much better than its PPC backend skews things even more for most OS X apps (which are compiled with GCC, not IBM's XLC). If Intel fixes their FP performance (and initial indications suggest that Yonah is a great step in that direction), then there should be no worries about the Conroe-based PowerMacs being slower than the existing line.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
Originally Posted by rhashem
Apple could put Yonah into the PowerMacs and they would beat most of the current G5 line, except the Quad (and then, the Quad would only win on heavily-multithreaded benchmarks). The P-M architecture is really fast per-clock. If you look at the SPEC integer benchmarks, the Pentium-M 780 is like 800 SPECint/GHz, while the G5 is only ~550 SPECint/GHz. And that's considering IBM's compiler. The fact that GCC's x86 backend is much better than its PPC backend skews things even more for most OS X apps (which are compiled with GCC, not IBM's XLC). If Intel fixes their FP performance (and initial indications suggest that Yonah is a great step in that direction), then there should be no worries about the Conroe-based PowerMacs being slower than the existing line.
SPECint isn't exactly a nice thing to do to the G5  Having only two int pipes, high latency memory and 2-cycle minimum instruction latency on the int pipes doesn't make the G5 a real integer performance monster. Unfortunately, that makes it rather hard to compare the P-M and the G5, since they're both good where the other is bad. Hopefully Conroe will beef up the FPU further.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Floating point is a bit closer, but the PM is still more efficient.
Even in SPECfp (speed benchmark) the Pentium M gets 608/599 (peak/base) per Ghz (at 2.26ghz) and the G5 gets 564/535 per Ghz (at 2.2Ghz).
In SPECfp_rate (throughput benchmark) the Pentium M gets 6.41/6.35 per Ghz per CPU (since it is multithreaded) and the G5 gets 4.54/4.36 per Ghz per CPU.
All of those numbers are using platform specific compiliers (ICC/IF and XLC/XLF), not GCC.
Of course, the G5 clocks 250-450Mhz faster than the Pentium M does.
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Originally Posted by mduell
While cache/chip size is one factor in price, it's not everything. Despite having twice the cache, a 2.8Ghz Pentium D is still $100 (33%) cheaper than a 1.8Ghz Athlon X2.
The number of chips you can get from a wafer is not only a function of the size of the chip, but also what your yield is. Intel did a lot of work for Itanium on getting good yields (a must with a billion plus transistors) that they can presumably reuse for other chips.
The 533Mhz bus works very well with the Pentium M; the increase from 400Mhz was noticable (they doubled cache at the same time), but not that much. Even adding a second core Intel is only going to push it up to 667Mhz.
Itanium is pretty much practically irrelevant from a practical perspective. I know yields are important, but it's easier to get good yields on a smaller chip.
Originally Posted by mduell
The Opteron is the odd man out, but the G5 and P4 are pretty much the same; one bus for one chip and they must access memory over that bus. Walks like a duck, swims like a duck, there's no meaningful differentiation between calling it a point to point interconnect and a front side bus when you have one CPU and one link.
No, it's not. The G5 and the P4 have a very different architecture. One is a point-to-point connection, the other is a shared bus. It doesn't walk or quack like a duck. If cpu 1 accesses the memory, it has full access to the memory and doesn't have to share the bandwidth with cpu 0. This is fundamentally different. If you have one cpu, this makes some difference, but then, the Opteron is also `the same'. I think you confuse the integrated memory controller with a different kind of interconnect topology.
If you say, there is no meaningful differentiation between the different concepts for a single-cpu system, then this includes the Opteron (and all other systems with point-to-point interconnects, single-cpu Alphas, whatnot).
Originally Posted by mduell
That's all great, but what are you going to use the paper bandwidth for? The next fastest part of the system is the RAM and it's a good bit slower (espically with the latency issues in the G5 system controller). HDD, GPU, and network don't come close.
Take a look at the cinebench benchmarks of the quad: the gains are quite substantial (for mp-aware apps) as neither of the cores seems to be starved too much. It's not just on paper. A dual dual-core Xeon system will have a bandwidth problem, while the G5s will not. Even if it is `too much', I'd rather have too much bandwidth than too little, wouldn't you agree?
Merry Christmas 
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by OreoCookie
No, it's not. The G5 and the P4 have a very different architecture. One is a point-to-point connection, the other is a shared bus. It doesn't walk or quack like a duck. If cpu 1 accesses the memory, it has full access to the memory and doesn't have to share the bandwidth with cpu 0. This is fundamentally different. If you have one cpu, this makes some difference, but then, the Opteron is also `the same'. I think you confuse the integrated memory controller with a different kind of interconnect topology.
I was refering to the P4, which is uniprocessor only. The front side bus is an interconnect from a point (northbridge) to a point (CPU); that is effectively the same as a single CPU (single or dual core) G5s.
I differentiated the Opteron because the CPU does not use the bus between it and the chipset (HyperTransport in this case) to access memory.
Originally Posted by OreoCookie
Take a look at the cinebench benchmarks of the quad: the gains are quite substantial (for mp-aware apps) as neither of the cores seems to be starved too much. It's not just on paper. A dual dual-core Xeon system will have a bandwidth problem, while the G5s will not. Even if it is `too much', I'd rather have too much bandwidth than too little, wouldn't you agree?
I don't have any Cinebench numbers handy for quad Xeons, but looking at SPEC (int and fp, speed and rate) the duals are within a couple points of the singles (score per Ghz per CPU), but the quads all fall down miserably (they have less L2 cache and a slower FSB than the duals). The dual dual Xeons may do better than the quad Xeons due to having more L2 cache and a faster FSB, but I don't have any numbers for them handy.
Originally Posted by OreoCookie
Merry Christmas
And the same to you. 
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Originally Posted by mduell
I was refering to the P4, which is uniprocessor only. The front side bus is an interconnect from a point (northbridge) to a point (CPU); that is effectively the same as a single CPU (single or dual core) G5s.
I differentiated the Opteron because the CPU does not use the bus between it and the chipset (HyperTransport in this case) to access memory.
The same applies to Xeons, they use the same bus. Those were and are the competitors to the PowerMacs.
And apparently you are under the mistaken impression that the separate interconnect (to the memory) is the thing which differentiates a K8 (be it Opteron or Athlon) from a Pentium cpu (4 or M), but it's not. Even the K7 used a point-to-point connection (a flavor of the Alpha's EV6 bus interconnect). It's not. Neither the G5, Athlons (any flavor, even K7s) or 1xx Opterons have a Front Side Bus (FSB), but use a point-to-point connection. Which means that different devices cannot be used concurrently.
Also, it does not really matter for the bandwidth if there is a separate bus (here, one HT link) for the memory (the advantage, low latency of the memory accesses is not connected to bandwidth) if the bus were sufficiently fast.
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by OreoCookie
The same applies to Xeons, they use the same bus. Those were and are the competitors to the PowerMacs.
You made reference to Pentium 4 and the majority of P4/G5 models and systems are uniprocessor, so it's reasonable to talk about them.
Low-end (2 and 4-way) Xeons share a bus, multiprocessor Opterons and G5s have multiple busses. I mentioned how the 4-way Xeons with slow FSBs and small caches fall down in benhcmarks like SPEC in my previous post.
Originally Posted by OreoCookie
And apparently you are under the mistaken impression that the separate interconnect (to the memory) is the thing which differentiates a K8 (be it Opteron or Athlon) from a Pentium cpu (4 or M), but it's not. Even the K7 used a point-to-point connection (a flavor of the Alpha's EV6 bus interconnect). It's not. Neither the G5, Athlons (any flavor, even K7s) or 1xx Opterons have a Front Side Bus (FSB), but use a point-to-point connection. Which means that different devices cannot be used concurrently.
In uniprocessor systems the choice in naming between a front side bus and point to point interconnect is a distinction without a difference. In multiprocessor systems you could have multiple front side busses (as the larger Xeon systems do); you can also have one or more point to point interconnects in a system.
Directly access between the memory and the CPU is a major architectural difference between K8 and P4/G5/K7. The K8 architecture uses an HyperTransport link between the CPU and the chipet and other CPUs.
AMD says the K7 Athlons have front side busses.
Originally Posted by OreoCookie
Also, it does not really matter for the bandwidth if there is a separate bus (here, one HT link) for the memory (the advantage, low latency of the memory accesses is not connected to bandwidth) if the bus were sufficiently fast.
Earlier in this thread someone attributed the PowerMac's relatively poor memory throughput to the G5s high memory latency (about double that of a P4, and 2.5x that of a K8).
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Jan 2001
Location: Seattle
Status:
Offline
|
|
Apple could put Yonah into the PowerMacs and they would beat most of the current G5 line, except the Quad (and then, the Quad would only win on heavily-multithreaded benchmarks). The P-M architecture is really fast per-clock. If you look at the SPEC integer benchmarks, the Pentium-M 780 is like 800 SPECint/GHz, while the G5 is only ~550 SPECint/GHz.
No they couldn't. Judging a processor by SPEC is horrible. SPEC won't take into account the extra variables that a CPU has like SSE/Altivec and in some cases SPEC scores for Macs didn't even account for the second proc. I'll guarantee you that a Yonah won't beat a G5 with any kind of decent software. With Yonah you're looking at a max of 3 execution units. A G5 as 5 execution units if memory serves me correct.
I'm under no dellusion that a Yonah is going to bring workstation performance to laptops yet. Look for Merom to do that with 4 execution units, SSE3, em64t and other goodies. Yonah is going to make a nice lowend to midrange proc though . G5 FPU performance eats Pentiums for breakfast though.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
Originally Posted by hmurchison2001
No they couldn't. Judging a processor by SPEC is horrible. SPEC won't take into account the extra variables that a CPU has like SSE/Altivec and in some cases SPEC scores for Macs didn't even account for the second proc. I'll guarantee you that a Yonah won't beat a G5 with any kind of decent software. With Yonah you're looking at a max of 3 execution units. A G5 as 5 execution units if memory serves me correct.
I'm under no dellusion that a Yonah is going to bring workstation performance to laptops yet. Look for Merom to do that with 4 execution units, SSE3, em64t and other goodies. Yonah is going to make a nice lowend to midrange proc though . G5 FPU performance eats Pentiums for breakfast though.
A few misconceptions here. You're talking about issue width, not number of execution units (the G5 has waaaay more than 5 execution units, but dispatches 4 instructions + 1 branch instruction per clock, max). Next misconception: That all of the execution width can be used for any given piece of software. The G5 has only two integer units, so if you're just doing int ops, it can only use 2/5s of its instructions/clock. Those two int units also have a minimum two cycle latency, so something issuing serially dependent int instructions can only use 2/10ths of the peak dispatch rate (although out of order execution helps get around this, as do decent compilers). Third misconception: That some SPECcpu runs are multiprocessor and some aren't. ALL SPECcpu runs, not just some, use only one processor. SPECrate is the muli-processor one. Fourth misconception: that SPECcpu can't use altivec/SSE. It can, if the compiler can autovectorize (GCC4/ICC).
That said, you were right that judging a CPU using SPECcpu alone is a bad idea.
|
|
|
| |
|
|
|
 |
|
 |
|
Forum Regular
Join Date: Nov 2005
Status:
Offline
|
|
Originally Posted by hmurchison2001
No they couldn't. Judging a processor by SPEC is horrible.
I'm curious. Are you aware exactly what SPEC consists of? SPECint consists of real-world kernels, ranging from gzip to gcc. SPECfp consists of real scientific kernels that perform useful work. You can read about them http://www.ideasinternational.com/in.../ispecint.html and http://www.ideasinternational.com/infofile/ispecfp.html. Yes, the best benchmark is ultimately your own application, but in lieu of that, a SPEC score is a very useful alternative. In my experience, it's a pretty good approximation of how my own code will behave without platform-specific tweeking.
<i>SPEC won't take into account the extra variables that a CPU has like SSE/Altivec and in some cases SPEC scores for Macs didn't even account for the second proc.</i>
The common SPEC scores cited (peak/base), never take into account a second CPU. They are a per-core figure, which is what you're interested in when comparing CPUs (rather than systems).
<i>I'll guarantee you that a Yonah won't beat a G5 with any kind of decent software.</i>
I'd love to take you up on that! In anything integer related, Yonah will toy with the G5, but that's not really unexpected, since the G5 is more of media-oriented chip. This weakness is significant, because the G5 isn't just Apple's current media workstation chip, but it is the chip that powers everything from their mid-range desktops on up. There is a lot of "decent software" that doesn't consist of running single-precision vector operations all day  Also, let's define "decent software". In my experience (and I say this as someone who has a dual G5 next to his dual Opteron), the G5 wins only on floating-point heavy software optimized for it. That's great if you're running FCP all day (which is undoubtedly an Apple stronghold!), but a lot of people in the workstation realm want to run something other than the handful of highly G5-optimized media apps out there. Some of these programs (Mathematica, Matlab), aren't usually optimized for a particular platform, and the G5 really suffers on those, especially when using GCC.
<i> With Yonah you're looking at a max of 3 execution units. A G5 as 5 execution units if memory serves me correct. </i>
Yonah can dispatch 3 instructions per clock, the G5 can dispatch 5. However, the two numbers are fairly incomparable. The P4 can dispatch 6 uops per cycle, but we know how that performs clock-for-clock! Some of Yonah's 3 instructions can be fused macro-ops. For example, Yonah can dispatch a load + operation as one instruction, while on the G5 they must be dispatched as seperate load + operation instructions. The G5 also has complex rules about what combinations of 5 instructions can be dispatched, something that apparently holds it back quite a bit, as IBM gained a lot of performance per-clock in the Power5 processor (as compared to the Power4 on which the G5 is based), by relaxing some of these rules. Yonah also appears to have a better branch predictor, and has features like loop and indirect branch prediction, which push its actual IPC much closer to its theoretical IPC. On top of all that, the G5 is really dependent on a good compiler to squeeze out maximum performance, and GCC/PPC isn't that.
G5 FPU performance eats Pentiums for breakfast though.
I'm sure it eats Pentiums for breakfast, but those are a decade old at this point  Whether it beats Yonah or Opteron is a different question. Initial benchmarks show Yonah holding its own with the Opteron on at least some FPU code. The Opteron, in turn, can beat the G5 on a lot of FPU code not explicitly optimized for Altivec (significant, since AltiVec is useless if you need high-precision). The G5 *can* be an FPU monster, if you take great care to use AltiVec when possible, and avoid hitting the PowerMac's high-latency memory controller too often, but if you're running your average portable C code, you won't often see all this potential.
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Jan 2001
Location: Seattle
Status:
Offline
|
|
Catfish thanks for that explanation. It's difficult to compare procs from different manf because terminology varies.
rhashem
excellent reply. Come to think of it I haven't seen many reviews pitting a Dothan against a Power Mac G5. It would indeed be interesting to see the comparison. I'll see if I can find anything.
I think that the improvement in integer performance with Intel chips is evident. Many of the people running the OS X Intel version do indeed perceive a difference in speed which I attribute to the better integer performance of the Pentium 4.
This excites me because if Intel is able to deliver Merom with improved FPU and solid Integer perfromance we're going to have Powermacs in 2007 that are well balanced. I myself think FPU performance is pretty important but then again I'm into audio and video applications. Integer is equally as important to many others.
I'll be the last person to lament the passing of the G5. Not because it's a bad chip but because there's that potential for a balanced processor and I hope we have that next year.
Regards.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by hmurchison2001
No they couldn't. Judging a processor by SPEC is horrible. SPEC won't take into account the extra variables that a CPU has like SSE/Altivec and in some cases SPEC scores for Macs didn't even account for the second proc. I'll guarantee you that a Yonah won't beat a G5 with any kind of decent software. With Yonah you're looking at a max of 3 execution units. A G5 as 5 execution units if memory serves me correct.
Are you seriously suggesting that IBM's highly optimized compilers (XLC and XLF) for the G5 don't take advantage of the VMX unit? Many (most?) OSX apps are using less optimized compilers (GCC, etc).
SPECint and SPECfp is single threaded on all platforms (speed test) and SPECint_rate and SPECfp_rate are mutlithreaded on all platforms (throughput test).
Originally Posted by hmurchison2001
G5 FPU performance eats Pentiums for breakfast though.
Since SPEC says the opposite, what benchmark would you like to use?
Originally Posted by rhashem
The common SPEC scores cited (peak/base), never take into account a second CPU. They are a per-core figure, which is what you're interested in when comparing CPUs (rather than systems).
In my post about fp performance I listed both SPEC and SPEC_rate scores for this reason.
Originally Posted by rhashem
Yonah also appears to have a better branch predictor, and has features like loop and indirect branch prediction, which push its actual IPC much closer to its theoretical IPC.
Intel was forced (by the 20 or 31 stage pipeline) to put a lot of work into branch prediction with NetBurst. I hope that carries over into Intel Core (Merom/Conroe), although I'm not sure if they backported it to P6 (Dothan/Yonah).
Originally Posted by hmurchison2001
Come to think of it I haven't seen many reviews pitting a Dothan against a Power Mac G5. It would indeed be interesting to see the comparison. I'll see if I can find anything.
Outside of blade servers (where each is only a minor player), there aren't many (any?) places where PPC970 and Dothan compete. System Shootouts' rough comparison rates them as equals at the same clockrate.
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Originally Posted by mduell
Since SPEC says the opposite, what benchmark would you like to use?
...
In my post about fp performance I listed both SPEC and SPEC_rate scores for this reason.
IBM's SPECmarks are 1428 SPECint and 2076 SPECfp for the PPC970MP … which certainly is competitive with Intel's Pentium 4 cpus (and their derivatives) (scroll down to Intel, I found only one cpu with higher SPECfp marks, although I'm not sure what cpu it is exactly, they just list it as 3.73 GHz Intel cpu on a motherboard with D955XBK chipset, so I suspect it's the latest P4 EE).
However, these are IBM's figures and I have yet to see results for quad G5s.
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
Originally Posted by OreoCookie
IBM's SPECmarks are 1428 SPECint and 2076 SPECfp for the PPC970MP … which certainly is competitive with Intel's Pentium 4 cpus (and their derivatives) (scroll down to Intel, I found only one cpu with higher SPECfp marks, although I'm not sure what cpu it is exactly, they just list it as 3.73 GHz Intel cpu on a motherboard with D955XBK chipset, so I suspect it's the latest P4 EE).
However, these are IBM's figures and I have yet to see results for quad G5s.
I find these scores to be very interesting, as they're a dramatic improvement over previous ones, and it's unclear what significant changes were made.
mduell: Are you sure XLC/XLF can autovectorize? I thought they were still working on that (which would make sense, since only one IBM processor has Altivec).
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Originally Posted by Catfish_Man
I find these scores to be very interesting, as they're a dramatic improvement over previous ones, and it's unclear what significant changes were made.
mduell: Are you sure XLC/XLF can autovectorize? I thought they were still working on that (which would make sense, since only one IBM processor has Altivec).
It seems a bit much to attribute all of these improvements to the doubled L2 cache, I agree, so I assume gradual improvements in the core and improvements in the compiler also play a role. Maybe the PPC970MP benefitted from the improvements of the VMX unit put into the X-CPU (the cpu for the XBox 360): a wider bus which connects the vmx unit and the rest of the core and some new operations.
I forgot to add the SPECrate marks: 32.3 SPECint_rate, 42.8 SPECfp_rate. I think also those are not bad. Some cpus (e. g. a single cpu system with a dual-core Opteron 254 does better, but still, it's very competitive with what's available now).
As to your second question, gcc autovectorizes, so I think (I'm not sure) IBM's compilers do, too.
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Forum Regular
Join Date: Nov 2005
Status:
Offline
|
|
Originally Posted by Catfish_Man
I find these scores to be very interesting, as they're a dramatic improvement over previous ones, and it's unclear what significant changes were made.
The extra cache probably helps a bit, but the biggest contributor is likely the fact that the recent SPEC benchmarks are taken using IBM's XLC 8.0 (the older benchmarks were taken with XLC 6.0). Unfortunately, the most recent version of XLC for OS X is 6.1, and IBM won't even sell you that anymore.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by OreoCookie
IBM's SPECmarks are 1428 SPECint and 2076 SPECfp for the PPC970MP … which certainly is competitive with Intel's Pentium 4 cpus (and their derivatives) (scroll down to Intel, I found only one cpu with higher SPECfp marks, although I'm not sure what cpu it is exactly, they just list it as 3.73 GHz Intel cpu on a motherboard with D955XBK chipset, so I suspect it's the latest P4 EE).
However, these are IBM's figures and I have yet to see results for quad G5s.
So we have:
PPC970MP (2.5Ghz) at 1428/2076 and 32.3/42.8
Pentium 4 at 1834/2118 (3.8Ghz) and 33.3/32.7 (3.2Ghz dual core)
Pentium M (2Ghz) at 1839/1375 and 16.6 (single core only, no fp_rate score)
That comes to 571/830 SPECint/fp per Ghz and 6.46/8.56 SPECint/fp_rate per Ghz per core for PPC970MP, which is dramatically better than the old 472/564 and 4.88/4.54 for PPC970FX (at 2.2Ghz). I'm surprised that those scores aren't in the official SPEC database, since that article is over a month old; perhaps they're like Apple's original G5 SPEC scores (from Veritest), where they were never submitted because they used relaxed math standards.
The quads should get the same SPEC scores and twice the SPEC_rate scores; or it may not be so rosy for the quads due the compilier availability issue that rhashem mentions.
Originally Posted by Catfish_Man
mduell: Are you sure XLC/XLF can autovectorize? I thought they were still working on that (which would make sense, since only one IBM processor has Altivec).
In 2004 IBM said auto-vectorization would be in the next release (which is 8.x, out now AFAIK), and I know they did a lot of work with auto-vectorization for PPC440 (although that work may or may not be applicable to VMX on PPC970).
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Nov 2000
Location: in front of my Mac
Status:
Offline
|
|
Here's some beef. AppleInsider claims that Intel will design the new PowerMac for Apple using Conroe (no more NetBurst, yeah!). They think it could be introduced as early as Q3 2006. 
|
|
•
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
I think I've got it! Did anyone notice the "automatic parallelization" mentioned in the PDF mduell linked? I'll bet that the SPECcpu scores for a single core G5 will be significantly lower than for a dual core, despite there not being any explicit threading in SPECcpu. I think the drastic improvement in SPECcpu scores for the 970MP is attributable to some or all of the following factors:
1) Doubled L2 cache
2) Minor core tweaks
3) Improved northbridge latency (iirc a minor improvement was measured on arstechnica a little while ago)
4) Automatic parallelization
5) Automatic vectorization
6) General compiler tweaks
7) Higher clock frequency than previous 970 SPECcpu submissions
8) Possible math library tricks, ala Apple's SPEC submissions
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by Simon
Here's some beef. AppleInsider claims that Intel will design the new PowerMac for Apple using Conroe (no more NetBurst, yeah!). They think it could be introduced as early as Q3 2006.
That article supports my guess that Apple will use Intel chipsets and motherboards, rather than designing their own or using a third party (nVidia/ULi, VIA, SiS).
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by Catfish_Man
7) Higher clock frequency than previous 970 SPECcpu submissions
8) Possible math library tricks, ala Apple's SPEC submissions
7) That's why I compared normalized performance figures.
8) Although I was the one who suggested it, I still have a hard time seeing IBM using bogus figures for marketing; but the performance improvement is quite dramatic.
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
Originally Posted by Catfish_Man
8) Possible math library tricks, ala Apple's SPEC submissions
SPECmarks are YAB, yet another benchmark. In pretty much every compiler, the cpu manufactuer's compiler in particular, there are specific optimizations for the SPEC suite. Also, it depends on how much money companies want to invest, there are some rules for base and peak results, e. g. the number of compiler switches you can use (some are good for one benchmark, but may be detrimental to performance for others, etc.).
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Aug 2001
Status:
Offline
|
|
Originally Posted by OreoCookie
SPECmarks are YAB, yet another benchmark. In pretty much every compiler, the cpu manufactuer's compiler in particular, there are specific optimizations for the SPEC suite. Also, it depends on how much money companies want to invest, there are some rules for base and peak results, e. g. the number of compiler switches you can use (some are good for one benchmark, but may be detrimental to performance for others, etc.).
Certainly. My 8) was referring to library tricks that weren't allowed be SPEC rules, specifically. That would be one possible explanation as to why it hasn't been submitted. My main point, though, was the autoparallelization may be a significant factor.
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
Originally Posted by OreoCookie
SPECmarks are YAB, yet another benchmark. In pretty much every compiler, the cpu manufactuer's compiler in particular, there are specific optimizations for the SPEC suite. Also, it depends on how much money companies want to invest, there are some rules for base and peak results, e. g. the number of compiler switches you can use (some are good for one benchmark, but may be detrimental to performance for others, etc.).
Any idiot (not implying that anyone is, just a figure of speech) can run the SPEC benchmark on their hardware. However if you want a score that is "official" and comparable to the others in the database, you have to follow the rules (no SPEC specific optimizations, limited and fixed compiler flags, etc).
Apple (SJobs) made a big deal about their scores at the keynote, but they never got the scores into the SPEC database. I think it may be because of some of the following tweaks that Veritest used on the G5 (but not on the Intel chips they tested in the same report):
- Using the “Reggie” tool available from CHUD, modify CPU registers to enable memory Read By-pass. As Read requests are speculatively sent to the memory controller, this eliminates the need to wait for the snoop response required in a multiprocessor configuration thus reducing the time required for a read request.
- Installed a high performance, single threaded malloc library. This library implementation is geared for speed rather than memory efficiency and is single-threaded which makes it unsuitable for many uses. Special provisions are made for very small allocations (less than 4 bytes).
- -fast This flag ... enables the use of C99 aliasing rules and relaxed IEEE math operations.
As soon as I see those new PPC970MP scores (and their details) from IBM in the database, I'll believe them; until then they're a bit fishy (like IBM's marketing claims for PPC970FX power consumption, which the engineering docs disputed).
|
|
|
| |
|
|
|
 |
|
 |
|
Posting Junkie
Join Date: Oct 2005
Location: Houston, TX
Status:
Offline
|
|
MacNN's forums have issues.
|
|
|
| |
|
|
|
 |
 |
|
 |
|
|
|
|
|

|
|
 |
Forum Rules
|
 |
 |
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is Off
|
|
|
|
|
|
 |
 |
 |
 |
|
 |
|