I followed your instructions, AbnRanger, and the experience was exactly as you predicted. It's a very jarring, sudden change in performance completely dependent on brush size. It's speedy and just fine, then totally drops off with just a tiny change to the brush size. Shrink it back down just a tiny fraction and it goes right back to being speedy.
Knowing Nvidia (and how corporate types think in general), their gaming cards are probably designed to work great only in games, while their pro cards are designed to work well only in CG apps. That way you're forced to buy both, or so they hope. Greed makes people do strangely illogical things. It would be interesting to see how AMD's 7970 would perform with 3DC if Andrew were to add OpenCL support.
The article posted by L'Ancien Regime is an interesting read. Thanks for sharing it with us! I'll probably replace my GTX 670 with whatever blows away the AMD 7970. I try not to upgrade too often because even though it can be fun, it's often also time consuming and I do so hate the inevitable troubleshooting that tends to go with it lol. 
About memory with XMP, I had to turn it off because the timings it set would prevent my PC from getting past the BIOS screen, and sometimes not even that far. What I did was write down the settings it wanted to use, then entered identical settings into the BIOS myself using its manual override mode. Then it would boot perfectly fine and even ended up being super stable that way. Don't know why one way would work and the other wouldn't when the settings were identical, but there you have it. Fwiw they were Mushkin Enhanced Blackline Frostbyte DDR3-1600 rated for 9-9-9-24 timings at 1.5v. They easily ran at higher clock speeds so long as the timings were loosened, but after a lot of benchmarking I found that a slower frequency with tighter timings was actually a fair bit faster than a higher frequency with loose timings. Naturally YMMV.
Actually after all this discussion, I'm thinking that the whole GPU card parallel programming business may not be the right way to go. Andrew thinks it's a bitch to program and so do the guys over at Vray.
Vray has a much more interesting take on it; they're going with Intel Xeons and the Xeon Psi Co Processor, which is much easier to program for multithreading and parallel computing. For not much more than an Nvidia Titan you get a lot more cores. 240 threads..with 8 gigs of DDR5 RAM and 320 GB/s of max memory bandwidth. One Xeon Phi will thus be = 4 * 8 core Xeons. For under $2000. And that will have MKL, Math Kernel Libraries built in.
I'm still not sure how many of these you could plug into your PCIE slots for each Xeon CPU....I would think at least two.
This is the route SGI is going with it's SGI UV chassis..
http://www.sgi.com/p...cts/servers/uv/

So forget CUDA, and pass on OpenCL and go for the Xeon Phi, and just get an AMD 7970 or a Titan for viewport..or if you've got money to burn a Quadro Pro or FireGL..
If Andrew can make 3d Coat scalable to all those threads, (and the new Xeon Phi coprocessor coming out in July will have 480 threads) that will be the real deal, not trying to transform your GPUs into CPU functionality.
And the Nvidia 780 will be crippled just like the 680...what a joke..
Somebody send Andrew this for his birthday;


http://www.amazon.ca...g/dp/0124104142
"Reinders and Jeffers have written an outstanding book about much more than the Intel® Xeon PhiT. This is a comprehensive overview of the challenges in realizing the performance potential of advanced architectures, including modern multi-core processors and many-core coprocessors. The authors provide a cogent explanation of the reasons why applications often fall short of theoretical performance, and include steps that application developers can take to bridge the gap. This will be recommended reading for all of my staff." -James A. Ang, Ph.D. Senior Manager, Extreme-scale Computing, Sandia National Laboratories
Edited by L'Ancien Regime, Today, 09:07 AM.