20 September 2013

517. Very briefly: Prime95 (GIMPS) on linux

I'm very unhappy about a newly built node which randomly crashes and reboots when running long jobs. More about that later, but here are the specs: FX 8350, 4x8 Gb RAM GSkill Ripjaws, ASRock FX990 Extreme3, Corsair GS700, MSI N210, ASUS NX1101 in an Antec GX700 case, running Wheezy with stock kernel (3.2.0-4 amd64).

I've tested the RAM using memtest86+  and found no errors, the rig uses a 700 W Corsair PSU which /should/ provide enough power, and I see no evidence of overheating based on a cronjob which runs every 2 minutes. Anyway, the first step in troubleshooting is finding a good way of reproducing the error reliably, and prime95 is what the windows overclockers use to stresstest.

Turns out prime95 (actually GIMPS) can run in a few different modes which tests different aspects of you system, which makes it sound like a pretty good program for my purposes.

See here for more information: http://www.mersenne.org/freesoft/

mkdir ~/tmp/mprime -p
cd ~/tmp/mprime
wget http://www.mersenne.info/gimps/p95v279.linux64.tar.gz
tar xvf p95v279.linux64.tar.gz
./mprime
Welcome to GIMPS, the hunt for huge prime numbers. You will be asked a few simple questions and then the program will contact the primenet server to get some work for your computer. Good luck! Attention OVERCLOCKERS!! Mprime has gained a reputation as a useful stress testing tool for people that enjoy pushing their hardware to the limit. You are more than welcome to use this software for that purpose. Please select the stress testing choice below to avoid interfering with the PrimeNet server. Use the Options/Torture Test menu choice for your stress tests. Also, read the stress.txt file. If you want to both join GIMPS and run stress tests, then Join GIMPS and answer the questions. After the server gets some work for you, stop mprime, then run mprime -m and choose Options/Torture Test. Join Gimps? (Y=Yes, N=Just stress testing) (Y): N Number of torture test threads to run (3): 2 Choose a type of torture test to run. 1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested much). 2 = In-place large FFTs (maximum heat and power consumption, some RAM tested). 3 = Blend (tests some of everything, lots of RAM tested). 11,12,13 = Allows you to fine tune the above three selections. Blend is the default. NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller. Type of torture test to run (3): 1 Accept the answers above? (Y): Y [Main thread Sep 20 11:06] Starting workers. [Worker #1 Sep 20 11:06] Worker starting [Worker #1 Sep 20 11:06] Setting affinity to run worker on any logical CPU. [Worker #2 Sep 20 11:06] Worker starting [Worker #2 Sep 20 11:06] Setting affinity to run worker on any logical CPU. [Worker #1 Sep 20 11:06] Beginning a continuous self-test to check your computer. [Worker #1 Sep 20 11:06] Please read stress.txt. Hit ^C to end this test. [Worker #2 Sep 20 11:06] Beginning a continuous self-test to check your computer. [Worker #2 Sep 20 11:06] Please read stress.txt. Hit ^C to end this test. [Worker #1 Sep 20 11:06] Test 1, 180000 Lucas-Lehmer iterations of M580673 using AMD K10 type-1 FFT length 28K, Pass1=112, Pass2=256. [Worker #2 Sep 20 11:06] Test 1, 180000 Lucas-Lehmer iterations of M580673 using AMD K10 type-1 FFT length 28K, Pass1=112, Pass2=256. CTRL+C
And so on.

9 comments:

  1. I recently got an fx-9590 and asus m5a99fx rev2.0 and has no end of troubles with g03 (bogus results during geometry optimizations that led to ***** for the gradients) and hard locks with HPL (both only when running 8 threads). Turned out the mobo couldn't get enough juice from the psu, and when I replaced it with a crosshair v (1x24, 1x8, and 1x4 power connectors) these problems went away. Is it possible that you might have something similar? Perhaps you could reduce the cpu multiplier and voltage and see if your stability goes up (with mine I found it was less stable when I increased the voltage).

    ReplyDelete
    Replies
    1. I'm suspecting that you might be right -- I'm trying to finish up a project on my cluster at the moment, but once that's done I'm going to swap PSUs between my FX8150 and my FX 8350 nodes to see whether anything changes. The FX8150 node has an 800W PSU (Corsair GS800). Still, /shouldn't/ 700 W really be enough?

      I'll have a look at the multiplier/voltage settings as well, although I have to admit that that's an area where I have no clue as to what I'm doing.

      In my case I get spontaneous reboots every now and again, but it's stable enough to throw jobs at the node and hope that they finish. I haven't had issues with garbled results though.

      Delete
    2. I agree wrt the bios. I had a good feel for what I was doing back in the socket 939 era, but now there are so many options I honestly have no clue what half of them do. One thing I can add is that in troubleshooting my setup I put an 8350 into the m5a99fx. It was solid at stock parameters, but if I bumped up its voltage I could get it to lock while running HPL. In this case I have a coolermaster v1000, so it seemed unlikely it couldn't provide enough juice. To answer your question, I would think the 700 W on the 12 V rail would be fine (it's even got 150 W on the 3.3 + 5 V rails, which is more than the v1000 at 125 W). Perhaps this means the VRM on your asrock can't quite handle the 8350 (the m5a99fx and I also have an m5a97, both of which have larger sections devoted to VRM).

      On a side note, I would suggest HPL (or HPCC if you'd like to have benchmarks for DGEMM and FFT) because it was what worked my system hard enough to get quickly reproducible hard locks.

      Delete
    3. Sorry to spam your thread, but I have an idea for the power supplied. If you nreally want to make sure it has enough juice you could leave the 700 W plugged into the 24 pin connector and unplug the 8 pin connector. Then plug the 800 W into the 8 pin connector and short the green and a black pin on the 24 pin connector.

      Delete
    4. No worries about 'spamming' -- you obviously know what you are talking about.

      I think my current plan of action (once my cluster frees up so I can start troubleshooting) is to first see if I can get the cluster to lock up quickly (hence Prime95 ). HPCC is in the repos, so it sounds like a lazy and easy thing to try.

      Once that's done, I reckon simply swapping the supplies would be an easy test, then try what you're suggesting with mixing power supplies.

      I'm not looking forward to trying to return the mobo though -- sure, I should've looked at the list of officially supported CPUs before committing, but having built too many boxes without any hardware issues (other than bad RAM) has made me careless. I guess it was time for a wake-up call.

      Also, as a linux user you quickly start ignoring the whole 'supported' thing.

      At any rate, I welcome your suggestions, and will report back once I get around to troubleshooting.

      So far I've sort of ruled out RAM (based on memtest -- not fool proof, but indicative), and I don't think overheating is an issue either, which would've been the obvious. Mobo or PSU it is, then.

      Not sure if it's related, but my UPS died shortly after setting up that new node...

      Delete
    5. Good luck and I look forward to hearing what you find.

      PS I only found your blog today and spent a good hour going through posts. I'm glad to see somebody document such things (sadly I have been too lazy to do it myself).

      Delete
  2. One further thought, do you have the latest (170) bios, released only about a week ago?

    ReplyDelete
    Replies
    1. No, when I started troubleshooting they only had the 1.5 bios at http://www.asrock.com/mb/AMD/990FX%20Extreme3/?cat=Download&os=BIOS, and that was the stock BIOS. It's certainly going to be the first thing I'll try, hopefully tomorrow.

      On a related note: the node has been stable for a week now. It could be due to the change in usage pattern (opt + vib jobs which typically take 2-5 hours instead of PES jobs that take days). Paradoxically, I'd be much happier having a magic bullet to reproduce the crashes...electronic devices tend not to magically fix themselves with continued usage.

      Anyway, thanks for the feedback -- I'll be making a post on the troubleshooting process later.

      Delete
    2. The troubleshooting post is here: http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

      I've added a couple of screenshots of the bios settings as well -- I'm currently trying with a lower multiplier setting and a lower voltage.

      So far I've had no luck solving the reboots, but at least I have a way of (seemingly) reliably triggering them in 36-48 hours.

      Delete