23 October 2013

524. Generating bonds, angles and dihedrals for a molecule for GROMACS

Generating bond, angle and dihedral parameters for GROMACS molecular dynamics simulations is a real PITA when it comes to reasonably large and complex molecules.

Since I am currently looking at the solvation dynamics of a series of isomers of a metal oxide cluster I quickly got tired of working out the bonds and bond constraints for my 101 atom clusters by hand.

So here's a python (2.7.x) script that generates parameters suitable for a GROMACS .top or .itp file.

Let's call the script genparam. You can call it as follows:
genparam molecule.xyz 0.21 3

where molecule.xyz is the molecule in xyz coordinates. 0.21 (nm) is used to generate bonds -- atoms that are 0.21 nm or closer to each other are bonded. The final number should be 1, 2 or 3. If it is 1, then only bonds (and constraints for bonds that are 0.098 nm or shorter -- such as O-H bonds. It's specific for my use, so you can edit the code.) are printed. If it is 2, then bonds and angles are printed. If it is 3, then bonds, angles and dihedrals are printed.

A simple and understandable example using methanol looks like this:
./genparams.py methanol.xyz 0.15 3
[bonds] [bonds] 1 2 6 0.143 50000.00 ;C OH 1 3 6 0.108 50000.00 ;C H 1 4 6 0.108 50000.00 ;C H 1 5 6 0.108 50000.00 ;C H [constraints] 2 6 2 0.096 ;OH HO [ angles ] 2 1 3 1 109.487 50000.00 ;OH C H 2 1 4 1 109.487 50000.00 ;OH C H 2 1 5 1 109.466 50000.00 ;OH C H 1 2 6 1 109.578 50000.00 ;C OH HO 3 1 4 1 109.542 50000.00 ;H C H 3 1 5 1 109.534 50000.00 ;H C H 4 1 5 1 109.534 50000.00 ;H C H [ dihedrals ] 6 2 1 3 1 60.01 50000.00 2 ;HO OH C H 6 2 1 4 1 60.01 50000.00 2 ;HO OH C H 6 2 1 5 1 180.00 50000.00 2 ;HO OH C H

where methanol.xyz looks like this:
6 Methanol C -0.000000010000 0.138569980000 0.355570700000 OH -0.000000010000 0.187935770000 -1.074466460000 H 0.882876920000 -0.383123830000 0.697839450000 H -0.882876940000 -0.383123830000 0.697839450000 H -0.000000010000 1.145042790000 0.750208830000 HO -0.000000010000 -0.705300580000 -1.426986340000

Note that the bond length value may need to be tuned for each molecule -- e.g. 0.21 nm is fine for my metal oxides which have long M-O bonds, while 0.15 nm is better for my methanol.xyz. In the end, you'll probably have to do some manual editing to remove excess bonds, angles and dihedrals.

Also note that there's a switch (prevent_hhbonds) which prevent the formation of bonds between atoms that start with H -- this is to prevent bonds between e.g. H in CH3 units (177 pm). You can change this in the code in the __main__ section.

Finally, note that the function types, multiplicities and forces may need to be edited. I suggest not doing that in the code, but in the output as it will depend on each molecule.

This script won't do science for you, but it'll take care of some of the drudgery of providing geometric descriptors.

Anyway, having said all that, here's the code:


#!/usr/bin/python
import sys
from math import sqrt, pi
from itertools import permutations
from math import acos,atan2

# see table 5.5 (p 132, v 4.6.3) in the gromacs manual
# for the different function types


#from
#http://stackoverflow.com/questions/1984799/cross-product-of-2-different-\
#vectors-in-python
def crossproduct(a, b):
 c = [a[1]*b[2] - a[2]*b[1],
  a[2]*b[0] - a[0]*b[2],
  a[0]*b[1] - a[1]*b[0]]
 return c
#end of code snippet

# mostly from 
# http://www.emoticode.net/python/calculate-angle-between-two-vectors.html
def dotproduct(a,b):
 return sum([a[i]*b[i] for i in range(len(a))])

def veclength(a):
 length=sqrt(a[0]**2+a[1]**2+a[2]**2)
 return length

def ange(a,b,la,lb,angle_unit):
 dp=dotproduct(a,b)
 costheta=dp/(la*lb)
 angle=acos(costheta)
 if angle_unit=='deg':
  angle=angle*180/pi
 return angle
# end of code snippet

def diheder(a,b,c,angle_unit):
 dihedral=atan2(veclength(crossproduct(crossproduct(a,b),\
 crossproduct(b,c))), dotproduct(crossproduct(a,b),crossproduct(b,c)))
 if angle_unit=='deg':
  dihedral=dihedral*180/pi
 return dihedral

def readatoms(infile):
 positions=[]
 f=open(infile,'r')
 atomno=-2
 for line in f:
  atomno+=1
  if atomno >=1:
   position=filter(None,line.rstrip('\n').split(' '))
   if len(position)>3:
    positions+=[[position[0],int(atomno),\
    float(position[1]),float(position[2]),\
    float(position[3])]]
 return positions

def makebonds(positions,rcutoff,prevent_hhbonds):
 
 bonds=[]
 
 for firstatom in positions:
  for secondatom in positions:
   distance=round(sqrt((firstatom[2]-secondatom[2])**2\
+(firstatom[3]-secondatom[3])**2+(firstatom[4]-secondatom[4])**2)/10.0,3)
   xyz=[[firstatom[2],firstatom[3],firstatom[4]],\
[secondatom[2],secondatom[3],secondatom[4]]]
   
   if distance<=rcutoff and firstatom[1]!=secondatom[1]:
    if prevent_hhbonds and (firstatom[0][0:1]=='H' and\
     secondatom[0][0:1]=='H'):
      pass
    elif firstatom[1]<secondatom[1]:
     bonds+=[[firstatom[1],secondatom[1],\
distance,firstatom[0],secondatom[0],xyz[0],xyz[1]]]
    else:
     bonds+=[[secondatom[1],firstatom[1],\
distance,firstatom[0],secondatom[0],xyz[1],xyz[0]]]
 return bonds

def dedupe_bonds(bonds):
 
 newbonds=[]
 for olditem in bonds:
  dupe=False
  for newitem in newbonds:
   if newitem[0]==olditem[0] and newitem[1]==olditem[1]:
    dupe=True
    break;
  if dupe==False:
   newbonds+=[olditem]
 return(newbonds)

def genvec(a,b):
 vec=[b[0]-a[0],b[1]-a[1],b[2]-a[2]]
 return vec

def findangles(bonds,angle_unit):
 # for atoms 1,2,3 we can have the following situations
 # 1-2, 1-3
 # 1-2, 2-3
 # 1-3, 2-3
 # The indices are sorted so that the lower number is always first
 
 angles=[]
 for firstbond in bonds:
    
  for secondbond in bonds:
   if firstbond[0]==secondbond[0] and not (firstbond[1]==\
secondbond[1]): # 1-2, 1-3
    vec=[genvec(firstbond[6],firstbond[5])]
    vec+=[genvec(secondbond[6],secondbond[5])]
    angle=ange(vec[0],vec[1],firstbond[2]*10,\
secondbond[2]*10,angle_unit)
    angles+=[[firstbond[1],firstbond[0],\
secondbond[1],angle,firstbond[4],firstbond[3],secondbond[4],firstbond[6],\
firstbond[5],secondbond[6]]]
   
   if firstbond[0]==secondbond[1] and not (firstbond[1]==\
secondbond[1]): # 1-2, 3-1
    #this should never be relevant since we've sorted the atom numbers
    pass
   
   if firstbond[1]==secondbond[0] and not (firstbond[0]==\
secondbond[1]): # 1-2, 2-3
    vec=[genvec(firstbond[5],firstbond[6])]
    vec+=[genvec(secondbond[6],secondbond[5])]
    angle=ange(vec[0],vec[1],firstbond[2]*10,\
secondbond[2]*10,angle_unit)
    angles+=[[firstbond[0],firstbond[1],\
secondbond[1],angle,firstbond[3],firstbond[4],secondbond[4],firstbond[5],\
firstbond[6],secondbond[6]]]
   
   if firstbond[1]==secondbond[1] and not (firstbond[0]==\
secondbond[0]): # 1-3, 2-3
    vec=[genvec(firstbond[6],firstbond[5])]
    vec+=[genvec(secondbond[6],secondbond[5])]
    angle=ange(vec[0],vec[1],firstbond[2]*10,\
secondbond[2]*10,angle_unit)
    angles+=[[firstbond[0],firstbond[1],\
secondbond[0],angle,firstbond[3],firstbond[4],secondbond[3],firstbond[5],\
firstbond[6],secondbond[5]]]    
 return angles

def dedupe_angles(angles):
 dupe=False
 newangles=[]
 for item in angles:
  dupe=False
  for anotheritem in newangles:
   if item[0]==anotheritem[2] and (item[2]==anotheritem[0]\
 and item[1]==anotheritem[1]):
    dupe=True
    break
  if dupe==False:
   newangles+=[item]
 
 newerangles=[]
 dupe=False
 
 for item in newangles:
  dupe=False
  for anotheritem in newerangles:
   if item[2]==anotheritem[2] and (item[0]==anotheritem[1]\
 and item[1]==anotheritem[0]):
    dupe_item=anotheritem
    dupe=True
    break
  
  if dupe==False:
   newerangles+=[item]
  elif dupe==True:
   if dupe_item[3]>item[3]:
    pass;
   else:
    newerangles[len(newerangles)-1]=item

 newestangles=[]
 dupe=False
 
 for item in newerangles:
  dupe=False
  for anotheritem in newestangles:
   if (sorted(item[0:3]) == sorted (anotheritem[0:3])):
    dupe_item=anotheritem
    dupe=True
    break
  
  if dupe==False:
   newestangles+=[item]
  elif dupe==True:
   if dupe_item[3]>item[3]:
    pass;
   else:
    newestangles[len(newestangles)-1]=item
 return newestangles

def finddihedrals(angles,bonds,angle_unit):
 dihedrals=[]
 for item in angles:
  for anotheritem in bonds:
   if item[2]==anotheritem[0]:
    vec=[genvec(item[7],item[8])]
    vec+=[genvec(item[8],item[9])]
    vec+=[genvec(item[9],anotheritem[6])]
    dihedral=diheder(vec[0],vec[1],vec[2],\
angle_unit)
    dihedrals+=[ [ item[0],item[1],item[2],\
anotheritem[1], dihedral, item[4],item[5],item[6], anotheritem[4] ] ]

   if item[0]==anotheritem[0] and not item[1]==anotheritem[1]:
    vec=[genvec(anotheritem[6],item[7])]
    vec+=[genvec(item[7],item[8])]
    vec+=[genvec(item[8],item[9])]
    dihedral=diheder(vec[0],vec[1],vec[2],\
angle_unit)
    dihedrals+=[ [ anotheritem[1],item[0],item[1],\
item[2], dihedral, anotheritem[4],item[4],item[5], item[6] ] ]
 return dihedrals

def dedup_dihedrals(dihedrals):
 newdihedrals=[]
 
 for item in dihedrals:
  dupe=False
  for anotheritem in newdihedrals:
   rev=anotheritem[0:4]
   rev.reverse()
   if item[0:4] == rev:
    dupe=True
  if not dupe:
   newdihedrals+=[item]
 return newdihedrals

def print_bonds(bonds,func):
 constraints=""
 funcconstr='2'
 print '[bonds]'
 
 for item in bonds:
  if item[2]<=0.098:
   constraints+=(str(item[0])+'\t'+str(item[1])+'\t'+\
funcconstr+'\t'+str("%1.3f"% item[2])+'\t;'+str(item[3])+'\t'+str(item[4])+'\n')
  else: 
   print str(item[0])+'\t'+str(item[1])+'\t'+func+'\t'+\
str("%1.3f"% item[2])+'\t'+'50000.00'+'\t;'+str(item[3])+'\t'+str(item[4])
 print '[constraints]'
 print constraints
 return 0

def print_angles(angles,func):
 print '[ angles ]'
 for item in angles:
  print str(item[0])+'\t'+str(item[1])+'\t'+str(item[2])+'\t'+\
func+'\t'+str("%3.3f" % item[3])+'\t'+'50000.00'+'\t;'\
  +str(item[4])+'\t'+str(item[5])+'\t'+str(item[6])
 return 0

def print_dihedrals(dihedrals,func):
 print "[ dihedrals ]"
 force='50000.00'
 mult='2'
 for item in dihedrals:
  print str(item[0])+'\t'+str(item[1])+'\t'+str(item[2])+'\t'+\
str(item[3])+\
  '\t'+func+'\t'+str("%3.2f" % item[4])+'\t'+force+'\t'+mult+\
  '\t;'+str(item[5])+'\t'+str(item[6])+\
  '\t'+str(item[7])+'\t'+str(item[8])
 return 0

if __name__ == "__main__":
 infile=sys.argv[1]
 rcutoff=float(sys.argv[2]) # in nm
 itemstoprint=int(sys.argv[3])
 
 angle_unit='deg' #{'rad'|'deg'}
 prevent_hhbonds=True # False is safer -- it prevents bonds between
 # atoms whose names start with H

 positions=readatoms(infile)

 bonds=makebonds(positions,rcutoff,prevent_hhbonds)
 bonds=dedupe_bonds(bonds) 
 
 print_bonds(bonds,'6')
 
 angles=findangles(bonds,angle_unit)
 angles=dedupe_angles(angles)
 
 if itemstoprint>=2:
  print_angles(angles,'1')

 dihedrals=finddihedrals(angles,bonds,angle_unit)
 dihedrals=dedup_dihedrals(dihedrals)
 if itemstoprint>=3:
  print_dihedrals(dihedrals,'1') 

21 October 2013

523. Random Reboots -- troubleshooting. Diagnosed: incompatible motherboard.

Update 8 Jan 2014: I've been putting the FX8350 through its paces together with the other mobo and it's completely stable.  The FX8150 box is also stable. Note that I thought I had a crash a few days after making the swap -- I have not had any issues since whatsoever in spite of running very heavy jobs. Either way, it should remind me to check whether a mobo is compatible with a CPU before making my purchase in the future.

Update 18 Nov 2013: I swapped the CPUs between to boxes, so that I was now using a mobo that officially supported FX8350. Only the CPU moved, nothing else.

Update 5 Nov 2013: Note that the motherboard doesn't support the CPU and this leads to spontaneous reboots under certain conditions. Make sure to look at the list over supported CPUs for the motherboard you use (in retrospect, obvious -- but as a linux person you get used to ignoring those things since everything's for just OSX or Win).

See here for the troubleshooting thread:
 http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

Also see this thread: http://www.techpowerup.com/forums/showthread.php?t=184061
I'll need to read up on...stuff...but the bottom line seems to be that one would expect issues with this board/cpu combo:

Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable. [..] Those would be your three best to choose from all are the better 8+2 phase designs...
and
my opinion is to stay away from the asus FX ive seen many people asking why their boards are throttling at full load, vrm protection causes voltages to drop at full load when vrms hit a certain temp.

and it seemed that low (CPU) voltages precipitated crashes.

Update 4 Nov 2013: swapped CPUs with a different box. Will test in a couple of days.
Update 4 Nov 2013: Changing the multiplier back to 20 from 17 (but keeping voltage stable) caused a crash -- this time in a record 13 minutes.
Update 4 Nov 2013: System stable with new voltage/multiplier settings.
Update 27 Oct 2013: I'm currently looking at BIOS voltage.
---

I recently built a new node (http://verahill.blogspot.com.au/2013/10/520-new-node-amd-fx-835032-gb-ram990-fx.html). While that's always exciting, it quickly left a sour taste due to random reboots when running long (days) computational jobs.

Note that the motherboard (asrock 990 fx extreme3) does not officially support FX8350, which is something that I shouldn't have ignored. I might eventually move my fx 8350 to my gigabyte 990 fxa and put my 8150 on my asrock instead.

Short description
* Both Gaussian 09 and NWChem 6.3 cause the reboots.
* I've set up a cron job that logs a lot of data every minute and there's nothing odd in there. No overheating, the wattage seems ok etc.
* Running only smaller jobs (even though they are running non-stop) which take less than a day, the node has stayed up for 11 days now.
* I have never seen it reboot, so I don't know if there's any beeping etc.
* There's nothing in the logs, and nothing in the output from tailing dmesg using a cronjob.
* The only real output is in last:

reboot   system boot  3.11.5           Fri Oct 18 14:08 - 11:57 (2+21:48)   
reboot   system boot  3.8.10           Fri Oct 18 13:23 - 14:07  (00:44)    
reboot   system boot  3.8.10           Tue Oct  8 10:46 - 13:18 (10+02:31)  
me       tty1                          Mon Oct  7 13:25 - crash  (21:21)    
me       pts/0        beryllium        Mon Oct  7 12:29 - crash  (22:17)    
reboot   system boot  3.8.10           Mon Oct  7 12:27 - 13:18 (11+00:51)  
me       pts/0        beryllium        Sat Oct  5 20:59 - crash (1+14:27)   
reboot   system boot  3.8.10           Sat Oct  5 20:58 - 13:18 (12+15:19)
reboot   system boot  3.8.10           Tue Oct  1 14:09 - 11:54 (19+20:45)  
me       pts/0        beryllium        Sun Sep 29 11:39 - crash (2+02:29)   
reboot   system boot  3.8.10           Sun Sep 29 11:39 - 11:54 (21+23:14)  
me       pts/0        beryllium        Mon Sep 23 11:09 - crash (6+00:30)   
reboot   system boot  3.8.10           Mon Sep 23 11:07 - 11:54 (27+23:46)  
me       pts/0        beryllium        Fri Sep 20 12:59 - crash (2+22:08)   
reboot   system boot  3.8.0            Fri Sep 20 12:50 - 11:54 (30+22:04)  
reboot   system boot  3.8.0            Fri Sep 20 12:49 - 12:49  (00:00)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 11:52 - 12:48  (00:56)    
reboot   system boot  3.2.0-4-amd64    Fri Sep 20 06:29 - 08:08  (01:38)    
me       pts/0        beryllium        Wed Sep 18 14:51 - crash (1+15:38)   
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 14:40 - 08:08 (1+17:27)   
me       pts/8        beryllium        Wed Sep 18 09:02 - crash  (05:38)    
reboot   system boot  3.2.0-4-amd64    Wed Sep 18 01:51 - 08:08 (2+06:17)   
me       pts/0        beryllium        Tue Sep 17 18:11 - crash  (07:40)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 18:08 - 08:08 (2+14:00)   
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 17:55 - 17:56  (00:01)    
me       pts/0        beryllium        Tue Sep 17 13:12 - crash  (04:43)    
reboot   system boot  3.2.0-4-amd64    Tue Sep 17 12:23 - 17:56  (05:33)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 20:05 - 12:17  (16:12)    
me       pts/0        beryllium        Mon Sep 16 16:03 - crash  (04:02)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:31 - 12:17  (20:46)    
reboot   system boot  3.2.0-4-amd64    Mon Sep 16 15:20 - 15:30  (00:09)

Looking at the output it does seems that the crashes are happening less frequently. Part of the reason for the is probably a change in how I use the node, but I don't think that explains everything, and I don't like the idea of a piece of electronic hardware 'fixing' itself.

Another thing that puzzles me is the repeating numbers -- e.g. 08:08, 11:54 and 13:18  -- in the ouput. There's no cronjob or anything like that running at any of those times.

Other things that have changed are the kernel versions and that I removed the UPS around the 1st of October  (the UPS died, which is a bad sign, power-wise. I should probably also look into the warranty on it).

The chief challenge here is that I can't reliable trigger the reboots, which makes it difficult to see whether I've solved the issue or not.

On an older node I could trigger errors by compiling the kernel, but not using any other technique. On that node the RAM was faulty: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

==> <== indicates what I'm currently doing.


0.  RAM
The most common reason for unstable nodes if faulty RAM, so if your computer is behaving strangely and randomly crashes, always suspect the RAM first. It's a more likely culprit than software, and the most likely of the hardware components to be at fault.

I ran a full cycle of memtest86+ which took some 4-5 hours if I remember correctly. No errors shown. Note that if memtest86+ does not show any errors it is no guarantee that the RAM is fine. However, the likelihood that it is indeed corrupt goes down.

1. Overheating
The second thing to investigate when something like this happens, in particular if it's associated with prolonged and heavy use, is the possibility of overheating. You can install sensors-lm and configure it to track various temperatures. Note that these aren't always correct.

At any rate, I've logged the output from sensors every minute and there's nothing indicating that the temperature is rising prior to a crash.

--------------------------------------------------
Intermission -- trying to trigger a reboot

* It's stable while compiling a kernel (in my case 11.5). Not surprising as it is intense, but short.

* Prime95
Number of torture test threads to run (8): Choose a type of torture test to run. 1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested much). 2 = In-place large FFTs (maximum heat and power consumption, some RAM tested). 3 = Blend (tests some of everything, lots of RAM tested). 11,12,13 = Allows you to fine tune the above three selections. Blend is the default. NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller. Type of torture test to run (3): 2 Accept the answers above? (Y): y
I ran this for three days and the node was stable.
I then ran test type 3 for 30 hours and it too was stable.

I accidentally ran the tests above without mounting ~/oxygen to the head node using NFS. Shouldn't matter, but in order to troubleshoot it's better to keep everything as constant as possible.

* PES scan
I think I saw reboots triggered using all sort of jobs, but due to their long run times, I saw it more consistently with PES scans.

So I ran a long PES scan in nwchem 6.3, and lo, it crashed after just under two days running this job (and having been up for 6 day and 30 minutes). It's not quite the quick, efficient way of crashing the computer that I was looking for, but it will do.

Note that this crash didn't lead to a reboot, but simply to the computer locking up and become unresponsive. No screen, no network, no harddrive activity.

The only errors I can spot in the dmesg are two warnings about 'perf samples too long (2545>2500)' at 30 minutes and at 14 hours of uptime, i.e. well before I started the PES job.

me pts/1 beryllium Fri Oct 25 08:59 still logged in me pts/0 beryllium Fri Oct 25 08:58 still logged in me pts/0 beryllium Fri Oct 25 08:57 - 08:58 (00:01) me tty1 Fri Oct 25 08:52 still logged in reboot system boot 3.11.5 Fri Oct 25 08:52 - 08:59 (00:07) me pts/4 beryllium Sun Oct 20 17:32 - 17:32 (00:00)

--------------------------------------------------

3. BIOS version
The BIOS version (1.5) at the time of purchase of the motherboard was the same version as the BIOS available at the motherboard manufacturer's site. Since then an update has been released, as pointed out by a commentator. Nothing in the description of the update indicates that it would fix the issue I'm having, but upgrading the bios is just one of those things that should be tried.

mkdir ~/tmp/bios -p
cd ~/tmp/bios
wget 'ftp://download.asrock.com/bios/AM3+/990FX%20Extreme3(1.70)ROM.zip'
unzip 990FX\ Extreme3\(1.70\)ROM.zip 
Archive: 990FX Extreme3(1.70)ROM.zip inflating: 990EX31.70

I copied the file to a USB stick formatted with FAT32 since my guess is that the uefi might not recognise extX. Booting with the USB stick plugged in and hitting F6 ('Instant flash') lead to the UEFI finding the flash file. Now click on the file name -- don't click on the buttons (e.g. 'Configuration' and 'Refresh device'). During the bios update the usual goes: don't power off during the update, and make sure that your usb stick isn't old and damaged.

Once the update is done you get a message saying 'Programming success, press Enter to reboot system'.

I reran the PES scan, and I had a crash after less than two days (ca 40 hours). This crash caused a reboot.
me       tty1                          Sun Oct 27 12:00   still logged in   
me       pts/1        beryllium        Sun Oct 27 11:55   still logged in   
me       pts/0        beryllium        Sun Oct 27 11:54   still logged in   
reboot   system boot  3.11.5           Sun Oct 27 11:52 - 12:14  (00:22)    
me       tty1                          Fri Oct 25 09:36 - crash (2+02:16)   
me       pts/0        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
me       pts/2        beryllium        Fri Oct 25 09:17 - crash (2+02:35)   
reboot   system boot  3.11.5           Fri Oct 25 09:17 - 12:14 (2+02:57) 
.

4. BIOS settings -- voltage
The older I get, the more comfortable I become with admitting when I don't really know what I'm doing. This  -- the tweaking of voltage settings -- is one of those areas where I definitely lack expertise.

Luckily, I've got some advice from a commentator: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

Anyway, I've always treated electricity a bit like magic (I always had issues with electrochemistry as a youngster, which is why I'm forcing myself to teach it these days), and the older I get the more I wish I had done chemical engineering rather than chemistry. Perhaps we want to benefit society more with age, rather than just benefit from it?

Anyway, here are some literal screenshots -- taken with my trusty old phone:

Main
 I don't see anything odd here.

First half of OC Tweaker
The alternative to 'Manual' in the OC Mode is 'CPU OC Mode', which sounds like something I want to avoid. Anyway, what bothers me is that there's no 'OC OFF' button. I don't know if the BIOS is doing something odd.

More OC Tweaker

HW Monitor. The Vcore and +12v lines fluctuate by about 5-10 mv.
Changes: 
* turn off Cool'n'Quiet.
* Change Multiplier/Voltage Change from Automatic to Manual
** Set CPU Freq multiplier to 17.0x (3400 MHz) instead of 20x (4.0 GHZ) under OC Tweaker
** Set CPU voltage to 1.35 instead of stock 1.3750 V, (OC Tweaker/CPU Voltage)

I've managed to run a full PES scan -- which hasn't worked before -- and re-ran it without issue. Looks like the issue was solved.

I then set the multiplier back to 20 (4000 MHz)  while keeping the CPU voltage at 1.35 V, and relaunched the PES scan. Almost immediate crash:.
me       pts/0        beryllium        Mon Nov  4 13:48   still logged in   
reboot   system boot  3.11.5           Mon Nov  4 13:48 - 14:30  (00:42)    
me       pts/0        beryllium        Mon Nov  4 13:35 - crash  (00:13)    
me       tty1                          Mon Nov  4 13:34 - 13:34  (00:00)    
reboot   system boot  3.11.5           Mon Nov  4 13:34 - 14:30  (00:56)

Not sure what to do now. Did it crash faster because the CPU voltage is low while the multiplier is high? Can it be solved by increasing -- rather than decreasing -- the voltage?

Re-running the PES job again gave a more interesting result -- the job crashed, but not the node (note that the job is exactly the same each time, so it's not a matter of the input):

 Grid integrated density:     191.999968853836
 Requested integration accuracy:   0.10E-06
 d= 0,ls=0.0,diis    13  -1092.4755719318 -1.63D-05  8.49D-06  2.73D-05   719.5
 Grid integrated density:     191.999968846937
 Requested integration accuracy:   0.10E-06
  Singularity in Pulay matrix. Error and Fock matrices removed. 
 PeIGS error from dstebz 4 ...trying dsterf 
 error from dsterf 516  
 error from dsterf 516  
 Error in pstein5. me = 0 argument 10 has an illegal value. 
 Error in pstein5. me = 1 argument 10 has an illegal value. 
 Error in pstein5. me = 2 argument 10 has an illegal value. 
  ME =                     2  Exiting via  
 Error in pstein5. me = 4 argument 10 has an illegal value. 
  ME =                     4  Exiting via  
 Error in pstein5. me = 5 argument 10 has an illegal value. 
  ME =                     5  Exiting via  
5:5: peigs error: mxpend:: -1
 Error in pstein5. me = 6 argument 10 has an illegal value. 
  ME =                     6  Exiting via  
 Error in pstein5. me = 7 argument 10 has an illegal value. 
  ME =                     7  Exiting via  
7:7: peigs error: mxpend:: -1
(rank:7 hostname:oxygen pid:3741):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0
  ME =                     0

And, while the frequency affects the thermal output, a thermal issue shouldn't lead to garbled stuff. This is looking more and more like what the poster wazoo42 recounted: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

5. Swapping the CPU to an approved MOBO.
This is a bit of a cop-out. Most people don't have multiple computers and which happen to have compatible hardware. On the other hand, I have a job to do.

To be able to continue to follow this post you'll need to know this:
There are two nodes (cpu, mobo, psu):
* oxygen: FX 8350, asrock 990fx extreme3, corsair GS700.
* neon: FX 8150, gigabyte GA-FX990D3, corsair GS800.
Both nodes have otherwise similar hardware: 32 gb ram, GT210 nvidia, one PCI network card. Both motherboards support 8150, but only GA-FX990D3 supports 8350.

So we'll move FX8350 to neon, and FX 8150 to oxygen.

I first set the Multiplier/Voltage Change back to Automatic on oxygen in preparation for the FX 8150.
I then shut down the two nodes and unplugged them. And here's where it's not funny anymore: I tried to gently remove the heatsink but the heatsink together with the CPU popped off in spite of the lever being locked. On both nodes.

The CPU was solidly glued to the heatsink in both cases. I managed to get the FX 8350 off its heatsink by gently scraping off the excess thermal paste (dry and solid), but the FX 8150 was a real struggle. In the end I used the back of a knife as a lever (gently). Not ideal.

Anyway, cleaned the fx 8350 heatsink and cpu, applied new thermal paste and installed on the gigabyte fxa990-d3 motherboard. Turned on -- no lights on the mobo. Fans etc all working. Dammit. Googled and saw that bios only supports 8350 from version 9 (incorrect -- looked at wrong mobo, but didn't discover that until later)

So now I have a CPU that'd work, but which I can't install since it's stuck to the heatsink, and one which I can install but not use, since the bios is wrong.

When I put the old CPU (fx8150) back in neon it wouldn't boot either -- the fans were spinning but no motherboard lights went on (e.g. LAN). PCI cards lit up, but nothing on Mobo. Took out the CPU, put back in, took out, put back in.

Put the FX8350 back in the oxygen case. Didn't work either, although here the LAN mobo light was on. Still didn't work though -- no video output and couldn't connect via LAN. Great. Killed two working nodes in on afternoon.

Finally, somehow, I managed to get neon working again. Popped a USB stick in. Set up a USB stick with a 1 GB W95 partition as shown in this post: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

Downloaded bios etc. Couldn't install -- BIOS check error. Googled again -- dammit. BIOS for wrong mother board. And the bios that's installed actually supports 8350.

OK, installed all the CPUs, and now they booted up. I must have installed the CPUs badly -- which doesn't speak well of my attention to detail.

For some reason the card that used to be eth0 now gets assigned as eth2 on oxygen. Checked udev -- doesn't make sense. Turned everything off and checked that the pci card (eth0) was seated properly. Booted -- now ok.

Not sure if I have to recompile all the computational code but I did anyway -- the only difference, according to the acml cpuid.exe util, is that 8350 supports FMA3 while 8150 doesn't. Both support SSE, SSE2, SSE3, AVX, FMA4.





[edit]
After ca two months both boxes are stable in spite of being subjected to heavy work loads. The reason for the crashes/reboots originally must have been due to incompatible mobo/cpu.

18 October 2013

522. Briefly: nvidia installer holding up dpkg

I'm normally using smxi to handle my graphics drivers on debian.

Because I'm working on a QM/MM problem at the moment I wanted to use gromacs (4.6) to generate a solvated molecule, but I kept getting errors about missing libcudart4.so. So I figured I'd install it using apt-get install libcudart4

Everything went fine until I got the following prompt:
Selecting 'No' stops everything, and selecting Yes hangs it.

You can't install or remove any other programs until this has been resolved, as everything ends with
E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.

Running
sudo dpkg --configure -a
takes you back to the screenshot above.

The fix:
Run
nvidia-installer --uninstall

At this point you get
You can now select No and go through everything:

After this, you can run
sudo dpkg --configure -a

and now your system is working again so that packages can be installed and removed.

17 October 2013

520. New node: AMD FX 8350/32 Gb RAM/990 FX

Update 5 Nov 2012: Note that the motherboard doesn't support the CPU and this leads to spontaneous reboots under certain conditions. Make sure to look at the list over supported CPUs for the motherboard you use (in retrospect, obvious -- but as a linux person you get used to ignoring those things since everything's for just OSX or Win).

See here for the troubleshooting thread:
 http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

Also see this thread: http://www.techpowerup.com/forums/showthread.php?t=184061
I'll need to read up on...stuff...but the bottom line seems to be that one would expect issues with this board/cpu combo:

Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable. [..] Those would be your three best to choose from all are the better 8+2 phase designs...
and
my opinion is to stay away from the asus FX ive seen many people asking why their boards are throttling at full load, vrm protection causes voltages to drop at full load when vrms hit a certain temp.

and it seemed that low (CPU) voltages precipitated crashes.

Original post:
So I built a new node at the beginning of October 2013, using the following parts:
  • AMD FX 8350 CPU
  • 4*8 Gb GSkill RAM
  • ASRock 990FX Extreme3 motherboard
  • 1 Tb Seagate Barracuda HDD
  • MSI N210 graphics card
  • ASUS NX1101 Gigabit NIC
  • Corsair GS700 PSU
  • Antec GX700 case
NOTE that I'm having issues with spontaneous reboots during extended periods (days) of heavy load (100% CPU) which do not appear to be associated with faulty RAM, so you might want to think twice before using the exact same permutation of parts as is listed above. Most likely there's a power issue -- either the PSU isn't supplying enough juice to the Mobo, or the Mobo isn't supplying enough power to the CPU. Note also that the CPU isn't listed as an officially supported CPU by the motherboard manufacturer.*

See here for my troubleshooting thread: http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

So the main value of this post are the photos which show how easy it is to build a computer. Just don't...well...build this particular computer -- use a different motherboard.

The other value of the post is purely personal -- I just wanted to write down the steps to take whenever installing a new node in my little cluster.

There will be a post later on troubleshooting (and hopefully fixing) the issue of the spontaneous reboots.

*I've built a number of computers for myself as well as for other people and haven't had any issues (other than bad RAM) before. I got lazy this time and am paying for it.

The first step was assembly:

I like the case -- it's metal and feels robust. Having two fans on top is a definite plus as well as it works well with my home-built rack.

Note that the case doesn't come with a printed manual -- to get the manual you need to go online. And it still falls short -- there's no guide as to how to use the many, many cables it comes with. However, it's not rocket science either. Turns out that the case has a molex plug for powering the case fans. So plug in the molex plug to the PSU, then plug in the fans to the four weird plug/cables that the case comes with. Note that the mobo has no plug for the internal connector USB 3 cable that the case comes with.

See here for more details re the case: http://www.hardocp.com/article/2013/09/12/antec_gx700_atx_computer_case_review
Case, closed

The glorious innards of The Case.

The 700GX does not come with a mobo panel -- not that they tend to be useful anyway

Luckily all (most?) mobos come with their own panels -- push it in place before doing anything else. It can need a bit of negotiation in order to snap in properly.

The case came with riser nuts, four PSU screws and lots of screws for the mobo.

Put the riser nuts in the case -- they are the golden thingies

And here's the mother board. I don't know if there are universally accepted recommendations, but I prefer to install the CPU and RAM onto the mother board before installing it in the case -- you have more space to manouver and the risk of breaking the board is smaller.

The heatsink (left) and CPU (right)

The heatsink comes with thermal paste pre-applied. Don't touch it -- you want it to be as smooth and even as possible.

Get the CPU out

Note the yellow triangle in the bottom right corner in the picture

That should match up with the triangle in the bottom left of this picture. Note the raised level on the right side of the CPU socket.

The CPU in place. Note the raised lever. There should be no pushing -- the CPU should fit perfectly without any force whatsoever. If you bend a pin...then good luck.

The lever is in the locked position.

Next put the heatsink on. Give this a bit of thought as you won't want to have to reseat it several times (in the worst case you'll have to go buy some thermal paste, clean the heatsink and reapply the paste). So make sure you line up the fasteners before pushing the heatsink in place.

Everything is locked down.

The motherboard with the processor and heatsink in place. In the picture the CPU fan is attached to the WRONG connector. Look for a connector saying 'CPU FAN' (in the picture it's attached to POWER FAN)

Open the RAM slot fasteners, and push the RAM sticks in place firmly, but without excessive violence. Once the fasteners snap shut by themselves the sticks are properly seated. Improperly seated RAM sticks tend to prevent you from booting and leads to a lot of noise.

All four RAM sticks in place, and the motherboard attached to the case via seven screws that screw into the riser nuts.

The PSU is in place.

Main power and auxiliary power cables attached.

This particular case has a special tray for the hard drives.

Hard drive in place

SATA data and power cables attached

The other end of the SATA data cable attaches to the motherboard (SATA 1)

After a bit of rewiring. 

PCI NIC and PCI-E graphics cards in place.
And below is a picture of the cluster -- each node is connected to a gigabit WAN (192.168.2.0/24) router  and a gigabit LAN switch (192.168.1.0/24). 8/32 means 8 cores, 32 gb ram. The cluster 'runs' on the LAN. Each of the four nodes in the picture (there are two three-core nodes in addition) are connected to a KVM. Jobs are managed using SGE.

It's questionable whether one can really call it a cluster though since I run each job on a single node for performance reasons. It still attracts attention from visitors to my office though.



Software:
I then installed debian wheezy on it. During the installation I was notified that I might want to consider enabling non-free to get the r8169 and tg3 firmwares

So after enabling non-free in the sources I did:
sudo apt-get install firmware-realtek firmware firmware-linux-nonfree

Didn't seem to change anything though -- everything was working fine before too.

I also installed amd64-microcode which, if I understand things correctly, should obviate the need for some of the full BIOS updates.

Other little housekeeping things:
I first sorted out
INIT: Id "co" respawning too fast: disabled for 5 minutes
as shown here: http://verahill.blogspot.com.au/2012/01/debian-testing-64-wheezy-small-fixes.html

I then installed a few basic thing:
sudo apt-get install vim screen sinfo gawk lm-sensors

and made a ~/.vimrc:
set number set pastetoggle=<f3> nnoremap <f4> :set nonumber!<CR>

And set vim to the default editor in lieu of nano:
sudo update-alternatives --config editor

I edited /etc/default/sinfo to make it use the correct network:
OPTS="${OPTS} --quiet --bcastaddress=192.168.1.255"
I set up 'static' dhcp on the WAN router.

On the node, I then sorted out /etc/network/interfaces  to use dhcp on eth1 and 192.168.1.180 on eth0, and to route everything properly (i.e. local traffic over eth0, and everything else over eth1):

auto lo
iface lo inet loopback

auto eth1
iface eth1 inet dhcp

auto eth0
iface eth0 inet static
address 192.168.1.180
gateway 192.168.1.1
netmask 255.255.255.0

post-up ip route flush all
post-up route add default eth1
post-up route add -net 192.168.1.0 netmask 255.255.255.0 gw 192.168.1.1 eth0

SGE won't work properly unless you edit /etc/hosts:
127.0.0.1       localhost
#127.0.1.1      oxygen
192.168.1.180   oxygen

The way my cluster works is that every node has its own shared folder.
mkdir ~/oxygen
mkdir ~/scratch
chmod 777 ~/oxygen

Export it as shown here: http://verahill.blogspot.com.au/2012/02/debian-testing-wheezy-64-sharing-folder.html
Set up ssh key login in both directions:
ssh-keygen
vim ~/.ssh/authorized_keys

Then add the new node to the cluster: http://verahill.blogspot.com.au/2013/08/501-briefly-adding-new-node-to-sge.html
Build nwchem as shown here: http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html
Set up gaussian as shown here: http://verahill.blogspot.com.au/2012/05/settiing-up-gaussian-g09-on-debian.html
Fix shmem: http://verahill.blogspot.com.au/2012/10/shmmax-revisited-and-shmall-shmmni.html

Finally, to address this issue regarding corrupt packages during SSH sessions I then added to /etc/rc.local/sbin/ethtool -K eth1 rx off tx off

11 October 2013

519. Formatting an XYZ molecular geometry file using python

This is a silly little script -- ECCE, which I use to manage all my computations, is very particular about what XYZ files it can and cannot read -- if the number of spaces between the coordinates are wrong, it will fail to read the structure. So here's a python script, based on parts I've cannibalised from other scripts that I've written in the past (i.e. it could be made more elegant, but I had the parts ready and just wanted something that works) which takes a somewhat ugly XYZ file and turns it into something beautiful. At least in the eyes of ECCE.

The impetus for this was that someone gave me an XYZ file the had generated by copying columns from Excel. On Mac. mac2unix took care of the line endings, but there were tabs (\t) all over the place.

Usage:
polish_xyz ugly.xyz pretty.xyz

Script:
#!/usr/bin/python
import sys 

def getrawdata(infile):
    f=open(infile,'r')
    n=0 
    preamble=[]
    struct=[]
    
    for line in f:
        if n<2: data-blogger-escaped-if="" data-blogger-escaped-line.rstrip="" data-blogger-escaped-n="" data-blogger-escaped-preamble="">3:
            line=line.rstrip()
            struct+=[line]
        n+=1
    xyz=[struct]
    return xyz, preamble

def genxyzstring(coords,elementnumber):
    x_str='%10.5f'% coords[0]
    y_str='%10.5f'% coords[1]
    z_str='%10.5f'% coords[2]
    element=elementnumber
    xyz_string=element+(3-len(element))*' '+10*' '+\ 
    (8-len(x_str))*' '+x_str+10*' '+(8-len(y_str))*' '+y_str+10*' '+(8-len(z_str))*' '+z_str+'\n'
    
    return xyz_string

def getstructures(rawdata,preamble,outfile):
    n=0 
    for structure in rawdata:
        n=n+1
        num="%03d" % (n,)
        g=open(outfile,'w')
        itson=False
        cartesian=[]
    
        for item in structure:
            coordx=filter(None,item.split(' ')) 
            coordy=filter(None,item.split('\t'))
            if len(coordx)>len(coordy):
                coords=coordx
            else:
                coords=coordy
            coordinates=[float(coords[1]),float(coords[2]),float(coords[3])]
            element=coords[0]
            cartesian+=[genxyzstring(coordinates,element)]
        g.write(str(preamble[0])+'\n')
        g.write(str(preamble[1])+'\n')
        for line in cartesian:
            g.write(line)
        g.close()
        cartesian=[]
    return 0

if __name__ == "__main__":
    infile=sys.argv[1]
    outfile=sys.argv[2]
    xyz,preamble=getrawdata(infile)
    structures=getstructures(xyz,preamble,outfile)

518. Generating enantiomers of molecular structures given in XYZ coordinates using python.

What I'm showing here is probably overkill -- there may be better ways of doing this with sed/awk*. However, since I had most of the code below ready as it was part of another script, doing this is python was quick and easy. Plus it's portable-ish.

*[ECCE, which I use for managing my calculations, is very, very particular about the format of the XYZ file, including the number of chars between coordinates. So simply doing e.g.

cat molecule.xyz|gawk '{print $1,$2,$3,-$4}'

won't work. Not all pieces of software are that picky when it comes to xyz coordinates though -- jmol, for example, is very forgiving.]

Anyway, the script below flips a molecular structure by taking the geometry given as a (properly formatted) XYZ file, and changing the sign in front of the Z coordinates. It's that simple.
Save it, call it flip_xyz, make it executable and call it using e.g.

flip_xyz molecule.xyz flipped_molecule.xyz

Script:
#!/usr/bin/python
import sys 

def getrawdata(infile):
    f=open(infile,'r')
    n=0 
    preamble=[]
    struct=[]
  
    for line in f:
        if n<2: data-blogger-escaped-if="" data-blogger-escaped-line.rstrip="" data-blogger-escaped-n="" data-blogger-escaped-preamble="">1:
            line=line.rstrip()
            struct+=[line]
        n+=1
    xyz=[struct]
    
    return xyz, preamble

def genxyzstring(coords,elementnumber):
    x_str='%10.5f'% coords[0]
    y_str='%10.5f'% coords[1]
    z_str='%10.5f'% -coords[2]
    element=elementnumber
    xyz_string=element+(3-len(element))*' '+10*' '+\ 
    (8-len(x_str))*' '+x_str+10*' '+(8-len(y_str))*' '+y_str+10*' '+(8-len(z_str))*' '+z_str+'\n'
    
    return xyz_string

def getstructures(rawdata,preamble,outfile):
    n=0 
    for structure in rawdata:
        n=n+1
        num="%03d" % (n,)
        g=open(outfile,'w')
        cartesian=[]
    
        for item in structure:
            coordx=filter(None,item.split(' ')) 
            coordy=filter(None,item.split('\t'))
            if len(coordx)>len(coordy):
                coords=coordx
            else:
                coords=coordy
    
            coordinates=[float(coords[1]),float(coords[2]),float(coords[3])]
            element=coords[0]
            cartesian+=[genxyzstring(coordinates,element)]
    
        g.write(str(preamble[0])+'\n')
        g.write(str(preamble[1])+'\n')
        for line in cartesian:
            g.write(line)
        g.close()
        cartesian=[]
    return 0

if __name__ == "__main__":
    infile=sys.argv[1]
    outfile=sys.argv[2]
    xyz,preamble=getrawdata(infile)
    structures=getstructures(xyz,preamble,outfile)