Quick Help

 

Most Common Customer Support Questions

 

Host Troubleshooting Wizard

Our Host Troubleshooting Wizard will assist you to quickly resolve common Driver and ULP technical problems.

 

Production Rollout Checklist

This Production Rollout Checklist describes the steps needed to verify site installation before rolling to production, please consider all the information below carefully while planning your grid production deployment.

 


Quick Troubleshooting Guide

Sometimes a little helpful direction can be just what you need to get through a tough installation problem. This Quick Troubleshooting Guides below provide some helpful instructions how to isolate problems.Quick Troubleshooting Guide

 

Grid Director Quick Help

RcvSswitchRelay errors is it a problem?
No, This counter increases every multicast ARP and you should ignore it
 
Which errors do we use to verify the fabric?
Counter                                            Importance
Width
                                                All links should be running at  4X. Any links reporting 1X are bad and need to be repaired
SymbolError
                                    Can increase without a significant problem present
LinkErrorRecovery                          Increasing LinkErrorRecovery errors may indicate a bad link
LinkDowned
                                     Indicates number of times the port has gone down (usually for valid reasons)
PortRcvErrors
                                  This counter should not be increasing.  Increasing number indicates a bad link
PortRcvRemotePhysicalErrors
       This indicates that a problem is occurring ELSEWHERE in the fabric and that this port received a packet that was intentionally corrupted by another switch in the fabric

PortRcvSwitchRelayErrors             Does not indicate a problem
PortXmitDiscards                            May indicate HOQ or other parameter should be tweaked
PortXmitConstraintErrors              May indicate that a parameter should be tweaked
PortRcvConstraintErrors                May indicate that a parameter should be tweaked
LocalLinkIntegrityErrors                Counter should not be increasing.  Increasing number indicates a bad link
ExcessiveBufferOverrunErrors
      May indicate that a parameter should be tweaked
Status                                               If the value reads “IB-Timeout” this means that the link is in Initialize mode and failed to negotiate a logical link
VL15Dropped                                   This counter increasing in small increments is not seen as a problem.
 
How to upgrade ASICs firmware
  1. To upgrade all SLB and SFB boards Login to CLI as admin ( password 123456)
    Type:
    1. enable ( password Voltaire)
    2. update firmware chassis
    3. Apply chassis power cycle

Note! you may also upgrade firmware to a specific boards

 
How to Get Portcounter CSV file
  1. Please make sure the Switch PM (performance monitoring) report is enabled:
  2. Please login to the switch CLI – user admin ( password 123456)
    Type:
    1. enable ( password Voltaire)
    2. config
    3. sm
    4. pm
    5. pm report-enable set enable
    6. run traffic wait for 2-3 minutes for data gathering
    7. To get the the report either do:
      1. Run export logs from CLI
      2. or open your browser and type http://switch-ip-address/PortCounters.csv
 
How to upgrade software
  1. Download switch software from Voltaire ftp
  2. Place software package on a local ftp server
  3. Login to CLI and set the ftp setting using the following switch CLI commands
  4. Please login to the switch CLI – user admin ( password 123456)
    Type:
    1. enable ( password Voltaire)
    2. config
    3. remote (ftp in previous versions)
    4. server (write down the local ftp server ip address)
    5. username (write username)
    6. password (write password)
    7. exit
    8. exit
    9. update software{write the full path to file directory )
  5. upgrade the switch firmware (if needed)
    1. Please login to the switch CLI – user admin ( password 123456)
    2. enable ( password Voltaire)
    3. config
    4. update firmware chassis
    5. once it is done, power cycle the switch chassis

 

How to Export Logs

Latest switch code 3.4 export-logs CLI command combines all logs together to a tar file and send it to local ftp. To run export log :

  1. Please login to the switch CLI – user admin ( password 123456)
    Type:
    1. enable ( password Voltaire)
    2. config
    3. ftp
    4. username [set your ftp username]
    5. password [ set you login password]
    6. server [set ftp server ipaddress]
    7. ExportLogs
    8. exit
    9. exit
  2.  you will find the file LOGs.tar on your local ftp directory

Latest switch code 3.4 export-logs CLI command combines all logs together to a tar file and send it to local ftp. To run export log :

  1. Please login to the switch CLI – user admin ( password 123456)
    Type:
    1. enable ( password Voltaire)
    2. config
    3. ftp
    4. username [set your ftp username]
    5. password [ set you login password]
    6. server [set ftp server ipaddress]
    7. ExportLogs
    8. exit
    9. exit
  2.  you will find the file LOGs.tar on your local ftp directory

 

For 5.1 version and higher

  • From the CLI Switch(config-remote)# menu, set the parameters to the Remote server.
  • Issue the export-logs CLI, as follows:
  • ISR2004-38f2(config-remote)# exportLOGs ?
  • exportLOGs exportLOGs [remote path].
  • ISR2004-38f2(config-remote)# server 172.25.2.86 ISR2004-38f2(config-remote)# password
  • password: *******
  • ISR2004-38f2(config-remote)# username ftpuser ISR2004-38f2(config-remote)# remote show remote configuration

-----------------------
remote server: 172.25.2.86
username: ftpuser
protocol: FTP
ISR2004-38f2(config-remote)# exportLOGs testing Creating archive (could take 1-3 minutes) Extracting info
| 100%

  • Logs tarball linked to http://172.28.4.80/voltaire/LOGs.tar.gz
  • Succeeded to create log file


In this example, the tar file will be uploaded to the directory of the Remote server.
The switch will generate the log tar file and ftp it to the remote server.
Collect the tar file from your Remote server and send the tar file to Support.

 

 

How to disable ports
  1. Login to CLI and go to utilities
  2. Run port-manage –d {lid} { port}
    Note ! port is the internal Infiniband ASIC port You may get the lid and port from the port counter csv or netdiscover

 

How to reset all 1x links in one command
  1. Login to CLI and go to utilities
  2. Run the command netdiscover –r r1x

 

How to Reconfigure routing to return fabric balance
When servicing boards / removing restoring large number of links or even rebooting many nodes it is recommended to run sm-info sm-initiate-fabric-reconfiguration set command to regain fabric routing balance

GridStack Quick Help

GridStack 3.x How to Upgrade
  1. Remove the old stack:
    1. ibhost.init stop
    2. rpm -qa|grep ibhost ( grep the rpm name)
    3. rpm -e
    4. rm -rf /usr/voltaire /usr/mellanox /usr/mst
  2. Install the new RPM:
    1. rpm -ivh {rpm name}
    2. Start the stack: ibhost.init start
 
GridStack 4.x - How to install
  1. install the driver
    1. tar –jxvf  GridStack-4.x ........tar.bz2  (open the zip)
    2. cd GridStack-4.x   (cd to installation directory, available after the previous step)
    3. run: ./install.sh
  2. Start the driver
    1. /etc/init.d/gridstack {start|stop}

    2. Source /etc/profile  (getting the variables)

 
GridStack 4.x How to Configer IPoIB interface
  1. run ib-config-as-eth
  2. Example:

     ib-config-as-eth -network 192.168.80.05 -netmask  255.255.255.0 -ib ib -mtu 2044 -os RH -overwrite

 
Voltaire MPI - how to run a quick test
  1. cd /opt/vltmpi/bin  (contain binary mpi files)
  2. mpirun_ssh –np N [hostname] ……[hostname]./[any mpi application]

Examples:

mpirun_ssh -np 2 host host ./hello_world

Hello world! I'm 0 of 2 on hydra5.support.voltaire.com

Cleaning up all processes ...

done.

mpirun_ssh -np 2 host host ./cpi

Process 0 of 2 on hydra5.support.voltaire.com

Process 1 of 2 on hydra6.support.voltaire.com

pi is approximately 3.1415926544231318, Error is 0.0000000008333387

wall clock time = 0.000386

Cleaning up all processes ...

done.