Other Checks and Tools

This section gives some additional background, ideas, tools, and routine checks for verifying and monitoring the operation of the ISS workstation.

Utilities

The following list mentions a few programs which are useful for diagnostic and maintenance work. Most of the programs belong to the Zebra system, so full descriptions may be found in the Zebra documentation, such as the Advanced User's Manual. Most of the commands will provide a detailed usage summary if invoked only with the -help option.

dsdump [platform...]

This will produce a summary of all data files in the data store, for either the platforms named on the command line or for all platforms. The summary includes details on the data times and number of observations the file contains.

dsdelete {platform} {time}

Use this command to delete data from the data store prior to a given time. This command actually removes data files from the disk, so obviously it should be used carefully and only if the data to be deleted have been archived elsewhere. Run dsdelete -h, or just dsdelete with no options, to see the full summary of options.

dsrescan platform...

,

dsrescan -all

This command tells the data store daemon to rescan data directories and update its file database. With the -all option, the daemon rescans directories for all of the platforms. Without -all, the daemon rescans each of the platforms named as parameters on the command line. Use dsrescan to notify the data store of additions or deletions of files made directly to a data store directory, as opposed to through the Zebra commands.

mstatus

This command returns a listing of all processes connected to the Zebra message system. This is a handy way to check whether all of the correct programs are running. The listing also contains some useful information, including the process identifiers for each program and the name by which the program is known to the message system. The name is needed to query the program with the zqeury command.

ncdump {file}

The ncdump command can be used to examine the contents of datastore files stored in the netCDF format. Files in netCDF usually have a filename extension of .cdf or .nc.

zquery {process}

Some of the programs connected to the Zebra message system can be queried for their status. The zquery performs this query and writes the response to the terminal window. Specify the name of the process to query in the process parameter. The Data Checks section gives an example of using zquery to check on the datastore daemon.

Data Checks

The usual first step to verifying proper ISS operation is to examine the graphical data displays to see that data products are being ingested, stored, and displayed properly. If data measurements are making it to the data plots, then it is a good bet they are being ingested and stored as well. To check all of the data platforms and their plots, cycle through each of the graphical display entries in the DISPLAY menu on the Iconbar. Look carefully for data gaps, noisy values, or values that exceed reasonable ranges. Trust your intuition on which values look reasonable.

Next, check the Data Available display. Open this window using the DATA menu on the icon bar. For each platform, check that the end time of the data period includes the most recent data which should have been recorded. Click on a platform to display detailed information on the observations stored in that platform. In particular, check to see that data files exist at the expected frequencies and times.

Sometimes it can be reassuring to query the datastore daemon directly and ask it for its status. Among other things, a response assures that the daemon is actually running and able to respond to other Zebra programs. Use the zquery command to send a query to the datastore daemon, as shown below. In the Zebra message system, the daemon goes by the name of DS_Daemon.

104 iss2:/iss/home/iss> zquery DS_Daemon
Zebra data store daemon, proto 00980417
ZebraVersion: 5.0beta-ExportDate
ZebraVersion: Research Data Program, NCAR
Copyright: University Corporation for Atmospheric Research, 1987-2000

Up since Sat Apr 22 00:25:34 2000, 4 days, 19 hrs, and 17 mins ago  (1)
No full rescans have occurred.
Sources:
           default: /ds   (2)

  Platform classes: 5 used of 100 allocated, grow by 50
         Platforms: 5 used of 200 allocated, grow by 50
                    0 composite platforms with 0 subplatforms
      Memory usage: 540 bytes for platforms
                    1900 bytes for classes

         Variables: revision method: stat();
                    debugging: disabled;

(1)
The uptime line is perhaps the most useful part of the status message. It tells when the daemon started and how long it has been running.
(2)
In case there is ever any question, the status response includes the default data directory for all of the platforms. For the ISS, all of the platforms share the same data directory tree—the default—which should be /ds.

Checking Disk Space

Use the UNIX df command to display a disk usage table. The table shows the disk space used and the disk space remaining for each partition in the system. The absolute space remaining in a partition is more important than the percentage of space remaining. The df command is almost always invoked with the -k option so that it reports disk usage in kilobytes.

49 iss2:/iss/home/iss> df -k
                                       (1)
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0t3d0s0     398982  213558  145526    60%    /          (2)
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
swap                   78628     188   78440     1%    /tmp
/dev/dsk/c0t1d0s2     963662  582866  322977    65%    /u2        (3)

(1)
The avail column shows how many kilobytes are still available on each partition. For example, in this table there are still about 323000 kilobytes free on the /u2 filesystem, or about 320 megabytes.
(2)
The root filesystem contains all of the files and directories not explicitly mounted from somewhere else. The main tree for the ISS software, /iss, is on this filesystem. The root filesystem may slowly grow as log files or temporary files grow, but dramatic increases in used space or capacities nearing 95% probably indicate a problem.
(3)
The /u2 filesystem houses all of the transient and persistent data and log files used by the ISS software. If it the available free space drops below about 20 megabytes, there may be a problem. The datastore should be cleaning off old data files to free up space before the filesystem becomes that full.

If the Jaz disk is currently mounted, it will also appear in the disk usage table.

Status Scripts

When the ISS transfers data back to ATD, it also sends along a summary of the system's status from the init.iss status script. The script runs several commands consecutively to generate a single status report. Some of the commands have already been mentioned, and some of the information is not very interesting for casual operations. However, if you ever encounter an unexplainable problem and seek assistance from someone else, especially over the phone, you may be asked to run this script and help analyze the output.

The command line is simply the name of the script and the method to run. The output, however, is large and not so simple, so either pipe the output into more, as shown below, or be prepared to use the terminal window scroll bars to scroll up to the beginning of the output:

105 iss2:/iss/home/iss> init.iss status | more

The output begins by identifying the host producing the output and the time of the output. Then the first section gives the output of several system status commands.

For example, the very first part of this section is the output from the disk space command df -k, which is mentioned above. In the output below, there is an entry for the Jaz drive because at the time the device /dev/jaz was mounted to the directory /jaz. Since the Jaz drive is mounted automatically by the system when the system needs to access the /jaz directory, then unmounted after a minute without access, the /dev/jaz mount entry will not always appear in df -k output.

The last part of this section is the output from the ifconfig command, which lists important parameters for each network interface in the system. The interfaces have names like lo0, le0, and ipdptp0. See the chapter Networking for more information on networking.

In the particular example below, you can tell that the PPP connection has been established over the ipdptp3 interface because the 1.1.1.n dummy addresses have been replaced with real 128.117.80.n internet address.

System status for iss1 on Thu Apr 27 04:05:10 GMT 2000
-----  
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0t3d0s0     357638  260151   61724    81%    /
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
/dev/dsk/c0t1d0s0     388998  288727   61372    83%    /u2
swap                   13352     212   13140     2%    /tmp
/dev/jaz             1894319  679525 1157965    37%    /jaz
-----  
 10:05pm  up 118 day(s),  4:58,  1 user,  load average: 0.35, 0.10, 0.05
-----  
Users: iss
-----  
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr f0 s1 s3 s4   in   sy   cs us sy id
 0 0 0    516  1400   0   8  0  0  0  0  0  0  0  0  0   19   55   34  1  2 98
-----  
      tty          fd0           sd1           sd3           sd4          cpu
 tin tout kps tps serv  kps tps serv  kps tps serv  kps tps serv  us sy wt id
   2    6   0   0    0    2   0   46    2   0  133    0   0  718   1  2  0 97
 
-----  
lo0: flags=849<UP,LOOPBACK,RUNNING,MULTICAST> mtu 8232
	inet 127.0.0.1 netmask ff000000 
le0: flags=863<UP,BROADCAST,NOTRAILERS,RUNNING,MULTICAST> mtu 1500
	inet 128.117.80.84 netmask ffffff00 broadcast 128.117.80.255
ipdptp0: flags=8d1<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST> mtu 8232
	inet 1.1.1.1 --> 1.1.1.2 netmask ffffff00 
ipdptp1: flags=8d1<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST> mtu 8232
	inet 1.1.1.3 --> 1.1.1.4 netmask ffffff00 
ipdptp2: flags=8d1<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST> mtu 8232
	inet 1.1.1.5 --> 1.1.1.6 netmask ffffff00 
ipdptp3: flags=28d1<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST,UNNUMBERED> mtu 1500
	inet 128.117.80.84 --> 128.117.80.29 netmask ffffff00 

Next, there is a short section listing the current crontab entries for user iss. This section will usually look something like the example below, containing only an entry to run the data transfers periodically. If data transfers have been disabled, then the crontab output may only contain the single line at the top.

----- Crontab:
# ISS boot crontab generated Thu Sep  2 22:38:53 GMT 1999
#   datasend:
#   datasend: periodic data transfer installed Wed Apr 26 20:40:17 GMT 2000
#   datasend:
5 10,22 * * * /iss/etc/init.d/datasend run

While the first few sections provide general system status, the next section finally provides some feedback about the ISS software itself, particularly all of the Zebra processes.

----- Zebra:
(1)
'Message manager'@iss1: pid 332, uid 6000 (iss)
unix socket: /tmp/fcc.socket; internet socket: not enabled
session began: Thu Dec 30 16:09:46 1999; proto version 'V-1.5'
12217136 messages sent, 607537424 bytes (26834/1789580 broadcast)
	780 disconnects, with 4 pipe signals, 761282 del rd 0 wt
 Process  'Logger-334' on 5 (p 334), send 796/3512, rec 26838/1789764, nd 0
 Process  'Timer' on 6 (p 336), send 10152649/161500042, rec 266581/1279828, nd 0
 Process  'DS_Daemon' on 7 (p 338), send 925768/289463740, rec 809936/141083860, nd 0
 Process  'Archiver-19301' on 8 (p 19301), send 120510/30895632, rec 115306/37476576, nd 0
 Process  'is' on 9 (p 25189), send 2590/164559, rec 112681/1802940, nd 0
 Process  'Status reporter' on 10 (p 29127), send 3/80, rec 14/881, nd 0
 Process  'Logger-19341' on 11 (p 19341), send 660/2968, rec 24994/1658013, nd 0
 Process  'Displaymgr' on 12 (p 19350), send 116/125688, rec 3937/10150085, nd 0
 Process  'Clock' on 13 (p 19351), send 7/248, rec 9197375/147213546, nd 0
 Process  'issgp1' on 14 (p 19360), send 408518/18425710, rec 513529/103935194, nd 0
Event mask: 0x17
Groups:
 TimeChange (2): Clock, issgp1
 EventLoggers (2): Logger-334, Logger-19341
 Archiver (1): Archiver-19301
 Graphproc (2): Clock, issgp1
 DataStore (3): Archiver-19301, Displaymgr, issgp1
 Everybody (10): Logger-334, Timer, DS_Daemon, Archiver-19301, Logger-19341, Displaymgr, Clock, issgp1, is, Status reporter
 Client events (5): Logger-334, Timer, DS_Daemon, Logger-19341, Displaymgr

(2)
Timer module version $Revision: 1.2 $ $Date: 2002/04/30 06:00:55 $
Current time is 27-Apr-2000,4:05:12
	Clock: alarm 27-Apr-2000,4:05:13.693842 incr 10 N 9197314
	is: alarm 27-Apr-2000,4:06:10.413970 incr 600 N 28052
	is: alarm 27-Apr-2000,4:06:10.438972 incr 600 N 28052
	is: alarm 27-Apr-2000,4:06:10.462678 incr 600 N 28052
	is: alarm 27-Apr-2000,4:06:10.511596 incr 600 N 28052
	is: alarm 27-Apr-2000,4:34:10.535717 incr 72000 N 234
	is: alarm 27-Apr-2000,4:34:10.559653 incr 72000 N 234
	DS_Daemon: alarm 27-Apr-2000,5:09:57.012509 incr 72000 N 1418
	Archiver-19301: alarm 27-Apr-2000,6:00:00.000000 incr 72000 N 1278

(3)
Zebra data store daemon, proto 00980417
ZebraVersion: 4.2.x 
ZebraVersion: Research Data Program, NCAR 
Copyright: University Corporation for Atmospheric Research, 1987-1996 

Up since Thu Dec 30 16:09:55 1999, 118 days, 4 hrs, and 55 mins ago
No full rescans have occurred.
Sources: 
           default: /ds

  Platform classes: 5 used of 100 allocated, grow by 50
         Platforms: 5 used of 200 allocated, grow by 50
                    0 composite platforms with 0 subplatforms
      Memory usage: 540 bytes for platforms
                    1900 bytes for classes

         Variables: revision method: stat(); 
                    debugging: disabled; 

(1)
The first section should look familiar, since it is the output from the mstatus command.
(2)
The Zebra timer program provides timing services to other Zebra programs. The output shown is its response to zquery. The ingest scheduler, is, should normally have several entries here, one for each of its ingest procedures which must wake up periodically. Likewise, the archiver command will have an alarm entry here while it sleeps between archive writes. The Clock program, which shows the Clock icon on the display, naturally uses an alaram entry to update itself every second.
(3)
The last part of the status output for Zebra is the datastore daemon query response, which is described earlier in Data Checks.

Finally, the init.iss status script appends the last several lines of the system log file, /var/adm/messages. Most of the messages in this file should be informational only, such as notices of successful adjustments of the system clock by ntpdate.

----- System log:
Apr 23 10:05:11 iss1 ntpdate[15518]: step time server 192.52.106.6 offset 1.253221 sec
Apr 23 22:05:10 iss1 ntpdate[21699]: step time server 192.52.106.6 offset 1.386205 sec
Apr 24 06:00:24 iss1 named[25886]: starting.  named 4.9.4-P1
Apr 24 06:00:24 iss1 named[25887]: Ready to answer queries.
Apr 24 10:05:10 iss1 ntpdate[28000]: step time server 192.52.106.6 offset 1.308586 sec
Apr 24 22:05:11 iss1 ntpdate[4179]: step time server 192.52.106.6 offset 1.446363 sec
Apr 25 06:00:24 iss1 named[8363]: starting.  named 4.9.4-P1
Apr 25 06:00:24 iss1 named[8364]: Ready to answer queries.
Apr 25 10:05:10 iss1 ntpdate[10475]: step time server 192.52.106.6 offset 1.116340 sec
Apr 25 22:05:10 iss1 ntpdate[16627]: step time server 192.52.106.6 offset 1.448122 sec
Apr 26 06:00:24 iss1 named[20840]: starting.  named 4.9.4-P1
Apr 26 06:00:25 iss1 named[20841]: Ready to answer queries.
Apr 26 06:01:03 iss1 unix: ip_rput: DL_ERROR_ACK for 7, errno 3, unix 0
Apr 26 06:01:38 iss1 last message repeated 3 times
Apr 26 06:02:18 iss1 unix: ip_rput: DL_ERROR_ACK for 7, errno 3, unix 0
Apr 26 06:08:41 iss1 last message repeated 24 times
Apr 26 06:08:48 iss1 unix: ip_rput: DL_ERROR_ACK for 7, errno 3, unix 0
Apr 26 06:10:03 iss1 last message repeated 8 times
Apr 26 10:05:15 iss1 ntpdate[22960]: step time server 192.52.106.6 offset 1.354962 sec
Apr 26 22:05:10 iss1 ntpdate[29109]: step time server 192.52.106.6 offset 1.224128 sec