|
Managing the USB Disks and Recording at S-Pol
|
Gan, Maldives Sep-Dec 2011 Jan 2012
|
This is an incomplete procedure, as not all details have been developed.
The machine spol-dm has been configured for recording S-PolKa
primary data, and for recording a local copy of SMART-R data. Two
removable USB disks should be attached to spol-dm at all times.
spol-dm must be operational at all times, and it is best if that
machine is not used for other activities, or at least, any activities
that could force or require a reboot. (Scientists, particularly,
should be discouraged from using spol-dm).
sci3 is configured the same as spol-dm, and represents a fail-over
machine for data recording. sci3 has the two USB disks attached, and
those disks are being recorded at the same time as those on spol-dm.
However, the sci3 disks will be cleaned-off if the spol-dm disks are
verified as "good". The sci3 disks get scrubbed and then transferred
to spol-dm (only after the spol-dm disks are safely removed)
How many days on a disk? This is a somewhat uncertain issue.
Quite a few of the data sets have data thresholded or removed under
certain, specific circumstances. For instance, reflectivity with SND
below, say, -3 dB might be removed, and blank-filled. This allows
file compression to make the data sets smaller. The data sets will be
smaller on a clear day, than on a day with wide-spread precip. Still,
on average, we expect to get 5 to 7 days out of a disk.
Monitoring
- Ensure that two USB disks are always attached to spol-dm
- There are dim flashing lights on each disk showing activity
- Available disk space is monitored in nagios for each disk
- Each day, a timeline is produced for each of the USB disks,
comparing the USB data to pgen2 and pgen1 (the primary and
secondary RAIDS); review this timeline and flag any problems.
[find the "Matlab Monitoring" or the "Data Timelines" link off
of http://control1 homepage.]
- Ensure that two disks are always attached to sci3, as well.
Monitor the flashing lights, occasionally, and check the disks
while logged into sci3, using the command:
df -h /usb/*
(see notes on this, below). The sci3 disks are not monitored by Nagios.
Handling
- To the extent possible, use the USB disks in numerical order
- Disks are stored in the grey suitcase-shaped box under the spol-dm machine;
there is a smaller cardboard box behind this one
- Disks are labeled as SPOL-HDnn (where nn is the disk number)
- Secondary labels are also included in the box. These labels should be filled-out
when starting a disk, with the end time added when the disk is removed.
- Disks have been uglified with blue paint along the edges
- If a disk fails to mount, move on to the next disk
- It should require about 6 days to fill a disk (we're working on a more solid number)
- Disks are managed in pairs, and should ultimately be exact images of each other
- Disks for spol-dm are used in increasing numerical order. Disks on sci3 are used in
descending order, starting with numbers in the 60s.
- Disks on sci3 will be re-used multiple times on sci3 (they will not be transferred to
spol-dm). The sci3 disks are "safeties", and will be scrubbed when the data on spol-dm's
disks are verified.
- Sci3 disks can probably be reused about 5 or 6 times before we get
concerned with "wearing them out." [Someone needs to create a label and keep hash
marks on these disks; maybe changing at the start of each month makes sense.]
A Note on logging into Sci3
Assuming you are working on the spol-dm terminal, you can just open
another window, then "shell" into sci3. So:
- Open another X-term (use the "start" menu, or the big X on the
application bar at the bottom of the screen).
- Within that new window (or, for that matter, in any old window), do
ssh sci3
You will see that you window prompt changes to sci3.gate.rtf
- To avoid confusion, exit the window when you are done. Just type "exit", or kill the window
in the usual way.
Changing disks
When nagios says the spol-dm disks are nearly full (80% to 90% ?), do the following:
- This procedure is ideally done between about 0300 UTC and 0500 UTC. This is
because a newly inserted disk will try to "catch up" on all the data for the
current day. If we are too far into a day when a disk is changed, these slow
USB disks might never catch up.
[Note that it is not possible to do this prior to 0300: JVA's dismount routine is
non-operational until after 0300]
- check the ID label on the spol-dm disks
- find the timeline outputs for these disks, and review them. If there are six days
of data on the disk, there should be six timeline plots. The timelines will be
available on the S-Pol web (http://control1), and are labeled with the
date of the plot. Within a plot, there are labels on the timelines showing the
disk IDs. Ensure that the pgen2/pgen1 timelines
are complete, and match the SPOL-HDnn timelines.
- log in to spol-dm as user "operator". In an available window, execute the command
df -h /usb/*
This command will show the usage of the usb disks.
Check that the disks are mostly full, and that the amount of space
used is the same for the two disks.
- If you notice any problems with unequal disk sizes, or anything suspicious,
contact the on-site software engineer. Remember that you may need to preserve
the data on the sci3 disks if the spol-dm disks have failed.
- Execute a dismount, using the "un-plug" application, accessible from
the "applications" bar at the bottom of the screen (the icon looks like a
plug being disconnected; it is near the CIDD application buttons)
- Watch for a change in the USB icons (upper left of desktop), to show that a disk is dismounted
- Disconnect the disks at the disk side of the cable (we're short on cables -- leave the
cables connected to the CPU)
- Allow a half-minute for the disks to spin-down
- Complete the info on the larger disk label.
- Place the disks into their numbered bubble envelopes.
- Make an entry in the S-Pol logbook (behind the first tab). Enter the disk number,
start/end dates (no time is required, since the disks should always start at 00 Z).
Provide any comments.
- Plug the next two disks in the numerical sequence into spol-dm. Watch the icons appear,
then change to a shape like a memory stick.
- Monitor the new disks in a few hours with the
df -h /usb/*
command.
- Find a courier going directly back to Boulder, and have them carry a few disks.
Note that any flight back to the US should not have matched disks sent on that
same flight. I would expect that no one person would have to carry more than three disks.
- Note the name of the courier in the logbook.
- Send Bob Rilling an email detailing who is carry which disks
Handling the backup disks from sci3
A note about general disk reliability
It is expected that the USB Passport Drives could have a high
failure rate. JVA (?) had a discussion with a Western Digital
engineer. The engineer stated that the error rates are phenomenally
high on these small, consumer USB disks. However, most of these
errors are somewhat transient, and they are able to fix the errors in
the software (using pretty fancy parity/error checking). When these
disks fail hard, they fail completely and quickly.
We purchased 72 of the Passport drives. Each was slow-formatted,
completely written with test bits, then purged. There were 3 that did
not pass the tests, and another two that were very slow to write, so
were not brought to Gan. Web reviews complain about failures, but an
actual failure rate cannot be determined. We simply expect some of
the disks to fail.
See here for formatting of USB disks for use by
the archival system.
Failure modes might include:
- Very slow write rates (evidenced by gaps in the timeline, possibly)
- Failure of disk to mount
- Disappearance of all data from the disk
If any of these happen, put a note on the disk, and use another. If
the disk should have been written, replace the disk with one from sci3.
--- Bob Rilling ---
/ NCAR Earth Observing Laboratory
Created: Sun Oct 2 09:57:31 GMT 2011
Last modified: