NWrecover cannot change browse time in GUI on Solaris 10

July 13, 2010

I ran across another problem shortly after I wrote the first post of today. I was attempting to run nwrecover from a Solaris 10 server to restore some data. I was unable to change the browse time, receiving this error:

“An invalid or incomplete time value has been entered. Ensure the value entered includes hours, minutes and seconds and then try again.”

All I attempted to do was change the date, I did not change the time at all. Fortunately, I discovered a similar problem elsewhere which was solved by installing Solaris patch 119397-09. I am now able to run nwrecover and change the browse time without any problem.

Update

July 13, 2010

It’s been quite some time since I’ve updated this site. It’s been a busy year, both personally and professionally. Since my last update, we have added two Quantum Scalar i2000 libraries. We now have three libraries, one with 48 drives, one with 32 drives, and one with 16 drives. All have LTO-4 drives. We are thinking of upgrading the library with 16 drives to an i6000 with LTO-5 drives and moving the existing 16 drives to the library with 32 LTO-4 drives already. We have upgraded to Networker 7.5.2 still running Solaris 10 on a Sun T2000. We have four Networker Storage Nodes to help back up all the network backup traffic. We now backup about 50 TB per night. Things got so busy on the first Storage Node that we actually were running at 80-90% capacity of a dual-trunked Gigabit Ethernet card. Once this was discovered, we quickly added a third to the trunk and traffic calmed down to 20-30%.

The COPAN VTL was a disaster. We discovered that there was a bug in the software that caused the VTL controller to hang if backup was running at the same time it tried to start the de-duplication. This took COPAN almost a year to fix. Then we discovered that the de-duplicated area was almost impossible to read from. We were trying to restore a 300 GB file and the restore was running at 2-5 MB/s. As you can figure out, it timed out before it could finish. This took them several more months to fix. Their solution was to provide us with some “always-on” disk for the landing area with some cache for the de-duped data disks. The reconstitution of the data would run until the cache would fill, then the cache would write back to the landing area at about 20 MB/s. This was still not acceptable. Their solution was to charge us to replace the de-duped data disks with their “always-on” disk. I estimated that the cost to replace these disks would have been over a hundred thousand dollars. Needless to say, COPAN never came back with a quote to do this and we de-commissioned the COPAN. It is now serving to hold down some tile in the data center.

“Error: ‘nsrclone: error, no matching devices on ‘server'”

January 5, 2009

I owe numerous posts on our COPAN situation, but I don’t have the time now. However, COPAN and Falconstor seem to have fixed the major problem that the VTL had, so we have put the COPAN VTL back into production. However, since we first used it back in March 2008, we have upgraded from Networker 7.2.2 to Networker 7.4.2. EMC has added a field in the Properites of the jukebox which needs to be filled out. The “Read hostname attribute” in the “jukebox resource” shows the name of the host that will be used to read the VTL tapes for cloning. Of course, the hostname must be a hostname that is attached to the jukebox.

“Enabler type is for a different platform”

September 24, 2008

A few months or so, we finally got around to buying Networker licenses for the two COPAN VTL’s we purchased. We were running on temporary licenses for two jukeboxes. This was caused by the failure of the temporary VTL licenses to work.

Once I entered the licenses codes, an error message popped up, “Enabler type is for a different platform”. We opened a ticket with Networker support and the end result was to load a hotfix, “LGTsc07430”.  Unfortunately, this hotfix was supposed to be for Networker 7.4, but after my objection, EMC admitted it should work just fine for 7.3.3 as well.

Unfortunately, installing this fix also forces you to begin using the infamous “daemon.raw” file instead of the old daemon.log we know and hate/love. I tend to have an open window running a “tail -f” on the daemon.log file. Although you can get the gist of the file by reading it, you need an editor to read it properly. Still, I run a tail on the daemon.raw file as well as the daemon.log file (which is still used and needed).

Sun T2000 failed to boot – fc-fabric Method or service exit time out.

June 20, 2008

In February in this year, after installing some new LTO-4 tape drives in our Quantum (ADIC) i2000, we began having problems with our primary Networker 7.3.3 server, a Sun T2000. It refused to boot while the switch ports to the i2000 robot were enabled. This was the error message we saw:

NIS domain name is chi.navtech.com
svc.startd[7]: svc:/system/device/fc-fabric:default: Method or s ervice exit timed out. Killing contract 31.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method “/lb /svc/method/fc-fabric” failed due to signal KILL.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method or s ervice exit timed out. Killing contract 34.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method “/lb /svc/method/fc-fabric” failed due to signal KILL.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method or s ervice exit timed out. Killing contract 36.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method “/lb /svc/method/fc-fabric” failed due to signal KILL.
svc.startd[7]: system/device/fc-fabric:default failed: transitio ned to maintenance (see ‘svcs -xv’ for details) Requesting System Maintenance Mode (See /lib/svc/share/README for more information.) Console login service(s) cannot run.

We received a work-around from Sun after a bit as follows:

After you add the latest patches and reboot the host, and assuming your problem hasn’t been resolved, do the following:

Get the host booted into multi-user by disabling the involved switch ports or removing the fiber connections.

# cd /etc/cfg/fp

I noticed from the Explorer run on January 15 that you currently have in
/etc/cfg/fp:

fabric_WWN_map fabric_WWN_map.old fabric_WWN_map.old2

You need to rename them all something that doesn’t begin with the word “fabric” or remove them.

Remove device paths:

# rm /dev/rmt/*

# rm /dev/dsk/c4*; rm /dev/rdsk/c4*; rm /dev/cfg/c4*

# rm /dev/dsk/c5*; rm /dev/rdsk/c5*; rm /dev/cfg/c5*

# rm /dev/dsk/c6*; rm /dev/rdsk/c6*; rm /dev/cfg/c6*

# rm /dev/dsk/c7*; rm /dev/rdsk/c7*; rm /dev/cfg/c7*

# mv /etc/path_to_inst /etc/old_path

Make sure there are no file in /etc that start with the word “path”.

Re-enable switch ports (or re-attach fiber cables)

# luxadm -e forcelip /devices/pci@7c0/pci@0/pci@8/QLGC,qlc@0/fp@0,0:devctl
# luxadm -e forcelip /devices/pci@7c0/pci@0/pci@9/QLGC,qlc@0/fp@0,0:devctl

# luxadm insert

Rebuild device paths:

# devfsadm -p /etc/path_to_inst

# cfgadm -c configure <c#> (for each controller)

# cfgadm -al

If it doesn’t look right at this point, reboot — -r.

This procedure was used multiple times over the last four months. In fact, it was necessary each time we needed to reboot the Networker server.

After further prodding and a furious email exchange over several weeks resulted in this:

The failure is against LUN=ff (255). Please check with tape vendor
to see if it’s possible to unmap LUN 255 from the tape library.
The error which is preventing boot is due to SFK’s inability to configure
LUN 255 (which is of type raid-ctrl and not recognized by Solaris).

So I proceeded to contact Quantums’ tech support to get their opinion on this matter. I spoke with a Quantum tech and he stated that in our environment our NetApps require the Control LUN on the Quantum to be at LUN 255. Also, he stated that the ANSI standards also required it to be at LUN 255.

I reported this to the Sun Engineer and he replied:

escalation engineer reply:

Task Summary: unable to Boot Bug re-occurring need faster solution than previous case 65808366
Note:         Update:

- Since LUN 255 (of type 'array_ctrl' is causing cfgadm to fail because
  there are no target driver for array_ctrl, besides unmapping this LUN
  from the Storage, another workaround is to bind this LUN to our sgen
  (Generic SCSI) driver.

- Disclaimer: I haven't tried this out as I don't have a 3rd party
  storage that maps a LUN of type 'array_ctrl' to Solaris host.

Action(s):

- Customer to try binding "array_ctrl" luns to sgen driver.

  1. Add the following lines to /kernel/drv/sgen.conf to allow array_ctrl
     LUNs to be bound to sgen (Generic SCSI) driver.

     device-type-config-list="array_ctrl";

  2. Add the following lines to /etc/driver_aliases

     sgen "scsiclass,0c"

  3. Reboot the machine to see if the problem is resolved.

After following these instructions and rebooting the Sun T2000 several times I discovered that the problem was fixed!

VTL and the KISS principle

May 29, 2008

We are in the process of adding a COPAN VTL to our backup environment. It consists of two FalconStor landing areas with about 13 TB each. Behind them are two COPAN MAID SIR disk trays with about 60 TB each. Each VTL should have about 500 MB/s throughput from the SAN. Unfortunately, we are seeing about 400 MB/s on one and 300 MB/s on the other. As of yet, we do not know why we are not seeing the full performance. Currently, all but one of the six servers are running 2 GB fiber channel. The faster VTL has the one server running 4 GB fiber channel. Each server has an port on the HBA dedicated to the traffic going to the VTL. The obvious thing to try is to convert all those ports to 4 GB fiber channel.

NFS mount from NetApp hangs

May 22, 2008

We had a problem that would come and go for months on a Solaris 10 server. Any NFS mount from the old NetApp 940C’s on this server would hang. Also, the backup by Networker of the root filesystem ( /) would fail each time. The workaround was that we would unmount any NFS filesystem from the two NetApps. The backups would then succeed.

We discovered that ipsecinit.conf was present in /etc/inet. It was only present on this server. The fix was to remove the file to prevent it starting up again. We also issued the ipsecconf -F which flushes the cache apparently and more of less turns it off as well. This works without a reboot.

Introduction

May 22, 2008

I have been working as a UNIX administrator since 1992, starting with AIX and moving to Solaris and Linux. I have been using Legato (now EMC) Networker off and on since 1996. I have only been directly responsible for Networker the last four years.

We are currently running Networker 7.3.3 on a Sun T2000 with 16 threads and 16 GB of memory. We are trunking two of the copper gigabit Ethernet adapters. There are three QLogic 2462 HBA’s in this system: two ports are connected to SAN disk, two ports are connected to two 9509 2GB FC switches (connected to the tape library) , and two ports are connected to two 9513 4GB FC switches (connected to two VTL’s.

We have 139 Windows clients, 44 Linux clients, 118 Solaris clients and 13 AIX clients. We also have 6 clusters, 6 NDMP and 3 Exchange servers. We also have two COPAN VTL’s (more on this later). We also have one storage node and 25 dedicated storage nodes. We have a Quantum (ADIC) i2000 with 9 LTO-2’s, 11 LTO-3’s and 16 LTO-4’s. Quite a varied installation!