Archive for the ‘SAN’ Category

Update

July 13, 2010

It’s been quite some time since I’ve updated this site. It’s been a busy year, both personally and professionally. Since my last update, we have added two Quantum Scalar i2000 libraries. We now have three libraries, one with 48 drives, one with 32 drives, and one with 16 drives. All have LTO-4 drives. We are thinking of upgrading the library with 16 drives to an i6000 with LTO-5 drives and moving the existing 16 drives to the library with 32 LTO-4 drives already. We have upgraded to Networker 7.5.2 still running Solaris 10 on a Sun T2000. We have four Networker Storage Nodes to help back up all the network backup traffic. We now backup about 50 TB per night. Things got so busy on the first Storage Node that we actually were running at 80-90% capacity of a dual-trunked Gigabit Ethernet card. Once this was discovered, we quickly added a third to the trunk and traffic calmed down to 20-30%.

The COPAN VTL was a disaster. We discovered that there was a bug in the software that caused the VTL controller to hang if backup was running at the same time it tried to start the de-duplication. This took COPAN almost a year to fix. Then we discovered that the de-duplicated area was almost impossible to read from. We were trying to restore a 300 GB file and the restore was running at 2-5 MB/s. As you can figure out, it timed out before it could finish. This took them several more months to fix. Their solution was to provide us with some “always-on” disk for the landing area with some cache for the de-duped data disks. The reconstitution of the data would run until the cache would fill, then the cache would write back to the landing area at about 20 MB/s. This was still not acceptable. Their solution was to charge us to replace the de-duped data disks with their “always-on” disk. I estimated that the cost to replace these disks would have been over a hundred thousand dollars. Needless to say, COPAN never came back with a quote to do this and we de-commissioned the COPAN. It is now serving to hold down some tile in the data center.

Advertisements

Sun T2000 failed to boot – fc-fabric Method or service exit time out.

June 20, 2008

In February in this year, after installing some new LTO-4 tape drives in our Quantum (ADIC) i2000, we began having problems with our primary Networker 7.3.3 server, a Sun T2000. It refused to boot while the switch ports to the i2000 robot were enabled. This was the error message we saw:

NIS domain name is chi.navtech.com
svc.startd[7]: svc:/system/device/fc-fabric:default: Method or s ervice exit timed out. Killing contract 31.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method “/lb /svc/method/fc-fabric” failed due to signal KILL.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method or s ervice exit timed out. Killing contract 34.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method “/lb /svc/method/fc-fabric” failed due to signal KILL.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method or s ervice exit timed out. Killing contract 36.
svc.startd[7]: svc:/system/device/fc-fabric:default: Method “/lb /svc/method/fc-fabric” failed due to signal KILL.
svc.startd[7]: system/device/fc-fabric:default failed: transitio ned to maintenance (see ‘svcs -xv’ for details) Requesting System Maintenance Mode (See /lib/svc/share/README for more information.) Console login service(s) cannot run.

We received a work-around from Sun after a bit as follows:

After you add the latest patches and reboot the host, and assuming your problem hasn’t been resolved, do the following:

Get the host booted into multi-user by disabling the involved switch ports or removing the fiber connections.

# cd /etc/cfg/fp

I noticed from the Explorer run on January 15 that you currently have in
/etc/cfg/fp:

fabric_WWN_map fabric_WWN_map.old fabric_WWN_map.old2

You need to rename them all something that doesn’t begin with the word “fabric” or remove them.

Remove device paths:

# rm /dev/rmt/*

# rm /dev/dsk/c4*; rm /dev/rdsk/c4*; rm /dev/cfg/c4*

# rm /dev/dsk/c5*; rm /dev/rdsk/c5*; rm /dev/cfg/c5*

# rm /dev/dsk/c6*; rm /dev/rdsk/c6*; rm /dev/cfg/c6*

# rm /dev/dsk/c7*; rm /dev/rdsk/c7*; rm /dev/cfg/c7*

# mv /etc/path_to_inst /etc/old_path

Make sure there are no file in /etc that start with the word “path”.

Re-enable switch ports (or re-attach fiber cables)

# luxadm -e forcelip /devices/pci@7c0/pci@0/pci@8/QLGC,qlc@0/fp@0,0:devctl
# luxadm -e forcelip /devices/pci@7c0/pci@0/pci@9/QLGC,qlc@0/fp@0,0:devctl

# luxadm insert

Rebuild device paths:

# devfsadm -p /etc/path_to_inst

# cfgadm -c configure <c#> (for each controller)

# cfgadm -al

If it doesn’t look right at this point, reboot — -r.

This procedure was used multiple times over the last four months. In fact, it was necessary each time we needed to reboot the Networker server.

After further prodding and a furious email exchange over several weeks resulted in this:

The failure is against LUN=ff (255). Please check with tape vendor
to see if it’s possible to unmap LUN 255 from the tape library.
The error which is preventing boot is due to SFK’s inability to configure
LUN 255 (which is of type raid-ctrl and not recognized by Solaris).

So I proceeded to contact Quantums’ tech support to get their opinion on this matter. I spoke with a Quantum tech and he stated that in our environment our NetApps require the Control LUN on the Quantum to be at LUN 255. Also, he stated that the ANSI standards also required it to be at LUN 255.

I reported this to the Sun Engineer and he replied:

escalation engineer reply:

Task Summary: unable to Boot Bug re-occurring need faster solution than previous case 65808366
Note:         Update:

- Since LUN 255 (of type 'array_ctrl' is causing cfgadm to fail because
  there are no target driver for array_ctrl, besides unmapping this LUN
  from the Storage, another workaround is to bind this LUN to our sgen
  (Generic SCSI) driver.

- Disclaimer: I haven't tried this out as I don't have a 3rd party
  storage that maps a LUN of type 'array_ctrl' to Solaris host.

Action(s):

- Customer to try binding "array_ctrl" luns to sgen driver.

  1. Add the following lines to /kernel/drv/sgen.conf to allow array_ctrl
     LUNs to be bound to sgen (Generic SCSI) driver.

     device-type-config-list="array_ctrl";

  2. Add the following lines to /etc/driver_aliases

     sgen "scsiclass,0c"

  3. Reboot the machine to see if the problem is resolved.

After following these instructions and rebooting the Sun T2000 several times I discovered that the problem was fixed!

VTL and the KISS principle

May 29, 2008

We are in the process of adding a COPAN VTL to our backup environment. It consists of two FalconStor landing areas with about 13 TB each. Behind them are two COPAN MAID SIR disk trays with about 60 TB each. Each VTL should have about 500 MB/s throughput from the SAN. Unfortunately, we are seeing about 400 MB/s on one and 300 MB/s on the other. As of yet, we do not know why we are not seeing the full performance. Currently, all but one of the six servers are running 2 GB fiber channel. The faster VTL has the one server running 4 GB fiber channel. Each server has an port on the HBA dedicated to the traffic going to the VTL. The obvious thing to try is to convert all those ports to 4 GB fiber channel.