Archive for October, 2009
Re-scan the scsi bus after detecting a missing media changer for the tape drive
Well today ran into a nice problem, the media changer on a Dell 124T had disappeared and the backup server had been neglected for a while. So Amanda was spewing out backup failed reports for a while now. A bit of investigation later found that the media changer had “gone”. More poking showed the tape drive still present hmm.
As this media changer and server live in another state (1200km away) and it was Friday night, the chances of getting the tape drive power cycled == 0. Luckily a rescan of the correct SCSI bus resulted in the media changer being found and usable again.
For reference, check out your SCSI devices by running:
cat /proc/scsi/scsi
In my case the host bus I wanted to re-scan was 1. So running this command tells the controller to rescan all channels, IDs & LUNs..
echo "- - - " > /sys/class/scsi_host/host1/scan
After this, I was able to interact with the tape changer device via the Amanda tools again.
Note: Always be-careful when running commands that hot add/remove SCSI devices. I have seen some servers crash from this and others lose access to all SCSI devices which required a reboot to fix (Mostly it is okay though).
LSI Megaraid physical device error counters and what SMART errors they imply
At work we have a number of LSI megaraid controllers & dell perc cards. Today I ran accross a system with a “Other Error Count: XX” assigned to the both drives in a raid 1 array. It took some googling but it turns out these are not ciritcal drive media errors, but some other SMART errors.
For reference here is a list of the LSI megaraid error codes and the corresponding SMART failure count (Thanks to the dell linux mailing list).
- Predictive Failure Count == Number of SMART errors.
- Media Error Count == Number of SMART errors related to the drive media.
- Other Error Count == Number of SMART errors not related to the drive.
See wikipedia for the SMART error codes.