Solaris Disk Check

Corporate Logo

Section 1: Incident Type	Section 4: Troubleshooting Procedure
Section 2: Support Policy	Section 5: Tracking and Escalation
Section 3: Associated Monitoring Notifications	Section 6: Related Support Documents

Solaris OS - Hardware/Disk - Recommended Guidelines for Status Check

Business Hours
Off Hours
SLA

sdx:Error for Command:read (10) Error Level:Fatal Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:write(10) Error Level:Retryable Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:write Error Level:Retryable Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:[undecoded cmd 0x25]Error Level:Fatal Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:[undecoded cmd 0x3c]Error Level:Retryable Requested Block:x Error Block:x Vendor:x
sdx:SCSI transport failed: reason 'reset':retrying command
sdx:SCSI transport failed: reason 'timeout':retrying command
sdx:SCSI transport failed: reason 'tran_err':giving up

General Procedure to check status of disks:

Verify that format can read all of the disk device paths:

echo | /usr/sbin/format | more

For disks under Veritas control, run:

/usr/sbin/vxdisk list

All configured disks should show a status of "online".

Check for any DISABLED or FAILED Veritas volumes:

/usr/sbin/vxprint -ht | egrep 'DISA|FAIL'

If a volume is marked DISABLED: Check if the volume is in a restore operation, confirm normal configuration.

If a volume is marked FAILED: Confirm the volume is mirrored (ie:The volume has redundant plexes).

Verify that all file systems are online and accessible:

df -kF ufs; df -kF vxfs

I/O errors or a hung listing will indicate you have a serious issue.

For network storage disks, EMC:

Knowledge Base: EMC Procedures/Solaris 10

/etc/powermt display

Both lun device paths should show a status of "optimal".

I/O path totals should be equal.

Errors should be 0. The error count is cumlative from uptime or from the last time a "restore" was run.

/etc/powermt restore

Recheck. The restore will reset the error count to 0.

If the error count resumes after the restore, see

Knowledge Base: EMC Procedures

for additional diagnostics and escalation procedures.

Check /var/adm/messages for more detailed error information, the following factors should be taken into consideration to determine a course of action:

Error frequency. grep error and monitoring logs for counts and timestamps.

Does the error frequency indicate a failing disk, even if the FAILED/DEGRADED flag has not yet been set? Is there a recent marked increase in the rate ?

Is the fault notification transient (timeout) related to system/network load or scheduled work ?

Is the error FATAL or Retryable? Was it a READ (more serious) or a WRITE error?

Was Veritas able to correct the bad block from the mirror?

Is it the same block in every instance, or multiple blocks ?

Run a search for the host in the [event tracking database/Remedy]

Update [event tracking] documentation with the output from the command line checks.

If the event is found to be transient or under the error threshold for action. Update to a "Resolved" status.

For a DEGRADED or FAILED disk, that meets the criteria for replacement:

For a network disk (EMC), open an escalation ticket to the EMC Support Group. Follow standard procedures for escalation to another support group.

For a local Sun disk:

See

Knowledge Base: Solaris Architecture Table

to see if it is hot swappable.

Send notification to the client server owner groups to schedule for downtime or a low activity period for replacement.

Update [event tracking] documentation with timestamped specific reference information, escalation ticket numbers, client contact information and response.

Follow Knowledge Base: Standard Procedures/Open Issues.

Send a hardware failure notification to the shift manager to be added to the shift report.

Knowledge Base: EMC Procedures
Knowledge Base: EMC Procedures/Solaris 10
Knowledge Base: Solaris Architecture Table
Knowledge Base: Solaris Disk Replacement Procedure
Knowledge Base: Veritas

Knowledge Base: Standard Procedures/Escalation
Knowledge Base: Standard Procedures/Open Issues