Corporate Logo

Solaris Disk Check

Section Index

Section 1: Incident Type Section 4: Troubleshooting Procedure
Section 2: Support Policy Section 5: Tracking and Escalation
Section 3: Associated Monitoring Notifications Section 6: Related Support Documents

Incident Type

Solaris OS - Hardware/Disk - Recommended Guidelines for Status Check

Support Policy

  • Business Hours
  • Off Hours
  • SLA

Associated Monitoring Notifications

sdx:Error for Command:read (10) Error Level:Fatal Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:write(10) Error Level:Retryable Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:write Error Level:Retryable Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:[undecoded cmd 0x25]Error Level:Fatal Requested Block:x Error Block:x Vendor:x
sdx:Error for Command:[undecoded cmd 0x3c]Error Level:Retryable Requested Block:x Error Block:x Vendor:x
sdx:SCSI transport failed: reason 'reset':retrying command
sdx:SCSI transport failed: reason 'timeout':retrying command
sdx:SCSI transport failed: reason 'tran_err':giving up

Troubleshooting Procedure

    General Procedure to check status of disks:

  1. Verify that format can read all of the disk device paths:

  2. echo | /usr/sbin/format | more

  3. For disks under Veritas control, run:

  4. /usr/sbin/vxdisk list
    All configured disks should show a status of "online".

  5. Check for any DISABLED or FAILED Veritas volumes:

  6. /usr/sbin/vxprint -ht | egrep 'DISA|FAIL'
    If a volume is marked DISABLED: Check if the volume is in a restore operation, confirm normal configuration.
    If a volume is marked FAILED:       Confirm the volume is mirrored (ie:The volume has redundant plexes).

  7. Verify that all file systems are online and accessible:

  8. df -kF ufs; df -kF vxfs
    I/O errors or a hung listing will indicate you have a serious issue.

  9. For network storage disks, EMC:

  10. Verify Solaris Version, if Solaris 10, go to Knowledge Base: EMC Procedures/Solaris 10.
    For Versions 5.9 and under,run:
    /etc/powermt display
    Both lun device paths should show a status of "optimal".
    I/O path totals should be equal.
    Errors should be 0. The error count is cumlative from uptime or from the last time a "restore" was run.

    If there is an error count, run:
    /etc/powermt restore
    Recheck. The restore will reset the error count to 0.
    If the error count resumes after the restore, see Knowledge Base: EMC Procedures for additional diagnostics and escalation procedures.

  11. Check /var/adm/messages for more detailed error information, the following factors should be taken into consideration to determine a course of action:

    • Error frequency. grep error and monitoring logs for counts and timestamps.

    • Does the error frequency indicate a failing disk, even if the FAILED/DEGRADED flag has not yet been set? Is there a recent marked increase in the rate ?

    • Is the fault notification transient (timeout) related to system/network load or scheduled work ?

    • Is the error FATAL or Retryable? Was it a READ (more serious) or a WRITE error?

    • Was Veritas able to correct the bad block from the mirror?

    • Is it the same block in every instance, or multiple blocks ?

    • Run a search for the host in the [event tracking database/Remedy]

    • Is this the first occurrence ? Has a disk repair or analyze already been run ?
      Has the issue already been escalated, if so, can fault monitoring be disabled til the disk is replaced ?

Tracking and Escalation

  1. Update [event tracking] documentation with the output from the command line checks.

  2. If the event is found to be transient or under the error threshold for action. Update to a "Resolved" status.

  3. For a DEGRADED or FAILED disk, that meets the criteria for replacement:

    1. For a network disk (EMC), open an escalation ticket to the EMC Support Group. Follow standard procedures for escalation to another support group.

    2. For a local Sun disk:

    3. See Knowledge Base: Solaris Architecture Table to see if it is hot swappable.
      Send notification to the client server owner groups to schedule for downtime or a low activity period for replacement.
  4. Update [event tracking] documentation with timestamped specific reference information, escalation ticket numbers, client contact information and response.

  5. Follow Knowledge Base: Standard Procedures/Open Issues.

  6. Send a hardware failure notification to the shift manager to be added to the shift report.


Related Support Documents

Knowledge Base: EMC Procedures
Knowledge Base: EMC Procedures/Solaris 10
Knowledge Base: Solaris Architecture Table
Knowledge Base: Solaris Disk Replacement Procedure
Knowledge Base: Veritas

Knowledge Base: Standard Procedures/Escalation
Knowledge Base: Standard Procedures/Open Issues