0% found this document useful (0 votes)
51 views15 pages

(#EE-17529) LISTRAK - Multiple SSD Failures Had Caused The Cluster To Stop

Uploaded by

mingli.bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views15 pages

(#EE-17529) LISTRAK - Multiple SSD Failures Had Caused The Cluster To Stop

Uploaded by

mingli.bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

[EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop Created: 30/Dec/20 Updated: 31/Dec/20

Status: Pending L2/Customer


Project: Escalation Engineering
Affects Version/s: None

Type: Problem
Reporter: Thomas Tobiasz Assignee: Petr A Dushkin
Resolution: Unresolved Votes: 0
Labels: DTS_Supriya, ETS_Primary
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

L2 Owner: Thomas Tobiasz


EE Owner: Edwin Gardineriv
Customer name: LISTRAK
Severity: 2 - High
Installation Type: Production
DU/DL/DI: DU - Data Unavailability
Case Origin: Phone
SR Number (CSI): 100484369
View SR: http://nova.corp.emc.com/view/sr/100484369
XtremApp version: 4.0.27-1
XMS Version: 6.2.1-36
Number of clusters Please select
managed by XMS:
System State: Stopped-orderly
Encryption: Enabled
Exec Summary: Current status: cluster stopped on Dec 24th due to triple SSD failure in X3-DAE. SSDs failed due to media errors one
after another.

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 1/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

Action Taken: reviewed logs.

Plan of Action: since both SCs on X3 could not detect the 3 SSDs it's recommended to reseat the drives in an attempt to
get more diagnostics data from the SSDs.
Number of bricks: 6
ESRS Type: No
PSNT: FNM00155000684
Search SRs by PSNT: http://nova.corp.emc.com/view/customer-product?id=FNM00155000684
Case Reason: HW problem
Type: Please select
Handover: No
Fixed in Version: Please select
Exec Summary Updated 31/Dec/20 7:17 AM
Date:

Description
THIS IS A T&M ACCOUNT

Current cluster status is stopped with reason multiple_disk_failures

*** Log bundle will be UPLOADING IN 15 MINUTES***

. Cluster stopped on Dec 24th

. At that time two SSD's had failed and 4 other reported SSD lifecycle state was changed from disconnected to healthy.

. The failure of the SSD's had happened at the same time on the 24th and 28th as noted below.

The ask from L3 is to help determine if in fact all 6 SSD's had failed or are the result of a bad noisy LCC / SAS Cables

CLUSTER STATUS

Cluster general status: As of 2020-12-30 12:02:17 the cluster was stopped connect status was connected stop reason was multiple_disk_failures
As of 2020-12-30 12:02:17 the gate was closed Proactive_metadata_loading: True is_any_c_mdl_lazy_load_in_progress: True

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 2/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
is_any_d_mdl_lazy_load_in_progress: True X1-DPG was proactive_metadata_loading ... X2-DPG was proactive_metadata_loading ... X3-DPG was
proactive_metadata_loading ... X4-DPG was proactive_metadata_loading ... X5-DPG was proactive_metadata_loading ... X6-DPG was
proactive_metadata_loading …

Close gate and Open gate messages

crit 2020-12-27 16:33:29.092197 PAVAFXT01-x1-n2 xtremapp: M [log_id: 5874][26776(26848 nb_truck_0)]complete_system_activation: #TIME_ME system
initialization: complete_system_activation entered!!! (#activation - sym connected to xenvs); elapsed_time=2370 milliseconds crit 2020-12-27
16:33:29.480981 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][26776(26848 nb_truck_0)]ham_rule_action_close_gates: event ID 385 closing system
gates! Reason: multiple_disk_failures crit 2020-12-27 16:33:29.767257 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][26776(26848
nb_truck_0)]ham_rule_action_close_gates: event ID 648 closing system gates! Reason: ha_failure crit 2020-12-28 18:32:40.644980 PAVAFXT01-x1-n2
xtremapp: M [log_id: 5874][121408(121485 nb_truck_0)]complete_system_activation: #TIME_ME system initialization: complete_system_activation
entered!!! (#activation - sym connected to xenvs); elapsed_time=2106 milliseconds crit 2020-12-28 18:32:41.046342 PAVAFXT01-x1-n2 xtremapp: M
[log_id: 6554][121408(121485 nb_truck_0)]ham_rule_action_close_gates: event ID 373 closing system gates! Reason: multiple_disk_failures crit
2020-12-28 18:32:41.341187 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][121408(121485 nb_truck_0)]ham_rule_action_close_gates: event ID 619 closing
system gates! Reason: ha_failure

---------------------------------------------------------------------------------------------------------------------------------- Shared memory


Utilization is 0 the status is healthy Total memory Utilziation is 0 Timestamp obj_name sever
description =================== ======================== =====
===================================================================================== 2020-12-24 20:05:12 PAVAFXT01 criti Cluster
has stopped due to multiple SSD failures in DAE. 2020-12-28 18:36:26 PAVAFXT01 criti The cluster service has stopped. Stopped type
is: stopping stopped reason is: multiple_disk_failures. Storage Controllers information:

Storage Controllers information:

SC-Name Index Mgr-Addr State Health Enabled Conn-State Jour-Stat SW-Version Sym Stop-Reason X1-SC1 1 10.205.255.22
healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X1-SC2 2 10.205.255.23 healthy degraded
enabled connected healthy 4.0.27-1 True ha_failure X2-SC1 3 10.205.255.24 healthy degraded enabled connected
healthy 4.0.27-1 False ha_failure X2-SC2 4 10.205.255.25 healthy degraded enabled connected healthy 4.0.27-1 False
ha_failure X3-SC1 5 10.205.255.26 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X3-SC2 6
10.205.255.27 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X4-SC1 7 10.205.255.28 healthy
degraded enabled connected healthy 4.0.27-1 False ha_failure X4-SC2 8 10.205.255.29 healthy degraded enabled
connected healthy 4.0.27-1 False ha_failure X5-SC1 9 10.205.255.30 healthy degraded enabled connected healthy
4.0.27-1 False ha_failure X5-SC2 10 10.205.255.31 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure
X6-SC1 11 10.205.255.32 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X6-SC2 12
10.205.255.33 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure Timestamp obj_name
sever description =================== ======================== =====
===================================================================================== 2020-12-28 18:35:16 X2-SC1 major Storage
Controller has stopped. 2020-12-28 18:35:17 X2-SC2 major Storage Controller has stopped. 2020-12-28 18:35:22
X4-SC1 major Storage Controller has stopped. 2020-12-28 18:35:23 X4-SC2 major Storage Controller has

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 3/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
stopped. 2020-12-28 18:35:57 X3-SC1 major Storage Controller has stopped. 2020-12-28 18:35:58 X3-
SC2 major Storage Controller has stopped. 2020-12-28 18:35:59 X5-SC1 major Storage Controller has
stopped. 2020-12-28 18:36:00 X5-SC2 major Storage Controller has stopped. 2020-12-28 18:36:02 X6-
SC1 major Storage Controller has stopped. 2020-12-28 18:36:02 X6-SC2 major Storage Controller has
stopped. 2020-12-28 18:36:10 X1-SC1 major Storage Controller has stopped. 2020-12-28 18:36:38 X1-
SC2 major Storage Controller has stopped.

XEnvs information:

Name Index CSID XEnv_state CPU_usage X1-SC1-E1 1 10 inactive 0 X1-SC1-E2 2 11 inactive 0 X1-SC2-E1 3 12
inactive 0 X1-SC2-E2 4 13 inactive 0 X2-SC1-E1 5 16 inactive 0 X2-SC1-E2 6 17 inactive 0 X2-SC2-E1 7
18 inactive 0 X2-SC2-E2 8 19 inactive 0 X3-SC1-E1 9 22 inactive 0 X3-SC1-E2 10 23 inactive 0 X3-SC2-E1
11 24 inactive 0 X3-SC2-E2 12 25 inactive 0 X4-SC1-E1 13 28 inactive 0 X4-SC1-E2 14 29 inactive 0
X4-SC2-E1 15 30 inactive 0 X4-SC2-E2 16 31 inactive 0 X5-SC1-E1 17 34 inactive 0 X5-SC1-E2 18 35
inactive 0 X5-SC2-E1 19 36 inactive 0 X5-SC2-E2 20 37 inactive 0 X6-SC1-E1 21 40 inactive 0 X6-SC1-E2
22 41 inactive 0 X6-SC2-E1 23 42 inactive 0 X6-SC2-E2 24 43 inactive 0 Timestamp
obj_name sever description =================== ======================== =====
===================================================================================== 2020-12-24 20:04:47 X1-SC1-E1 major XENV
is not active. 2020-12-24 20:04:48 X1-SC1-E2 major XENV is not active. 2020-12-24
20:04:48 X1-SC2-E1 major XENV is not active. 2020-12-24 20:04:48 X1-SC2-E2 major XENV is not
active. 2020-12-24 20:04:48 X2-SC1-E1 major XENV is not active. 2020-12-24 20:04:48 X2-
SC1-E2 major XENV is not active. 2020-12-24 20:04:48 X2-SC2-E1 major XENV is not
active. 2020-12-24 20:04:48 X2-SC2-E2 major XENV is not active. 2020-12-24 20:04:48
X4-SC1-E1 major XENV is not active. 2020-12-24 20:04:48 X4-SC1-E2 major XENV is not
active. 2020-12-24 20:04:48 X4-SC2-E1 major XENV is not active. 2020-12-24 20:04:48
X4-SC2-E2 major XENV is not active. 2020-12-24 20:04:48 X5-SC1-E1 major XENV is not
active. 2020-12-24 20:04:48 X5-SC1-E2 major XENV is not active. 2020-12-24 20:04:49
X5-SC2-E1 major XENV is not active. 2020-12-24 20:04:49 X5-SC2-E2 major XENV is not
active. 2020-12-24 20:04:49 X6-SC1-E1 major XENV is not active. 2020-12-24 20:04:49
X6-SC1-E2 major XENV is not active. 2020-12-24 20:04:49 X6-SC2-E1 major XENV is not
active. 2020-12-24 20:04:49 X6-SC2-E2 major XENV is not active. 2020-12-24 20:05:09
X3 SC1 E1 major XENV is not active 2020 12 24 20:05:09 X3 SC1 E2 major XENV is not

SSDs information:

Name Index Slot# SSD-Size DPG-Name XDP-State State End-Rem% Encry-Status wwn-0x5000cca04f740f04 48 0
781.422768 X2-DPG in_rg disconnected 0 enc_supported_locked_cluster_pin wwn-0x5000cca04f743dec 50 0 781.422768 X2-DPG
in_rg disconnected 0 enc_supported_locked_cluster_pin wwn-0x5000cca04f4c6b50 66 15 781.422768 X3-DPG in_rg disconnected 97
enc_supported_locked_cluster_pin wwn-0x5000cca04f4c7348 67 16 781.422768 failed_in_rg disconnected 0
enc_supported_locked_cluster_pin wwn-0x5000cca04f4c6b90 68 17 781.422768 failed_in_rg disconnected 0
enc_supported_locked_cluster_pin wwn-0x5000cca04f74460c 100 0 781.422768 X4-DPG in_rg disconnected 0
enc_supported_locked_cluster_pin wwn-0x5000cca02b256794 151 15 781.422768 not_in_rg healthy 0 not_supported

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 4/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
Timestamp obj_name sever description =================== ========================
===== ===================================================================================== 2020-12-24 19:59:07 wwn-0x5000cca04f4c7348 major
SSD has failed. 2020-12-24 19:59:13 wwn-0x5000cca04f4c7348 major SSD is disconnected. 2020-12-24
20:03:23 wwn-0x5000cca04f4c6b90 major SSD has failed. 2020-12-24 20:03:24 wwn-0x5000cca04f4c6b90 major SSD is
disconnected. 2020-12-28 18:36:27 wwn-0x5000cca04f740f04 major SSD is disconnected. 2020-12-28 18:36:28
wwn-0x5000cca04f743dec major SSD is disconnected. 2020-12-28 18:36:32 wwn-0x5000cca04f74460c major SSD is
disconnected. 2020-12-28 18:50:38 wwn-0x5000cca04f4c6b50 major SSD is disconnected.

JBOD LCC SAS error counters information:

LCC_Name Phy_Index Invalid-Dwords Disparity-Errors Loss-Dword-Sync Phy-Resets X3-DAE-LCC-B 0 0


0 0 0 X3-DAE-LCC-B 1 0 0 0 0 X3-DAE-LCC-B 2
0 0 0 0 X3-DAE-LCC-B 3 0 0 0 0 X3-DAE-LCC-B
4 0 0 0 0 X3-DAE-LCC-B 5 0 0 0 0 X3-
DAE-LCC-B 6 0 0 0 0 X3-DAE-LCC-B 7 0 0
0 0 X3-DAE-LCC-B 11 0 0 0 0 X3-DAE-LCC-B 12 0
0 0 0 X3-DAE-LCC-B 13 0 0 0 0 X3-DAE-LCC-B 14
0 0 0 0 X3-DAE-LCC-B 15 0 0 0 0 X3-DAE-LCC-B
16 0 0 0 0 X3-DAE-LCC-B 17 0 0 0 0 X3-DAE-
LCC-B 18 0 0 0 0 X3-DAE-LCC-B 19 0 0 0 0
X3-DAE-LCC-B 20 0 0 0 0 X3-DAE-LCC-B 21 0 0
0 0 X3-DAE-LCC-B 22 0 0 0 0 X3-DAE-LCC-B 23 0
0 0 0 X3-DAE-LCC-B 24 0 0 0 0 X3-DAE-LCC-B
25 0 0 0 0 X3-DAE-LCC-B 26 0 0 0 0 X3-
DAE-LCC-B 27 1483 1439 4 0 X3-DAE-LCC-B 28 1066 1034
2 0 X3-DAE-LCC-B 29 0 0 0 0 X3-DAE-LCC-B 30 0
0 0 0 X3-DAE-LCC-B 31 0 0 0 0 X3-DAE-LCC-B 32
0 0 0 0 X3-DAE-LCC-B 33 0 0 0 0 X3-DAE-LCC-B
34 0 0 0 0 X3-DAE-LCC-B 35 0 0 0 0 X3-DAE-
LCC-A 0 0 0 0 0 X3-DAE-LCC-A 1 0 0 0
0 X3-DAE-LCC-A 2 0 0 0 0 X3-DAE-LCC-A 3 0 0
0 0 X3 DAE LCC A 4 0 0 0 0 X3 DAE LCC A 5 0

Detail log information:

X3-SC1:xtremapp: D22 PANIC D4680 csid 22 2020-12-24 20:04:42.907748X3-SC1:xtremapp: D23 PANIC D4680 csid 23 2020-12-24 20:04:42.909298X3-
SC2:xtremapp: D24 PANIC D4680 csid 24 2020-12-24 20:04:42.915026X3-SC2:xtremapp: D25 PANIC D4680 csid 25 2020-12-24 20:04:42.923559X3-

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 5/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
SC2:xtremapp*: X25 PANIC S11 csid 25 info 2020-12-27 16:33:35.397335X6-SC1:xtremapp*: X41 PANIC S11 csid 41 info 2020-12-27
16:33:35.397472X1-SC2:xtremapp*: X13 PANIC S11 csid 13 info 2020-12-28 18:32:47.145229

Detail log information:

info 2020-12-27 16:22:01.912467 PAVAFXT01-x3-n2 kernel:[44940173.179988] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-
12-27 16:22:01.913395 PAVAFXT01-x3-n2 kernel:[44940173.181201] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-27
16:22:01.955357 PAVAFXT01-x3-n1 kernel:[44941003.302116] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-27
16:22:01.957428 PAVAFXT01-x3-n1 kernel:[44941003.304544] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.265404 PAVAFXT01-x3-n2 kernel:[45030964.804579] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.268387 PAVAFXT01-x3-n2 kernel:[45030964.807411] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.275512 PAVAFXT01-x3-n2 kernel:[45030964.814813] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.279383 PAVAFXT01-x3-n2 kernel:[45030964.819299] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.411325 PAVAFXT01-x3-n1 kernel:[45031795.881564] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.414357 PAVAFXT01-x3-n1 kernel:[45031795.884898] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.415317 PAVAFXT01-x3-n1 kernel:[45031795.886212] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.417304 PAVAFXT01-x3-n1 kernel:[45031795.887547] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.657310 PAVAFXT01-x3-n1 kernel:[45032139.264530] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.658348 PAVAFXT01-x3-n1 kernel:[45032139.265668] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.757481 PAVAFXT01-x3-n2 kernel:[45031308.430551] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.760409 PAVAFXT01-x3-n2 kernel:[45031308.433207] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.756431 PAVAFXT01-x3-n2 kernel:[45031382.457810] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.758360 PAVAFXT01-x3-n2 kernel:[45031382.460037] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.907345 PAVAFXT01-x3-n1 kernel:[45032213.543862] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.908291 PAVAFXT01-x3-n1 kernel:[45032213.545040] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2]148 SSD is
disconnected. major Mon Dec 28 18:50:38 2020
SSD wwn-0x5000cca04f4c6b50 66 PAVAFXT01 1 ssd_fru_disconnected outstanding 0900704150 SSD is
disconnected major Mon Dec 28 18:36:32 2020

Comments
Comment by Petr A Dushkin [ 30/Dec/20 ]
Hi Thomas Tobiasz,

I will review the logs to see the status of the SSDs under the question. What's your overall plan for this cluster? Are you trying to validate if it's okay to
fresh install based on the SSDs status?
Comment by Thomas Tobiasz [ 30/Dec/20 ]
Log bundle is now available on the FTP folder //incoming/EE-17529

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 6/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

Plan at this point is to determine if any of the 6 SSD's drives can be recovered. What is odd is the time frame of when they failed.

Comment by Petr A Dushkin [ 30/Dec/20 ]


Those 3 affected ssds are in the same X3-DAE and I think that's why the cluster has stopped:

wwn-0x5000cca04f4c6b50 66 PAVAFXT01 1 X3 3 15 HITACHI HUSMM118 CLAR800 C250 no_error 5051058 745.223G X3-DPG
3 in_rg disconnected 97 ok enc_supported_locke
wwn-0x5000cca04f4c7348 67 PAVAFXT01 1 X3 3 16 HITACHI HUSMM118 CLAR800 no_error 5051058 745.223G
failed_in_rg disconnected 0 ok enc_supported_locke
wwn-0x5000cca04f4c6b90 68 PAVAFXT01 1 X3 3 17 HITACHI HUSMM118 CLAR800 no_error 5051058 745.223G
failed_in_rg disconnected 0

Comment by Petr A Dushkin [ 30/Dec/20 ]


It appears both X3-SC1 and X3-SC2 are reporting a gap in slots 15-17:

cat X3-SC*/system/disks/disk_in_slot.txt
# SLOT WWN SERIAL SD_enc0 SG_enc0 SD_enc1 SG_enc1
slot 0 ===> 0x5000cca04f743ad4 0RX1XLAA /dev/sdac /dev/sg28 /dev/sdad /dev/sg30
slot 1 ===> 0x5000cca04f7441d4 0RX1Y1TA /dev/sdab /dev/sg27 /dev/sdae /dev/sg31
slot 2 ===> 0x5000cca04f74398c 0RX1XHPA /dev/sdaa /dev/sg26 /dev/sdaf /dev/sg32
slot 3 ===> 0x5000cca04f7438c8 0RX1XG3A /dev/sdz /dev/sg25 /dev/sdag /dev/sg33
slot 4 ===> 0x5000cca04f4cae20 0RWB59RA /dev/sdy /dev/sg24 /dev/sdah /dev/sg34
slot 5 ===> 0x5000cca04f4c9cb4 0RWB44SA /dev/sdx /dev/sg23 /dev/sdai /dev/sg35
slot 6 ===> 0x5000cca04f4c7690 0RWB1M0A /dev/sdw /dev/sg22 /dev/sdaj /dev/sg36
slot 7 ===> 0x5000cca04f4cb018 0RWB5ETA /dev/sdv /dev/sg21 /dev/sdak /dev/sg37
slot 8 ===> 0x5000cca04f509dd8 0RWEAE5A /dev/sdu /dev/sg20 /dev/sdal /dev/sg38
slot 9 ===> 0x5000cca04f50b598 0RWED06A /dev/sdt /dev/sg19 /dev/sdam /dev/sg39
slot 10 ===> 0x5000cca04f5093d4 0RWE9SHA /dev/sds /dev/sg18 /dev/sdan /dev/sg40
slot 11 ===> 0x5000cca04f4c6970 0RWB0RXA /dev/sdr /dev/sg17 /dev/sdao /dev/sg41
slot 12 ===> 0x5000cca04f4c6944 0RWB0RKA /dev/sdq /dev/sg16 /dev/sdap /dev/sg42
slot 13 ===> 0x5000cca04f4ca948 0RWB4ZRA /dev/sdp /dev/sg15 /dev/sdaq /dev/sg43
slot 14 ===> 0x5000cca04f4c9d60 0RWB464A /dev/sdo /dev/sg14 /dev/sdar /dev/sg44
slot 18 ===> 0x5000cca04f4ca8f8 0RWB4Z2A /dev/sdk /dev/sg10 /dev/sdav /dev/sg48
slot 19 ===> 0x5000cca04f4ca888 0RWB4Y5A /dev/sdj /dev/sg9 /dev/sdaw /dev/sg49
slot 20 ===> 0x5000cca04f4ca880 0RWB4Y3A /dev/sdi /dev/sg8 /dev/sdax /dev/sg50
slot 21 ===> 0x5000cca04f4c7314 0RWB1BUA /dev/sdh /dev/sg7 /dev/sday /dev/sg51
slot 22 ===> 0x5000cca04f4c9090 0RWB3APA /dev/sdg /dev/sg6 /dev/sdaz /dev/sg52
slot 23 ===> 0x5000cca04f4ca90c 0RWB4Z7A /dev/sdf /dev/sg5 /dev/sdba /dev/sg53

Comment by Thomas Tobiasz [ 30/Dec/20 ]


Those 3 SSDs show disconnected on the 28th / Alert log however on the 24th they when the array went down this entry was in the alert log for those

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 7/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

3 SSD's

2020-12-24 21:34:28,498 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f740f04:


SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:29,440 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f743dec:
SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:29,508 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f4c6b50:
SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:30,177 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f74460c:
SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:33,912 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: X1-SC1: Storage Controller
stop type was changed from stopped to none. Stop reason is: none
2020-12-24 21:34:34,266 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: X1-SC1: The Storage
Controller state was changed from stopped to stopping.
2020-12-24 21:34:34,299 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: X1-SC1: Storage Controller
journal state was changed from healthy to dumping.

Comment by Petr A Dushkin [ 30/Dec/20 ]


Thomas Tobiasz,

is it possible to capture the messages from X3 brick for 12-24th as well?


Comment by Thomas Tobiasz [ 31/Dec/20 ]
Message logs for both X3-SC1 and X3-SC2 for the 24th of December are now available on the FTP folder
Comment by Petr A Dushkin [ 31/Dec/20 ]
Thomas Tobiasz,

Yes, you are correct, 3 SSDs failed one after another, and the cluster stopped as max # of the failed drive reached. We can see from the xms.log that
X3-DPG was undergoing the rebuild after dual SSD failure ( wwn-0x5000cca04f4c7348 and wwn-0x5000cca04f4c6b90) when 3rd SSD failed ( wwn-
0x5000cca04f4c6b50):

2020-12-24 19:59:02,797 - [20632:13f6::PSY899920] - PAVAFXT01: Raised alert: "Diagnostics detected a minor problem in the SSD." object: wwn-
0x5000cca04f4c7348 severity: minor threshold:
2020-12-24 19:59:04,784 - [20632:13f6::PSY899920] - PAVAFXT01: Raised alert: "DPG rebuild has started." object: X3-DPG severity: information
threshold:
2020-12-24 19:59:05,587 - [20632:13f6::PSY899920] - PAVAFXT01: Raised alert: "An SSD has failed and the DPG resiliency is degraded." object:
X3-DPG severity: major threshold:
2020-12-24 19:59:08,871 - [20748:13f6::PSYadd508] - PAVAFXT01: Raised alert: "SSD has failed." object: wwn-0x5000cca04f4c7348 severity: major
threshold:
2020-12-24 19:59:13,346 - [20839:13f6::PSY14eec6] - PAVAFXT01: Removed alert: "Diagnostics detected a minor problem in the SSD." object: wwn-
0x5000cca04f4c7348

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 8/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
2020-12-24 19:59:14,445 - [20839:13f6::PSY14eec6] - PAVAFXT01: Raised alert: "SSD is disconnected." object: wwn-0x5000cca04f4c7348 severity:
major threshold:
2020-12-24 20:03:20,507 - [24707:13f6::PSYc293a2] - PAVAFXT01: Cleared alert: "An SSD has failed and the DPG resiliency is degraded." object:
X3-DPG
2020-12-24 20:03:21,785 - [24707:13f6::PSYc293a2] - PAVAFXT01: Raised alert: "DPG has two simultaneous SSD failures and is in degraded
protection mode." object: X3-DPG severity: critical threshold:
2020-12-24 20:03:24,797 - [24805:13f6::PSYf6656e] - PAVAFXT01: Raised alert: "SSD has failed." object: wwn-0x5000cca04f4c6b90 severity: major
threshold:
2020-12-24 20:03:25,736 - [24805:13f6::PSYf6656e] - PAVAFXT01: Raised alert: "SSD is disconnected." object: wwn-0x5000cca04f4c6b90 severity:
major threshold:
2020-12-24 20:04:51,621 - [26066:13f6::PSYd49141] - PAVAFXT01: Removed alert: "DPG rebuild has started." object: X3-DPG
2020-12-24 20:04:51,648 - [26066:13f6::PSYd49141] - X3-DPG: alert c36bc420d8034000952dcfdb807559b9 not found, skipping

I will review the messages from 12-24 and will add my findings to the ticket.
Comment by Petr A Dushkin [ 31/Dec/20 ]
2020-12-24 19:58:57 - Physical errors on SSD wwn-0x5000cca04f4c7348:

<info>2020-12-24 19:58:57.511791 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 2195][12046(12399 pm_ssd)]pm_disk_diag_test: disk /dev/sdm is_fault:
false -> true, smart_problem
<info>2020-12-24 19:58:57.511851 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 2197][12046(12399 pm_ssd)]pm_disk_diag_test:915: Device =
/dev/sg12, SSD SENSE/ASC/ASCQ = SCSI_SENSE_KEY_NO_SENSE/ (0x00/0x0b/0xfb), IO SENSE/ASC/ASCQ = SCSI_SENS
E_KEY_NO_SENSE/NO ADDITIONAL SENSE INFORMATION (0x00/0x00/0x00)
<info>2020-12-24 19:58:57.511859 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1696][12046(12166 nb_truck_0)]ssd_update_in_mom: updating SSD wwn-
0x5000cca04f4c7348 in mom, diagnostic_health_state has changed MGMT_SENSOR_SEVERITY_CLEAR -> MGMT
_SENSOR_SEVERITY_WARNING
<info>2020-12-24 19:58:57.511883 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 21937][12046(12166 nb_truck_0)]send_mgmt_events_from_buffer:
NOTIF_FLOW: module MODULE_TYPE_PLATFORM(csid=20) sending event type mom_object_update (event_idx=121219
99) on obj_type MGMT_OBJTYPE_SSD(guid=dd8790ba091f48f2af92dea559cf1f93)
<info>2020-12-24 19:58:57.541064 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1704][12046(12399 pm_ssd)]pm_ssd_poller:1030: clst update is needed:
ssd failed
<info>2020-12-24 19:58:57.698460 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 21937][12046(12166 nb_truck_0)]send_mgmt_events_from_buffer:
NOTIF_FLOW: module MODULE_TYPE_PLATFORM(csid=20) sending event type mom_object_update (event_idx=121220
00) on obj_type MGMT_OBJTYPE_IB_SWITCH(guid=e9cbcda298304f9aaa6f2315ef0b6d3f)
...
<info>2020-12-24 19:58:59.109169 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 3][12046(12166 nb_truck_0)]handle_mbe_p_check_disk: MBE_P message
received
<info>2020-12-24 19:58:59.109194 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 22659][12046(12166 nb_truck_0)]pm_ssd_check_disk_flow:7852: check
disk called in EXTENDED MODE on ssd (wwn=wwn-0x5000cca04f4c7348)
...
<info>2020 12 24 19:59:01 111246 PAVAFXT01 x3 n1 kernel:[44694724 496151] sd 2:0:8:0: [sdm] Unhandled sense code

2020-12-24 19:59:04 DPG rebuild started:

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 9/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

<crit>2020-12-24 19:59:04.224720 PAVAFXT01-x3-n1 xtremapp: D22 [log_id: 7934][106072(106147


nb_truck_2)]pl_par_mgmt_disk_status_changed_listener_callback: #RAID rebuild start for owner 0 disk 16 all_status 'HHHHHHHHHHHHHHHHRHHHHHHHHXX'
<info>2020-12-24 19:59:04.224726 PAVAFXT01-x3-n1 xtremapp: D22 [log_id: 7811][106072(106147 nb_truck_2)]reserve_entire_disk: owner 0 disk 16
force 0 forced_non_secure_reservation 0 degraded_mode 0 num_disks 24 max_num_seen_disks 25 num_
spare_disks 0 count_disk_for_free 0 reserve_from_common 0 reserve_from_a2h 0 result 1; before: a2h_reserve 1048576 (type 0 free 368787579 stripes
609375) (type 1 free 249358930 stripes 208286) (type 2 free 44791728 stripes 18866); after:
a2h_reserve 1048576 (type 0 free 368787579 stripes 609375) (type 1 free 249358930 stripes 208286) (type 2 free 44791728 stripes 18866)

2020-12-24 19:59:10 - both paths reported ssd 0x5000cca04f4c7348 as disconnected:

<info>2020-12-24 19:59:10.288630 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 19242][12046(12399 pm_ssd)]pm_ssd_update_lcc_phys_health_state:


lcc_phys: lcc JWXEL153801482 phy 27 is inactive! Setting PHY health to MGMT_HEALTH_LEVEL_5_MAJOR
<crit>2020-12-24 19:59:10.503763 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1783][12046(12399 pm_ssd)]pm_ssd_cycle: #SSD wwn 0x5000cca04f4c7348
(/dev/sdm ; /dev/sdat) removed from logical slot #17 (physical #16)
<crit>2020-12-24 19:59:10.510559 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 22265][12046(12399 pm_ssd)]update_ssd: SSD slot=16
wwn=0x5000cca04f4c7348 event=2
<info>2020-12-24 19:59:10.523440 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1690][12046(12166 nb_truck_0)]ssd_update_in_mom:765: updating SSD wwn-
0x5000cca04f4c7348 [0RWB1D7A ] in mom PM_SLOT_STATE_SIGNED->PM_SLOT_STATE_EMPTY
<info>2020-12-24 19:59:10.523459 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1692][12046(12166 nb_truck_0)]ssd_update_in_mom:810: SSD
0x5000cca04f4c7348 changed FRU state healthy -> disconnected

2020-12-24 20:03:11 - Physical errors on SSD 0x5000cca04f4c6b90 and it was lost as well:

<crit>2020-12-24 20:03:11.649829 PAVAFXT01-x3-n1 xtremapp: X22 [log_id: 9823][106072(106150 nb_truck_5)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b90-JWXEL153801482 owner 0 op 1 res -5 res2 0 log_params_i
dx 0 hvol 0xc78d722e40
<warn>2020-12-24 20:03:11.650211 PAVAFXT01-x3-n1 kernel:[44694975.133659] __ratelimit: 1 callbacks suppressed
<info>2020-12-24 20:03:11.650222 PAVAFXT01-x3-n1 kernel:[44694975.133662] sd 2:0:7:0: [sdl] Unhandled sense code
<info>2020-12-24 20:03:11.650224 PAVAFXT01-x3-n1 kernel:[44694975.133664] sd 2:0:7:0: [sdl] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:03:11.650226 PAVAFXT01-x3-n1 kernel:[44694975.133667] sd 2:0:7:0: [sdl] Sense Key : Medium Error [current]
<warn>2020-12-24 20:03:11.650231 PAVAFXT01-x3-n1 kernel:[44694975.133671] Info fld=0x8b5db20
<info>2020-12-24 20:03:11.650232 PAVAFXT01-x3-n1 kernel:[44694975.133672] sd 2:0:7:0: [sdl] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:03:11.650234 PAVAFXT01-x3-n1 kernel:[44694975.133677] sd 2:0:7:0: [sdl] CDB: Read(10): 28 00 08 b5 db 20 00 00 10 00
<warn>2020-12-24 20:03:11.650235 PAVAFXT01-x3-n1 kernel:[44694975.133683] __ratelimit: 1 callbacks suppressed
<crit>2020-12-24 20:03:11.662193 PAVAFXT01-x3-n1 xtremapp: X23 [log_id: 9823][106074(106111 nb_truck_5)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b90-JWXEL153801047 owner 0 op 1 res -5 res2 0 log_params_i
dx 1 hvol 0xc78d722e40
<info>2020-12-24 20:03:11.662248 PAVAFXT01-x3-n1 kernel:[44694975.146039] sd 2:0:43:0: [sdau] Unhandled sense code
<info>2020-12-24 20:03:11.662252 PAVAFXT01-x3-n1 kernel:[44694975.146043] sd 2:0:43:0: [sdau] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:03:11.662254 PAVAFXT01-x3-n1 kernel:[44694975.146046] sd 2:0:43:0: [sdau] Sense Key : Medium Error [current]
<warn>2020-12-24 20:03:11.662255 PAVAFXT01-x3-n1 kernel:[44694975.146051] Info fld=0x1ffbe3c0
<info>2020-12-24 20:03:11.662256 PAVAFXT01-x3-n1 kernel:[44694975.146052] sd 2:0:43:0: [sdau] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:03:11.662257 PAVAFXT01-x3-n1 kernel:[44694975.146057] sd 2:0:43:0: [sdau] CDB: Read(10): 28 00 1f fb e3 c0 00 00 10 00
...

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 10/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
<info>2020-12-24 20:03:16.188165 PAVAFXT01-x3-n1 xtremapp-pm: P [log id: 11814][12046(12166 nb truck 0)]pm ssd check disk ext:7817: Marking

2020-12-24 20:03:18 - Restatrted X3-DPG rebuild due to the 2nd SSD failure:

<info>2020-12-24 20:03:18.762330 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7943][106074(106106 nb_truck_0)]: Changing owner 0 partition 16 status
from 4 to 15 process status was 2 code was 0 all_status 'HHHHHHHHHHHHHHHHFFHHHHHHHXX'
<err>2020-12-24 20:03:18.762333 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7935][106074(106106
nb_truck_0)]pl_par_mgmt_disk_status_changed_listener_callback: #RAID rebuild failed/aborted for owner 0 disk 16 all_status 'HHHHHHHHHHHHHHHHFFHHH
HHHHXX'
...
<info>2020-12-24 20:03:18.762330 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7943][106074(106106 nb_truck_0)]: Changing owner 0 partition 16 status
from 4 to 15 process status was 2 code was 0 all_status 'HHHHHHHHHHHHHHHHFFHHHHHHHXX'
<err>2020-12-24 20:03:18.762333 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7935][106074(106106
nb_truck_0)]pl_par_mgmt_disk_status_changed_listener_callback: #RAID rebuild failed/aborted for owner 0 disk 16 all_status 'HHHHHHHHHHHHHHHHFFHHH
HHHHXX'

2020-12-24 20:04:41 - Physical errors on SSD 0x5000cca04f4c6b50:

<crit>2020-12-24 20:04:41.575171 PAVAFXT01-x3-n1 xtremapp: X23 [log_id: 9823][106074(106110 nb_truck_4)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b50-JWXEL153801047 owner 0 op 1 res -5 res2 0 log_params_i
dx 0 hvol 0xc78d71d9c0
<warn>2020-12-24 20:04:41.575254 PAVAFXT01-x3-n1 kernel:[44695065.094531] __ratelimit: 24 callbacks suppressed
<info>2020-12-24 20:04:41.575270 PAVAFXT01-x3-n1 kernel:[44695065.094536] sd 2:0:41:0: [sdas] Unhandled sense code
<info>2020-12-24 20:04:41.575272 PAVAFXT01-x3-n1 kernel:[44695065.094538] sd 2:0:41:0: [sdas] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:04:41.575276 PAVAFXT01-x3-n1 kernel:[44695065.094541] sd 2:0:41:0: [sdas] Sense Key : Medium Error [current]
<warn>2020-12-24 20:04:41.575277 PAVAFXT01-x3-n1 kernel:[44695065.094546] Info fld=0x1a72c3d6
<info>2020-12-24 20:04:41.575279 PAVAFXT01-x3-n1 kernel:[44695065.094547] sd 2:0:41:0: [sdas] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:04:41.575280 PAVAFXT01-x3-n1 kernel:[44695065.094552] sd 2:0:41:0: [sdas] CDB: Read(10): 28 00 1a 72 c3 d0 00 00 10 00
<warn>2020-12-24 20:04:41.575282 PAVAFXT01-x3-n1 kernel:[44695065.094559] __ratelimit: 24 callbacks suppressed
<crit>2020-12-24 20:04:41.725413 PAVAFXT01-x3-n1 xtremapp: X22 [log_id: 9823][106072(106147 nb_truck_2)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b50-JWXEL153801482 owner 0 op 1 res -5 res2 0 log_params_i
dx 0 hvol 0xc78d71d9c0
<info>2020-12-24 20:04:41.726188 PAVAFXT01-x3-n1 kernel:[44695065.244826] sd 2:0:9:0: [sdn] Unhandled sense code
<info>2020-12-24 20:04:41.726197 PAVAFXT01-x3-n1 kernel:[44695065.244829] sd 2:0:9:0: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:04:41.726199 PAVAFXT01-x3-n1 kernel:[44695065.244832] sd 2:0:9:0: [sdn] Sense Key : Medium Error [current]
<warn>2020-12-24 20:04:41.726215 PAVAFXT01-x3-n1 kernel:[44695065.244836] Info fld=0x112800e8
<info>2020-12-24 20:04:41.726217 PAVAFXT01-x3-n1 kernel:[44695065.244838] sd 2:0:9:0: [sdn] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:04:41.726219 PAVAFXT01-x3-n1 kernel:[44695065.244842] sd 2:0:9:0: [sdn] CDB: Read(10): 28 00 11 28 00 e8 00 00 08 00
<info>2020-12-24 20:04:41.852264 PAVAFXT01-x3-n1 kernel:[44695065.371742] sd 2:0:41:0: [sdas] Unhandled sense code
<info>2020-12-24 20:04:41.852280 PAVAFXT01-x3-n1 kernel:[44695065.371746] sd 2:0:41:0: [sdas] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020 12 24 20:04:41 852282 PAVAFXT01 x3 n1 kernel:[44695065 371750] sd 2:0:41:0: [sdas] Sense Key : Medium Error [current]

2020-12-24 20:04:42 - XENVs PANIC on X3 due to 3rd SSD failure:

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 11/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

<crit>2020-12-24 20:04:42.907748 PAVAFXT01-x3-n1 xtremapp: D22 [log_id: -1][106072(106145 nb_truck_0)]: PANIC <D4680> csid 22 at pl_cas.c:6775
cas_report_failure (timestamp 314641822509234537): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 15,16,17 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'
..
<crit>2020-12-24 20:04:42.909298 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: -1][106074(106111 nb_truck_5)]: PANIC <D4680> csid 23 at pl_cas.c:6775
cas_report_failure (timestamp 314641822512475718): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 16,17,15 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'
...
<crit>2020-12-24 20:04:42.915026 PAVAFXT01-x3-n2 xtremapp: D24 [log_id: -1][108928(109007 nb_truck_6)]: PANIC <D4680> csid 24 at pl_cas.c:6775
cas_report_failure (timestamp 314640019858762433): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 15,16,17 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'
...
<crit>2020-12-24 20:04:42.923559 PAVAFXT01-x3-n2 xtremapp: D25 [log_id: -1][108929(108962 nb_truck_0)]: PANIC <D4680> csid 25 at pl_cas.c:6775
cas_report_failure (timestamp 314640019876639139): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 15,16,17 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'

Cluster closed the gates due to triple ssd failure as expected:

X1-SC2/system/logs/crit-messages:<crit>2020-12-24 20:04:44.983998 PAVAFXT01-x1-n2 xtremapp: M [log_id: 4380][125030(125047


nb_truck_0)]bl_rg_close_gates_if_triple_degraded: Triple failure: 2 SSDs failed_in_rg, 2 SSDs pending rebuild, SS
D wwn-0x5000cca04f4c6b50 (slot 15) ejected, did not change raid index state, closing gates
X1-SC2/system/logs/crit-messages:<crit>2020-12-24 20:44:02.547851 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][65531(65548
nb_truck_0)]ham_rule_action_close_gates: event ID 383, closing system gates! Reason: multiple_disk_failures
X1-SC2/system/logs/crit-messages:<crit>2020-12-24 20:44:02.882455 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][65531(65548
nb_truck_0)]ham_rule_action_close_gates: event ID 536, closing system gates! Reason: ha_failure

Comment by Petr A Dushkin [ 31/Dec/20 ]


Thomas Tobiasz,

It appears those 3 ssds are disconnected due to HW issues.

Failed to inquiry device:/dev/sg12, scsi:2.


failed to add dev /dev/sg12: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x3.
....

2020-12-24 20:01:43
Failed to inquiry device:/dev/sg12, scsi:2.
failed to add dev /dev/sg12: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x3.
....

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 12/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
2020-12-24 20:02:54
Failed to inquiry device:/dev/sg12, scsi:2.
failed to add dev /dev/sg12: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x3.
....

2020-12-24 20:04:12
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.

In order to attempt and read the "smart" data from the ssds, you can arrange on-site CE to reseat the ssds and check if the system detects it.
Comment by Petr A Dushkin [ 31/Dec/20 ]
Previous attempts to start the cluster:

2020-12-24 20:43:31,879 - [5869:aaad::XRS5f21d0] - User: admin, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=1}
2020-12-24 20:53:57,113 - [5869:05ca::RSTebefc6] - User: admin, Command: start_cluster, Failed: sys_start_error
...
2020-12-24 21:34:05,808 - [5869:2f21::XRSa8caa3] - User: admin, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=2}
2020-12-24 21:44:15,062 - [5869:05ca::RSTd431ba] - User: admin, Command: start_cluster, Failed: sys_start_error
...
2020-12-27 16:33:17,677 - [5869:f9a0::XRS329967] - User: admin, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=3}
2020-12-27 16:43:20,819 - [5869:05ca::RST4dddad] - User: admin, Command: start_cluster, Failed: sys_start_error
...
2020-12-28 18:32:24,268 - [5869:ae0c::XRS78fbb6] - User: tech, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=3}
2020-12-28 18:42:37,674 - [5869:05ca::RST7eff95] - User: tech, Command: start_cluster, Failed: sys_start_error

Comment by Thomas Tobiasz [ 31/Dec/20 ]


Update from the customer in regards to any action they had performed on the array

. The first 2 drives within slots 16 & 17 failed initially and then slot 15. The drives within these slots were reseated in initial efforts to regain access as
they had already gone offline.

I will have a CE go onsite and perform the reseat(s) of the SSD. Reset one and monitor to see if the is detected then move on tot he next depending on
the result

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 13/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

Comment by Petr A Dushkin [ 31/Dec/20 ]


Thomas Tobiasz,

Let me know if you need any additional feedback regarding the SSDs and cluster status.
Comment by Petr A Dushkin [ 31/Dec/20 ]
It seems that ssd in slot 15 was reseated here:

Failed to inquiry device:/dev/sg46, scsi:2.


failed to add dev /dev/sg46: scsi2, hdl=0x7.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x8.
....27:03:23:42:920 dev/phy.26: Drive Slot 15 no device detected.
27:03:23:42:999 dev/phy.26: attached SAS address 5000cca0_4f4c6b52->00000000_00000000

2020-12-27 16:19:53
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg12, scsi:2.
failed to add dev /dev/sg12: scsi2, hdl=0x3.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x7.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x8.
...r27:03:24:11:324 dev/phy.26: Drive slot 15 device detected.
27:03:24:11:330 dev/phy.26: DISABLED ERROR->ENABLED
27:03:24:11:938 dev/phy.26: ready
27:03:24:11:942 dev/phy.26: link ready
27:03:24:11:946 dev/phy.26: rate unknown->6G
27:03:24:11:951 dev/phy 26: attached SAS address 00000000 00000000 >5000cca0 2b256796

followed by slot 16:

2020-12-27 16:21:23
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x3.
...I27:03:26:23:255 dev/phy.27: not ready
27:03:26:23:258 dev/phy.27: link not ready
27:03:26:23:265 dev/phy.27: rate 6G->unknown
27:03:26:23:269 dev/phy.27: attached phy id 0x01->0xff
27:03:26:23:953 dev/phy.27: Drive Slot 16 no device detected.
27:03:26:24:120 dev/phy.27: attached SAS address 5000cca0_4f4c734a->00000000_00000000

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 14/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

2020-12-27 16:21:43
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x3.
...>27:03:26:47:608 dev/phy.27: Drive slot 16 device detected.
27:03:26:48:989 dev/phy.27: ready
27:03:26:48:992 dev/phy.27: link ready
27:03:26:48:997 dev/phy.27: rate unknown->6G
d / h h d dd f

Generated at Thu Dec 31 03:25:21 IST 2020 by Mingli Bi using JIRA 7.8.0#78000-sha1:4568b9d484113d74dfb6f152fb925b5fa1be2ef7.

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 15/15

You might also like