0% found this document useful (0 votes)

51 views15 pages

(#EE-17529) LISTRAK - Multiple SSD Failures Had Caused The Cluster To Stop

Uploaded by

mingli.bi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views15 pages

(#EE-17529) LISTRAK - Multiple SSD Failures Had Caused The Cluster To Stop

Uploaded by

mingli.bi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

[EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop Created: 30/Dec/20 Updated: 31/Dec/20

Status: Pending L2/Customer

Project: Escalation Engineering
Affects Version/s: None

Type: Problem
Reporter: Thomas Tobiasz Assignee: Petr A Dushkin
Resolution: Unresolved Votes: 0
Labels: DTS_Supriya, ETS_Primary
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

L2 Owner: Thomas Tobiasz

EE Owner: Edwin Gardineriv
Customer name: LISTRAK
Severity: 2 - High
Installation Type: Production
DU/DL/DI: DU - Data Unavailability
Case Origin: Phone
SR Number (CSI): 100484369
View SR: http://nova.corp.emc.com/view/sr/100484369
XtremApp version: 4.0.27-1
XMS Version: 6.2.1-36
Number of clusters Please select
managed by XMS:
System State: Stopped-orderly
Encryption: Enabled
Exec Summary: Current status: cluster stopped on Dec 24th due to triple SSD failure in X3-DAE. SSDs failed due to media errors one
after another.

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 1/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

Action Taken: reviewed logs.

Plan of Action: since both SCs on X3 could not detect the 3 SSDs it's recommended to reseat the drives in an attempt to
get more diagnostics data from the SSDs.
Number of bricks: 6
ESRS Type: No
PSNT: FNM00155000684
Search SRs by PSNT: http://nova.corp.emc.com/view/customer-product?id=FNM00155000684
Case Reason: HW problem
Type: Please select
Handover: No
Fixed in Version: Please select
Exec Summary Updated 31/Dec/20 7:17 AM
Date:

Description
THIS IS A T&M ACCOUNT

Current cluster status is stopped with reason multiple_disk_failures

* Log bundle will be UPLOADING IN 15 MINUTES*

. Cluster stopped on Dec 24th

. At that time two SSD's had failed and 4 other reported SSD lifecycle state was changed from disconnected to healthy.

. The failure of the SSD's had happened at the same time on the 24th and 28th as noted below.

The ask from L3 is to help determine if in fact all 6 SSD's had failed or are the result of a bad noisy LCC / SAS Cables

CLUSTER STATUS

Cluster general status: As of 2020-12-30 12:02:17 the cluster was stopped connect status was connected stop reason was multiple_disk_failures
As of 2020-12-30 12:02:17 the gate was closed Proactive_metadata_loading: True is_any_c_mdl_lazy_load_in_progress: True

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 2/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
is_any_d_mdl_lazy_load_in_progress: True X1-DPG was proactive_metadata_loading ... X2-DPG was proactive_metadata_loading ... X3-DPG was
proactive_metadata_loading ... X4-DPG was proactive_metadata_loading ... X5-DPG was proactive_metadata_loading ... X6-DPG was
proactive_metadata_loading …

Close gate and Open gate messages

crit 2020-12-27 16:33:29.092197 PAVAFXT01-x1-n2 xtremapp: M [log_id: 5874][26776(26848 nb_truck_0)]complete_system_activation: #TIME_ME system
initialization: complete_system_activation entered!!! (#activation - sym connected to xenvs); elapsed_time=2370 milliseconds crit 2020-12-27
16:33:29.480981 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][26776(26848 nb_truck_0)]ham_rule_action_close_gates: event ID 385 closing system
gates! Reason: multiple_disk_failures crit 2020-12-27 16:33:29.767257 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][26776(26848
nb_truck_0)]ham_rule_action_close_gates: event ID 648 closing system gates! Reason: ha_failure crit 2020-12-28 18:32:40.644980 PAVAFXT01-x1-n2
xtremapp: M [log_id: 5874][121408(121485 nb_truck_0)]complete_system_activation: #TIME_ME system initialization: complete_system_activation
entered!!! (#activation - sym connected to xenvs); elapsed_time=2106 milliseconds crit 2020-12-28 18:32:41.046342 PAVAFXT01-x1-n2 xtremapp: M
[log_id: 6554][121408(121485 nb_truck_0)]ham_rule_action_close_gates: event ID 373 closing system gates! Reason: multiple_disk_failures crit
2020-12-28 18:32:41.341187 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][121408(121485 nb_truck_0)]ham_rule_action_close_gates: event ID 619 closing
system gates! Reason: ha_failure

---------------------------------------------------------------------------------------------------------------------------------- Shared memory

Utilization is 0 the status is healthy Total memory Utilziation is 0 Timestamp obj_name sever
description =================== ======================== =====
===================================================================================== 2020-12-24 20:05:12 PAVAFXT01 criti Cluster
has stopped due to multiple SSD failures in DAE. 2020-12-28 18:36:26 PAVAFXT01 criti The cluster service has stopped. Stopped type
is: stopping stopped reason is: multiple_disk_failures. Storage Controllers information:

Storage Controllers information:

SC-Name Index Mgr-Addr State Health Enabled Conn-State Jour-Stat SW-Version Sym Stop-Reason X1-SC1 1 10.205.255.22
healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X1-SC2 2 10.205.255.23 healthy degraded
enabled connected healthy 4.0.27-1 True ha_failure X2-SC1 3 10.205.255.24 healthy degraded enabled connected
healthy 4.0.27-1 False ha_failure X2-SC2 4 10.205.255.25 healthy degraded enabled connected healthy 4.0.27-1 False
ha_failure X3-SC1 5 10.205.255.26 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X3-SC2 6
10.205.255.27 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X4-SC1 7 10.205.255.28 healthy
degraded enabled connected healthy 4.0.27-1 False ha_failure X4-SC2 8 10.205.255.29 healthy degraded enabled
connected healthy 4.0.27-1 False ha_failure X5-SC1 9 10.205.255.30 healthy degraded enabled connected healthy
4.0.27-1 False ha_failure X5-SC2 10 10.205.255.31 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure
X6-SC1 11 10.205.255.32 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure X6-SC2 12
10.205.255.33 healthy degraded enabled connected healthy 4.0.27-1 False ha_failure Timestamp obj_name
sever description =================== ======================== =====
===================================================================================== 2020-12-28 18:35:16 X2-SC1 major Storage
Controller has stopped. 2020-12-28 18:35:17 X2-SC2 major Storage Controller has stopped. 2020-12-28 18:35:22
X4-SC1 major Storage Controller has stopped. 2020-12-28 18:35:23 X4-SC2 major Storage Controller has

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 3/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
stopped. 2020-12-28 18:35:57 X3-SC1 major Storage Controller has stopped. 2020-12-28 18:35:58 X3-
SC2 major Storage Controller has stopped. 2020-12-28 18:35:59 X5-SC1 major Storage Controller has
stopped. 2020-12-28 18:36:00 X5-SC2 major Storage Controller has stopped. 2020-12-28 18:36:02 X6-
SC1 major Storage Controller has stopped. 2020-12-28 18:36:02 X6-SC2 major Storage Controller has
stopped. 2020-12-28 18:36:10 X1-SC1 major Storage Controller has stopped. 2020-12-28 18:36:38 X1-
SC2 major Storage Controller has stopped.

XEnvs information:

Name Index CSID XEnv_state CPU_usage X1-SC1-E1 1 10 inactive 0 X1-SC1-E2 2 11 inactive 0 X1-SC2-E1 3 12
inactive 0 X1-SC2-E2 4 13 inactive 0 X2-SC1-E1 5 16 inactive 0 X2-SC1-E2 6 17 inactive 0 X2-SC2-E1 7
18 inactive 0 X2-SC2-E2 8 19 inactive 0 X3-SC1-E1 9 22 inactive 0 X3-SC1-E2 10 23 inactive 0 X3-SC2-E1
11 24 inactive 0 X3-SC2-E2 12 25 inactive 0 X4-SC1-E1 13 28 inactive 0 X4-SC1-E2 14 29 inactive 0
X4-SC2-E1 15 30 inactive 0 X4-SC2-E2 16 31 inactive 0 X5-SC1-E1 17 34 inactive 0 X5-SC1-E2 18 35
inactive 0 X5-SC2-E1 19 36 inactive 0 X5-SC2-E2 20 37 inactive 0 X6-SC1-E1 21 40 inactive 0 X6-SC1-E2
22 41 inactive 0 X6-SC2-E1 23 42 inactive 0 X6-SC2-E2 24 43 inactive 0 Timestamp
obj_name sever description =================== ======================== =====
===================================================================================== 2020-12-24 20:04:47 X1-SC1-E1 major XENV
is not active. 2020-12-24 20:04:48 X1-SC1-E2 major XENV is not active. 2020-12-24
20:04:48 X1-SC2-E1 major XENV is not active. 2020-12-24 20:04:48 X1-SC2-E2 major XENV is not
active. 2020-12-24 20:04:48 X2-SC1-E1 major XENV is not active. 2020-12-24 20:04:48 X2-
SC1-E2 major XENV is not active. 2020-12-24 20:04:48 X2-SC2-E1 major XENV is not
active. 2020-12-24 20:04:48 X2-SC2-E2 major XENV is not active. 2020-12-24 20:04:48
X4-SC1-E1 major XENV is not active. 2020-12-24 20:04:48 X4-SC1-E2 major XENV is not
active. 2020-12-24 20:04:48 X4-SC2-E1 major XENV is not active. 2020-12-24 20:04:48
X4-SC2-E2 major XENV is not active. 2020-12-24 20:04:48 X5-SC1-E1 major XENV is not
active. 2020-12-24 20:04:48 X5-SC1-E2 major XENV is not active. 2020-12-24 20:04:49
X5-SC2-E1 major XENV is not active. 2020-12-24 20:04:49 X5-SC2-E2 major XENV is not
active. 2020-12-24 20:04:49 X6-SC1-E1 major XENV is not active. 2020-12-24 20:04:49
X6-SC1-E2 major XENV is not active. 2020-12-24 20:04:49 X6-SC2-E1 major XENV is not
active. 2020-12-24 20:04:49 X6-SC2-E2 major XENV is not active. 2020-12-24 20:05:09
X3 SC1 E1 major XENV is not active 2020 12 24 20:05:09 X3 SC1 E2 major XENV is not

SSDs information:

Name Index Slot# SSD-Size DPG-Name XDP-State State End-Rem% Encry-Status wwn-0x5000cca04f740f04 48 0
781.422768 X2-DPG in_rg disconnected 0 enc_supported_locked_cluster_pin wwn-0x5000cca04f743dec 50 0 781.422768 X2-DPG
in_rg disconnected 0 enc_supported_locked_cluster_pin wwn-0x5000cca04f4c6b50 66 15 781.422768 X3-DPG in_rg disconnected 97
enc_supported_locked_cluster_pin wwn-0x5000cca04f4c7348 67 16 781.422768 failed_in_rg disconnected 0
enc_supported_locked_cluster_pin wwn-0x5000cca04f4c6b90 68 17 781.422768 failed_in_rg disconnected 0
enc_supported_locked_cluster_pin wwn-0x5000cca04f74460c 100 0 781.422768 X4-DPG in_rg disconnected 0
enc_supported_locked_cluster_pin wwn-0x5000cca02b256794 151 15 781.422768 not_in_rg healthy 0 not_supported

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 4/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
Timestamp obj_name sever description =================== ========================
===== ===================================================================================== 2020-12-24 19:59:07 wwn-0x5000cca04f4c7348 major
SSD has failed. 2020-12-24 19:59:13 wwn-0x5000cca04f4c7348 major SSD is disconnected. 2020-12-24
20:03:23 wwn-0x5000cca04f4c6b90 major SSD has failed. 2020-12-24 20:03:24 wwn-0x5000cca04f4c6b90 major SSD is
disconnected. 2020-12-28 18:36:27 wwn-0x5000cca04f740f04 major SSD is disconnected. 2020-12-28 18:36:28
wwn-0x5000cca04f743dec major SSD is disconnected. 2020-12-28 18:36:32 wwn-0x5000cca04f74460c major SSD is
disconnected. 2020-12-28 18:50:38 wwn-0x5000cca04f4c6b50 major SSD is disconnected.

JBOD LCC SAS error counters information:

LCC_Name Phy_Index Invalid-Dwords Disparity-Errors Loss-Dword-Sync Phy-Resets X3-DAE-LCC-B 0 0

0 0 0 X3-DAE-LCC-B 1 0 0 0 0 X3-DAE-LCC-B 2
0 0 0 0 X3-DAE-LCC-B 3 0 0 0 0 X3-DAE-LCC-B
4 0 0 0 0 X3-DAE-LCC-B 5 0 0 0 0 X3-
DAE-LCC-B 6 0 0 0 0 X3-DAE-LCC-B 7 0 0
0 0 X3-DAE-LCC-B 11 0 0 0 0 X3-DAE-LCC-B 12 0
0 0 0 X3-DAE-LCC-B 13 0 0 0 0 X3-DAE-LCC-B 14
0 0 0 0 X3-DAE-LCC-B 15 0 0 0 0 X3-DAE-LCC-B
16 0 0 0 0 X3-DAE-LCC-B 17 0 0 0 0 X3-DAE-
LCC-B 18 0 0 0 0 X3-DAE-LCC-B 19 0 0 0 0
X3-DAE-LCC-B 20 0 0 0 0 X3-DAE-LCC-B 21 0 0
0 0 X3-DAE-LCC-B 22 0 0 0 0 X3-DAE-LCC-B 23 0
0 0 0 X3-DAE-LCC-B 24 0 0 0 0 X3-DAE-LCC-B
25 0 0 0 0 X3-DAE-LCC-B 26 0 0 0 0 X3-
DAE-LCC-B 27 1483 1439 4 0 X3-DAE-LCC-B 28 1066 1034
2 0 X3-DAE-LCC-B 29 0 0 0 0 X3-DAE-LCC-B 30 0
0 0 0 X3-DAE-LCC-B 31 0 0 0 0 X3-DAE-LCC-B 32
0 0 0 0 X3-DAE-LCC-B 33 0 0 0 0 X3-DAE-LCC-B
34 0 0 0 0 X3-DAE-LCC-B 35 0 0 0 0 X3-DAE-
LCC-A 0 0 0 0 0 X3-DAE-LCC-A 1 0 0 0
0 X3-DAE-LCC-A 2 0 0 0 0 X3-DAE-LCC-A 3 0 0
0 0 X3 DAE LCC A 4 0 0 0 0 X3 DAE LCC A 5 0

Detail log information:

X3-SC1:xtremapp: D22 PANIC D4680 csid 22 2020-12-24 20:04:42.907748X3-SC1:xtremapp: D23 PANIC D4680 csid 23 2020-12-24 20:04:42.909298X3-
SC2:xtremapp: D24 PANIC D4680 csid 24 2020-12-24 20:04:42.915026X3-SC2:xtremapp: D25 PANIC D4680 csid 25 2020-12-24 20:04:42.923559X3-

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 5/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
SC2:xtremapp*: X25 PANIC S11 csid 25 info 2020-12-27 16:33:35.397335X6-SC1:xtremapp*: X41 PANIC S11 csid 41 info 2020-12-27
16:33:35.397472X1-SC2:xtremapp*: X13 PANIC S11 csid 13 info 2020-12-28 18:32:47.145229

Detail log information:

info 2020-12-27 16:22:01.912467 PAVAFXT01-x3-n2 kernel:[44940173.179988] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-
12-27 16:22:01.913395 PAVAFXT01-x3-n2 kernel:[44940173.181201] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-27
16:22:01.955357 PAVAFXT01-x3-n1 kernel:[44941003.302116] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-27
16:22:01.957428 PAVAFXT01-x3-n1 kernel:[44941003.304544] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.265404 PAVAFXT01-x3-n2 kernel:[45030964.804579] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.268387 PAVAFXT01-x3-n2 kernel:[45030964.807411] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.275512 PAVAFXT01-x3-n2 kernel:[45030964.814813] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.279383 PAVAFXT01-x3-n2 kernel:[45030964.819299] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.411325 PAVAFXT01-x3-n1 kernel:[45031795.881564] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.414357 PAVAFXT01-x3-n1 kernel:[45031795.884898] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.415317 PAVAFXT01-x3-n1 kernel:[45031795.886212] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:34:38.417304 PAVAFXT01-x3-n1 kernel:[45031795.887547] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.657310 PAVAFXT01-x3-n1 kernel:[45032139.264530] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.658348 PAVAFXT01-x3-n1 kernel:[45032139.265668] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.757481 PAVAFXT01-x3-n2 kernel:[45031308.430551] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:40:21.760409 PAVAFXT01-x3-n2 kernel:[45031308.433207] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.756431 PAVAFXT01-x3-n2 kernel:[45031382.457810] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.758360 PAVAFXT01-x3-n2 kernel:[45031382.460037] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.907345 PAVAFXT01-x3-n1 kernel:[45032213.543862] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2] info 2020-12-28
17:41:35.908291 PAVAFXT01-x3-n1 kernel:[45032213.545040] mpt2sas0: [sense_key asc ascq]: [0x04 0x44 0xa2]148 SSD is
disconnected. major Mon Dec 28 18:50:38 2020
SSD wwn-0x5000cca04f4c6b50 66 PAVAFXT01 1 ssd_fru_disconnected outstanding 0900704150 SSD is
disconnected major Mon Dec 28 18:36:32 2020

Comments
Comment by Petr A Dushkin [ 30/Dec/20 ]
Hi Thomas Tobiasz,

I will review the logs to see the status of the SSDs under the question. What's your overall plan for this cluster? Are you trying to validate if it's okay to
fresh install based on the SSDs status?
Comment by Thomas Tobiasz [ 30/Dec/20 ]
Log bundle is now available on the FTP folder //incoming/EE-17529

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 6/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

Plan at this point is to determine if any of the 6 SSD's drives can be recovered. What is odd is the time frame of when they failed.

Comment by Petr A Dushkin [ 30/Dec/20 ]

Those 3 affected ssds are in the same X3-DAE and I think that's why the cluster has stopped:

wwn-0x5000cca04f4c6b50 66 PAVAFXT01 1 X3 3 15 HITACHI HUSMM118 CLAR800 C250 no_error 5051058 745.223G X3-DPG
3 in_rg disconnected 97 ok enc_supported_locke
wwn-0x5000cca04f4c7348 67 PAVAFXT01 1 X3 3 16 HITACHI HUSMM118 CLAR800 no_error 5051058 745.223G
failed_in_rg disconnected 0 ok enc_supported_locke
wwn-0x5000cca04f4c6b90 68 PAVAFXT01 1 X3 3 17 HITACHI HUSMM118 CLAR800 no_error 5051058 745.223G
failed_in_rg disconnected 0

Comment by Petr A Dushkin [ 30/Dec/20 ]

It appears both X3-SC1 and X3-SC2 are reporting a gap in slots 15-17:

cat X3-SC*/system/disks/disk_in_slot.txt
# SLOT WWN SERIAL SD_enc0 SG_enc0 SD_enc1 SG_enc1
slot 0 ===> 0x5000cca04f743ad4 0RX1XLAA /dev/sdac /dev/sg28 /dev/sdad /dev/sg30
slot 1 ===> 0x5000cca04f7441d4 0RX1Y1TA /dev/sdab /dev/sg27 /dev/sdae /dev/sg31
slot 2 ===> 0x5000cca04f74398c 0RX1XHPA /dev/sdaa /dev/sg26 /dev/sdaf /dev/sg32
slot 3 ===> 0x5000cca04f7438c8 0RX1XG3A /dev/sdz /dev/sg25 /dev/sdag /dev/sg33
slot 4 ===> 0x5000cca04f4cae20 0RWB59RA /dev/sdy /dev/sg24 /dev/sdah /dev/sg34
slot 5 ===> 0x5000cca04f4c9cb4 0RWB44SA /dev/sdx /dev/sg23 /dev/sdai /dev/sg35
slot 6 ===> 0x5000cca04f4c7690 0RWB1M0A /dev/sdw /dev/sg22 /dev/sdaj /dev/sg36
slot 7 ===> 0x5000cca04f4cb018 0RWB5ETA /dev/sdv /dev/sg21 /dev/sdak /dev/sg37
slot 8 ===> 0x5000cca04f509dd8 0RWEAE5A /dev/sdu /dev/sg20 /dev/sdal /dev/sg38
slot 9 ===> 0x5000cca04f50b598 0RWED06A /dev/sdt /dev/sg19 /dev/sdam /dev/sg39
slot 10 ===> 0x5000cca04f5093d4 0RWE9SHA /dev/sds /dev/sg18 /dev/sdan /dev/sg40
slot 11 ===> 0x5000cca04f4c6970 0RWB0RXA /dev/sdr /dev/sg17 /dev/sdao /dev/sg41
slot 12 ===> 0x5000cca04f4c6944 0RWB0RKA /dev/sdq /dev/sg16 /dev/sdap /dev/sg42
slot 13 ===> 0x5000cca04f4ca948 0RWB4ZRA /dev/sdp /dev/sg15 /dev/sdaq /dev/sg43
slot 14 ===> 0x5000cca04f4c9d60 0RWB464A /dev/sdo /dev/sg14 /dev/sdar /dev/sg44
slot 18 ===> 0x5000cca04f4ca8f8 0RWB4Z2A /dev/sdk /dev/sg10 /dev/sdav /dev/sg48
slot 19 ===> 0x5000cca04f4ca888 0RWB4Y5A /dev/sdj /dev/sg9 /dev/sdaw /dev/sg49
slot 20 ===> 0x5000cca04f4ca880 0RWB4Y3A /dev/sdi /dev/sg8 /dev/sdax /dev/sg50
slot 21 ===> 0x5000cca04f4c7314 0RWB1BUA /dev/sdh /dev/sg7 /dev/sday /dev/sg51
slot 22 ===> 0x5000cca04f4c9090 0RWB3APA /dev/sdg /dev/sg6 /dev/sdaz /dev/sg52
slot 23 ===> 0x5000cca04f4ca90c 0RWB4Z7A /dev/sdf /dev/sg5 /dev/sdba /dev/sg53

Comment by Thomas Tobiasz [ 30/Dec/20 ]

Those 3 SSDs show disconnected on the 28th / Alert log however on the 24th they when the array went down this entry was in the alert log for those

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 7/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

3 SSD's

2020-12-24 21:34:28,498 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f740f04:

SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:29,440 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f743dec:
SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:29,508 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f4c6b50:
SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:30,177 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f74460c:
SSD lifecycle state was changed from disconnected to healthy.
2020-12-24 21:34:33,912 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: X1-SC1: Storage Controller
stop type was changed from stopped to none. Stop reason is: none
2020-12-24 21:34:34,266 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: X1-SC1: The Storage
Controller state was changed from stopped to stopping.
2020-12-24 21:34:34,299 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: X1-SC1: Storage Controller
journal state was changed from healthy to dumping.

Comment by Petr A Dushkin [ 30/Dec/20 ]

Thomas Tobiasz,

is it possible to capture the messages from X3 brick for 12-24th as well?

Comment by Thomas Tobiasz [ 31/Dec/20 ]
Message logs for both X3-SC1 and X3-SC2 for the 24th of December are now available on the FTP folder
Comment by Petr A Dushkin [ 31/Dec/20 ]
Thomas Tobiasz,

Yes, you are correct, 3 SSDs failed one after another, and the cluster stopped as max # of the failed drive reached. We can see from the xms.log that
X3-DPG was undergoing the rebuild after dual SSD failure ( wwn-0x5000cca04f4c7348 and wwn-0x5000cca04f4c6b90) when 3rd SSD failed ( wwn-
0x5000cca04f4c6b50):

2020-12-24 19:59:02,797 - [20632:13f6::PSY899920] - PAVAFXT01: Raised alert: "Diagnostics detected a minor problem in the SSD." object: wwn-
0x5000cca04f4c7348 severity: minor threshold:
2020-12-24 19:59:04,784 - [20632:13f6::PSY899920] - PAVAFXT01: Raised alert: "DPG rebuild has started." object: X3-DPG severity: information
threshold:
2020-12-24 19:59:05,587 - [20632:13f6::PSY899920] - PAVAFXT01: Raised alert: "An SSD has failed and the DPG resiliency is degraded." object:
X3-DPG severity: major threshold:
2020-12-24 19:59:08,871 - [20748:13f6::PSYadd508] - PAVAFXT01: Raised alert: "SSD has failed." object: wwn-0x5000cca04f4c7348 severity: major
threshold:
2020-12-24 19:59:13,346 - [20839:13f6::PSY14eec6] - PAVAFXT01: Removed alert: "Diagnostics detected a minor problem in the SSD." object: wwn-
0x5000cca04f4c7348

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 8/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
2020-12-24 19:59:14,445 - [20839:13f6::PSY14eec6] - PAVAFXT01: Raised alert: "SSD is disconnected." object: wwn-0x5000cca04f4c7348 severity:
major threshold:
2020-12-24 20:03:20,507 - [24707:13f6::PSYc293a2] - PAVAFXT01: Cleared alert: "An SSD has failed and the DPG resiliency is degraded." object:
X3-DPG
2020-12-24 20:03:21,785 - [24707:13f6::PSYc293a2] - PAVAFXT01: Raised alert: "DPG has two simultaneous SSD failures and is in degraded
protection mode." object: X3-DPG severity: critical threshold:
2020-12-24 20:03:24,797 - [24805:13f6::PSYf6656e] - PAVAFXT01: Raised alert: "SSD has failed." object: wwn-0x5000cca04f4c6b90 severity: major
threshold:
2020-12-24 20:03:25,736 - [24805:13f6::PSYf6656e] - PAVAFXT01: Raised alert: "SSD is disconnected." object: wwn-0x5000cca04f4c6b90 severity:
major threshold:
2020-12-24 20:04:51,621 - [26066:13f6::PSYd49141] - PAVAFXT01: Removed alert: "DPG rebuild has started." object: X3-DPG
2020-12-24 20:04:51,648 - [26066:13f6::PSYd49141] - X3-DPG: alert c36bc420d8034000952dcfdb807559b9 not found, skipping

I will review the messages from 12-24 and will add my findings to the ticket.
Comment by Petr A Dushkin [ 31/Dec/20 ]
2020-12-24 19:58:57 - Physical errors on SSD wwn-0x5000cca04f4c7348:

<info>2020-12-24 19:58:57.511791 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 2195][12046(12399 pm_ssd)]pm_disk_diag_test: disk /dev/sdm is_fault:
false -> true, smart_problem
<info>2020-12-24 19:58:57.511851 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 2197][12046(12399 pm_ssd)]pm_disk_diag_test:915: Device =
/dev/sg12, SSD SENSE/ASC/ASCQ = SCSI_SENSE_KEY_NO_SENSE/ (0x00/0x0b/0xfb), IO SENSE/ASC/ASCQ = SCSI_SENS
E_KEY_NO_SENSE/NO ADDITIONAL SENSE INFORMATION (0x00/0x00/0x00)
<info>2020-12-24 19:58:57.511859 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1696][12046(12166 nb_truck_0)]ssd_update_in_mom: updating SSD wwn-
0x5000cca04f4c7348 in mom, diagnostic_health_state has changed MGMT_SENSOR_SEVERITY_CLEAR -> MGMT
_SENSOR_SEVERITY_WARNING
<info>2020-12-24 19:58:57.511883 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 21937][12046(12166 nb_truck_0)]send_mgmt_events_from_buffer:
NOTIF_FLOW: module MODULE_TYPE_PLATFORM(csid=20) sending event type mom_object_update (event_idx=121219
99) on obj_type MGMT_OBJTYPE_SSD(guid=dd8790ba091f48f2af92dea559cf1f93)
<info>2020-12-24 19:58:57.541064 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1704][12046(12399 pm_ssd)]pm_ssd_poller:1030: clst update is needed:
ssd failed
<info>2020-12-24 19:58:57.698460 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 21937][12046(12166 nb_truck_0)]send_mgmt_events_from_buffer:
NOTIF_FLOW: module MODULE_TYPE_PLATFORM(csid=20) sending event type mom_object_update (event_idx=121220
00) on obj_type MGMT_OBJTYPE_IB_SWITCH(guid=e9cbcda298304f9aaa6f2315ef0b6d3f)
...
<info>2020-12-24 19:58:59.109169 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 3][12046(12166 nb_truck_0)]handle_mbe_p_check_disk: MBE_P message
received
<info>2020-12-24 19:58:59.109194 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 22659][12046(12166 nb_truck_0)]pm_ssd_check_disk_flow:7852: check
disk called in EXTENDED MODE on ssd (wwn=wwn-0x5000cca04f4c7348)
...
<info>2020 12 24 19:59:01 111246 PAVAFXT01 x3 n1 kernel:[44694724 496151] sd 2:0:8:0: [sdm] Unhandled sense code

2020-12-24 19:59:04 DPG rebuild started:

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 9/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

<crit>2020-12-24 19:59:04.224720 PAVAFXT01-x3-n1 xtremapp: D22 [log_id: 7934][106072(106147

nb_truck_2)]pl_par_mgmt_disk_status_changed_listener_callback: #RAID rebuild start for owner 0 disk 16 all_status 'HHHHHHHHHHHHHHHHRHHHHHHHHXX'
<info>2020-12-24 19:59:04.224726 PAVAFXT01-x3-n1 xtremapp: D22 [log_id: 7811][106072(106147 nb_truck_2)]reserve_entire_disk: owner 0 disk 16
force 0 forced_non_secure_reservation 0 degraded_mode 0 num_disks 24 max_num_seen_disks 25 num_
spare_disks 0 count_disk_for_free 0 reserve_from_common 0 reserve_from_a2h 0 result 1; before: a2h_reserve 1048576 (type 0 free 368787579 stripes
609375) (type 1 free 249358930 stripes 208286) (type 2 free 44791728 stripes 18866); after:
a2h_reserve 1048576 (type 0 free 368787579 stripes 609375) (type 1 free 249358930 stripes 208286) (type 2 free 44791728 stripes 18866)

2020-12-24 19:59:10 - both paths reported ssd 0x5000cca04f4c7348 as disconnected:

<info>2020-12-24 19:59:10.288630 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 19242][12046(12399 pm_ssd)]pm_ssd_update_lcc_phys_health_state:

lcc_phys: lcc JWXEL153801482 phy 27 is inactive! Setting PHY health to MGMT_HEALTH_LEVEL_5_MAJOR
<crit>2020-12-24 19:59:10.503763 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1783][12046(12399 pm_ssd)]pm_ssd_cycle: #SSD wwn 0x5000cca04f4c7348
(/dev/sdm ; /dev/sdat) removed from logical slot #17 (physical #16)
<crit>2020-12-24 19:59:10.510559 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 22265][12046(12399 pm_ssd)]update_ssd: SSD slot=16
wwn=0x5000cca04f4c7348 event=2
<info>2020-12-24 19:59:10.523440 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1690][12046(12166 nb_truck_0)]ssd_update_in_mom:765: updating SSD wwn-
0x5000cca04f4c7348 [0RWB1D7A ] in mom PM_SLOT_STATE_SIGNED->PM_SLOT_STATE_EMPTY
<info>2020-12-24 19:59:10.523459 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 1692][12046(12166 nb_truck_0)]ssd_update_in_mom:810: SSD
0x5000cca04f4c7348 changed FRU state healthy -> disconnected

2020-12-24 20:03:11 - Physical errors on SSD 0x5000cca04f4c6b90 and it was lost as well:

<crit>2020-12-24 20:03:11.649829 PAVAFXT01-x3-n1 xtremapp: X22 [log_id: 9823][106072(106150 nb_truck_5)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b90-JWXEL153801482 owner 0 op 1 res -5 res2 0 log_params_i
dx 0 hvol 0xc78d722e40
<warn>2020-12-24 20:03:11.650211 PAVAFXT01-x3-n1 kernel:[44694975.133659] __ratelimit: 1 callbacks suppressed
<info>2020-12-24 20:03:11.650222 PAVAFXT01-x3-n1 kernel:[44694975.133662] sd 2:0:7:0: [sdl] Unhandled sense code
<info>2020-12-24 20:03:11.650224 PAVAFXT01-x3-n1 kernel:[44694975.133664] sd 2:0:7:0: [sdl] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:03:11.650226 PAVAFXT01-x3-n1 kernel:[44694975.133667] sd 2:0:7:0: [sdl] Sense Key : Medium Error [current]
<warn>2020-12-24 20:03:11.650231 PAVAFXT01-x3-n1 kernel:[44694975.133671] Info fld=0x8b5db20
<info>2020-12-24 20:03:11.650232 PAVAFXT01-x3-n1 kernel:[44694975.133672] sd 2:0:7:0: [sdl] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:03:11.650234 PAVAFXT01-x3-n1 kernel:[44694975.133677] sd 2:0:7:0: [sdl] CDB: Read(10): 28 00 08 b5 db 20 00 00 10 00
<warn>2020-12-24 20:03:11.650235 PAVAFXT01-x3-n1 kernel:[44694975.133683] __ratelimit: 1 callbacks suppressed
<crit>2020-12-24 20:03:11.662193 PAVAFXT01-x3-n1 xtremapp: X23 [log_id: 9823][106074(106111 nb_truck_5)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b90-JWXEL153801047 owner 0 op 1 res -5 res2 0 log_params_i
dx 1 hvol 0xc78d722e40
<info>2020-12-24 20:03:11.662248 PAVAFXT01-x3-n1 kernel:[44694975.146039] sd 2:0:43:0: [sdau] Unhandled sense code
<info>2020-12-24 20:03:11.662252 PAVAFXT01-x3-n1 kernel:[44694975.146043] sd 2:0:43:0: [sdau] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:03:11.662254 PAVAFXT01-x3-n1 kernel:[44694975.146046] sd 2:0:43:0: [sdau] Sense Key : Medium Error [current]
<warn>2020-12-24 20:03:11.662255 PAVAFXT01-x3-n1 kernel:[44694975.146051] Info fld=0x1ffbe3c0
<info>2020-12-24 20:03:11.662256 PAVAFXT01-x3-n1 kernel:[44694975.146052] sd 2:0:43:0: [sdau] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:03:11.662257 PAVAFXT01-x3-n1 kernel:[44694975.146057] sd 2:0:43:0: [sdau] CDB: Read(10): 28 00 1f fb e3 c0 00 00 10 00
...

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 10/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
<info>2020-12-24 20:03:16.188165 PAVAFXT01-x3-n1 xtremapp-pm: P [log id: 11814][12046(12166 nb truck 0)]pm ssd check disk ext:7817: Marking

2020-12-24 20:03:18 - Restatrted X3-DPG rebuild due to the 2nd SSD failure:

<info>2020-12-24 20:03:18.762330 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7943][106074(106106 nb_truck_0)]: Changing owner 0 partition 16 status
from 4 to 15 process status was 2 code was 0 all_status 'HHHHHHHHHHHHHHHHFFHHHHHHHXX'
<err>2020-12-24 20:03:18.762333 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7935][106074(106106
nb_truck_0)]pl_par_mgmt_disk_status_changed_listener_callback: #RAID rebuild failed/aborted for owner 0 disk 16 all_status 'HHHHHHHHHHHHHHHHFFHHH
HHHHXX'
...
<info>2020-12-24 20:03:18.762330 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7943][106074(106106 nb_truck_0)]: Changing owner 0 partition 16 status
from 4 to 15 process status was 2 code was 0 all_status 'HHHHHHHHHHHHHHHHFFHHHHHHHXX'
<err>2020-12-24 20:03:18.762333 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: 7935][106074(106106
nb_truck_0)]pl_par_mgmt_disk_status_changed_listener_callback: #RAID rebuild failed/aborted for owner 0 disk 16 all_status 'HHHHHHHHHHHHHHHHFFHHH
HHHHXX'

2020-12-24 20:04:41 - Physical errors on SSD 0x5000cca04f4c6b50:

<crit>2020-12-24 20:04:41.575171 PAVAFXT01-x3-n1 xtremapp: X23 [log_id: 9823][106074(106110 nb_truck_4)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b50-JWXEL153801047 owner 0 op 1 res -5 res2 0 log_params_i
dx 0 hvol 0xc78d71d9c0
<warn>2020-12-24 20:04:41.575254 PAVAFXT01-x3-n1 kernel:[44695065.094531] __ratelimit: 24 callbacks suppressed
<info>2020-12-24 20:04:41.575270 PAVAFXT01-x3-n1 kernel:[44695065.094536] sd 2:0:41:0: [sdas] Unhandled sense code
<info>2020-12-24 20:04:41.575272 PAVAFXT01-x3-n1 kernel:[44695065.094538] sd 2:0:41:0: [sdas] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:04:41.575276 PAVAFXT01-x3-n1 kernel:[44695065.094541] sd 2:0:41:0: [sdas] Sense Key : Medium Error [current]
<warn>2020-12-24 20:04:41.575277 PAVAFXT01-x3-n1 kernel:[44695065.094546] Info fld=0x1a72c3d6
<info>2020-12-24 20:04:41.575279 PAVAFXT01-x3-n1 kernel:[44695065.094547] sd 2:0:41:0: [sdas] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:04:41.575280 PAVAFXT01-x3-n1 kernel:[44695065.094552] sd 2:0:41:0: [sdas] CDB: Read(10): 28 00 1a 72 c3 d0 00 00 10 00
<warn>2020-12-24 20:04:41.575282 PAVAFXT01-x3-n1 kernel:[44695065.094559] __ratelimit: 24 callbacks suppressed
<crit>2020-12-24 20:04:41.725413 PAVAFXT01-x3-n1 xtremapp: X22 [log_id: 9823][106072(106147 nb_truck_2)]: volio: got error from kernel dev
/dev/disk/by-id/xio-wwn-0x5000cca04f4c6b50-JWXEL153801482 owner 0 op 1 res -5 res2 0 log_params_i
dx 0 hvol 0xc78d71d9c0
<info>2020-12-24 20:04:41.726188 PAVAFXT01-x3-n1 kernel:[44695065.244826] sd 2:0:9:0: [sdn] Unhandled sense code
<info>2020-12-24 20:04:41.726197 PAVAFXT01-x3-n1 kernel:[44695065.244829] sd 2:0:9:0: [sdn] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020-12-24 20:04:41.726199 PAVAFXT01-x3-n1 kernel:[44695065.244832] sd 2:0:9:0: [sdn] Sense Key : Medium Error [current]
<warn>2020-12-24 20:04:41.726215 PAVAFXT01-x3-n1 kernel:[44695065.244836] Info fld=0x112800e8
<info>2020-12-24 20:04:41.726217 PAVAFXT01-x3-n1 kernel:[44695065.244838] sd 2:0:9:0: [sdn] ASC=0x11 ASCQ=0x3b
<info>2020-12-24 20:04:41.726219 PAVAFXT01-x3-n1 kernel:[44695065.244842] sd 2:0:9:0: [sdn] CDB: Read(10): 28 00 11 28 00 e8 00 00 08 00
<info>2020-12-24 20:04:41.852264 PAVAFXT01-x3-n1 kernel:[44695065.371742] sd 2:0:41:0: [sdas] Unhandled sense code
<info>2020-12-24 20:04:41.852280 PAVAFXT01-x3-n1 kernel:[44695065.371746] sd 2:0:41:0: [sdas] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<info>2020 12 24 20:04:41 852282 PAVAFXT01 x3 n1 kernel:[44695065 371750] sd 2:0:41:0: [sdas] Sense Key : Medium Error [current]

2020-12-24 20:04:42 - XENVs PANIC on X3 due to 3rd SSD failure:

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 11/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

<crit>2020-12-24 20:04:42.907748 PAVAFXT01-x3-n1 xtremapp: D22 [log_id: -1][106072(106145 nb_truck_0)]: PANIC <D4680> csid 22 at pl_cas.c:6775
cas_report_failure (timestamp 314641822509234537): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 15,16,17 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'
..
<crit>2020-12-24 20:04:42.909298 PAVAFXT01-x3-n1 xtremapp: D23 [log_id: -1][106074(106111 nb_truck_5)]: PANIC <D4680> csid 23 at pl_cas.c:6775
cas_report_failure (timestamp 314641822512475718): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 16,17,15 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'
...
<crit>2020-12-24 20:04:42.915026 PAVAFXT01-x3-n2 xtremapp: D24 [log_id: -1][108928(109007 nb_truck_6)]: PANIC <D4680> csid 24 at pl_cas.c:6775
cas_report_failure (timestamp 314640019858762433): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 15,16,17 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'
...
<crit>2020-12-24 20:04:42.923559 PAVAFXT01-x3-n2 xtremapp: D25 [log_id: -1][108929(108962 nb_truck_0)]: PANIC <D4680> csid 25 at pl_cas.c:6775
cas_report_failure (timestamp 314640019876639139): [3xDEGRADED] Third failure not supported
. Failed owner 0 disks: 15,16,17 all_disks_status='HHHHHHHHHHHHHHHHRRHHHHHHHXX'

Cluster closed the gates due to triple ssd failure as expected:

X1-SC2/system/logs/crit-messages:<crit>2020-12-24 20:04:44.983998 PAVAFXT01-x1-n2 xtremapp: M [log_id: 4380][125030(125047

nb_truck_0)]bl_rg_close_gates_if_triple_degraded: Triple failure: 2 SSDs failed_in_rg, 2 SSDs pending rebuild, SS
D wwn-0x5000cca04f4c6b50 (slot 15) ejected, did not change raid index state, closing gates
X1-SC2/system/logs/crit-messages:<crit>2020-12-24 20:44:02.547851 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][65531(65548
nb_truck_0)]ham_rule_action_close_gates: event ID 383, closing system gates! Reason: multiple_disk_failures
X1-SC2/system/logs/crit-messages:<crit>2020-12-24 20:44:02.882455 PAVAFXT01-x1-n2 xtremapp: M [log_id: 6554][65531(65548
nb_truck_0)]ham_rule_action_close_gates: event ID 536, closing system gates! Reason: ha_failure

Comment by Petr A Dushkin [ 31/Dec/20 ]

Thomas Tobiasz,

It appears those 3 ssds are disconnected due to HW issues.

Failed to inquiry device:/dev/sg12, scsi:2.

failed to add dev /dev/sg12: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x3.
....

2020-12-24 20:01:43
Failed to inquiry device:/dev/sg12, scsi:2.
failed to add dev /dev/sg12: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x3.
....

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 12/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop
2020-12-24 20:02:54
Failed to inquiry device:/dev/sg12, scsi:2.
failed to add dev /dev/sg12: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x3.
....

2020-12-24 20:04:12
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.

In order to attempt and read the "smart" data from the ssds, you can arrange on-site CE to reseat the ssds and check if the system detects it.
Comment by Petr A Dushkin [ 31/Dec/20 ]
Previous attempts to start the cluster:

2020-12-24 20:43:31,879 - [5869:aaad::XRS5f21d0] - User: admin, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=1}
2020-12-24 20:53:57,113 - [5869:05ca::RSTebefc6] - User: admin, Command: start_cluster, Failed: sys_start_error
...
2020-12-24 21:34:05,808 - [5869:2f21::XRSa8caa3] - User: admin, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=2}
2020-12-24 21:44:15,062 - [5869:05ca::RSTd431ba] - User: admin, Command: start_cluster, Failed: sys_start_error
...
2020-12-27 16:33:17,677 - [5869:f9a0::XRS329967] - User: admin, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=3}
2020-12-27 16:43:20,819 - [5869:05ca::RST4dddad] - User: admin, Command: start_cluster, Failed: sys_start_error
...
2020-12-28 18:32:24,268 - [5869:ae0c::XRS78fbb6] - User: tech, Command: start_cluster, Arguments: {sys_obj_id=[], force_start=False,
receiver_id=3}
2020-12-28 18:42:37,674 - [5869:05ca::RST7eff95] - User: tech, Command: start_cluster, Failed: sys_start_error

Comment by Thomas Tobiasz [ 31/Dec/20 ]

Update from the customer in regards to any action they had performed on the array

. The first 2 drives within slots 16 & 17 failed initially and then slot 15. The drives within these slots were reseated in initial efforts to regain access as
they had already gone offline.

I will have a CE go onsite and perform the reseat(s) of the SSD. Reset one and monitor to see if the is detected then move on tot he next depending on
the result

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 13/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

Comment by Petr A Dushkin [ 31/Dec/20 ]

Thomas Tobiasz,

Let me know if you need any additional feedback regarding the SSDs and cluster status.
Comment by Petr A Dushkin [ 31/Dec/20 ]
It seems that ssd in slot 15 was reseated here:

Failed to inquiry device:/dev/sg46, scsi:2.

failed to add dev /dev/sg46: scsi2, hdl=0x7.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x8.
....27:03:23:42:920 dev/phy.26: Drive Slot 15 no device detected.
27:03:23:42:999 dev/phy.26: attached SAS address 5000cca0_4f4c6b52->00000000_00000000

2020-12-27 16:19:53
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg12, scsi:2.
failed to add dev /dev/sg12: scsi2, hdl=0x3.
Failed to inquiry device:/dev/sg46, scsi:2.
failed to add dev /dev/sg46: scsi2, hdl=0x7.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x8.
...r27:03:24:11:324 dev/phy.26: Drive slot 15 device detected.
27:03:24:11:330 dev/phy.26: DISABLED ERROR->ENABLED
27:03:24:11:938 dev/phy.26: ready
27:03:24:11:942 dev/phy.26: link ready
27:03:24:11:946 dev/phy.26: rate unknown->6G
27:03:24:11:951 dev/phy 26: attached SAS address 00000000 00000000 >5000cca0 2b256796

followed by slot 16:

2020-12-27 16:21:23
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x3.
...I27:03:26:23:255 dev/phy.27: not ready
27:03:26:23:258 dev/phy.27: link not ready
27:03:26:23:265 dev/phy.27: rate 6G->unknown
27:03:26:23:269 dev/phy.27: attached phy id 0x01->0xff
27:03:26:23:953 dev/phy.27: Drive Slot 16 no device detected.
27:03:26:24:120 dev/phy.27: attached SAS address 5000cca0_4f4c734a->00000000_00000000

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 14/15
12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

2020-12-27 16:21:43
Failed to inquiry device:/dev/sg11, scsi:2.
failed to add dev /dev/sg11: scsi2, hdl=0x0.
Failed to inquiry device:/dev/sg47, scsi:2.
failed to add dev /dev/sg47: scsi2, hdl=0x3.
...>27:03:26:47:608 dev/phy.27: Drive slot 16 device detected.
27:03:26:48:989 dev/phy.27: ready
27:03:26:48:992 dev/phy.27: link ready
27:03:26:48:997 dev/phy.27: rate unknown->6G
d / h h d dd f

Generated at Thu Dec 31 03:25:21 IST 2020 by Mingli Bi using JIRA 7.8.0#78000-sha1:4568b9d484113d74dfb6f152fb925b5fa1be2ef7.

https://jira.xioeng.lab.emc.com:8443/si/jira.issueviews:issue-html/EE-17529/EE-17529.html 15/15

New Basic Faults and Their Solution
100% (1)
New Basic Faults and Their Solution
200 pages
Stoage Healthchecks
No ratings yet
Stoage Healthchecks
23 pages
FTE Link IC To RSNG Test
No ratings yet
FTE Link IC To RSNG Test
45 pages
DC Lastrun Log Spa
No ratings yet
DC Lastrun Log Spa
9 pages
Log
No ratings yet
Log
3 pages
Versa Analytics-Alarms
No ratings yet
Versa Analytics-Alarms
7 pages
Oracle 11gR2 RAC Command Guide
No ratings yet
Oracle 11gR2 RAC Command Guide
5 pages
Oracle 1z0 902 Dumps by Mcdaniel 22-07-2024 8qa Certsinside
No ratings yet
Oracle 1z0 902 Dumps by Mcdaniel 22-07-2024 8qa Certsinside
9 pages
EMC Centera Server Release Notes Rev.a40
100% (1)
EMC Centera Server Release Notes Rev.a40
32 pages
Storage
No ratings yet
Storage
3 pages
TroubleLog 2025 2 2
No ratings yet
TroubleLog 2025 2 2
50 pages
Export View
No ratings yet
Export View
16 pages
External Alarm Definition
100% (5)
External Alarm Definition
24 pages
Crash 4 Del Voting
No ratings yet
Crash 4 Del Voting
10 pages
Hwontlog
No ratings yet
Hwontlog
21 pages
Addressing Aci Fault Code f0321 f0323
No ratings yet
Addressing Aci Fault Code f0321 f0323
5 pages
Usp/Nsc MICROCODE VERSION 50-09-82-00/00 RELEASED 01/07/2009 Newly Supported Features and Functions For Version 50-09-82-00/00
No ratings yet
Usp/Nsc MICROCODE VERSION 50-09-82-00/00 RELEASED 01/07/2009 Newly Supported Features and Functions For Version 50-09-82-00/00
5 pages
EXAClusterOS-6 0 6
No ratings yet
EXAClusterOS-6 0 6
138 pages
ECS - Solution To Address CVE-2022-31231 Security Vulnerability On 3.5.x - 3.6.x - Dell US
No ratings yet
ECS - Solution To Address CVE-2022-31231 Security Vulnerability On 3.5.x - 3.6.x - Dell US
11 pages
Ivan
No ratings yet
Ivan
4 pages
Usp 50-07-30-00-00
No ratings yet
Usp 50-07-30-00-00
20 pages
DB GG Stop Start For OS Patch
No ratings yet
DB GG Stop Start For OS Patch
9 pages
FGT-B Sniffer
No ratings yet
FGT-B Sniffer
75 pages
Cluster Disk Connection Troubleshooting
No ratings yet
Cluster Disk Connection Troubleshooting
4 pages
Alarm
No ratings yet
Alarm
18 pages
SPARC M-Series Firmware Updates
No ratings yet
SPARC M-Series Firmware Updates
15 pages
Current Alarms 20250620084520521
No ratings yet
Current Alarms 20250620084520521
24 pages
Log Eventos Celda 8-4
No ratings yet
Log Eventos Celda 8-4
14 pages
Alert Orcl2
No ratings yet
Alert Orcl2
16 pages
AP File Processing Fault: Apg Commands
No ratings yet
AP File Processing Fault: Apg Commands
8 pages
Knowledge Article - OCR Backup and Restore
No ratings yet
Knowledge Article - OCR Backup and Restore
7 pages
EMC Centera 4.0 GlobalServices Release Notes Rev.a37
No ratings yet
EMC Centera 4.0 GlobalServices Release Notes Rev.a37
94 pages
Alarmas Previas Ri0002 La Cuesta
No ratings yet
Alarmas Previas Ri0002 La Cuesta
7 pages
Log
No ratings yet
Log
199 pages
E0a - E0b Link Flaps On A300 - FAS8200, A200 - FAS2600, A220 - FAS2700, C190 May Cause A Takeover
No ratings yet
E0a - E0b Link Flaps On A300 - FAS8200, A200 - FAS2600, A220 - FAS2700, C190 May Cause A Takeover
7 pages
Event Log Codes For Cpqcisse
No ratings yet
Event Log Codes For Cpqcisse
13 pages
Quick Checklist For Troubleshooting Host Issues
No ratings yet
Quick Checklist For Troubleshooting Host Issues
1 page
Print All Data (DVOR) 22.09.2015
No ratings yet
Print All Data (DVOR) 22.09.2015
22 pages
Log Alarm Prev RO0071
No ratings yet
Log Alarm Prev RO0071
5 pages
vs07 Asdf Awer
No ratings yet
vs07 Asdf Awer
70 pages
Ecs RN 3814
No ratings yet
Ecs RN 3814
11 pages
+ Add/request New Update: 13170581 - ING BANK Umraniye
No ratings yet
+ Add/request New Update: 13170581 - ING BANK Umraniye
7 pages
UCS C-Series FaultList
No ratings yet
UCS C-Series FaultList
1,022 pages
Trobleshoot GRID 1050908.1
No ratings yet
Trobleshoot GRID 1050908.1
18 pages
Issue ARCHIVELOG Sent From PRIMARY To STANDBY
No ratings yet
Issue ARCHIVELOG Sent From PRIMARY To STANDBY
4 pages
Steps To Shutdown Startup The Exadata & RDBMS Services and Cell Compute Nodes On An Exadata Configuration
100% (1)
Steps To Shutdown Startup The Exadata & RDBMS Services and Cell Compute Nodes On An Exadata Configuration
4 pages
5v0-21.20 Exam
No ratings yet
5v0-21.20 Exam
7 pages
Ticketing Report
No ratings yet
Ticketing Report
1 page
Data Center Uptime: MX & EX Series
No ratings yet
Data Center Uptime: MX & EX Series
1 page
ASA/FTD High-Availability Troubleshooting
No ratings yet
ASA/FTD High-Availability Troubleshooting
43 pages
Vdocuments - MX Comandos Diagnostico Errores m5000
No ratings yet
Vdocuments - MX Comandos Diagnostico Errores m5000
11 pages
Comandos Diagnostico Errores M5000
No ratings yet
Comandos Diagnostico Errores M5000
11 pages
VSP 70-03-37-00-M110
No ratings yet
VSP 70-03-37-00-M110
5 pages
Dvor TX1
No ratings yet
Dvor TX1
24 pages
DCH 10
No ratings yet
DCH 10
9 pages
Railway Faults Summary Report
No ratings yet
Railway Faults Summary Report
26 pages
Ma0013 BB5216
No ratings yet
Ma0013 BB5216
2 pages
VNX - VNX 5600 Procedures-Replacing The DME CPU Module
No ratings yet
VNX - VNX 5600 Procedures-Replacing The DME CPU Module
30 pages
Ihe Pub Infinity Chassis Replacement Guide
No ratings yet
Ihe Pub Infinity Chassis Replacement Guide
15 pages
6-68446 DXi6900 User Guide 3.4
No ratings yet
6-68446 DXi6900 User Guide 3.4
422 pages
ECS - ObjectScale - ECS Miscellaneous How To Service Procedures-Shutdown Node - Rack - VDC - 3.2.x.x and Above
No ratings yet
ECS - ObjectScale - ECS Miscellaneous How To Service Procedures-Shutdown Node - Rack - VDC - 3.2.x.x and Above
16 pages
Metro Node - Metro Node Troubleshooting-7.1 Troubleshooting Guide
No ratings yet
Metro Node - Metro Node Troubleshooting-7.1 Troubleshooting Guide
34 pages
6-68165-01 - User Essentials - DXi6902 - RevA
No ratings yet
6-68165-01 - User Essentials - DXi6902 - RevA
2 pages
PowerScale - Isilon - A200-Node Chassis
No ratings yet
PowerScale - Isilon - A200-Node Chassis
20 pages
Quantum DXi6902 User's Guide
No ratings yet
Quantum DXi6902 User's Guide
466 pages
6-68249-13 RevA DXi6900 DXi4700 3.2.6 Release Notes
No ratings yet
6-68249-13 RevA DXi6900 DXi4700 3.2.6 Release Notes
24 pages
EMC WEEE Disassembly Instructions TAE and TAE-DC DAE
No ratings yet
EMC WEEE Disassembly Instructions TAE and TAE-DC DAE
46 pages
R - 07 MR-1CP-NSSANTS - FCoE Troubleshooting
No ratings yet
R - 07 MR-1CP-NSSANTS - FCoE Troubleshooting
72 pages
6-68249-18 RevA DXi6900 DXi4700 3.4.0.4 RN
No ratings yet
6-68249-18 RevA DXi6900 DXi4700 3.4.0.4 RN
24 pages
6-68166-03 - Release Notes - 3.0.1 - 69 - DXi6900 - RevA
No ratings yet
6-68166-03 - Release Notes - 3.0.1 - 69 - DXi6900 - RevA
20 pages
6-68249-05 - RevA - DXi-Series 3.2.0.1 Release - Notes
No ratings yet
6-68249-05 - RevA - DXi-Series 3.2.0.1 Release - Notes
29 pages
4 Parity RAID and Alignment
No ratings yet
4 Parity RAID and Alignment
18 pages
R - 05 MR-1CP-NSSANTS - iSCSI Troubleshooting
No ratings yet
R - 05 MR-1CP-NSSANTS - iSCSI Troubleshooting
84 pages
R - 08 MR-1CP-NSSANTS - Wrap-Up Module
No ratings yet
R - 08 MR-1CP-NSSANTS - Wrap-Up Module
7 pages
VxRail Upgrade Guide
No ratings yet
VxRail Upgrade Guide
36 pages
R - 03 MR-1CP-NSSANTS - Brocade FC Troubleshooting
No ratings yet
R - 03 MR-1CP-NSSANTS - Brocade FC Troubleshooting
189 pages
XtremIO Process Restart Script (Process - Restart - Py)
No ratings yet
XtremIO Process Restart Script (Process - Restart - Py)
9 pages
How To Manually Replace A Failed Battery Backup Unit (BBU)
No ratings yet
How To Manually Replace A Failed Battery Backup Unit (BBU)
15 pages
Net Worker Deduplication
No ratings yet
Net Worker Deduplication
30 pages
VPLEX 5.5.x ESSM
No ratings yet
VPLEX 5.5.x ESSM
22 pages
+ Add/Request New Update: 19788237 - CMB Shanghai Datacenter
No ratings yet
+ Add/Request New Update: 19788237 - CMB Shanghai Datacenter
3 pages
Dell EMCVx Rail S470 Appliance OM
No ratings yet
Dell EMCVx Rail S470 Appliance OM
32 pages
h7012 Brs Solution Overview So
No ratings yet
h7012 Brs Solution Overview So
2 pages
Cloning: Emc Networker
No ratings yet
Cloning: Emc Networker
146 pages
EMC RecoverPoint 4.1.x ESSM
No ratings yet
EMC RecoverPoint 4.1.x ESSM
3 pages
+ Add/request New Update
No ratings yet
+ Add/request New Update
3 pages
EMC RecoverPoint 4.4.x ESSM
No ratings yet
EMC RecoverPoint 4.4.x ESSM
7 pages
Equitrac - Administration Guide
No ratings yet
Equitrac - Administration Guide
467 pages
All VNX CLARiiON Celerra Storage System Disk and FLARE OE Matrices
No ratings yet
All VNX CLARiiON Celerra Storage System Disk and FLARE OE Matrices
120 pages
UT35A/UT32A Display Parts Guide
No ratings yet
UT35A/UT32A Display Parts Guide
2 pages
Modeling of IPFC For Power Flow Control in 3-Phase Line - Further Aspects and Its Limitations
No ratings yet
Modeling of IPFC For Power Flow Control in 3-Phase Line - Further Aspects and Its Limitations
4 pages
Carrefour Android API Script
100% (1)
Carrefour Android API Script
2 pages
2 (Icom) A210
No ratings yet
2 (Icom) A210
2 pages
6.6 R T.Y.B.Sc I.T Sem V VI
No ratings yet
6.6 R T.Y.B.Sc I.T Sem V VI
7 pages
ANW1192 I-HUB Chassis Hardware Interface Manual
No ratings yet
ANW1192 I-HUB Chassis Hardware Interface Manual
34 pages
TACL Command Language Quiz
100% (1)
TACL Command Language Quiz
12 pages
Sample UMLs
No ratings yet
Sample UMLs
5 pages
Step-By-step Creation of A BAPI in Detailed Steps
100% (1)
Step-By-step Creation of A BAPI in Detailed Steps
27 pages
Supermarket Billing System in C
No ratings yet
Supermarket Billing System in C
51 pages
Antenna Thesis Report
No ratings yet
Antenna Thesis Report
49 pages
Com 414 Lecture Note HND II
100% (1)
Com 414 Lecture Note HND II
34 pages
Final Project Presentation
No ratings yet
Final Project Presentation
17 pages
Bangladesh University of Engineering & Technology (BUET) M.Sc. in CSE Admission Test (April 2017 Session)
No ratings yet
Bangladesh University of Engineering & Technology (BUET) M.Sc. in CSE Admission Test (April 2017 Session)
2 pages
Functional Spec for Admin Systems
No ratings yet
Functional Spec for Admin Systems
10 pages
How To Find A Solution For Short Dumps From ST22
No ratings yet
How To Find A Solution For Short Dumps From ST22
5 pages
1s Manual
No ratings yet
1s Manual
35 pages
HP CVEVA InstallGuide
No ratings yet
HP CVEVA InstallGuide
78 pages
QSS-37 Series LP7500/LP7600/LP7700/LP7900 Service Manual
80% (10)
QSS-37 Series LP7500/LP7600/LP7700/LP7900 Service Manual
904 pages
Μchiller Family: News And Updates: Hvac Commercial Marketing Unit
No ratings yet
Μchiller Family: News And Updates: Hvac Commercial Marketing Unit
6 pages
FireBeetle Board-ESP32 User Manual Update
100% (1)
FireBeetle Board-ESP32 User Manual Update
49 pages
MS Thesis ECE Mohammad Tasneem Obaid
No ratings yet
MS Thesis ECE Mohammad Tasneem Obaid
72 pages
Nokia Creating The Connected Digital Mine With Nokia Critical Communications Solutions Use Case EN
No ratings yet
Nokia Creating The Connected Digital Mine With Nokia Critical Communications Solutions Use Case EN
1 page
Informatica - Deployment Group
No ratings yet
Informatica - Deployment Group
1 page
21EC52 - ARM Microcontrollers LAB - Programs
100% (1)
21EC52 - ARM Microcontrollers LAB - Programs
30 pages
No Code Application Development Program
No ratings yet
No Code Application Development Program
22 pages
Scada PLC Hmi
No ratings yet
Scada PLC Hmi
4 pages
G9 Report
No ratings yet
G9 Report
9 pages

(#EE-17529) LISTRAK - Multiple SSD Failures Had Caused The Cluster To Stop

Uploaded by

(#EE-17529) LISTRAK - Multiple SSD Failures Had Caused The Cluster To Stop

Uploaded by

12/31/2020 [#EE-17529] LISTRAK: Multiple SSD failures had caused the cluster to stop

Status: Pending L2/Customer

L2 Owner: Thomas Tobiasz

Action Taken: reviewed logs.

Current cluster status is stopped with reason multiple_disk_failures

*** Log bundle will be UPLOADING IN 15 MINUTES***

. Cluster stopped on Dec 24th

Close gate and Open gate messages

---------------------------------------------------------------------------------------------------------------------------------- Shared memory

Storage Controllers information:

JBOD LCC SAS error counters information:

LCC_Name Phy_Index Invalid-Dwords Disparity-Errors Loss-Dword-Sync Phy-Resets X3-DAE-LCC-B 0 0

Detail log information:

Detail log information:

Comment by Petr A Dushkin [ 30/Dec/20 ]

Comment by Petr A Dushkin [ 30/Dec/20 ]

Comment by Thomas Tobiasz [ 30/Dec/20 ]

2020-12-24 21:34:28,498 - XmsLogger[10922:b52e::PSY4e94c8] - INFO - mom::check_property_change_events:828 - PAVAFXT01: wwn-0x5000cca04f740f04:

Comment by Petr A Dushkin [ 30/Dec/20 ]

is it possible to capture the messages from X3 brick for 12-24th as well?

2020-12-24 19:59:04 DPG rebuild started:

<crit>2020-12-24 19:59:04.224720 PAVAFXT01-x3-n1 xtremapp: D22 [log_id: 7934][106072(106147

2020-12-24 19:59:10 - both paths reported ssd 0x5000cca04f4c7348 as disconnected:

<info>2020-12-24 19:59:10.288630 PAVAFXT01-x3-n1 xtremapp-pm: P [log_id: 19242][12046(12399 pm_ssd)]pm_ssd_update_lcc_phys_health_state:

2020-12-24 20:04:41 - Physical errors on SSD 0x5000cca04f4c6b50:

2020-12-24 20:04:42 - XENVs PANIC on X3 due to 3rd SSD failure:

Cluster closed the gates due to triple ssd failure as expected:

X1-SC2/system/logs/crit-messages:<crit>2020-12-24 20:04:44.983998 PAVAFXT01-x1-n2 xtremapp: M [log_id: 4380][125030(125047

Comment by Petr A Dushkin [ 31/Dec/20 ]

It appears those 3 ssds are disconnected due to HW issues.

Failed to inquiry device:/dev/sg12, scsi:2.

Comment by Thomas Tobiasz [ 31/Dec/20 ]

Comment by Petr A Dushkin [ 31/Dec/20 ]

Failed to inquiry device:/dev/sg46, scsi:2.

followed by slot 16:

You might also like

* Log bundle will be UPLOADING IN 15 MINUTES*