kernel 报错 ata3: SError: { HostInt PHYRdyChg 10B8B DevExch } failed command: FLUSH CACHE EXT (未解决,讨论文)

运行了好几年的x86主机,在去年更换了固态硬盘后,开始报错,错误原因不记得了,只记得有 status: { DRDY }  ata3: hard resetting link 这么一条。后面会越来越频繁,最后次数多了直接连不上硬盘了。

期间重插SATA线、调换接口,还退还了两块SSD硬盘,最终换了PCIE-sata卡,也换了SATA线,问题才消失。但不确定是哪一块导致的。


23年10月份的时候索性直接换了个全新的过时主板,内存和CPU,SATA线也换了。

然后系统稳定运行,马上满一年。查看今年9月21日的logwatch的时候,突然发现一条ata错误, 和之前的错误类似:

完整的错误如下:

Sep 21 22:55:35 yxhserver kernel: ata3.00: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Sep 21 22:55:48 yxhserver kernel: ata3.00: irq_stat 0x00400040, connection status changed
Sep 21 22:55:48 yxhserver kernel: ata3: SError: { HostInt PHYRdyChg 10B8B DevExch }
Sep 21 22:55:48 yxhserver kernel: ata3.00: failed command: FLUSH CACHE EXT
Sep 21 22:55:48 yxhserver kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 10
                                           res 40/00:48:c0:86:07/00:00:16:00:00/40 Emask 0x50 (ATA bus error)
Sep 21 22:55:48 yxhserver kernel: ata3.00: status: { DRDY }
Sep 21 22:55:48 yxhserver kernel: ata3: hard resetting link
Sep 21 22:55:48 yxhserver kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep 21 22:55:48 yxhserver kernel: ata3: COMRESET failed (errno=-16)
Sep 21 22:55:48 yxhserver kernel: ata3: hard resetting link
Sep 21 22:55:48 yxhserver kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 21 22:55:48 yxhserver kernel: ata3.00: configured for UDMA/133
Sep 21 22:55:48 yxhserver kernel: ata3.00: retrying FLUSH 0xea Emask 0x50
Sep 21 22:55:48 yxhserver kernel: ata3: EH complete

这是第一次报这个错误,也不确定什么原因引起的。  和他在一起的机械硬盘没问题。

固态硬盘的smart信息如下:

SMART Attributes Data Structure revision number: 20
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       7898
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       181
175 Program_Fail_Count_Chip 0x0022   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   000   000   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   000   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       62
190 Airflow_Temperature_Cel 0x0022   024   024   000    Old_age   Always       -       24 (Min/Max 10/39)
196 Reallocated_Event_Count 0x0012   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       4112
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       2888

与他一起的机械硬盘smart信息如下:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   173   168   021    Pre-fail  Always       -       2350
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       942
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       30240
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       934
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       95
193 Load_Cycle_Count        0x0032   190   190   000    Old_age   Always       -       32626
194 Temperature_Celsius     0x0022   110   098   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       2
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

 

系统版本是Debian bookworm , 无法确定是哪里出的问题, 欢迎有类似问题的小伙伴,一起讨论。


2024-09-30号又出现该问题, 什么问题导致,一直找不到原因,太纠结了。

WARNING:  Kernel Errors Present
             res 40/00:e8:a0:11:84/00:00:13:00:00/40 Emask 0x50 (ATA bus error) ...:  1 Time(s)
    ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100) ...:  1 Time(s)
    ata3: SError: { HostInt PHYRd ...:  1 Time(s)

评论列表: