I have a MD3000 that's been running completely trouble free for the past many years (other than occasionally a drive failure). It's connected to a single host with each RAID module connected to a unique Dell SAS5/e controller. Each power supply is connected to a different UPS. The firmware is version 07.35.39.64.
Less than a week ago the unit experienced an event that seem to cause one of the RAID modules to reset. During this event, an Exchange server noted in its logs that some IOs did complete but took an abnormally long time. Fast-forward to today and seemingly the exact same sequence of events has happened again but this time the other RAID module seems to have reset. The same Exchange server again reported delayed IOs. Even more troubling, later today at a moment when there were no reported MD3000 issues, the Exchange server also reported, "The database engine attempted a clean write operation on page 4167892 of database xxxx. This acton was performed in an attempt to correct a previous problem reading from the page. " This is the first message of its kind and would seem to indicate there's an underlying storage issue.
The only thing common about the first set of events (few days ago) and today's events is that both occurred around the same time in the morning hours (6-7:30am).
I am attaching the events from today. I also have a support bundle if that would be helpful. Without knowing more, I am inclined to purchase an extra RAID module and power supply from eBay. The only hint that there may be a power supply problem is because of two isolated events from today indicating the supplies changed to optimal (but no message about degraded or failed). There were no known UPS or power problems at the time so it's a mystery.
The MD3000 itself has been running for a very long time. I saw the "sample period" as being more than 840 days but now both modules are back to zero because of the recent resets. I am considering shutting the entire system down and power cycling the MD3000.
I once had a SAS5/E card fail because of bad capacitors. Is it possible the MD3000 is starting to experience something similar? I find it odd that both RAID controllers seemingly experienced the same trouble within a few days of each other. If the MD3000 is powered off completely, can I safely remove the RAID modules and open them to inspect the capacitors?
Please see attached log. Any help would be greatly appreciated!