How to troubleshoot RAID 5 in the server
How to troubleshoot RAID 5 in the server
Due to the continuous progress of technology, different types of servers have different treatment methods after RAID 5 failure.
At present, the network topology structure of large-scale application programs generally adopts C/S structure or B/S structure, and at least one server with a large database needs to be placed in the central computer room. Considering the security and reliability of the server, RAID (Redundant Array of Inexpensive Disks) is usually used for redundant backup of the server’s disks. Among them, RAID 5 array level is parity disk array without independent parity disk, which adopts data blocking and independent access technology, can process multiple access requests in parallel on the same disk, and allows any hard disk in the array to fail.
In practical application, due to some inevitable objective reasons, some array failures may occur. The most common situation is that the hard disk automatically goes offline, and the online status is displayed as DDD (invalid disk drive), and the hard disk has physical or logical failure. If it is a physical fault, only the hard disk can be replaced; If it is a logical fault, it can be repaired by targeted technology to restore the online state of the hard disk, continue to maintain the striped distribution of the hard disk data in its original array, and continue the consistency of the data storage system.
However, the data recovery of some old HP servers (such as HP LH6000) is different from that of new HP servers (such as HP ProLian series servers). Therefore, different servers handle RAID 5 failures differently. I have been in contact with the data failure of RAID 5 array card caused by unexpected power failure of two servers, and solved the problem by adopting different strategies.
Fault repair
One is HP LH6000 server, and four 18GB hard disks are made into RAID 5 disk array, and its array card is NetRaid. The other is HP ProLian ML370 server. Four 146GB hard disks are made into RAID 5 disk array, and its array card is Smart Array 642 with Hot Spare hard disk. Both operating systems are Window 2000 and the database is Server 2000.
The fault of HP LH6000 is as follows: one hard disk is flashing red light, and the machine is still running normally, but it will not be long before the system can not run normally, and then it is found that the red light of another hard disk is flashing.
The solution is as follows:
1. start the server, and press Ctrl+M to enter the NetRaid management program when self-checking to the array. Check the array information and find that the hard disk status is Failed, and use the modified configuration to forcibly set a hard disk to OnLine. Restart the server, the hardware self-check before entering the system is invalid, and the startup fails.
2. Start the server, and press Ctrl+M to enter the NetRaid management program when self-checking to the array. Select the disk array, manually Fail the hard disk that was originally hung OnLine, and then manually set another Failed hard disk to OnLine, and restart the server to enter the system.
3. After checking that the system and database are running normally, enter the array configuration tool and manually set the Failed hard disk to Rebuild, and then restart the server after 100% reconstruction. All arrays and systems are restored to their original state.
Another server running ERP system (HP ProLiant ML370) is configured as a RAID 5-level disk array by four 146GB hot-swappable hard disks through RAID cards (Smart array array cards). One of the hard disks suddenly broke down during operation. Server RAID 5 automatically enables the Hot Spare hard disk to logically replace the damaged hard disk. The data access task of the whole hard disk still runs completely in the original reading and writing process sequence, and the application program and database have no influence.
Check the hard disk status through the ACU tool that comes with HP, and find that the hard disk with red light warning is offline. If there are two red lights on Raid 5 in HP ProLiant server, it indicates that the system has crashed and the database cannot be accessed, but the system will not shut down automatically. When the second hard disk is red, the data can’t be recovered by conventional means, so we have to pay a professional third-party data recovery company to recover the data.
Therefore, for the old HP LH6000 series servers, the array design is quite different from that of the current HP ProLiant series servers. As far as the operation method is concerned, there are many options for the array operation method of HP LH6000 server, including deleting the array again and rebuilding it after the array fails, and the initialization is also manually selected. However, the initialization of HP ProLiant series server array is automatically performed in the background after the array is configured, so the ProLiant series server cannot reconfigure the array after an array error.
The HP LH6000 server will cause the disks in the array to be disconnected for other unexpected reasons, so the maintenance personnel can manually choose Online or Offline, Rebuild, etc. to restore the data. However, the current HP ProLiant series servers will no longer have the phenomenon of disk disconnection in the array like the old servers, so when the hard disk is red, it is basically damaged and needs to be replaced. Of course, you can choose to Rebuild the hot-plug hard disk and see if the hard disk can be used for a while.
Do a good job of technical backup
From the above two examples, it can be seen that the troubleshooting of Raid 5 disks of servers of the same brand and different series is different because of their different embedded technologies. But after Rebuild the data, the data was saved, from which the following experiences can be drawn:
We believe that any advanced technical means is not foolproof. If you want to ensure data security, you must do a good job of backup, and it is best to do a remote backup of the database once a day. Spare at least one new hard disk. It should be pointed out that the hard disk that joins the array must be greater than or equal to the capacity of the failed hard disk.
If conditions permit, the array creation scheme of "RAID 5+ hot spare" is recommended. This way, we have two chances to replace the hard disk before the data is lost. For general applications, only RAID 5 is needed, which can provide data access performance, reliability and maximum disk space at the same time.
Administrators must always observe the status of the array, including the yellow warning light of the disk array and the drive status in the management software. Trouble shooting in time. No matter what level of array, data backup should be done before troubleshooting.