As you do, I was reading up on RAID levels while in the bath. The topic of atomicity came up, and it’s something I wanted to share.
Not usually the most reliable source of technical data, but I’ll quote Wikipedia to help explain atomicity to set the stage. Taken from http://en.wikipedia.org/wiki/RAID under the section of “Problems with RAID”…
This is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote “Update in Place is a Poison Apple” during the early days of relational database commercialization. However, this warning largely went unheeded and fell by the wayside upon the advent of RAID, which many software engineers mistook as solving all data storage integrity and reliability problems. Many software programs update a storage object “in-place”; that is, they write a new version of the object on to the same disk addresses as the old version of the object. While the software may also log some delta information elsewhere, it expects the storage to present “atomic write semantics,” meaning that the write of the data either occurred in its entirety or did not occur at all.
This has come back into light recently but under a different guise with SSD write failure problems. Many SSD manufacturers and enterprise storage vendors are addressing this with new firmware that writes all data sequentially, never over-writing a data block until all of the disk has been written then starting over-writing blocks from the start (that have obviously been freed up first).
However this is an overlooked issue with traditional spinning media and is often overlooked and dismissed without a clear explanation or understanding. The idea is that many systems will over-write data in place, the write is confirmed that it was successfully written, but not necessarily that the data matched what the host sent. The overhead of this check would put a considerable extra load as every write would need an additional read and checksum before the write is confirmed and the write cache can be flushed.
This can be compounded by so called “Copy on Write” snapshot technologies. Rather than preserving the data that has already been written to a particular sector on disk, the original data is copied to a snapshot area in a different part of the storage system before the original data sector is overwritten. So a high transaction application that overwrites its data regularly (say a temp DB or replay logs that get flushed regularly, like Oracle logs before archiving) could be quite susceptible to this sort of error. The main issue here is that once the data is written and confirmed, there’s no way of correcting it as the storage system will confirm it as intact. This can have a massive knock on effect to data de-duplication. If the initial block is written to a corrupt sector without being identified, this could then be linked to hundred of other data blocks as part of the de-duplication process, causing massive data corruption.
This can’t always be fixed by RAID parity as RAID is calculated after a stripe is written. It can’t always be calculated in memory either as a full stripe isn’t always written, it could be a partial stripe in which case the parity has to be calculated from existing data on disk as well as data not yet written to disk. If the data is written to disk and then read in order to calculate parity, it is not necessarily confirmed against the source. There are several ways to address this, and mostly this needs to happen in memory, generally a checksum is considered the acceptable approach. Reading the data later after a confirmed write cannot be guaranteed as you have nothing to compare it against, the integrity needs to be checked while the data is still active in memory to compare against.
There are several ways that storage vendors tackle this, and as you would expect I’ll cover what NetApp do. The WAFL file system writes to every free data block and never actively over-writes a data block. To create free data blocks there is a scrub process that runs in the background, this runs the entire storage system block by block and interrogates whether a snapshot or active filesystem is pointing at this particular data block. If it is not, then it clears the data in the block and marks it as free (or unmarks it as in-use would probably be more correct). This allows the filesystem to confirm that not only is the data block not in use, but actually as a side effect it spreads data writes across the entire surface area of a disk and negates or minimises the effects of atomicity. Additionally the WAFL scrub process tests the data blocks to check for disk integrity, this is how disks can be pre-failed on the basis of disk surface integrity rather than physical failure, after a defined threshold of failed disk sectors the disk is failed and a recovery is attempted and a hot spare activated. So in a NetApp system, the same data blocks are rarely written to repeatedly, even (or especially) in a high repeat transaction system.
Take all of the above, and you also start to realise that a full filesystem is bad for your storage in different ways. If you have a full storage system, then there are less free blocks to write to, and so a smaller portion of data blocks are written to continuously. This compounds the chances of atomicity and in general will increase the disk wear. So a good reason to look at data archiving, de-duplication and generally keeping your file systems clean and not abusing the storage systems!
So please ask your storage vendor how they protect your data against these issues.