4K-sector drives and Linux
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Almost exactly one year ago, LWN examined the problem of 4K-sector drives and the reasons for their existence. In short, going to 4KB physical sectors allows drive manufacturers to increase storage density, always welcome in that competitive market. Recently, there have been a number of reports that Linux is not ready to work with these drives; kernel developer Tejun Heo even posted an extensive, worth-reading summary stating that "
4 KiB logical sector support is broken in both the kernel and partitioners." As the subsequent discussion revealed, though, the truth of the matter is that we're not quite that badly prepared.
Linux is fully prepared for a change in the size of physical sectors on a storage device, and has been for a long time. The block layer was written with an avoidance of hardwired sector sizes in mind. Sector counts and offsets are indeed managed as 512-byte units at that level of the kernel, but the block layer is careful to perform all I/O in units of the correct size. So, one would hope, everything would Just Work.
But, as Tejun's document notes, "unfortunately, there are
complications.
" These complications result from the fact that the
rest of the world is not prepared to deal with anything other than 512-byte
sectors, starting with the BIOS found on almost all systems. In fact, a
BIOS which can boot from a 4K-sector drive is an exceedingly rare item -
if, indeed, it exists at all. Fixing the BIOS is evidently harder than one
might think, and, evidently, there is little motivation to do so. Martin
Petersen, who has done much of the work around supporting these drives in
Linux, noted:
The problem does not just exist at the BIOS level: bootloaders (whether they are Linux-oriented or not) are not set up to handle larger sectors; neither are partitioning tools, not to mention a wide variety of other operating systems. Something must be done to enable 4K-sector drives to work with all of this software.
That something, of course, is to interpose a mapping layer in the middle. So most 4K-sector drives will implement separate logical and physical sector sizes, with the logical size - the one presented to the host computer - remaining 512 bytes. The system can then pretend that it's dealing with the same kind of hardware it has always dealt with, and everything just works as desired.
Except that, naturally enough, there are complications. A 512-byte sector written to a 4K-sector drive will now force the drive to perform a read-modify-write cycle to avoid losing the data in the rest of the sector. That slows things down, of course, and also increases the risk of data loss should something go wrong in the middle. To avoid this kind of problem, the operating system should do transfers that are a multiple of the physical sector size whenever possible. But, to do that, it must know the physical sector size. As it happens, that information has been made available; the kernel makes use of this information internally and exports it via sysfs.
It is not quite that simple, though. The Linux kernel can go out of its way to use the physical sector size, and to align all transfers on 4KB boundaries from the beginning of the partition. But that goes badly wrong if the partition itself is not properly aligned; in this case, every carefully-arranged 4KB block will overlap two physical sectors - hardly an optimal outcome.
As it happens, badly-aligned partitions are not just common; they are the norm. Consider an example: your editor was a lucky recipient of an Intel solid-state drive at the Kernel Summit which was quickly plugged into his system and partitioned for use. It has been a great move: git repositories on an SSD are much nicer to work with. A quick look at the partition table, though, shows this:
Disk /dev/sda: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x5361058c Device Boot Start End Blocks Id System /dev/sda1 63 52452224 26226081 83 Linux
Note that fdisk, despite having been taken out of the "DOS compatibility" mode, is displaying the drive dimensions in units of heads and cylinders. Needless to say, this device has neither; even on rotating media, those numbers are entirely fictional; they are a legacy from a dark time before Linux even existed. But that legacy is still making life difficult now.
Once upon a time, it was determined that 63 (512-byte) sectors was far more than anybody would be able to fit into a single disk track. Since track-aligned I/O is faster on a rotating drive, it made sense to align partitions so that the data began at the beginning of a track. So, traditionally, the first partition on a drive begins at (logical) sector 63, the last sector of the first track. That sector holds the boot block; any filesystem stored on the partition will follow at the beginning of the next track. That placement, of course, misaligns the filesystem with regard to any physical sector size larger than 512 bytes; logical sector 64 (the first data sector in the partition) will be placed at the end of a 4K physical sector. Any subsequent partitions on the device will almost certainly be misaligned in the same way.
One might argue that the right thing to do is to simply ditch this particular practice and align partitions properly; it should not be all that hard to teach partitioning tools about physical sector sizes. This can certainly be done. The tools have been slow to catch on, but a suitably motivated system administrator can usually convince them to place partitions sensibly even now. So weird alignments should not be an insurmountable problem.
Unfortunately, there are complications. It would appear that Windows XP not only expects misaligned partitions; it actually will not function properly without them. One simply cannot run XP on a device which has been properly partitioned for 4K physical sector sizes. To cope with that, drive manufacturers have introduced an even worse hack: shifting all 512-byte logical sectors forward by one, so that logical sector 64 lands at the beginning of a physical sector. So any partitioning tool which wants to lay things out properly must know where the origin of the device actually is - and not all devices are entirely forthcoming with that information.
With luck, the off-by-one problem will go away before it becomes a big
issue. As James Bottomley put it:
"...fortunately very few of these have been seen in the wild and we're
hopeful they can be shot before they breed.
" But that doesn't fix
the problem with the alignment of partitions for use by XP. Later versions
of Windows need not concern themselves with this problem, since they rarely
coexist with XP (and Windows has never been greatly concerned about
coexistence with other systems in general). Linux, though, may well be
installed on the same drive as XP; that leads to differing alignment
requirements for different partitions. Making that just work is
not going to be fun.
Martin suggests that it might be best to just ignore the XP issue:
It may well be that there will not be a significant number of XP installations on new-generation storage devices, but failure to support XP may still create some misery in some quarters.
A related issue pointed out by Tejun is that the DOS partition format, which is still widely used, tops out at 2TB, which just does not seem all that large anymore. Using 4K logical sectors in the partition table can extend that limit as far as 16TB, but, again, that requires cooperation from the BIOS - and it still does not seem all that large. The long-term solution would appear to be moving to a partition format like GPT, but that is not likely to be an easy migration.
In summary: Linux is not all that badly placed to support 4K-sector drives,
especially when there is no need to share a drive with older operating
systems. There is still work required at the tools level to make that
support work optimally without the need for low-level intervention by
system administrators, but that is, as they say, just a matter of a bit of
programming. As these drives become more widely available, we will be able
to make good use of them.
Index entries for this article | |
---|---|
Kernel | Block layer/Large physical sectors |
(Log in to post comments)
4K-sector drives and Fedora
Posted Mar 9, 2010 23:24 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link]
This comment adds a lot of important details
4K-sector drives and Fedora
Posted Mar 10, 2010 13:33 UTC (Wed) by Trelane (subscriber, #56877) [Link]
4K-sector drives and Fedora
Posted Mar 11, 2010 16:28 UTC (Thu) by msnitzer (subscriber, #57232) [Link]
stack has already been updated upstream and will be included in Fedora 13:
http://lkml.org/lkml/2010/3/11/230
I think it is unfortunate that the time was taken to highlight Linux's preparedness
for 4K sector drives but failed to accurately convey the true state of the various
pieces involved. Even inaccurately concluding that there is much work ahead for
the various Linux tools.
The article focused on Tejun's misunderstanding and worked from there rather
than incorporating the more promising state of Linux's preparedness that was
revealed in reply to Tejun's post.
4K-sector drives and Fedora
Posted Mar 11, 2010 16:45 UTC (Thu) by corbet (editor, #1) [Link]
Interesting...I thought I wrote an article saying that, Tejun's worries notwithstanding, we're not in all that bad a shape. As far as I can tell, the only thing I got really wrong was regarding XP, which can handle things better than I had thought. As for the rest...what does it take to make a partitioning utility a little smarter?Anyway, my apologies if you feel I misrepresented the situation.
4K-sector drives and Fedora
Posted Mar 11, 2010 17:09 UTC (Thu) by msnitzer (subscriber, #57232) [Link]
Not a big deal.. I was just saying that many of the partition tools and others have
already been updated. The "what does it take" to update them is somewhat
moot; as it has been done. Updating LVM was a bit more involved than
mkfs.ext3. Updating virtio and qemu was also somewhat intrusive.
But for those tools that haven't been updated they will either need to use libblkid
(like e2fsprogs does) or use the new "IO topology" block ioctls. Please see this
for more info (contains the specific ioctls and more):
http://people.redhat.com/msnitzer/docs/io-limits.txt
4K-sector drives and Fedora
Posted Mar 11, 2010 18:47 UTC (Thu) by ricwheeler (subscriber, #4980) [Link]
One thing that seems to have been skipped in the discussion so far is that this is not just an issue with local, 4KB sector drives. The changes we have in the kernel and in the tool chain will help with external arrays which have long had larger internal "sectors" but pretended to have 512 byte sectors like a local disk.
The larger impact of the change is that it should all "just work" if we got all of the bits in place correctly :-)
Testing on a variety of storage hardware from various vendors, without and without DM and MD is really, really interesting right now to help us uncover any bits we did miss.
4K-sector drives and Linux
Posted Mar 9, 2010 23:43 UTC (Tue) by pheldens (guest, #19366) [Link]
RAID: was 4K-sector drives and Linux
Posted Mar 10, 2010 2:36 UTC (Wed) by smoogen (subscriber, #97) [Link]
But that would be my guess. If its different let us know.
4K-sector drives and Linux
Posted Mar 10, 2010 21:21 UTC (Wed) by pheldens (guest, #19366) [Link]
Default fdisk misaligns the sdb1 data area (63)
changing it to 64 makes formats 20-30% faster
changing it to 65-71 slower again,
changing it to 72 faster again. (+8 presumably 4096/512)
Will see what happens when I put it in an array aligned
4K-sector drives and Linux
Posted Mar 9, 2010 23:54 UTC (Tue) by pheldens (guest, #19366) [Link]
"...For /dev/sdc, I used fdisk the same as with sdd, but after creating the partition, I realigned it. I did this by entering expert mode ("x"), then setting the start sector ("b") to 64."
for about twice the write performance, compared to defaults (63).
thanks for this important tip.
4K-sector drives and Linux
Posted Mar 10, 2010 9:25 UTC (Wed) by Darkmere (subscriber, #53695) [Link]
parted even goes as far as to yell at you if you have the wrong alignment, and offers to fix it for you.
However, you may have to use % based partitions for parted to be able to fix the auto-alignment, otherwise it complains.
mkpart primary ext3 0% +200M
mkpart primary ext4 200M 100%
Or however you want to work things.
4K-sector drives and Linux
Posted Mar 18, 2010 20:38 UTC (Thu) by till (guest, #50712) [Link]
4K-sector drives and Linux
Posted Mar 11, 2010 16:15 UTC (Thu) by jackb (guest, #41909) [Link]
4K-sector drives and Linux
Posted Mar 10, 2010 9:25 UTC (Wed) by ringerc (subscriber, #3071) [Link]
I have six, and another six on back-order :-(
The WD Green series (at least the 1TB drive WDC WD10EARS-00Y5B1) have 4kb sectors and offset-by-one. Thankfully, that awful hack is disabled by default and must be turned on by the application of a jumper across a pin pair on the back of the drive.
This is fine so long as the host OS can probe the disk to find out whether or not it's so jumpered, but somehow I doubt it's so easy.
4K-sector drives and Linux
Posted Mar 10, 2010 21:49 UTC (Wed) by kjp (guest, #39639) [Link]
4K-sector drives and Linux
Posted Mar 11, 2010 19:41 UTC (Thu) by cmccabe (guest, #60281) [Link]
I agree with you. We have to move to GPT soon anyway, unless you think that 2 TB should be enough for anyone.
Rather than fooling with this C/H/S nonsense, we should just fix BIOSes and such so that they can use the new format.
> And does it fix XP as well?
The 32 bit version of Windows XP can only work with MBRs, never GPTs. There is some weird way that you can have both an MBR and a GPT on your disk, but it looks ugly... very ugly. And it still doesn't let you break the 2 TB limit under XP.
On the other hand, the 64-bit version of Windows XP can read and write GPT partitions, but can't boot from them.
4K-sector drives and Linux
Posted Mar 12, 2010 21:46 UTC (Fri) by cmccabe (guest, #60281) [Link]
So that should buy the old MBR scheme another few years.
4K-sector drives and Linux
Posted Mar 15, 2010 10:16 UTC (Mon) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]
But only for disks with 4096 bytes per LOGICAL sector and 4096 bytes per PHYSICAL sector; you can't find those drives on the market.
There is only 512 bytes per LOGICAL sector and 4096 bytes per PHYSICAL sector for sale, those are 100% compatible with current 512 bytes per sector drive, just are slower in some cases (so they have the same limits).
XP compatibility
Posted Mar 10, 2010 21:49 UTC (Wed) by mrpippy (guest, #57134) [Link]
(http://www.anandtech.com/storage/showdoc.aspx?i=3691&p=2), I got the impression that
Windows XP can function with aligned partitions, it's just not possible to create them with any of
XP's partition editors.
That's why XP either needs to run with the sector+1 hack, or use the WD Align tool after
installation that literally shifts the entire partition back one sector. Both methods result in an
aligned partition that XP can run off of.
4K-sector drives and Linux
Posted Mar 11, 2010 5:44 UTC (Thu) by ranmachan (guest, #21283) [Link]
That's not true.
If you use 32 sectors per track you'll create only correctly aligned partitions. (If the disk doesn't use an offset)
Of course you'll need to go to expert mode in fdisk to change the sector count and you can only do that on a yet unpartitioned disk because IIRC it doesn't recalculate the CHS values for already existing partitions.
And you're still screwed with drives that use an offset. :)
Why partition alignment?
Posted Mar 11, 2010 8:43 UTC (Thu) by PO8 (guest, #41661) [Link]
Why partition alignment?
Posted Mar 11, 2010 15:50 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]
Why partition alignment?
Posted Mar 11, 2010 17:15 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]
1. The EXTxFS superblock is no more located at 1 Kbyte from the beginning of the partition but at the 3rd sector i.e. LBA=2.
It then only make unreadable the EXTxFS located on DVD-RAM or the EXTxFS images written to CDROM/DVDs.
Also, it seems strange to search for a signature in the middle of a sector when the device has 4096 bytes/sector.
2. The EXTxFS superblock is located at the 3rd *physical* sector of the partition.
Then to mount the FS the software has to scan few sectors to see if it find an EXT* superblock, and old mount command can probably handle the "-o offset=1" parameter.
Why partition alignment?
Posted Mar 11, 2010 18:39 UTC (Thu) by PO8 (guest, #41661) [Link]
Why partition alignment?
Posted Mar 11, 2010 19:33 UTC (Thu) by cmccabe (guest, #60281) [Link]
So when they get a request for a 512-byte write, rather than doing the read-modify-write of a 16k block, they wait to see if the user wants to do any more I/O to that erase block.
The disadvantages of write coalescing are kind of obvious-- it's complex, requires temporary storage (for the un-coalesced 512-byte chunks). More buffering also means there's a longer window when power failures can result in data loss.
Overall, it's not something you want to do unless you absolutely have to. Performance and stability would be a lot better if the kernel knew about the real situation on the hardware.
4K-sector drives and Linux
Posted Mar 11, 2010 12:32 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]
http://technet.microsoft.com/en-us/library/cc781134(WS.10).aspx
> Formatting Volumes: Formatting also aligns clusters at the cluster size boundary.
Same for FAT created on the other OS, we can read a bit further:
> Because formatting in Windows Server 2003 aligns FAT data clusters at the cluster size boundary
The FAT filesystem cluster aligment can be modified (and it seems to be the same for NTFS) depending on the alignment of the first sector; it means that you will not generate the same FAT for two partition which have the same size but are aligned differently - as a consequence you cannot directly copy them neither (by "dd").
The Gujin bootloader is aware of that when creating FATs.
I did not find such a field in the EXT2/3/4 filesystem to ignore some sectors at the beginning of the FS.
About bootloaders and 4096 sectors, the Gujin bootloader may be able to help thanks to its minimal IDE driver in the 512 bytes MBR, but the problem is a lack of hardware to test:
http://www.wdc.com/en/products/products.asp?driveid=336
says drive WD10EACS has 1,000,204 MB and 1,953,525,168 sectors, i.e. (1,000,204 * 1000 * 1000) / 1,953,525,168 = 512 bytes/sectors
http://www.wdc.com/en/products/products.asp?driveid=763
says drive WD10EARS has exactly the same 512 bytes/sectors
and WD10EARS-00Y5B1 doesn't even have a hit on WD web site... Is that available in UK?
BTW, GPT is quite easy to use to define partitions.
4K-sector drives and Linux
Posted Mar 18, 2010 15:44 UTC (Thu) by welinder (guest, #4699) [Link]
That wastes 7 extra "cylinders" (7*63*512 = ~250KB) but is aligned
any way you look at it. XP should be able to read it.
4K-sector drives and Linux
Posted Mar 18, 2010 20:24 UTC (Thu) by till (guest, #50712) [Link]
4K-sector drives and Linux
Posted Apr 25, 2010 15:23 UTC (Sun) by ramiro_morales (guest, #65623) [Link]
> [...] it made sense to align partitions so that the data began at the beginning of a track. So, traditionally, the first partition on a drive begins at (logical) sector 63, the last sector of the first track.
> That sector holds the boot block; any filesystem stored on the partition will follow at the beginning of the next track.
I think the two last paragraphs are incorrect. Logical sector 63 is the first sector of the second track, sectors 0-62 are in the first track. So the first partition is completely (both administrative overhead and data) located on the second track.
Not that this matters now that legacy emulated geometry is finally getting obsoleted.
4K-sector drives and Linux
Posted Dec 12, 2011 20:17 UTC (Mon) by derickmoore (guest, #81787) [Link]
I can't speak for any 'other' BIOS, but the 2nd Generation SAS products from LSI do have 4K sector support under INT13 Boot.
The only drawback to that support is the RMW (Read/Modify/Write) cycles that must take place in an environment that has no concept of 4K sectors.
The BIOS is smart enough to 'package' accesses and minimize reads in consecutive blocks, but of necessity is unable to do anything about the many one sector (512 byte) accesses.
On the other hand it does 'remember' previous reads into a 4K block, so that consecutive 512 byte accesses don't reread the 'same' 4K block when it hasn't changed. (Remember that INT13 is single threaded)
I don't know who this might help, but there it is!
Derick
P.S. I wish someone would update the DOS version of GDISK to work with 4K drives. Currently, it blows up when it sees one!