skip to main |
skip to sidebar
Relieve distress in patients
Get Educated
=========================================================================================================================== On 11/26/2009 04:31 AM, Hans de Goede wrote: Hi Doug, That is a lot of information in there, let me try to summarize it and please let me know if I've missed anything: 1) The default chunksize for raid4/5/6 is changing, this should not be a problem as we do not specify a chunksize when creating new arrays I thought we did specify a chunksize. Oh well, that just means our default raid array performance will improve dramatically. The old default of 64k was horrible for performance relative to the new 512k default. 4 disks on MB 5 disks on MB 4 disks on PM write read write read write read 64K 509.373 388.870 403.947 370.963 103.743 61.127 512K 502.123 498.510 460.817 487.720 113.897 111.980 MB = Motherboard ports PM = single eSATA port to a port multiplier Note: going from 4 disks to 5 disks on this one machine resulted in a performance drop which is a likely indicator that there were bus saturation issues between the memory subsystem and the southbridge and that 5 disks simply over saturated the southbridge's capacity. 2) The default bitmap chunk size changed, again not a problem as we don't use bitmaps in anaconda atm 3) We need to change the not using of a bitmap, we should use a bitmap by default except when the array will be used for /boot or swap. Correct. The typical /boot array is too small to worry about, it can usually be resynced in its entirety in a matter of seconds. Swap partitions shouldn't use a bitmap because we don't want the overhead of sync operations on the swap subsystem, especially since its data is generally speaking transient. Other filesystems, especially once you get to 10GB or larger, can benefit from the bitmap in the event of an improper shutdown. Questions: 1) What commandline option should we pass to "mdadm --create" to achieve this? --bitmap={none,internal} In the future if we opt for something other than the default bitmap chunk, then when the above is internal, we would also pass: --bitmap-chunk= 4) We need to start specifying a superblock version, and preferably version 1.1 No, we *must* start specifying a superblock version or else we will no longer be able to boot our machines after a clean install. The new default is 1.1, and I'm perfectly happy to use that as the default, but as far as I'm aware, the only boot loader that can use a 1.1 superblock based raid1 /boot partition is grub2, so all the other arches would not be able to boot and we would have to forcibly upgrade all systems using grub to grub2. 5) Specifying a superblock version of 1.1 will render systems non bootable, I assume this only applies to systems which have a raid1 /boot, so I guess that we need to specify a superblock version of 1.1, except when the raid set will be used for /boot, where we should keep using 0.9 Questions: 1) Is the above correct ? No, not quite. You can use superblock version 1.0 on /boot and grub will then work. Both version 0.90 and version 1.0 superblocks are at the end of the device and do not confuse boot loaders. Here's a summary of superblock format differences: Version 0.90: Stored at end of device Has no homehost field in the superblock but most recent versions of mdadm would hash the name of the machine and use that for half of the UUID, which provided a pseudo homehost entry Limited to 27 constituent devices Has no name field in the superblock Has a preferred-minor field in the superblock Does not contain sufficient information to distinguish between a superblock at the end of a whole device or a superblock at the end of a single partition on the whole device (aka, create a single partition on a drive that uses the whole drive, place a version 0.90 superblock on that drive, then you will be able to pass in either the whole disk or the partition to an mdadm assemble command and mdadm can't tell via the information in the superblock if you have passed in the right device). Common to all version 1.x superblocks: Has homehost and name fields (actually, one field with a max length of 32 chars) Full UUID is generated, none hashed, so more bits of randomness on UUID No limit to number of constituent devices Has no preferred-minor field in the superblock, but can be emulated by use of appropriate entry in name field Version 1.0: Located at end of device where version 0.90 superblocks are also located Contains sufficient information to differentiate between being a superblock for the whole device or just a partition on the device Version 1.1: Located at very beginning of device. If placed on a whole disk device, occupies the same space as the MBR and partition table and does not leave room for them. Data is offset after superblock, and as such the normal device can not be used to access the data, only the md device. Version 1.2: Located at beginning of device + 4K. This offset allows for the MBR and partition table to have the first 4K. This can, however, cause confusing situations when used on whole disk devices as you are able to partition the device, but the entire device is the raid device, so the partition is meaningless even if present. It does, however, allow for booting off of these devices (theoretically, I don't think anyone is doing so and I suspect even grub2 would need more work to make this operational). 6) When creating 1.1 superblock sets we need to pass in: --homehost= --name= -e{1.0,1.1,1.2} Questions 1) Currently when creating a set, we do for example: mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1 What would this look like with the new mdadm, esp, what would happen to the /dev/md0 argument ? The /dev/md0 argument is arbitrary. It could be /dev/md0, it could be /dev/md/foobar. However, if we insist on sticking with the old numbered device files, then it is certain that we should also do our best to make sure that the --name field we pass in is in the special format needed to get mdadm to automatically assume we want numbered devices. In this case, --name=0 would be appropriate. But this actually ignores a real situation that some of us use to get around the brokenness of anaconda for many releases now. I typically start any install by first burning the install image to CD, then booting into rescue mode, then hand running fdisk on all my disks to get the layout I want, then hand creating md raid arrays with the options I want, then hand creating filesystems on those arrays or swap spaces on those arrays with the options I want. Then I reboot in the install mode on the same CD, and when it gets to the disk layout, I specify custom layout and then I simply use all the filesystems and md raid devices I created previously. However, even if I use version 1.x superblocks, and even if I use named md raid arrays, anaconda always insists on ignoring the names I've given them and assigning them numbers. Of course, the numbers don't necessarily match up to the order in which I created them, so I have to guess at which numbered array corresponds to which named array (unless there are obvious hints like different sizes, but in the last instance I was doing this I had 7 arrays that were all the same size, each intended to be a root filesystem for a different version of either RHEL or Fedora). Then, once the install is all complete, I have to go back into rescue mode, remount the root filesystem, hand edit the mdadm.conf to use names instead of numbers, remake the initrd images (now dracut images), change any fstab entries, then I can finally use the names. Really, it's *very* annoying that this minor number dependence in anaconda has gone on so long. It was outdated 7 or 8 Fedora releases ago. If we can still specify which minor to use when creating a new array, even though that minor may change after the first reboot, then the amount of changes needed to the installer are minimal and we can likely do this without problems for RHEL-6. I don't understand. Please enlighten me as to these requirements on minor numbers in the installer. After all, it's not like there isn't a simple means of naming these things: If md raid device used for lvm pv, name it /dev/md/pv-# If md raid device used for swap, name it /dev/md/swap-# If md raid device used for /, name it /dev/md/root If md raid device used for any other data partition, name it /dev/md/ And it's not like anaconda doesn't already have that information available when its creating filesystem labels, so I'm curious why it's so hard to use names instead of numbers for arrays in anaconda? Regards, Hans On 11/26/2009 03:59 AM, Doug Ledford wrote: Please keep me on the Cc: as I'm not on this list. Upstream recently released mdadm-3.1.1, which I intend to include in Fedora soon. It finally updates three default settings that should have been updated a long time ago. The default chunk size for raid4/5/6 is now 512K. Anaconda needs to be updated to either leave the default alone or use 512K itself. In the past it has passed in 256K, but extensive performance testing shows that 512K is indeed the sweet spot on pretty much any SATA device, which simply due to SATA being the overwhelming majority of disks we run on today, it's sweet spot should be our default. It updates the default bitmap chunk to be at least 65536K when using an internal bitmap. Performance tests showed as much as a 10% performance penalty for the old default bitmap chunk (8192K). The new bitmap chunk reduces that performance penalty (although we don't have solid numbers on how much...I'll work on that). However, we've never used a bitmap by default on any arrays we create. That needs to change. The simple logic is this: no bitmap on /boot or any swap partitions, use a bitmap on anything else. If we need a bitmap chunk other than the default, I'll follow up here. It updates the default superblock format from the old, antiquated, deprecated version 0.90 superblock that we should have quit using years ago to version 1.1. This is the real kicker. Since anaconda has never actively set the superblock metadata version (even though we should have been using 1.1 long ago), it's now going to have to start. The reason is that unless you upgrade machines to use an md raid aware boot loader, such as grub2 for x86 although I have no idea what would work on non-x86 arches, version 1.1 superblocks will render all installs unbootable. More importantly though, unless the anaconda team decides to blindly set all superblocks back to the old version 0.90 format, this change necessitates more than just a change to controlling which version of 1.x superblock we use on any given array, but also a change to how we create and name arrays in general. Version 0.90 superblocks are from back in the day when we thought it was smart/reasonable to name arrays by number and to mount scsi devices in fstab by their /dev/ entry. That day has long since been gone, dead and buried. We switched filesystems to mount by label so they are immune to device number changes and similarly version 1.x superblocks totally do away with the preferred-minor field in the superblock. Instead, they have a homehost and name field that are used to control device *naming*, not numbering, and in a properly running version 1.x superblock system, the device numbers are not guaranteed to be static from boot to boot (although they usually are). This doesn't appear to be much problem for dracut, but as an example, I'm attaching the mkinitrd patch I have to apply to an F11 system after every mkinitrd update in order to get initrd images that mount by name properly. So, those are the major differences. Switching to any of the version 1.x superblocks necessitates that anaconda pass a few arguments that it hasn't in the past. Right now, these are the things anaconda is going to need to start passing in on any mdadm create commands (that I don't currently believe it does, but I haven't checked and could be wrong): --homehost= --name= -e{1.0,1.1,1.2} In addition, we should start passing the bitmap option as I outlined above. We will also likely need to set the HOMEHOST entry in mdadm.conf and possibly the AUTO entry in mdadm.conf as well. And this brings me to a different point. Hans asked me to comment on bz537329. I would suggest people look at my comments there for some additional explanation of why ideas like trying to make things work without mdadm.conf are probably a bad idea. So here are a few additional things that I think are worth taking into consideration. If an array is listed in mdadm.conf, then *every* item on the array line must match the array or else it will fail to start. This means that ARRAY lines that list things that can change by using mdadm --grow to change aspects of the array can result in the array failing to be found on the next reboot. Therefore, it would be best if each new ARRAY line we write includes nothing besides the name of the array, the metadata version, and the UUID. If an array is listed in mdadm.conf, then both the --homehost and --name settings will be overridden by the name in the mdadm.conf file, so do not depend on either having an effect for arrays listed in mdadm.conf. However, homehost and name are both used heavily any time the array is not listed in mdadm.conf so setting them correctly is still important. There are a number of common scenarios that make this important: you are carrying an array from machine to machine (like an external drive tower, or raid1 usb flash drive, etc.), when an array is visible to multiple hosts (like arrays built over SAN devices), or when you've built a machine to replace an existing machine and you temporarily install the drives from the machine being replaced in the new machine to copy data across in which case you are starting both your new array and the old array on the same machine. They are also relied upon heavily in order to attempt to satisfy those people that think the md raid stack should work without any mdadm.conf file at all. And there is a special case exception in the name field that is used to attempt to preserve back compatibility. The intersection of all these attempts to satisfy various needs is tricky. Here's how names are determined: 1) If the array is identified in mdadm.conf, the name from the ARRAY line is used. 2) If HOMEHOST has been set in the config a) If the array uses a version 0.90 superblock, check to see if the HOMEHOST has been encoded in the UUID via hash. If not, treat as foreign, if so, treat as local. b) For version 1.x superblocks check the homehost in the superblock against the set homehost. If they match, treat as local, else if the homehost in the superblock is not empty treat as named foreign else treat as foreign. 3) else a) for version 0.90 superblocks treat the array as foreign. b) for 1.x if homehost is set then named foreign else foreign. In case #1, the name as it's in the file is used. If the remainder of cases, local means to attempt to create the array with the requested number (in the case of 0.90 superblocks) or requested name (in the case of version 1.x superblocks). Foreign means that the array will be started with the requested name + a suffix. For example, version 0.90 superblock with preferred-minor of 0 would get created with a random *actual* minor number and the name /dev/md0_0 or md0_1 if md0_0 already exists, etc. A version 1.x superblock with the name root would get created as /dev/md/root_0. Named foreign is used whenever a version 1.x superblock can't be identified as local but it has a valid homehost entry in the superblock. The format attempt is /dev/md/homehost:name so that if you were to mount an array from workstation2:root on workstation1, it would be /dev/md/workstation2:root. There is a special exception for version 1.x superblock arrays. If the name field of the superblock contains a specially formatted name, then it will be treated as a request to create the device with a given minor number and name identical to an old version 0.90 superblock array. Those special case names are: a) a bare number (aka, 0) b) a bare name using standard number format (aka, md0 or md_d0) c) a full name using standard number format (aka, /dev/md0 or /dev/md_d0) If an array uses a name instead of a number, then the named entry created in /dev/md/ will be a symlink to a random numeric md device in /dev/. For example, /dev/md/root, since it's the first device started and since we start grabbing md devices at 127 and counting backwards when starting named devices, will almost always point to /dev/md127. The /dev/md127 file will be the real device file while the entries in /dev/md/ are always symlinks. This is in order to be consistent with the fact that our /sys/block entry will be md127 and our entry in /proc/mdstat will also be md127. This is because the current /sys/block setup does not allow /sys/block/md/root, only md. _______________________________________________ Anaconda-devel-list mailing list Anac ... @redhat.com https://www.redhat.com/mailman/listinfo/anaconda-devel-list -- Doug Ledford < dled ... @redhat.com > GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband _______________________________________________ Anaconda-devel-list mailing list Anac ... @redhat.com https://www.redhat.com/mailman/listinfo/anaconda-devel-list