Discussion:
zones on shared storage proposal
(too old to reply)
Edward Pilatowicz
2009-05-21 08:55:15 UTC
Permalink
hey all,

i've created a proposal for my vision of how zones hosted on shared
storage should work. if anyone is interested in this functionality then
please give my proposal a read and let me know what you think. (fyi,
i'm leaving on vacation next week so if i don't reply to comments right
away please don't take offence, i'll get to it when i get back. ;)

ed
Mike Gerdts
2009-05-21 16:59:22 UTC
Permalink
On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz
Post by Edward Pilatowicz
hey all,
i've created a proposal for my vision of how zones hosted on shared
storage should work.  if anyone is interested in this functionality then
please give my proposal a read and let me know what you think.  (fyi,
i'm leaving on vacation next week so if i don't reply to comments right
away please don't take offence, i'll get to it when i get back.  ;)
ed
I'm very happy to see this. Comments appear below.
Post by Edward Pilatowicz
" please ensure that the vim modeline option is not disabled
vim:textwidth=72
-------------------------------------------------------------------------------
Zones on shared storage (v1.0)
[snip]
Post by Edward Pilatowicz
----------
C.1.i Zonecfg(1m)
The zonecfg(1m) command will be enhanced with the following two new
rootzpool resource
src resource property
install-size resource property
zpool-preserve resource property
dataset resource property
zpool resource
src resource property
install-size resource property
zpool-preserve resource property
name resource property
"rootzpool"
- Description: Identifies a shared storage object (and it's
associated parameters) which will be used to contain the root
zfs filesystem for a zone.
"zpool"
- Description: Identifies a shared storage object (and it's
associated parameters) which will be made available to the
zone as a delegated zfs dataset.
That is to say "put your OS stuff in rootzpool, put everything else in
zpool" - right?
Post by Edward Pilatowicz
"src"
- Status: Required.
- Format: Storage object uri (so-uri). (See definition below.)
- Description: Identifies the storage object associated with this
resource.
"install-size"
- Status: Optional.
- Format: Integer. Defaults to bytes, but can be flagged as
gigabytes, kilobytes, or megabytes, with a g, k, or m suffix,
respectively.
- Description: If the specified storage object doesn't exist at zone
install time it will be created with this specific size. This
property has no effect for storage objects which already exist and
have a pre-defined size.
"zpool-preserve"
- Status: Optional.
- Format: Boolean. Defaults to false.
- Description: When doing an install, if this property if this
property is set to true and a zpool already exists on the
specified storage object it will be used. When doing a destroy,
if this property is set to true, the root zpool will not be
destroyed.
"dataset"
- Status: Optional
- Format: zfs filesystem name component (can't contain a '/')
- Description: Name of a dataset within the root zpool to delegate
to the zone.
"name"
- Status: Required
- Format: zfs filesystem name component (can't contain a '/')
- Description: Used as part of the name for a zpool which will be
delegated to the zone.
Zonecfg(1m) "verify" will verify the syntax of any "rootzpool" resource
group (and its properties), but it will NOT verify the accessibility of
any storage specified by by a so-uri. (This is because accessing the
storage specified by an so-uri could require configuration changes to
other subsystems.)
----------
C.1.ii Storage object uri (so-uri) format
The storage object uri (so-uri) syntax[03] will conform to the standard
uri format defined in RFC 3986 [04]. The nfs URI scheme is defined in
path:///<file-absolute>
nfs://<host>[:port]/<file-absolute>
vpath:///<file-absolute>
vnfs://<host>[:port]/<file-absolute>
File storage objects point to plain files on a local, nfs, or cifs
filesystems. These files are used to contain zpools which store zone
datasets. These are the simplest types of storage objects. Once
created, they have a fixed size, can't be grown, and don't support
advanced features like snapshotting, etc. Some example file so-uri's
path:///export/xvm/vm1.disk
- a local file
path:///net/heaped.sfbay/export/xvm/1.disk
- a nfs file accessible via autofs
nfs://heaped.sfbay/export/xvm/1.disk
- same file specified directly via a nfs so-uri
Vdisk storage objects are similar to file storage objects in that they
can live on local, nfs, or cifs filesystems, but they each have their
own special data format and varying featuresets, with support for things
like snapshotting, etc.. Some common vdisk formats are: VDI, VMDK and
vpath:///export/xvm/vm1.vmdk
- a local vdisk image
vpath:///net/heaped.sfbay/export/xvm/1.vmdk
- a nfs vdisk image accessible via autofs
vnfs://heaped.sfbay/export/xvm/1.vmdk
- same vdisk image specified directly via a nfs so-uri
Device storage objects specify block storage devices in a host
independant fashion. When configuring FC or iscsi storage on different
hosts, the storage configuration normally lives outsize of zonecfg, and
the configured storage may have varying /dev/dsk/cXtXdX* names. The
so-uri syntax provides a way to specify storage in a host independent
fashion, and during zone management operations, the zones framework can
map this storage to a host specific device path. Some example device
- lun 0 of a fc disk with the specified wwn
- lun 0 of an iscsi disk with the specified alias.
iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740
- lun 0 of an iscsi disk with the specified target id.
What about if there is already the necessary layer of abstraction that
provides a consistent namespace? For example,
/dev/vx/dsk/zone1dg/rootvol would refer to a block device named rootvol
in the disk group zone1dg. That may reside on a single disk or span
many disks and will have the same name regardless of which host the disk
group is imported on. Since this VxVM volume may span many disks, it
would be inappropriate to refer to a single LUN that makes up that disk
group.

Perhaps the following is appropriate for such situations.

dev:///dev/vx/dsk/zone1dg/rootvol
Post by Edward Pilatowicz
----------
C.1.iii Zoneadm(1m) install
When a zone is installed via the zoneadm(1m) "install" subcommand, the
zones subsystem will first verify that any required so-uris exist and
are accessible.
If an so-uri points to a plain file, nfs file, or vdisk, and the object
does not exist, the object will be created with the install-size that
was specified via zonecfg(1m). If the so-uri does not exist and an
install-size was not specified via zonecfg(1m) an error will be
generated and the install will fail.
If an so-uri points to an explicit nfs server, the zones framework will
need to mount the nfs filesystem containing storage object. The nfs
/var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
Just for clarity, I think you mean:

- "will be mounted at". I think "auto-mounted" conjures up the idea
that there is integration with autofs.
- <host> is the NFS server
- <nfs-share-name> is the path on the NFS server. Is this the exact
same thing as <path-absolute> in the URI specification? Is this the
file that is mounted or the directory above the file?

My storage administrators give me grief if I create too many NFS mounts
(but I am not sure I've heard a convincing reason). As I envision NFS
server layout, I think I would see something like:

vol
zones
zone1
rootzpool
zpool
zone2
rootzpool
zpool
zone3
rootzpool
zpool

It seems as though if these three zones are all running on the same box
the box will have at least the following mounts:

/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3

But maybe as many as:

/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/rootzpool
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/zpool
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/rootzpool
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/zpool
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/rootzpool
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/zpool

With a slightly different arrangment this could be reduced to one.
Change
Post by Edward Pilatowicz
/var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
To:

/var/zones/nfsmount/<host>/<nfs-share-name>/<zonename>/<file>

I can see that this would complicate things a bit because it would be
hard to figure out how far up the path is the right place for the mount.

Perhaps if this is what I would like I would be better off adding a
global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use
the path:/// uri instead.

Thoughts?
Post by Edward Pilatowicz
If an so-uri points to a fibre channel lun, the zones subsystem will
verify that the specified wwn corresponds to a global zone accessible
fibre channel disk device.
If an so-uri points to an iSCSI target or alias, the zones subsystem
will verify that the iSCSI device is accessible on the local system. If
an so-uri points to a static iSCSI target and that target is not
already accessible on the local host, then the zones subsystem will
enable static discovery for the local iSCSI initiator and attempt to
apply the specified static iSCSI configuration. If the iSCSI target
device is not accessible then the install will fail.
Once a zones install has verified that any required so-uri exists and is
accessible, the zones subsystem will need to initialise the so-uri. In
the case of a path or nfs path, this will involve creating a zpool
within the specified file. In the case of a vdisk, fibre channel lun,
or iSCSI lun, this will involve creating a EFI/GPT partition on the
device which uses the entire disk, then a zpool will be created within
this partition. For data protection purposes, if a storage object
contains any pre-existing partitions, zpools, or ufs filesystems, the
install will fail will fail with an appropriate error message. To
s/will fail will fail/will fail/
Post by Edward Pilatowicz
continue the installation and overwrite any pre-existing data, the user
will be able to specify a new '-f' option to zoneadm(1m) install. (This
option mimics the '-f' option used by zpool(1m) create.)
If zpool-preserve is set to true, then before initialising any target
storage objects, the zones subsystem will attempt to import a
pre-existing zpool from those objects. This will allow users to
pre-create a zpool with custom creation time options, for use with
zones. To successfully import a pre-created zpool for a zone install,
that zpool must not be attached. (Ie, any pre-created zpool must be
exported from the system where it was created before a zone can be
installed on it.) Once the zpool is imported the install process will
check for the existence of a /ROOT filesystem within the zpool. If this
filesystem exists the install will fail with an appropriate error
message. To continue the installation the user will need to specify the
'-f' option to zoneadm(1m) install, which will cause the zones framework
to delete the pre-existing /ROOT filesystem within the zpool.
Is this because the zone root will be installed <zonepath>/ROOT/<bename>
rather than <zonepath>/root?
Post by Edward Pilatowicz
The newly created or imported root zpool will be named after the zone to
which it is associated, with the assigned name being "<zonename>_rpool".
This zpool will then be mounted at the zones rootpath and then the
install process will continue normally[07].
This seems odd... why not have the root zpool mounted at zonepath rather
than zoneroot? This way (e.g.) SUNWdetached.xml would follow the zone
during migrations.
Post by Edward Pilatowicz
XXX: use altroot at zpool creation or just manually mount zpool?
If the user has specified a "zpool" resource, then the zones framework
will configure, initialize, and/or import it in a similar manaer to a
zpool specified by the "rootzpool" resource. The key differences are
that the name of the newly created or imported zpool will be
"<zonename>_<name>". The specified zpool will also have the zfs "zoned"
property set to "on", hence it will not be mounted anywhere in the
global zone.
XXX: do we need "zpool import -O file-system-property=" to set the
zoned property upon import.
Once a zone configured with a so-uri is in the installed state, the
zones framework needs a mechanism to mark that storage as in use to
prevent it from being accessed by multiple hosts simultaneously. The
most likely situation where this could happen is via a zoneadm(1m)
attach on a remote host. The easiest way to achieve this is to keep the
zpools associated with the storage imported and mounted at all times,
and leverage the existing zpool support for detecting and preventing
multi-host access.
So whenever a global zone boots and the zones smf service runs, it will
attempt to configure and import any shared storage objects associated
with installed zones. It will then continue to behave as it does today
and boot any installed zones that have the autoboot property set. If
- the zones associated with the failed storage will be transitioned
to the "uninstalled" state.
Is "uninstalled" a real state? Perhaps "configured" is more
appropriate, as this allows a transition to "installed" via "zoneadm
attach".
Post by Edward Pilatowicz
- an error message will be emitted to the zones smf log file.
- after booting any remaning installed zones that have autoboot set
to true, the zones smf service will enter the "maintainence" state,
there by prompting the administrator to look at the zones smf log
file.
After fixing any problems with shared storage accessibility, the
admin should be able to simply re-attach the zone to the system.
Currently the zones smf service is dependant upon multi-user-server, so
all networking services required for access to shared storage should be
propertly configured well before we try to import any shared storage
associated with zones.
May I propose a fix to the zones SMF service as part of this? The
current integration with the global zone's SMF is rather weak in
reporting the real status of zones and allowing the use of SMF for
controlling the zones service. In particular:

- If a zone fails to start, the state of svc:/system/zones:default does
not reflect a maintenance or degraded state.
- If an admin wishes to start a zone the same way that the system would
do it, "svcadm restart" and similar have the side effect of rebooting
all zones on the system.
- There is no way to establish dependencies between zones or between a
zone and something that needs to happen in the global zone.
- There isn't a good way to allow certain individuals within the global
zone the ability to start/stop specific zones with RBAC or
authorizations.

I propose that:

- zonecfg creates a new services instance svc:/system/zones:zonename
when the zone is configured. Its initial state is disabled. If the
service already exists sanity checking may be performed but it should
not whack things like dependencies and authorizations.
- After zoneadm installs a zone, the general/enabled property of
svc:/system/zones:zonename is set to match the zonecfg autoboot
property.
- "zoneadm boot" is the equivalent of
"svcadm enable -t svc:/system/zones:zonename"
- A new command "zoneadm shutdown" is the equivalent of
"svcadm disable -t svc:/system/zones:zonename"
- "zoneadm halt" is the equivalent of "svcadm mark maintenance
svc:/system/zones:zonename:" followed by the traditional ungraceful
teardown of the zone.
- Modification of the autoboot property with zonecfg (so long as the
zone has been installed/attached) triggers the corresponding
general/enabled property change in SMF. This should set the property
general/enabled without causing an immediate state change.
- zoneadm uninstall and zoneadm detach set the service to not autostart.
- zonecfg delete also deletes the service.
- A new property be added to zonecfg to disable SMF integration of this
particular zone. This will be important for people that have already
worked around this problem (including ISV's providing clustering
products) that don't want SMF getting in the way of their already
working solution.
Post by Edward Pilatowicz
On system shutdown, the zones system will NOT export zpools contained
within storage object used by the zone. Zpools contained within storage
objects assigned to installed zones will only be exported during zone
detach. More details about the behaviour of zone detach is provided
below.
----------
C.1.iv Zoneadm(1m) attach
[snip]
Post by Edward Pilatowicz
----------
C.1.v Zoneadm(1m) boot
[snip]
Post by Edward Pilatowicz
----------
C.1.vi Zoneadm(1m) detach
[snip]
Post by Edward Pilatowicz
----------
C.1.vii Zoneadm(1m) uninstall
[snip]
Post by Edward Pilatowicz
----------
C.1.viii Zoneadm(1m) clone
Normally when cloning a zone which lives on a zfs filesystem the zones
framework will take a zfs(1m) snapshot of the source zone and then do a
zfs(1m) clone operation to create a filesystem for the new zone which is
being instantiated. This works well when all the zones on a given
system live on local storage in a single zfs filesystem, but this model
doesn't work well for zones with encapsulated roots. First, with
encapsulated roots each zone has it's own zpool, and zfs (1m) does not
support cloning across zpools. Second, zfs(1m) snapshotting/cloning
within the source zpool and then mounting the resultant filesystem onto
the target zones zoneroot would introduce dependencies between zones,
complicating things like zone migration.
Hence, for cloning operations, if the source zone has an encapsulated
root, zoneadm(1m) will not use zfs(1m) snapshot/clone. Currently
zoneadm(1m) will fall back to the use of find+cpio to clone zones if it
is unable to use zfs(1m) snapshot/clone. We could just fall back to
this default behaviour for encapsulated root zones, but find+cpio are
not error free and can have problem with large files. So we propose to
update zoneadm(1m) clone to detect when both the source and target zones
are using separate zfs filesystems, and in that case attempt to use zfs
send/recv before falling back to find+cpio.
Can a provision be added for running an external command to produce the
clone? I envision this being used to make a call to a storage device to
tell the storage device to create a clone of the storage. (This implies
that the super-secret tool to re-write the GUID would need to become
available.)

The alternative seems to be to have everyone invent their own mechanism
with the same external commands and zoneadm attach.
Post by Edward Pilatowicz
Today, the zoneadm(1m) clone operations ignores any additional storage
(specified via the "fs", "device", or "dataset" resources) that may be
associated with the zone. Similarly, the clone operation will ignore
additional storage associated with any "zpool" resources.
Since zoneadm(1m) clone will be enhanced to support cloning between
encapsulated root zones and un-encapsulated root zones, zoneadm(1m)
clone will be documented as the recommended migration mechanism for
users who which to migrate existing zones from one format to another.
----------
C.2 Storage object uid/gid handling
One issue faced by all VTs that support shared storage is dealing with
file access permissions of storage objects accessible via NFS. This
issue doesn't affect device based shared storage, or local files and
vdisks, since these types of storage are always accessible, regardless
of the uid of the access process (as long as the accessing process has
the necessary privileges). But when accessing files and vdisk via NFS,
the accessing process can not use privileges to circumvent restrictive
file access premissions. This issue is also complicated by the fact
that by default most NFS servier will map all accesses by remote root
user to a different uid, usually "nobody". (a process known as "root
squashing".)
In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.
Additionally, whenever the zones framework attempts to access an storage
object file or vdisk it will temporarily switch its uid and gid to match
the owner and group of the file/vdisk, ensure that the file is readable
and writeable by it's owner (updating the file/vdisk permissions if
necessary), and finally setup the file/vdisk for access via a zpool
import or lofiadm -a. This should will allow the zones framework to
access storage object files/vdisks that we created by any user,
regardless of their ownership, simplifying file ownership and management
issues for administrators.
This implies that the xvm user is getting some additional privileges.
What are those privileges?
Post by Edward Pilatowicz
----------
C.3 Taskq enhancements
The integration of Duckhorn[08] greatly simplifies the management of cpu
resources assigned to zone. This management is partially implemented
through the use of dynamic resource pools, where zones and their
associated cpu resources can both be bound to a pool.
Internally, zfs has worker threads associated with each zpool. These
are kernel taskq threads which can run on any cpu which has not been
explicitly allocated to a cpu set/partition/pool.
So today, for any zones living on zfs filesystems, and running in a
dedicated cpu pool, any zfs disk processing associated with that zone is
not done by the cpu's bound to that zones pool. Essentially all the
zones zfs processing is done for "free" by the global zone.
With the introduction of zpools encapsulated within storage objects,
which are themselves associated with specific zones, it would be
desirable to have the zpool worker threads bound to the cpus currently
allocated to the zone. Currently, zfs uses taskq threads for each
zpool, so one way of doing this would be to introduce a mechanism that
allows for the binding of taskqs to pools.
zfs_poolbind(char *, poolid_t);
taskq_poolbind(taskq_t, poolid_t);
When a zone, which is bound to a pool, is booted, the zones framework
will call zfs_poolbind() for each zpool associated with an encapsulated
storage object bound to the zone being booted.
Zfs will in turn use the new taskq pool binding interfaces to bind all
it's taskqs to the specified pools. This mapping is transient and zfs
will not record or persist this binding in any way.
The taskq implementation will be enhanced to allow for binding worker
threads to a specific pool. If taskqs threads are created for a taskq
which is bound to a specific pool, those new thread will also inherit
the same pool bindings. The taskq to pool binding will remain in effect
until the taskq is explicitly rebound or the pool to which it is bound
is destroyed.
Any thoughts of dooing something similar for dedicated NICs? From
dladm(1M):

cpus

Bind the processing of packets for a given data link to
a processor or a set of processors. The value can be a
comma-separated list of one or more processor ids. If
the list consists of more than one processor, the pro-
cessing will spread out to all the processors. Connec-
tion to processor affinity and packet ordering for any
individual connection will be maintained.

That is, the enhancement is already there, it's just a matter of making
use of it.
Post by Edward Pilatowicz
----------
C.4 Zfs enhancements
In addition to the zfs_poolbind() interface proposed above. The
zpool(1m) "import" command will need to be enhanced. Currently the
zpool(1m) import by default scans all storage devices on the system
looking for pools to import. The caller can also use the '-d' option to
specify a directory within which the zpool(1m) command will scan for
zpools that may be imported. This scanning involves sampling many
objects. When dealing with zpools encapsulated in storage objects, this
scanning is unnecessary since we already know the path to the objects
which contains the zpool. Hence, the '-d' option will be enhanced to
allow for the specification of a file or device. The user will also be
able to specify this option multiple times, in case the zpool spans
multiple objects.
----------
C.5 Lofi and lofiadm(1m) enhancements
Currently, there is no way for a global zone to access the contents of a
vdisk. Vdisk support was first introduced in VirtualBox. xVM then
adopted the VirtualBox code for vdisk support. With both technologies,
the only way to access the contents of a vdisk is to export it to a VM.
To allow zones to use vdisk devices we propose to leverage the code
introduced by by xVM by incorporating it into lofi. This will allow any
solaris system to access the contents of vdisk devices. The interface
changes to lofi to allow for this are fairly straitforward.
A new '-l' option will be added to the lofiadm(1m) "-a" device creation
mode. The '-l' option will indicate to lofi that the new device should
have a label associated with it. Normally lofi device are named
/dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number.
When a disk device has a label associated with it, it exports many
device nodes with different names. Therefore lofi will need to be
enhanced to support these new device names, which multiple nodes
/dev/lofi/dsk<I>/p<j> - block device partitions
/dev/lofi/dsk<I>/s<j> - block device slices
/dev/rlofi/dsk<I>/p<j> - char device partitions
/dev/rlofi/dsk<I>/s<j> - char device slices
One of the big weaknesses with lofi is that you can't count on the
device name being the same between boots. Could -l take an argument
to be used instead of "dsk<I>"? That is:

lofiadm -a -l coolgames /media/coolgames.iso

Creates:

/dev/lofi/coolgames/p<j>
/dev/lofi/coolgames/s<j>
/dev/rlofi/coolgames/p<j>
/dev/rlofi/coolgames/s<j>

For those cases where legacy behavior is desired, an optional %d can be
used to create the names you suggest above.

lofiadm -a -l dsk%d /nfs/server/zone/stuff

[snip]
Post by Edward Pilatowicz
----------
C.6 Performance considerations
As previously mentioned, this proposal primarily simplifies the process
of configuring zones on shared storage. In most cases these proposed
configurations can be created today, but no one has actually verified
that these configurations perform acceptably. Hence, in conjunction
with providing functionality to simplify the setup of these configs,
we also need to be quantifying their performance to make sure that
none of the configurations suffer from gross performance problems.
The most straitforward configurations, with the least possibilities for
poor performance, are ones using local devices, fibre channel luns, and
iSCSI luns. These configuration should perform identically to the
configurations where the global zone uses these objects to host zfs
filesystems without zones. Additionally, the performance of these
configurations will mostly be dependent upon the hardware associated
with the storage devices. Hence the performance of these configuration
is for the most part uninteresting and performance analysis of these
configuration can by skipped.
Looking at the performance of storage objects which are local files or
nfs files is more interesting. In these cases the zpool that hosts the
zone will be accessing it's storage via the zpool vdev_file vdev_ops_t
interface. Currently, this interface doesn't receive as much use and
performance testing as some of the other zpool vdev_ops_t interfaces.
Hence it will worthwhile to measure the performance of a zpool backed by
a file within another zfs filesystem. Likewise we will want to measure
the performance of a zpool backed by a file on an NFS filesystem.
Finally, we should compare these two performance points to a zone which
is not encapsulated within a zpool, but is instead installed directly on
a local zfs filesystem. (These comparisons are not really that
interesting when dealing with block device based storage objects.)
Reminder for when I am testing: is this a case where forcedirectio will
make a lot of sense? That is, zfs is already buffering, don't make NFS
do it too.
Post by Edward Pilatowicz
Currently, while it is very common to deploy large numbers of zfs
filesystems, systems with large numbers of zpools are not very common.
The solution proposed in this project will likely result in an increase
of zpools on systems hosting zones. Hence, we should evaluate the
impact of an increasing number of zpools on performance scalability.
This could be done by comparing the io performance drop-off of an
increasing number of zones hosted multiple zfs filesystems in a single
zpool vs zones hosted in seperate zpools.
Finally, it will be important to do performance measurements for vdisk
configurations. These configurations are similar to the local file or
nfs configurations, but they will be utilising the vdev_disk backend and
they will have an additional layer of indirection through lofi.
XXX: impact of multiple zpools on arc and l2 arc? talk to mark maybee.
----------
C.7 Phased delivery
Customers have been asking for a simple mechanisms to allow hosting of
zones on NFS since the introduction of zones. Hence we'd like to get
this functionality into the hands of customers as quickly as possible.
Also, the approach taken by this proposal to supporting zones on shared
storage is different from what was originally anticipated, hence we'd
like to get practical experience with this approach at customer sites
asap to determine if there are situations where this approach may not
meet their requires. To accelerate the delivery of the previously
Sounds quite reasonable.

[snip]
Post by Edward Pilatowicz
-------------------------------------------------------------------------------
--
Mike Gerdts
http://mgerdts.blogspot.com/
Edward Pilatowicz
2009-05-22 07:11:22 UTC
Permalink
hey mike,

thanks for all the great feedback.
my replies to your individual comments are inline below.

i've updated my proposal to include your feedback, but i'm unable to
attach it to this reply because of mail size restrictions imposed by
this alias. i'll send some follow up emails which include the revised
proposal.

thanks again,
ed
Post by Mike Gerdts
On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz
Post by Edward Pilatowicz
hey all,
i've created a proposal for my vision of how zones hosted on shared
storage should work.  if anyone is interested in this functionality then
please give my proposal a read and let me know what you think.  (fyi,
i'm leaving on vacation next week so if i don't reply to comments right
away please don't take offence, i'll get to it when i get back.  ;)
ed
I'm very happy to see this. Comments appear below.
Post by Edward Pilatowicz
" please ensure that the vim modeline option is not disabled
vim:textwidth=72
-------------------------------------------------------------------------------
Zones on shared storage (v1.0)
[snip]
Post by Edward Pilatowicz
----------
C.1.i Zonecfg(1m)
The zonecfg(1m) command will be enhanced with the following two new
rootzpool resource
src resource property
install-size resource property
zpool-preserve resource property
dataset resource property
zpool resource
src resource property
install-size resource property
zpool-preserve resource property
name resource property
"rootzpool"
- Description: Identifies a shared storage object (and it's
associated parameters) which will be used to contain the root
zfs filesystem for a zone.
"zpool"
- Description: Identifies a shared storage object (and it's
associated parameters) which will be made available to the
zone as a delegated zfs dataset.
That is to say "put your OS stuff in rootzpool, put everything else in
zpool" - right?
yes. as i see it, this proposal allows for multiple types of deployment
configurations.

- a zone with a single encapsulated "rootzpool" zpool.
the OS will reside in <zonename>_rpool/ROOT/zbeXXX
everything else will also reside in <zonename>_rpool/ROOT/zbeXXX

- a zone with a single encapsulated "rootzpool" zpool.
the OS will reside in <zonename>_rpool/ROOT/zbeXXX
everything else will reside in <zonename>_rpool/dataset/<dataset>

- a zone with multiple encapsulated zpools.
the OS will reside in <zonename>_rpool/ROOT/zbeXXX
everything else will reside in other encapsulated "zpool"s

i've added some text to this section of the proposal to explain these
different configuration scenarios.
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.1.ii Storage object uri (so-uri) format
The storage object uri (so-uri) syntax[03] will conform to the standard
uri format defined in RFC 3986 [04]. The nfs URI scheme is defined in
path:///<file-absolute>
nfs://<host>[:port]/<file-absolute>
vpath:///<file-absolute>
vnfs://<host>[:port]/<file-absolute>
File storage objects point to plain files on a local, nfs, or cifs
filesystems. These files are used to contain zpools which store zone
datasets. These are the simplest types of storage objects. Once
created, they have a fixed size, can't be grown, and don't support
advanced features like snapshotting, etc. Some example file so-uri's
path:///export/xvm/vm1.disk
- a local file
path:///net/heaped.sfbay/export/xvm/1.disk
- a nfs file accessible via autofs
nfs://heaped.sfbay/export/xvm/1.disk
- same file specified directly via a nfs so-uri
Vdisk storage objects are similar to file storage objects in that they
can live on local, nfs, or cifs filesystems, but they each have their
own special data format and varying featuresets, with support for things
like snapshotting, etc.. Some common vdisk formats are: VDI, VMDK and
vpath:///export/xvm/vm1.vmdk
- a local vdisk image
vpath:///net/heaped.sfbay/export/xvm/1.vmdk
- a nfs vdisk image accessible via autofs
vnfs://heaped.sfbay/export/xvm/1.vmdk
- same vdisk image specified directly via a nfs so-uri
Device storage objects specify block storage devices in a host
independant fashion. When configuring FC or iscsi storage on different
hosts, the storage configuration normally lives outsize of zonecfg, and
the configured storage may have varying /dev/dsk/cXtXdX* names. The
so-uri syntax provides a way to specify storage in a host independent
fashion, and during zone management operations, the zones framework can
map this storage to a host specific device path. Some example device
- lun 0 of a fc disk with the specified wwn
- lun 0 of an iscsi disk with the specified alias.
iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740
- lun 0 of an iscsi disk with the specified target id.
What about if there is already the necessary layer of abstraction that
provides a consistent namespace? For example,
/dev/vx/dsk/zone1dg/rootvol would refer to a block device named rootvol
in the disk group zone1dg. That may reside on a single disk or span
many disks and will have the same name regardless of which host the disk
group is imported on. Since this VxVM volume may span many disks, it
would be inappropriate to refer to a single LUN that makes up that disk
group.
Perhaps the following is appropriate for such situations.
dev:///dev/vx/dsk/zone1dg/rootvol
good point. but rather than adding another URI type i'd rather just re-use
the "path:///" uri.

i've updated the doc to describe this use case and i've added an
example.
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.1.iii Zoneadm(1m) install
When a zone is installed via the zoneadm(1m) "install" subcommand, the
zones subsystem will first verify that any required so-uris exist and
are accessible.
If an so-uri points to a plain file, nfs file, or vdisk, and the object
does not exist, the object will be created with the install-size that
was specified via zonecfg(1m). If the so-uri does not exist and an
install-size was not specified via zonecfg(1m) an error will be
generated and the install will fail.
If an so-uri points to an explicit nfs server, the zones framework will
need to mount the nfs filesystem containing storage object. The nfs
/var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
- "will be mounted at". I think "auto-mounted" conjures up the idea
that there is integration with autofs.
- <host> is the NFS server
- <nfs-share-name> is the path on the NFS server. Is this the exact
same thing as <path-absolute> in the URI specification? Is this the
file that is mounted or the directory above the file?
My storage administrators give me grief if I create too many NFS mounts
(but I am not sure I've heard a convincing reason). As I envision NFS
vol
zones
zone1
rootzpool
zpool
zone2
rootzpool
zpool
zone3
rootzpool
zpool
It seems as though if these three zones are all running on the same box
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
well, it all depends on what nfs shares are actually being exported.

if the nfs server has the following share(s) exported:
nfsserver:/vol
then you would have the following mount(s):
/var/zones/nfsmount/zone1/nfsserver/vol
/var/zones/nfsmount/zone2/nfsserver/vol
/var/zones/nfsmount/zone3/nfsserver/vol

if the nfs server has the following share(s) exported:
nfsserver:/vol/zones
then you would have the following mount(s):
/var/zones/nfsmount/zone1/nfsserver/vol/zones
/var/zones/nfsmount/zone2/nfsserver/vol/zones
/var/zones/nfsmount/zone3/nfsserver/vol/zones

if the nfs server has the following share(s) exported:
nfsserver:/vol/zones/zone1
nfsserver:/vol/zones/zone2
nfsserver:/vol/zones/zone3
then you would have the following mount(s):
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
Post by Mike Gerdts
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/rootzpool
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/zpool
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/rootzpool
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/zpool
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/rootzpool
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/zpool
hm. afaik, you can only share directories via nfs, and i'm assuming
that "zpool" and "rootzpool" above are files (or volumes) which can
actually store data. in which case you would never mount them directly.
Post by Mike Gerdts
With a slightly different arrangment this could be reduced to one.
Change
Post by Edward Pilatowicz
/var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
/var/zones/nfsmount/<host>/<nfs-share-name>/<zonename>/<file>
nice catch.

in early versions of my proposal, the nfs:// uri i was planning to
support allowed for the specification of mount options. this required
allowing for per-zone nfs mounts with potentially different mount
options. since then i've simplified things (realizing that most people
really don't need or want to specify mount options) and i've switched to
using the the nfs uri defined in rfc 2224. this means we can do away
with the '<zonename>' path component as you suggest.

i've updated the doc.
Post by Mike Gerdts
I can see that this would complicate things a bit because it would be
hard to figure out how far up the path is the right place for the mount.
afaik, determining the mount point should be pretty strait forward.
i was planning to get a list of all the shares exported by the specified
nfs server, and then do a strncmp() of all the exported shares against
the specified path. the longest matching share name is the mount path.

for example. if we have:
nfs://jurassic/a/b/c/d/file

and jurassic is exporting:
jurassic:/a
jurassic:/a/b
jurassic:/a/b/c

then our mount path with be:
/var/zones/nfsmount/jurassic/a/b/c

and our encapsulated zvol will be accessible at:
/var/zones/nfsmount/jurassic/a/b/c/d/file

afaik, this is acutally the only way that this could be implemented.
Post by Mike Gerdts
Perhaps if this is what I would like I would be better off adding a
global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use
the path:/// uri instead.
Thoughts?
i'm not sure i understand how you would like to see this functionality
behave.

wrt vfstab, i'd rather you not use that since that moves configuration
outside of zonecfg. so later, if you want to migrate the zone, you'll
need to remember about that vfstab configuration and move it as well.
if at all possible i'd really like to keep all the configuration within
zonecfg(1m).

perhaps you could explanin your issues with the currently planned
approach in a different way to help me understand it better?
Post by Mike Gerdts
Post by Edward Pilatowicz
If an so-uri points to a fibre channel lun, the zones subsystem will
verify that the specified wwn corresponds to a global zone accessible
fibre channel disk device.
If an so-uri points to an iSCSI target or alias, the zones subsystem
will verify that the iSCSI device is accessible on the local system. If
an so-uri points to a static iSCSI target and that target is not
already accessible on the local host, then the zones subsystem will
enable static discovery for the local iSCSI initiator and attempt to
apply the specified static iSCSI configuration. If the iSCSI target
device is not accessible then the install will fail.
Once a zones install has verified that any required so-uri exists and is
accessible, the zones subsystem will need to initialise the so-uri. In
the case of a path or nfs path, this will involve creating a zpool
within the specified file. In the case of a vdisk, fibre channel lun,
or iSCSI lun, this will involve creating a EFI/GPT partition on the
device which uses the entire disk, then a zpool will be created within
this partition. For data protection purposes, if a storage object
contains any pre-existing partitions, zpools, or ufs filesystems, the
install will fail will fail with an appropriate error message. To
s/will fail will fail/will fail/
oops. thanks. ;)
Post by Mike Gerdts
Post by Edward Pilatowicz
continue the installation and overwrite any pre-existing data, the user
will be able to specify a new '-f' option to zoneadm(1m) install. (This
option mimics the '-f' option used by zpool(1m) create.)
If zpool-preserve is set to true, then before initialising any target
storage objects, the zones subsystem will attempt to import a
pre-existing zpool from those objects. This will allow users to
pre-create a zpool with custom creation time options, for use with
zones. To successfully import a pre-created zpool for a zone install,
that zpool must not be attached. (Ie, any pre-created zpool must be
exported from the system where it was created before a zone can be
installed on it.) Once the zpool is imported the install process will
check for the existence of a /ROOT filesystem within the zpool. If this
filesystem exists the install will fail with an appropriate error
message. To continue the installation the user will need to specify the
'-f' option to zoneadm(1m) install, which will cause the zones framework
to delete the pre-existing /ROOT filesystem within the zpool.
Is this because the zone root will be installed <zonepath>/ROOT/<bename>
rather than <zonepath>/root?
yes.

the current zones zfs filesystem layout and management for
opensolaris is documented here:
http://www.opensolaris.org/jive/thread.jspa?messageID=272726&#272726

i've mentioned this and reffered the user the '[07]'. (which references
the link above.)
Post by Mike Gerdts
Post by Edward Pilatowicz
The newly created or imported root zpool will be named after the zone to
which it is associated, with the assigned name being "<zonename>_rpool".
This zpool will then be mounted at the zones rootpath and then the
install process will continue normally[07].
This seems odd... why not have the root zpool mounted at zonepath rather
than zoneroot? This way (e.g.) SUNWdetached.xml would follow the zone
during migrations.
oops. that a mistake. it will be mounted on the zonepath. i've fixed
this.
Post by Mike Gerdts
Post by Edward Pilatowicz
XXX: use altroot at zpool creation or just manually mount zpool?
If the user has specified a "zpool" resource, then the zones framework
will configure, initialize, and/or import it in a similar manaer to a
zpool specified by the "rootzpool" resource. The key differences are
that the name of the newly created or imported zpool will be
"<zonename>_<name>". The specified zpool will also have the zfs "zoned"
property set to "on", hence it will not be mounted anywhere in the
global zone.
XXX: do we need "zpool import -O file-system-property=" to set the
zoned property upon import.
Once a zone configured with a so-uri is in the installed state, the
zones framework needs a mechanism to mark that storage as in use to
prevent it from being accessed by multiple hosts simultaneously. The
most likely situation where this could happen is via a zoneadm(1m)
attach on a remote host. The easiest way to achieve this is to keep the
zpools associated with the storage imported and mounted at all times,
and leverage the existing zpool support for detecting and preventing
multi-host access.
So whenever a global zone boots and the zones smf service runs, it will
attempt to configure and import any shared storage objects associated
with installed zones. It will then continue to behave as it does today
and boot any installed zones that have the autoboot property set. If
- the zones associated with the failed storage will be transitioned
to the "uninstalled" state.
Is "uninstalled" a real state? Perhaps "configured" is more
appropriate, as this allows a transition to "installed" via "zoneadm
attach".
oops. another bug. fixed.
Post by Mike Gerdts
Post by Edward Pilatowicz
- an error message will be emitted to the zones smf log file.
- after booting any remaning installed zones that have autoboot set
to true, the zones smf service will enter the "maintainence" state,
there by prompting the administrator to look at the zones smf log
file.
After fixing any problems with shared storage accessibility, the
admin should be able to simply re-attach the zone to the system.
Currently the zones smf service is dependant upon multi-user-server, so
all networking services required for access to shared storage should be
propertly configured well before we try to import any shared storage
associated with zones.
May I propose a fix to the zones SMF service as part of this? The
current integration with the global zone's SMF is rather weak in
reporting the real status of zones and allowing the use of SMF for
- If a zone fails to start, the state of svc:/system/zones:default does
not reflect a maintenance or degraded state.
- If an admin wishes to start a zone the same way that the system would
do it, "svcadm restart" and similar have the side effect of rebooting
all zones on the system.
- There is no way to establish dependencies between zones or between a
zone and something that needs to happen in the global zone.
- There isn't a good way to allow certain individuals within the global
zone the ability to start/stop specific zones with RBAC or
authorizations.
- zonecfg creates a new services instance svc:/system/zones:zonename
when the zone is configured. Its initial state is disabled. If the
service already exists sanity checking may be performed but it should
not whack things like dependencies and authorizations.
- After zoneadm installs a zone, the general/enabled property of
svc:/system/zones:zonename is set to match the zonecfg autoboot
property.
- "zoneadm boot" is the equivalent of
"svcadm enable -t svc:/system/zones:zonename"
- A new command "zoneadm shutdown" is the equivalent of
"svcadm disable -t svc:/system/zones:zonename"
- "zoneadm halt" is the equivalent of "svcadm mark maintenance
svc:/system/zones:zonename:" followed by the traditional ungraceful
teardown of the zone.
- Modification of the autoboot property with zonecfg (so long as the
zone has been installed/attached) triggers the corresponding
general/enabled property change in SMF. This should set the property
general/enabled without causing an immediate state change.
- zoneadm uninstall and zoneadm detach set the service to not autostart.
- zonecfg delete also deletes the service.
- A new property be added to zonecfg to disable SMF integration of this
particular zone. This will be important for people that have already
worked around this problem (including ISV's providing clustering
products) that don't want SMF getting in the way of their already
working solution.
yeah. the zones team is well aware that our current smf integration
story is pretty poor. :( we really want to improve our smf integration
by moving all our configuration into smf, adding per-zone smf services,
etc. so while this project proposes some minor changes to the behavior
of our existing smf service, i think that an overhaul of our smf
integration is really a project in and of itself, and out of scope for
this proposal. (this proposal already has plenty of scope that could
take a while to deliver. ;)
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.1.viii Zoneadm(1m) clone
Normally when cloning a zone which lives on a zfs filesystem the zones
framework will take a zfs(1m) snapshot of the source zone and then do a
zfs(1m) clone operation to create a filesystem for the new zone which is
being instantiated. This works well when all the zones on a given
system live on local storage in a single zfs filesystem, but this model
doesn't work well for zones with encapsulated roots. First, with
encapsulated roots each zone has it's own zpool, and zfs (1m) does not
support cloning across zpools. Second, zfs(1m) snapshotting/cloning
within the source zpool and then mounting the resultant filesystem onto
the target zones zoneroot would introduce dependencies between zones,
complicating things like zone migration.
Hence, for cloning operations, if the source zone has an encapsulated
root, zoneadm(1m) will not use zfs(1m) snapshot/clone. Currently
zoneadm(1m) will fall back to the use of find+cpio to clone zones if it
is unable to use zfs(1m) snapshot/clone. We could just fall back to
this default behaviour for encapsulated root zones, but find+cpio are
not error free and can have problem with large files. So we propose to
update zoneadm(1m) clone to detect when both the source and target zones
are using separate zfs filesystems, and in that case attempt to use zfs
send/recv before falling back to find+cpio.
Can a provision be added for running an external command to produce the
clone? I envision this being used to make a call to a storage device to
tell the storage device to create a clone of the storage. (This implies
that the super-secret tool to re-write the GUID would need to become
available.)
The alternative seems to be to have everyone invent their own mechanism
with the same external commands and zoneadm attach.
hm. currently there are internal brand hooks which are run during a
clone operation, but i don't think it would be appropriate to expose
these.

a "zoneadm clone" is basically a copy + sys-unconfig. if you have a
storage device that can be used to do the copy for you, perhaps you
could simply do the copy on the storage device, and then do a "zoneadm
attach" of the new zone image? if you want, i think it would be a
pretty trivial RFE to add a sys-unconfig option to "zoneadm attach".
that should let you get the same essential functionality as clone,
without having to add any new callbacks. thoughts?
Post by Mike Gerdts
Post by Edward Pilatowicz
Today, the zoneadm(1m) clone operations ignores any additional storage
(specified via the "fs", "device", or "dataset" resources) that may be
associated with the zone. Similarly, the clone operation will ignore
additional storage associated with any "zpool" resources.
Since zoneadm(1m) clone will be enhanced to support cloning between
encapsulated root zones and un-encapsulated root zones, zoneadm(1m)
clone will be documented as the recommended migration mechanism for
users who which to migrate existing zones from one format to another.
----------
C.2 Storage object uid/gid handling
One issue faced by all VTs that support shared storage is dealing with
file access permissions of storage objects accessible via NFS. This
issue doesn't affect device based shared storage, or local files and
vdisks, since these types of storage are always accessible, regardless
of the uid of the access process (as long as the accessing process has
the necessary privileges). But when accessing files and vdisk via NFS,
the accessing process can not use privileges to circumvent restrictive
file access premissions. This issue is also complicated by the fact
that by default most NFS servier will map all accesses by remote root
user to a different uid, usually "nobody". (a process known as "root
squashing".)
In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.
Additionally, whenever the zones framework attempts to access an storage
object file or vdisk it will temporarily switch its uid and gid to match
the owner and group of the file/vdisk, ensure that the file is readable
and writeable by it's owner (updating the file/vdisk permissions if
necessary), and finally setup the file/vdisk for access via a zpool
import or lofiadm -a. This should will allow the zones framework to
access storage object files/vdisks that we created by any user,
regardless of their ownership, simplifying file ownership and management
issues for administrators.
This implies that the xvm user is getting some additional privileges.
What are those privileges?
hm. afaik, the xvm user isn't defined as having any particular
privileges. (/etc/user_attr doesn't have an xvm entry.) i wasn't
planning on defining any privileg requirements for the xvm user.

zoneadmd currently runs as root with all privs. so zoneadmd will be
able to switch to the xvm user to create encapsulated zpool
files/vdisks. similarly, zoneadmd will also be able to switch uid to
the owner of any other objects it may need to access.
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.3 Taskq enhancements
The integration of Duckhorn[08] greatly simplifies the management of cpu
resources assigned to zone. This management is partially implemented
through the use of dynamic resource pools, where zones and their
associated cpu resources can both be bound to a pool.
Internally, zfs has worker threads associated with each zpool. These
are kernel taskq threads which can run on any cpu which has not been
explicitly allocated to a cpu set/partition/pool.
So today, for any zones living on zfs filesystems, and running in a
dedicated cpu pool, any zfs disk processing associated with that zone is
not done by the cpu's bound to that zones pool. Essentially all the
zones zfs processing is done for "free" by the global zone.
With the introduction of zpools encapsulated within storage objects,
which are themselves associated with specific zones, it would be
desirable to have the zpool worker threads bound to the cpus currently
allocated to the zone. Currently, zfs uses taskq threads for each
zpool, so one way of doing this would be to introduce a mechanism that
allows for the binding of taskqs to pools.
zfs_poolbind(char *, poolid_t);
taskq_poolbind(taskq_t, poolid_t);
When a zone, which is bound to a pool, is booted, the zones framework
will call zfs_poolbind() for each zpool associated with an encapsulated
storage object bound to the zone being booted.
Zfs will in turn use the new taskq pool binding interfaces to bind all
it's taskqs to the specified pools. This mapping is transient and zfs
will not record or persist this binding in any way.
The taskq implementation will be enhanced to allow for binding worker
threads to a specific pool. If taskqs threads are created for a taskq
which is bound to a specific pool, those new thread will also inherit
the same pool bindings. The taskq to pool binding will remain in effect
until the taskq is explicitly rebound or the pool to which it is bound
is destroyed.
Any thoughts of dooing something similar for dedicated NICs? From
cpus
Bind the processing of packets for a given data link to
a processor or a set of processors. The value can be a
comma-separated list of one or more processor ids. If
the list consists of more than one processor, the pro-
cessing will spread out to all the processors. Connec-
tion to processor affinity and packet ordering for any
individual connection will be maintained.
That is, the enhancement is already there, it's just a matter of making
use of it.
i'm currently engaged with someone on the crossbow team who is working
on a proposal to allow for binding datalinks to pools. but once again,
that's a seperate project. ;)
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.4 Zfs enhancements
In addition to the zfs_poolbind() interface proposed above. The
zpool(1m) "import" command will need to be enhanced. Currently the
zpool(1m) import by default scans all storage devices on the system
looking for pools to import. The caller can also use the '-d' option to
specify a directory within which the zpool(1m) command will scan for
zpools that may be imported. This scanning involves sampling many
objects. When dealing with zpools encapsulated in storage objects, this
scanning is unnecessary since we already know the path to the objects
which contains the zpool. Hence, the '-d' option will be enhanced to
allow for the specification of a file or device. The user will also be
able to specify this option multiple times, in case the zpool spans
multiple objects.
----------
C.5 Lofi and lofiadm(1m) enhancements
Currently, there is no way for a global zone to access the contents of a
vdisk. Vdisk support was first introduced in VirtualBox. xVM then
adopted the VirtualBox code for vdisk support. With both technologies,
the only way to access the contents of a vdisk is to export it to a VM.
To allow zones to use vdisk devices we propose to leverage the code
introduced by by xVM by incorporating it into lofi. This will allow any
solaris system to access the contents of vdisk devices. The interface
changes to lofi to allow for this are fairly straitforward.
A new '-l' option will be added to the lofiadm(1m) "-a" device creation
mode. The '-l' option will indicate to lofi that the new device should
have a label associated with it. Normally lofi device are named
/dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number.
When a disk device has a label associated with it, it exports many
device nodes with different names. Therefore lofi will need to be
enhanced to support these new device names, which multiple nodes
/dev/lofi/dsk<I>/p<j> - block device partitions
/dev/lofi/dsk<I>/s<j> - block device slices
/dev/rlofi/dsk<I>/p<j> - char device partitions
/dev/rlofi/dsk<I>/s<j> - char device slices
One of the big weaknesses with lofi is that you can't count on the
device name being the same between boots. Could -l take an argument
lofiadm -a -l coolgames /media/coolgames.iso
/dev/lofi/coolgames/p<j>
/dev/lofi/coolgames/s<j>
/dev/rlofi/coolgames/p<j>
/dev/rlofi/coolgames/s<j>
For those cases where legacy behavior is desired, an optional %d can be
used to create the names you suggest above.
lofiadm -a -l dsk%d /nfs/server/zone/stuff
so there are a lot of improvements that could be done to lofi. one
improvement that i think we should do is to allow for persistent lofi
devices that come back after reboots. custom device naming is another.
but once again, i think that is outside the scope of this project.
(this project will facilitate these other changes because it is creating
an smf service for lofi, where persistent configuration could be stored,
but adding that functionality will have to be another project.)
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.6 Performance considerations
As previously mentioned, this proposal primarily simplifies the process
of configuring zones on shared storage. In most cases these proposed
configurations can be created today, but no one has actually verified
that these configurations perform acceptably. Hence, in conjunction
with providing functionality to simplify the setup of these configs,
we also need to be quantifying their performance to make sure that
none of the configurations suffer from gross performance problems.
The most straitforward configurations, with the least possibilities for
poor performance, are ones using local devices, fibre channel luns, and
iSCSI luns. These configuration should perform identically to the
configurations where the global zone uses these objects to host zfs
filesystems without zones. Additionally, the performance of these
configurations will mostly be dependent upon the hardware associated
with the storage devices. Hence the performance of these configuration
is for the most part uninteresting and performance analysis of these
configuration can by skipped.
Looking at the performance of storage objects which are local files or
nfs files is more interesting. In these cases the zpool that hosts the
zone will be accessing it's storage via the zpool vdev_file vdev_ops_t
interface. Currently, this interface doesn't receive as much use and
performance testing as some of the other zpool vdev_ops_t interfaces.
Hence it will worthwhile to measure the performance of a zpool backed by
a file within another zfs filesystem. Likewise we will want to measure
the performance of a zpool backed by a file on an NFS filesystem.
Finally, we should compare these two performance points to a zone which
is not encapsulated within a zpool, but is instead installed directly on
a local zfs filesystem. (These comparisons are not really that
interesting when dealing with block device based storage objects.)
Reminder for when I am testing: is this a case where forcedirectio will
make a lot of sense? That is, zfs is already buffering, don't make NFS
do it too.
this is a great question, and i don't know the answer. i'll have to
ask some nfs folks and do some perf testing to determine what should
be done here. i've added a not about forcedirectio to the doc.
Frank Batschulat
2009-11-27 17:12:33 UTC
Permalink
Hey Ed, I want to comment on the NFS aspects involed here,
Post by Edward Pilatowicz
well, it all depends on what nfs shares are actually being exported.
I definitively think we do want to abstain from that much programmatic
attempts inside the Zones framework on making assumptions about what
an NFS server does export, how the NFS servers exported namespace
may look like and how the NFS client (who's running the Zone)
handles those exports upon access as opposed to explicit mounting.

It is merely okay for the NFS v2/v3 (and their helper) protocols world
but it is not always adequate for the V4 protocol and all the work/features
in V4 and V4.1 towards a unified, global namespace.

I'll show why in the context of V4 on the examples you
mentioned below.
Post by Edward Pilatowicz
nfsserver:/vol
/var/zones/nfsmount/zone1/nfsserver/vol
/var/zones/nfsmount/zone2/nfsserver/vol
/var/zones/nfsmount/zone3/nfsserver/vol
nfsserver:/vol/zones
/var/zones/nfsmount/zone1/nfsserver/vol/zones
/var/zones/nfsmount/zone2/nfsserver/vol/zones
/var/zones/nfsmount/zone3/nfsserver/vol/zones
in those 2 examples, we'd have to consider how the
V4 server constructs it's pseudo namespace starting
at the servers root, including what we call pseudo exports
that build the bridge to the real exported share points
at the server and how the V4 client may handle this.

for instance, on the V4 server the export:

/vol

may (and probably will) have different ZFS datasets
that host our zones underneath /vol eg.:

/vol/zone1
/vol/zone2
/vol/zone3

since they are separate ZFS datasets, we would
cross file system boundaries while traversing from
the exported servers root / over the share point /vol
down to the (also presumably exported, otherwise it
wouln't be usefull in our context anyways) share points
zone1/zone2/zone3.

We'll distingish between the different file systems
based on the FSID attribute, if it changes, we'd cross
server file system boundaries.

With V2/V3 that'd stop us and the client can not travel into
the new file system below the inital mount and a separate mount
would have to be performed (unless we've explicitely mounted
the entire path of course)

However, with V4, the client has the (in our implementation)
so called Mirror Mount feature. That allows the client
to transparantly mount those new file systems on access below
the starting share point /vol (provided they are shared as well)
and make them immediately visible without requiring the user
for perform any additional mounts.

Those mirror mounts will be done automatically by our V4 client
in the kernel as it detects it'd cross server side file system
boundaries (based on the FSID) on any access other then
VOP_LOOKUP() or VOP_GETATTR().

Ie. if the global zone did already have mounted

server:/vol

an attempt by the zone utilities to access (as opposed to
explicit mounting) of

server:/vol/zone1

will automatically mount server:/vol/zone1 into
the clients namespace and you'd get on the client
(nfsstat -m) 2 mounts:

server:/vol (already existing regular mount)
server:/vol/zone1 (the mirror mount done by the client)

if we'd really perform a mount though, that'd just induce
the mount of

server:/vol/zone1

into the clients namespace running the zone.

With the advent of the upcomming NFS v4 Referrals support
in the V4 server and V4 client, another 'automatism'
in the client can possibly change our observation of
the mounted server exports on the client running the zone.

On the V4 server (that is hosting our zone image) the administrator
might decide to relocate the export to a different server and
then might establish a so called 'reparse point' (in essence
a symlink containing special infos) that will redirect a client
to a different server hosting this export.

NB: other Vendors NFS servers might hand out referrals to
NB2: the same feature will be supported by our CIFS client

The V4 client can get a specific referral event (NFS4ERR_MOVED)
on VOP_LOOKUP(), VOP_GETATTR() and during inital mount processing
by observing the NFS4ERR_MOVED error and it'll fetch the new location
information from the server via the 'fs_locations' attribute.
Our client will then go off and automatically mount the file system
from the different server it had been referred to from the inital
server. Again like mirror mounts, this is done transparently for the
user and inside the kernel.

The minor but important quirk involved here as far as our
observation from the Zone NFS client is concered is that
we might get for our mount attempt (or access to)

server_A:/vol/zones1

a mount established instead for

server_B:/vol/zones1

It is planned to even provide our V2/V3 clients with
Referral support when taking to our NFS servers, although
the implementation will slightely differ and I'm not
yet sure how that V2/V3 clients referral mount will
be observed on the NFS client.

While this (Referrals) currently only affects inital access and
mounting, in the future with Migration and Replication support being
implemented, litterally every NFS v4 OTW OP may get a 'migration
event', aka. receive NFS4ERR_MOVED.

This is still in the early design stages, but we have to expect
that from the Zones NFS clients observability stand point
the 'nfsserver' portion of the mounted export may silently
be 're-written' behind the scenes instead of doing a
separate 2nd mount, ie.:

our inital zone initiated access/mount:

server_OLD:/vol/zone1

Oops, migration even happens to the client, now will
silently become:

server_NEW:/vol/zone1

this will be reflected in things like nfsstat(1M) output
as well.
Post by Edward Pilatowicz
nfsserver:/vol/zones/zone1
nfsserver:/vol/zones/zone2
nfsserver:/vol/zones/zone3
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
as I tried to explain above, the 'nfsserver' part
can be a moving target as far as our observability from
the Zone NFS client is concerned.
Post by Edward Pilatowicz
afaik, determining the mount point should be pretty strait forward.
i was planning to get a list of all the shares exported by the specified
nfs server, and then do a strncmp() of all the exported shares against
the specified path. the longest matching share name is the mount path.
Well, that in turn is anything but straight forward and almost
impossible for NFS v4 servers.

For the V2/V3 clients that do use the mount protocol to instantiate
a mount the servers mountd(1M) from a V2/V3 server can be asked by
the client using the MOUNTPROC_EXPORT/MOUNTPROC3_EXPORT RPC procedure
to return a list of exported file systems.

This is used by commands like showmount(1M) or dfshares(1M)
to list servers exported file systems, however there's no API available
todo that other then writing your own RPC aware application doing
essentially rpc_clnt_calls(3NSL) talking to a remote V2/V3 servers
mountd(1M).

But, the V4 protocol does not use the mount protocol at all anymore
so there's no real programmatic way to retrieve a list of
exported file systems from a V4 server. This would not make
much sense in the context of the V4 protocol anyways because
of the way the V4 server constructs its pseudo namespace starting
from servers root / potentially involving pseudo export nodes
that eventually bridge to the real share points.

You may be lucky and the exported file systems are shared for
V3 and V4 in which case you can make an educated guess at least.
Post by Edward Pilatowicz
nfs://jurassic/a/b/c/d/file
jurassic:/a
jurassic:/a/b
jurassic:/a/b/c
/var/zones/nfsmount/jurassic/a/b/c
/var/zones/nfsmount/jurassic/a/b/c/d/file
afaik, this is acutally the only way that this could be implemented.
for above reasons I'd rather stay away from implementing some
logic to figure out what to mount based on a potential
list of exported file systems from the server and rather stick
with some basics configured via zonecfg in the way:

NFS path = 'nfs://<host>[:port]/<export>'
Zone image = '<[dir to]filename>'

that way we avoid the problem of having to parse the entire
current proposed SO-URI like:

'nfs://<host>[:port]/<file-absolute>'.

and probe what part of that pathname may be suitable as a mount.

Of course we could always say that anything before the image file
name itself shall be in essence an exported path suitable for
performing a mount.

Also when talking to V4 servers we could also always just mount
the servers root / and then an any access to the <file-absolute>
path will trigger a mirror mount, but then, this does not
work for V2/V3 servers though.

I think we may want to elaborate a bit more on the use
of the current proposed NFS SO-URI of:

'nfs://<host>[:port]/<file-absolute>'

and its use from Zone land to perform mounts and access the
zone image.

cheers
frankB
--
This message posted from opensolaris.org
Edward Pilatowicz
2009-11-30 20:05:57 UTC
Permalink
Post by Frank Batschulat
Hey Ed, I want to comment on the NFS aspects involed here,
Post by Edward Pilatowicz
well, it all depends on what nfs shares are actually being exported.
I definitively think we do want to abstain from that much programmatic
attempts inside the Zones framework on making assumptions about what
an NFS server does export, how the NFS servers exported namespace
may look like and how the NFS client (who's running the Zone)
handles those exports upon access as opposed to explicit mounting.
It is merely okay for the NFS v2/v3 (and their helper) protocols world
but it is not always adequate for the V4 protocol and all the work/features
in V4 and V4.1 towards a unified, global namespace.
I'll show why in the context of V4 on the examples you
mentioned below.
Post by Edward Pilatowicz
nfsserver:/vol
/var/zones/nfsmount/zone1/nfsserver/vol
/var/zones/nfsmount/zone2/nfsserver/vol
/var/zones/nfsmount/zone3/nfsserver/vol
nfsserver:/vol/zones
/var/zones/nfsmount/zone1/nfsserver/vol/zones
/var/zones/nfsmount/zone2/nfsserver/vol/zones
/var/zones/nfsmount/zone3/nfsserver/vol/zones
in those 2 examples, we'd have to consider how the
V4 server constructs it's pseudo namespace starting
at the servers root, including what we call pseudo exports
that build the bridge to the real exported share points
at the server and how the V4 client may handle this.
/vol
may (and probably will) have different ZFS datasets
/vol/zone1
/vol/zone2
/vol/zone3
since they are separate ZFS datasets, we would
cross file system boundaries while traversing from
the exported servers root / over the share point /vol
down to the (also presumably exported, otherwise it
wouln't be usefull in our context anyways) share points
zone1/zone2/zone3.
We'll distingish between the different file systems
based on the FSID attribute, if it changes, we'd cross
server file system boundaries.
With V2/V3 that'd stop us and the client can not travel into
the new file system below the inital mount and a separate mount
would have to be performed (unless we've explicitely mounted
the entire path of course)
However, with V4, the client has the (in our implementation)
so called Mirror Mount feature. That allows the client
to transparantly mount those new file systems on access below
the starting share point /vol (provided they are shared as well)
and make them immediately visible without requiring the user
for perform any additional mounts.
Those mirror mounts will be done automatically by our V4 client
in the kernel as it detects it'd cross server side file system
boundaries (based on the FSID) on any access other then
VOP_LOOKUP() or VOP_GETATTR().
Ie. if the global zone did already have mounted
server:/vol
an attempt by the zone utilities to access (as opposed to
explicit mounting) of
server:/vol/zone1
will automatically mount server:/vol/zone1 into
the clients namespace and you'd get on the client
server:/vol (already existing regular mount)
server:/vol/zone1 (the mirror mount done by the client)
if we'd really perform a mount though, that'd just induce
the mount of
server:/vol/zone1
into the clients namespace running the zone.
With the advent of the upcomming NFS v4 Referrals support
in the V4 server and V4 client, another 'automatism'
in the client can possibly change our observation of
the mounted server exports on the client running the zone.
On the V4 server (that is hosting our zone image) the administrator
might decide to relocate the export to a different server and
then might establish a so called 'reparse point' (in essence
a symlink containing special infos) that will redirect a client
to a different server hosting this export.
NB: other Vendors NFS servers might hand out referrals to
NB2: the same feature will be supported by our CIFS client
The V4 client can get a specific referral event (NFS4ERR_MOVED)
on VOP_LOOKUP(), VOP_GETATTR() and during inital mount processing
by observing the NFS4ERR_MOVED error and it'll fetch the new location
information from the server via the 'fs_locations' attribute.
Our client will then go off and automatically mount the file system
from the different server it had been referred to from the inital
server. Again like mirror mounts, this is done transparently for the
user and inside the kernel.
The minor but important quirk involved here as far as our
observation from the Zone NFS client is concered is that
we might get for our mount attempt (or access to)
server_A:/vol/zones1
a mount established instead for
server_B:/vol/zones1
It is planned to even provide our V2/V3 clients with
Referral support when taking to our NFS servers, although
the implementation will slightely differ and I'm not
yet sure how that V2/V3 clients referral mount will
be observed on the NFS client.
While this (Referrals) currently only affects inital access and
mounting, in the future with Migration and Replication support being
implemented, litterally every NFS v4 OTW OP may get a 'migration
event', aka. receive NFS4ERR_MOVED.
This is still in the early design stages, but we have to expect
that from the Zones NFS clients observability stand point
the 'nfsserver' portion of the mounted export may silently
be 're-written' behind the scenes instead of doing a
server_OLD:/vol/zone1
Oops, migration even happens to the client, now will
server_NEW:/vol/zone1
this will be reflected in things like nfsstat(1M) output
as well.
first, one thing to keep in mind. all these mounts are being setup by
the global zone to access encapsulated zones. (ie, zones stored in
files, vdisks, etc.) these nfs filesystems won't be visible from within
a zone. everything we're describing here happens in the global zone.
Post by Frank Batschulat
Post by Edward Pilatowicz
nfsserver:/vol/zones/zone1
nfsserver:/vol/zones/zone2
nfsserver:/vol/zones/zone3
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
as I tried to explain above, the 'nfsserver' part
can be a moving target as far as our observability from
the Zone NFS client is concerned.
as i mentioned, none of these mounts will be visible from within any
non-global zone.
Post by Frank Batschulat
Post by Edward Pilatowicz
afaik, determining the mount point should be pretty strait forward.
i was planning to get a list of all the shares exported by the specified
nfs server, and then do a strncmp() of all the exported shares against
the specified path. the longest matching share name is the mount path.
Well, that in turn is anything but straight forward and almost
impossible for NFS v4 servers.
For the V2/V3 clients that do use the mount protocol to instantiate
a mount the servers mountd(1M) from a V2/V3 server can be asked by
the client using the MOUNTPROC_EXPORT/MOUNTPROC3_EXPORT RPC procedure
to return a list of exported file systems.
This is used by commands like showmount(1M) or dfshares(1M)
to list servers exported file systems, however there's no API available
todo that other then writing your own RPC aware application doing
essentially rpc_clnt_calls(3NSL) talking to a remote V2/V3 servers
mountd(1M).
But, the V4 protocol does not use the mount protocol at all anymore
so there's no real programmatic way to retrieve a list of
exported file systems from a V4 server. This would not make
much sense in the context of the V4 protocol anyways because
of the way the V4 server constructs its pseudo namespace starting
from servers root / potentially involving pseudo export nodes
that eventually bridge to the real share points.
You may be lucky and the exported file systems are shared for
V3 and V4 in which case you can make an educated guess at least.
Post by Edward Pilatowicz
nfs://jurassic/a/b/c/d/file
jurassic:/a
jurassic:/a/b
jurassic:/a/b/c
/var/zones/nfsmount/jurassic/a/b/c
/var/zones/nfsmount/jurassic/a/b/c/d/file
afaik, this is acutally the only way that this could be implemented.
for above reasons I'd rather stay away from implementing some
logic to figure out what to mount based on a potential
list of exported file systems from the server and rather stick
NFS path = 'nfs://<host>[:port]/<export>'
Zone image = '<[dir to]filename>'
i don't like the idea of having multiple objects that need to be
specified. it requires the addition of an extra variable that is only
needed for nfs uris.
Post by Frank Batschulat
that way we avoid the problem of having to parse the entire
'nfs://<host>[:port]/<file-absolute>'.
and probe what part of that pathname may be suitable as a mount.
Of course we could always say that anything before the image file
name itself shall be in essence an exported path suitable for
performing a mount.
so due to redirects and v4, we can't really do a "showmounts" and scan
for what's available. ok.

we also can't do a top-down probe (ie, first probe /a/b/c/d, then probe
/a/b/c) due to redirects. ok.

that means if we want to support the current uri format (with arbitrary
export + file path combinations) our only option is a bottom up probe.
ie:

- attempt to mount jurassic:/a
if fail, error

- attempt to mount jurassic:/a/b
if fail attempt to access path b/c/d/file
if fail, return error

it's not exactly elegant, but it only needs to be done once during zone
boot, so assuming it would work, it would probably be ok.
Post by Frank Batschulat
Also when talking to V4 servers we could also always just mount
the servers root / and then an any access to the <file-absolute>
path will trigger a mirror mount, but then, this does not
work for V2/V3 servers though.
i'm less concerned about v2/v3 serviers. v4 has been the default since
s10. i expect people to use it. that means i think it'd be ok to
design an initial solution that works for v4, and if we get actual
customer requests to support v2/v3 then we can do that as a follow on
RFE.
Post by Frank Batschulat
I think we may want to elaborate a bit more on the use
'nfs://<host>[:port]/<file-absolute>'
and its use from Zone land to perform mounts and access the
zone image.
first off, i'm not sure of where "zone land" is. ;)

that said, i'd be ok with doing any of the following:

1) do a bottom up probe (as described above)

2) change the uri format to add a seperator between the export and the
mount path. say
nfs://<host>[:port]/<export>?path=<file-absolute>

3) as you suggested above, require "that anything before the image file
name itself shall be in essence an exported path suitable for
performing a mount."

althought the last idea seems a bit restricting.

ed
Frank Batschulat
2009-11-27 22:47:37 UTC
Permalink
Hey Ed, addition to my previous posting as I just noticed something I've totally
forgotten about....
Post by Edward Pilatowicz
afaik, determining the mount point should be pretty
strait forward. i was planning to get a list of all the shares
exported by the specified nfs server, and then do a strncmp() of all the
exported shares against the specified path. the longest matching share name
is the mount path.
nfs://jurassic/a/b/c/d/file
jurassic:/a
jurassic:/a/b
jurassic:/a/b/c
/var/zones/nfsmount/jurassic/a/b/c
/var/zones/nfsmount/jurassic/a/b/c/d/file
afaik, this is acutally the only way that this could
be implemented.
I just recognized (my bad) that the SO-URI of 'nfs://<host>[:port]/<file-absolute>'.
is actually compliant to the WebNFS URL syntax of RFC 2224 / RFC 2054 / RFC 2055
ie.one could directly mount that.

so it looks like we can avoid all the parsing handstands and directly mount
such an URL aka. SO-URI if the server does support public file handles.

I'll look into the URL mount schema and requirements in a bit more detail
to discover potential problems laying around, generic support and availability.

ideally it should be something that should work with almost every NFS server and
presumably without much setup on the server side in order to serve our
NFS implementation independend needs.

cheers
frankB
--
This message posted from opensolaris.org
Edward Pilatowicz
2009-11-30 20:12:11 UTC
Permalink
Post by Frank Batschulat
Hey Ed, addition to my previous posting as I just noticed something I've totally
forgotten about....
Post by Edward Pilatowicz
afaik, determining the mount point should be pretty
strait forward. i was planning to get a list of all the shares
exported by the specified nfs server, and then do a strncmp() of all the
exported shares against the specified path. the longest matching share name
is the mount path.
nfs://jurassic/a/b/c/d/file
jurassic:/a
jurassic:/a/b
jurassic:/a/b/c
/var/zones/nfsmount/jurassic/a/b/c
/var/zones/nfsmount/jurassic/a/b/c/d/file
afaik, this is acutally the only way that this could
be implemented.
I just recognized (my bad) that the SO-URI of 'nfs://<host>[:port]/<file-absolute>'.
is actually compliant to the WebNFS URL syntax of RFC 2224 / RFC 2054 / RFC 2055
ie.one could directly mount that.
so it looks like we can avoid all the parsing handstands and directly mount
such an URL aka. SO-URI if the server does support public file handles.
I'll look into the URL mount schema and requirements in a bit more detail
to discover potential problems laying around, generic support and availability.
ideally it should be something that should work with almost every NFS server and
presumably without much setup on the server side in order to serve our
NFS implementation independend needs.
sounds good.

in the future i'll make sure to read all your follow up emails before
replying to your initial emails. ;)

ed

Edward Pilatowicz
2009-05-22 07:12:48 UTC
Permalink
[ second reply, includes revised proposal ]

hey mike,

thanks for all the great feedback.
my replies to your individual comments are inline below.

i've updated my proposal to include your feedback, but i'm unable to
attach it to this reply because of mail size restrictions imposed by
this alias. i'll send some follow up emails which include the revised
proposal.

thanks again,
ed
Edward Pilatowicz
2009-05-22 07:13:58 UTC
Permalink
[ third reply, includes revised proposal + change bars from previous
version ]

hey mike,

thanks for all the great feedback.
my replies to your individual comments are inline below.

i've updated my proposal to include your feedback, but i'm unable to
attach it to this reply because of mail size restrictions imposed by
this alias. i'll send some follow up emails which include the revised
proposal.

thanks again,
ed
Mike Gerdts
2009-05-22 16:26:06 UTC
Permalink
On Fri, May 22, 2009 at 1:57 AM, Edward Pilatowicz
Post by Edward Pilatowicz
hey mike,
thanks for all the great feedback.
my replies to your individual comments are inline below.
Thanks. I've responded inline where needed.
Post by Edward Pilatowicz
i've attached an updated version of the proposal (v1.1) which addresses
your feedback.  (i've also attached a second copy of the new proposal
that includes change bars, in case you want to review the updates.)
As I was reading through it again, I fixed a few picky things (mostly
spelling) that don't change the meaning. I don't think that I "fixed"
anything that was already right in British English.

diff attached.
Post by Edward Pilatowicz
thanks again,
ed
Post by Mike Gerdts
On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz
Post by Edward Pilatowicz
hey all,
i've created a proposal for my vision of how zones hosted on shared
storage should work.  if anyone is interested in this functionality then
please give my proposal a read and let me know what you think.  (fyi,
i'm leaving on vacation next week so if i don't reply to comments right
away please don't take offence, i'll get to it when i get back.  ;)
ed
I'm very happy to see this.  Comments appear below.
Post by Edward Pilatowicz
" please ensure that the vim modeline option is not disabled
vim:textwidth=72
-------------------------------------------------------------------------------
Zones on shared storage (v1.0)
[snip]
Post by Edward Pilatowicz
----------
C.1.i Zonecfg(1m)
The zonecfg(1m) command will be enhanced with the following two new
    rootzpool                               resource
            src                             resource property
            install-size                    resource property
            zpool-preserve                  resource property
            dataset                         resource property
    zpool                                   resource
            src                             resource property
            install-size                    resource property
            zpool-preserve                  resource property
            name                            resource property
"rootzpool"
    - Description: Identifies a shared storage object (and it's
    associated parameters) which will be used to contain the root
    zfs filesystem for a zone.
"zpool"
    - Description: Identifies a shared storage object (and it's
    associated parameters) which will be made available to the
    zone as a delegated zfs dataset.
That is to say "put your OS stuff in rootzpool, put everything else in
zpool" - right?
yes.  as i see it, this proposal allows for multiple types of deployment
configurations.
- a zone with a single encapsulated "rootzpool" zpool.
       the OS will reside in <zonename>_rpool/ROOT/zbeXXX
       everything else will also reside in <zonename>_rpool/ROOT/zbeXXX
- a zone with a single encapsulated "rootzpool" zpool.
       the OS will reside in <zonename>_rpool/ROOT/zbeXXX
       everything else will reside in <zonename>_rpool/dataset/<dataset>
- a zone with multiple encapsulated zpools.
       the OS will reside in <zonename>_rpool/ROOT/zbeXXX
       everything else will reside in other encapsulated "zpool"s
i've added some text to this section of the proposal to explain these
different configuration scenarios.
Thanks, looks good.
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.1.ii Storage object uri (so-uri) format
The storage object uri (so-uri) syntax[03] will conform to the standard
uri format defined in RFC 3986 [04].  The nfs URI scheme is defined in
    path:///<file-absolute>
    nfs://<host>[:port]/<file-absolute>
    vpath:///<file-absolute>
    vnfs://<host>[:port]/<file-absolute>
File storage objects point to plain files on a local, nfs, or cifs
filesystems.  These files are used to contain zpools which store zone
datasets.  These are the simplest types of storage objects.  Once
created, they have a fixed size, can't be grown, and don't support
advanced features like snapshotting, etc.  Some example file so-uri's
path:///export/xvm/vm1.disk
    - a local file
path:///net/heaped.sfbay/export/xvm/1.disk
    - a nfs file accessible via autofs
nfs://heaped.sfbay/export/xvm/1.disk
    - same file specified directly via a nfs so-uri
Vdisk storage objects are similar to file storage objects in that they
can live on local, nfs, or cifs filesystems, but they each have their
own special data format and varying featuresets, with support for things
like snapshotting, etc..  Some common vdisk formats are: VDI, VMDK and
vpath:///export/xvm/vm1.vmdk
    - a local vdisk image
vpath:///net/heaped.sfbay/export/xvm/1.vmdk
    - a nfs vdisk image accessible via autofs
vnfs://heaped.sfbay/export/xvm/1.vmdk
    - same vdisk image specified directly via a nfs so-uri
Device storage objects specify block storage devices in a host
independant fashion.  When configuring FC or iscsi storage on different
hosts, the storage configuration normally lives outsize of zonecfg, and
the configured storage may have varying /dev/dsk/cXtXdX* names.  The
so-uri syntax provides a way to specify storage in a host independent
fashion, and during zone management operations, the zones framework can
map this storage to a host specific device path.  Some example device
    - lun 0 of a fc disk with the specified wwn
    - lun 0 of an iscsi disk with the specified alias.
iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740
    - lun 0 of an iscsi disk with the specified target id.
What about if there is already the necessary layer of abstraction that
provides a consistent namespace?  For example,
/dev/vx/dsk/zone1dg/rootvol would refer to a block device named rootvol
in the disk group zone1dg.  That may reside on a single disk or span
many disks and will have the same name regardless of which host the disk
group is imported on.  Since this VxVM volume may span many disks, it
would be inappropriate to refer to a single LUN that makes up that disk
group.
Perhaps the following is appropriate for such situations.
dev:///dev/vx/dsk/zone1dg/rootvol
good point.  but rather than adding another URI type i'd rather just re-use
the "path:///" uri.
i've updated the doc to describe this use case and i've added an
example.
Oh yeah, UNIX presents devices as files. Duh. :)
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.1.iii Zoneadm(1m) install
When a zone is installed via the zoneadm(1m) "install" subcommand, the
zones subsystem will first verify that any required so-uris exist and
are accessible.
If an so-uri points to a plain file, nfs file, or vdisk, and the object
does not exist, the object will be created with the install-size that
was specified via zonecfg(1m).  If the so-uri does not exist and an
install-size was not specified via zonecfg(1m) an error will be
generated and the install will fail.
If an so-uri points to an explicit nfs server, the zones framework will
need to mount the nfs filesystem containing storage object.  The nfs
    /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
- "will be mounted at".  I think "auto-mounted" conjures up the idea
  that there is integration with autofs.
- <host> is the NFS server
- <nfs-share-name> is the path on the NFS server.  Is this the exact
  same thing as <path-absolute> in the URI specification?  Is this the
  file that is mounted or the directory above the file?
My storage administrators give me grief if I create too many NFS mounts
(but I am not sure I've heard a convincing reason).  As I envision NFS
vol
  zones
    zone1
      rootzpool
      zpool
    zone2
      rootzpool
      zpool
    zone3
      rootzpool
      zpool
It seems as though if these three zones are all running on the same box
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
well, it all depends on what nfs shares are actually being exported.
       nfsserver:/vol
       /var/zones/nfsmount/zone1/nfsserver/vol
       /var/zones/nfsmount/zone2/nfsserver/vol
       /var/zones/nfsmount/zone3/nfsserver/vol
       nfsserver:/vol/zones
       /var/zones/nfsmount/zone1/nfsserver/vol/zones
       /var/zones/nfsmount/zone2/nfsserver/vol/zones
       /var/zones/nfsmount/zone3/nfsserver/vol/zones
In either of these cases I'll get nagged about having three mounts
when one would suffice. I'm OK being nagged about that if it means
that I don't have something guessing how far up the tree they should
try to mount.
Post by Edward Pilatowicz
       nfsserver:/vol/zones/zone1
       nfsserver:/vol/zones/zone2
       nfsserver:/vol/zones/zone3
       /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
       /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
       /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
Post by Mike Gerdts
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/rootzpool
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/zpool
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/rootzpool
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/zpool
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/rootzpool
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/zpool
hm.  afaik, you can only share directories via nfs, and i'm assuming
that "zpool" and "rootzpool" above are files (or volumes) which can
actually store data.  in which case you would never mount them directly.
You can only share directories (I think) but you can mount files, much
like lofi allows you to mount files. The only place I've seen this
done is by the Solaris installer when it mounts a flash archive file
directly rather than mounting the parent directory. But... if zoneadm
needs to create files, it is hard to mount the file before it is
created. Mounting the parent directory seems to be the right thing to
do.
Post by Edward Pilatowicz
Post by Mike Gerdts
With a slightly different arrangment this could be reduced to one.
Change
Post by Edward Pilatowicz
    /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
      /var/zones/nfsmount/<host>/<nfs-share-name>/<zonename>/<file>
nice catch.
in early versions of my proposal, the nfs:// uri i was planning to
support allowed for the specification of mount options.  this required
allowing for per-zone nfs mounts with potentially different mount
options.  since then i've simplified things (realizing that most people
really don't need or want to specify mount options) and i've switched to
using the the nfs uri defined in rfc 2224.  this means we can do away
with the '<zonename>' path component as you suggest.
That was actually something I thought about after the fact. When I've
been involved in performance problems in the past, being able to tune
mount options (e.g. protocol versions, block sizes, caching behavior,
etc.) has been important.
Post by Edward Pilatowicz
i've updated the doc.
Post by Mike Gerdts
I can see that this would complicate things a bit because it would be
hard to figure out how far up the path is the right place for the mount.
afaik, determining the mount point should be pretty strait forward.
i was planning to get a list of all the shares exported by the specified
nfs server, and then do a strncmp() of all the exported shares against
the specified path.  the longest matching share name is the mount path.
       nfs://jurassic/a/b/c/d/file
       jurassic:/a
       jurassic:/a/b
       jurassic:/a/b/c
       /var/zones/nfsmount/jurassic/a/b/c
       /var/zones/nfsmount/jurassic/a/b/c/d/file
afaik, this is acutally the only way that this could be implemented.
So long as we don't try to do one mount that covers the needs of
multiple zones, it is quite simple. It gets difficult if jurassic is
exporting:

jurassic:/a (ro)
jurassic:/a/zones (ro)
jurassic:/a/zones/zone1 (rw)
jurassic:/a/zones/zone2 (rw)

Depending on the NFS (v3) server implementation (not this way with the
Solaris NFS implementation, but I think it is with NetApp) this is
problematic if the global zone mounts:

jurassic:/a/zones on /var/zones/nfsmount/jurassic/a/zones

which makes /var/zones/nfsmount/jurassic/a/zones/zone{1,2} readable
but not writable.

If it doesn't try to be clever and simply mounts

jurassic:/a/zones/zone1 on /var/zones/nfsmount/jurassic/a/zones/zone1
jurassic:/a/zones/zone2 on /var/zones/nfsmount/jurassic/a/zones/zone2

Then all is well.

The optimization of a single mount is where things get ugly. As such,
I'll let my storage people complain about having multiple mounts.
Post by Edward Pilatowicz
Post by Mike Gerdts
Perhaps if this is what I would like I would be better off adding a
global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use
the path:/// uri instead.
Thoughts?
i'm not sure i understand how you would like to see this functionality
behave.
wrt vfstab, i'd rather you not use that since that moves configuration
outside of zonecfg.  so later, if you want to migrate the zone, you'll
need to remember about that vfstab configuration and move it as well.
if at all possible i'd really like to keep all the configuration within
zonecfg(1m).
perhaps you could explanin your issues with the currently planned
approach in a different way to help me understand it better?
The key thing here is that all of my zones are served from one or two
NFS servers. Let's pretend that I have a T5440 with 200 zones on it.
The way the proposal is written, I would have 200 mounts in the global
zone of the form:

$nfsserver:/vol/zones/zone$i
on /var/zones/nfsmount/nfsserver/vol/zones/zone$i

When in reality, all I need is a single mount (subject to
implementation-specific details, as discussed above with ro vs. rw
shares):

$nfsserver:/vol/zones
on /var/myzones/nfs/$nfsserver/vol/zones

If my standard global zone deployment mechanism adds a vfstab entry
for $nfsserver:/vol/zones and configure each zone via path:/// I avoid
a storm of NFS mount requests at zone boot time as the global zone
boots. The NFS mount requests are UDP-based RPC calls, which
sometimes get lost on the wire. The timeout/retransmit may be such
that we add a bit of time to the overall zone startup process. Not a
huge deal in most cases, but a confusing problem to understand.

In this case, I wouldn't consider the NFS mounts as being something
specific to a particular zone. Rather, it is a common configuration
setting across all members of a particular "zone farm".
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
If an so-uri points to a fibre channel lun, the zones subsystem will
verify that the specified wwn corresponds to a global zone accessible
fibre channel disk device.
If an so-uri points to an iSCSI target or alias, the zones subsystem
will verify that the iSCSI device is accessible on the local system.  If
an so-uri points to a static iSCSI target and that target is not
already accessible on the local host, then the zones subsystem will
enable static discovery for the local iSCSI initiator and attempt to
apply the specified static iSCSI configuration.  If the iSCSI target
device is not accessible then the install will fail.
Once a zones install has verified that any required so-uri exists and is
accessible, the zones subsystem will need to initialise the so-uri.  In
the case of a path or nfs path, this will involve creating a zpool
within the specified file.  In the case of a vdisk, fibre channel lun,
or iSCSI lun, this will involve creating a EFI/GPT partition on the
device which uses the entire disk, then a zpool will be created within
this partition.  For data protection purposes, if a storage object
contains any pre-existing partitions, zpools, or ufs filesystems, the
install will fail will fail with an appropriate error message.  To
s/will fail will fail/will fail/
oops.  thanks.  ;)
Post by Mike Gerdts
Post by Edward Pilatowicz
continue the installation and overwrite any pre-existing data, the user
will be able to specify a new '-f' option to zoneadm(1m) install.  (This
option mimics the '-f' option used by zpool(1m) create.)
If zpool-preserve is set to true, then before initialising any target
storage objects, the zones subsystem will attempt to import a
pre-existing zpool from those objects.  This will allow users to
pre-create a zpool with custom creation time options, for use with
zones.  To successfully import a pre-created zpool for a zone install,
that zpool must not be attached.  (Ie, any pre-created zpool must be
exported from the system where it was created before a zone can be
installed on it.)  Once the zpool is imported the install process will
check for the existence of a /ROOT filesystem within the zpool.  If this
filesystem exists the install will fail with an appropriate error
message.  To continue the installation the user will need to specify the
'-f' option to zoneadm(1m) install, which will cause the zones framework
to delete the pre-existing /ROOT filesystem within the zpool.
Is this because the zone root will be installed <zonepath>/ROOT/<bename>
rather than <zonepath>/root?
yes.
the current zones zfs filesystem layout and management for
       http://www.opensolaris.org/jive/thread.jspa?messageID=272726&#272726
i've mentioned this and reffered the user the '[07]'.  (which references
the link above.)
Post by Mike Gerdts
Post by Edward Pilatowicz
The newly created or imported root zpool will be named after the zone to
which it is associated, with the assigned name being "<zonename>_rpool".
This zpool will then be mounted at the zones rootpath and then the
install process will continue normally[07].
This seems odd... why not have the root zpool mounted at zonepath rather
than zoneroot?  This way (e.g.) SUNWdetached.xml would follow the zone
during migrations.
oops.  that a mistake.  it will be mounted on the zonepath.  i've fixed
this.
Post by Mike Gerdts
Post by Edward Pilatowicz
XXX: use altroot at zpool creation or just manually mount zpool?
If the user has specified a "zpool" resource, then the zones framework
will configure, initialize, and/or import it in a similar manaer to a
zpool specified by the "rootzpool" resource.  The key differences are
that the name of the newly created or imported zpool will be
"<zonename>_<name>".  The specified zpool will also have the zfs "zoned"
property set to "on", hence it will not be mounted anywhere in the
global zone.
XXX: do we need "zpool import -O file-system-property=" to set the
     zoned property upon import.
Once a zone configured with a so-uri is in the installed state, the
zones framework needs a mechanism to mark that storage as in use to
prevent it from being accessed by multiple hosts simultaneously.  The
most likely situation where this could happen is via a zoneadm(1m)
attach on a remote host.  The easiest way to achieve this is to keep the
zpools associated with the storage imported and mounted at all times,
and leverage the existing zpool support for detecting and preventing
multi-host access.
So whenever a global zone boots and the zones smf service runs, it will
attempt to configure and import any shared storage objects associated
with installed zones.  It will then continue to behave as it does today
and boot any installed zones that have the autoboot property set.  If
- the zones associated with the failed storage will be transitioned
  to the "uninstalled" state.
Is "uninstalled" a real state?  Perhaps "configured" is more
appropriate, as this allows a transition to "installed" via "zoneadm
attach".
oops.  another bug.  fixed.
Post by Mike Gerdts
Post by Edward Pilatowicz
- an error message will be emitted to the zones smf log file.
- after booting any remaning installed zones that have autoboot set
  to true, the zones smf service will enter the "maintainence" state,
  there by prompting the administrator to look at the zones smf log
  file.
After fixing any problems with shared storage accessibility, the
admin should be able to simply re-attach the zone to the system.
Currently the zones smf service is dependant upon multi-user-server, so
all networking services required for access to shared storage should be
propertly configured well before we try to import any shared storage
associated with zones.
May I propose a fix to the zones SMF service as part of this?  The
current integration with the global zone's SMF is rather weak in
reporting the real status of zones and allowing the use of SMF for
- If a zone fails to start, the state of svc:/system/zones:default does
  not reflect a maintenance or degraded state.
- If an admin wishes to start a zone the same way that the system would
  do it, "svcadm restart" and similar have the side effect of rebooting
  all zones on the system.
- There is no way to establish dependencies between zones or between a
  zone and something that needs to happen in the global zone.
- There isn't a good way to allow certain individuals within the global
  zone the ability to start/stop specific zones with RBAC or
  authorizations.
- zonecfg creates a new services instance svc:/system/zones:zonename
  when the zone is configured.  Its initial state is disabled.  If the
  service already exists sanity checking may be performed but it should
  not whack things like dependencies and authorizations.
- After zoneadm installs a zone, the general/enabled property of
  svc:/system/zones:zonename is set to match the zonecfg autoboot
  property.
- "zoneadm boot" is the equivalent of
  "svcadm enable -t svc:/system/zones:zonename"
- A new command "zoneadm shutdown" is the equivalent of
  "svcadm disable -t svc:/system/zones:zonename"
- "zoneadm halt" is the equivalent of "svcadm mark maintenance
  svc:/system/zones:zonename:" followed by the traditional ungraceful
  teardown of the zone.
- Modification of the autoboot property with zonecfg (so long as the
  zone has been installed/attached) triggers the corresponding
  general/enabled property change in SMF.  This should set the property
  general/enabled without causing an immediate state change.
- zoneadm uninstall and zoneadm detach set the service to not autostart.
- zonecfg delete also deletes the service.
- A new property be added to zonecfg to disable SMF integration of this
  particular zone.  This will be important for people that have already
  worked around this problem (including ISV's providing clustering
  products) that don't want SMF getting in the way of their already
  working solution.
yeah.  the zones team is well aware that our current smf integration
story is pretty poor.  :(  we really want to improve our smf integration
by moving all our configuration into smf, adding per-zone smf services,
etc.  so while this project proposes some minor changes to the behavior
of our existing smf service, i think that an overhaul of our smf
integration is really a project in and of itself, and out of scope for
this proposal.  (this proposal already has plenty of scope that could
take a while to deliver.  ;)
Very well...
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.1.viii Zoneadm(1m) clone
Normally when cloning a zone which lives on a zfs filesystem the zones
framework will take a zfs(1m) snapshot of the source zone and then do a
zfs(1m) clone operation to create a filesystem for the new zone which is
being instantiated.  This works well when all the zones on a given
system live on local storage in a single zfs filesystem, but this model
doesn't work well for zones with encapsulated roots.  First, with
encapsulated roots each zone has it's own zpool, and zfs (1m) does not
support cloning across zpools.  Second, zfs(1m) snapshotting/cloning
within the source zpool and then mounting the resultant filesystem onto
the target zones zoneroot would introduce dependencies between zones,
complicating things like zone migration.
Hence, for cloning operations, if the source zone has an encapsulated
root, zoneadm(1m) will not use zfs(1m) snapshot/clone.  Currently
zoneadm(1m) will fall back to the use of find+cpio to clone zones if it
is unable to use zfs(1m) snapshot/clone.  We could just fall back to
this default behaviour for encapsulated root zones, but find+cpio are
not error free and can have problem with large files.  So we propose to
update zoneadm(1m) clone to detect when both the source and target zones
are using separate zfs filesystems, and in that case attempt to use zfs
send/recv before falling back to find+cpio.
Can a provision be added for running an external command to produce the
clone?  I envision this being used to make a call to a storage device to
tell the storage device to create a clone of the storage.  (This implies
that the super-secret tool to re-write the GUID would need to become
available.)
The alternative seems to be to have everyone invent their own mechanism
with the same external commands and zoneadm attach.
hm.  currently there are internal brand hooks which are run during a
clone operation, but i don't think it would be appropriate to expose
these.
a "zoneadm clone" is basically a copy + sys-unconfig.  if you have a
storage device that can be used to do the copy for you, perhaps you
could simply do the copy on the storage device, and then do a "zoneadm
attach" of the new zone image?  if you want, i think it would be a
pretty trivial RFE to add a sys-unconfig option to "zoneadm attach".
that should let you get the same essential functionality as clone,
without having to add any new callbacks.  thoughts?
Since cloning already requires the zone to be down, I don't think that
too many people are probably cloning anything other than zones that
are intended to be template zones that are never booted. Such zones
can be pre-sys-unconfig'd to work around this problem, and in my
opinion is not worth a lot of effort.

I further suspect that most places would prefer that zones were not
sys-unconfig'd so that they could just tweak the few things that need
to be tweaked rather than putting bogus information in /etc/sysidcfg
then going back and fixing things afterwards. For example, sysidcfg
is unable to cope with the notion that you might use LDAP for
accounts, DNS for hosts, and files for things like services.
Patching, upgrades, etc. also tend to break things related to sysidcfg
(e.g. disabling various SMF services required by name services).
Hopefully sysidcfg goes away or gets fixed...
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
Today, the zoneadm(1m) clone operations ignores any additional storage
(specified via the "fs", "device", or "dataset" resources) that may be
associated with the zone.  Similarly, the clone operation will ignore
additional storage associated with any "zpool" resources.
Since zoneadm(1m) clone will be enhanced to support cloning between
encapsulated root zones and un-encapsulated root zones, zoneadm(1m)
clone will be documented as the recommended migration mechanism for
users who which to migrate existing zones from one format to another.
----------
C.2 Storage object uid/gid handling
One issue faced by all VTs that support shared storage is dealing with
file access permissions of storage objects accessible via NFS.  This
issue doesn't affect device based shared storage, or local files and
vdisks, since these types of storage are always accessible, regardless
of the uid of the access process (as long as the accessing process has
the necessary privileges).  But when accessing files and vdisk via NFS,
the accessing process can not use privileges to circumvent restrictive
file access premissions.  This issue is also complicated by the fact
that by default most NFS servier will map all accesses by remote root
user to a different uid, usually "nobody".  (a process known as "root
squashing".)
In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.
Additionally, whenever the zones framework attempts to access an storage
object file or vdisk it will temporarily switch its uid and gid to match
the owner and group of the file/vdisk, ensure that the file is readable
and writeable by it's owner (updating the file/vdisk permissions if
necessary), and finally setup the file/vdisk for access via a zpool
import or lofiadm -a.  This should will allow the zones framework to
access storage object files/vdisks that we created by any user,
regardless of their ownership, simplifying file ownership and management
issues for administrators.
This implies that the xvm user is getting some additional privileges.
What are those privileges?
hm.  afaik, the xvm user isn't defined as having any particular
privileges.  (/etc/user_attr doesn't have an xvm entry.)  i wasn't
planning on defining any privileg requirements for the xvm user.
zoneadmd currently runs as root with all privs.  so zoneadmd will be
able to switch to the xvm user to create encapsulated zpool
files/vdisks.  similarly, zoneadmd will also be able to switch uid to
the owner of any other objects it may need to access.
Gotcha. It will be along the lines of:

seteuid(xvmuid);
system("/sbin/zpool ...");

Rather than:

system("/usr/bin/su - xvm /sbin/zpool ...");

Assuming you are using system(3C) and not libzfs.
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.3 Taskq enhancements
The integration of Duckhorn[08] greatly simplifies the management of cpu
resources assigned to zone.  This management is partially implemented
through the use of dynamic resource pools, where zones and their
associated cpu resources can both be bound to a pool.
Internally, zfs has worker threads associated with each zpool.  These
are kernel taskq threads which can run on any cpu which has not been
explicitly allocated to a cpu set/partition/pool.
So today, for any zones living on zfs filesystems, and running in a
dedicated cpu pool, any zfs disk processing associated with that zone is
not done by the cpu's bound to that zones pool.  Essentially all the
zones zfs processing is done for "free" by the global zone.
With the introduction of zpools encapsulated within storage objects,
which are themselves associated with specific zones, it would be
desirable to have the zpool worker threads bound to the cpus currently
allocated to the zone.  Currently, zfs uses taskq threads for each
zpool, so one way of doing this would be to introduce a mechanism that
allows for the binding of taskqs to pools.
    zfs_poolbind(char *, poolid_t);
    taskq_poolbind(taskq_t, poolid_t);
When a zone, which is bound to a pool, is booted, the zones framework
will call zfs_poolbind() for each zpool associated with an encapsulated
storage object bound to the zone being booted.
Zfs will in turn use the new taskq pool binding interfaces to bind all
it's taskqs to the specified pools.  This mapping is transient and zfs
will not record or persist this binding in any way.
The taskq implementation will be enhanced to allow for binding worker
threads to a specific pool.  If taskqs threads are created for a taskq
which is bound to a specific pool, those new thread will also inherit
the same pool bindings.  The taskq to pool binding will remain in effect
until the taskq is explicitly rebound or the pool to which it is bound
is destroyed.
Any thoughts of dooing something similar for dedicated NICs?  From
     cpus
         Bind the processing of packets for a given data link  to
         a  processor  or a set of processors. The value can be a
         comma-separated list of one or more  processor  ids.  If
         the  list  consists of more than one processor, the pro-
         cessing will spread out to all the  processors.  Connec-
         tion  to  processor affinity and packet ordering for any
         individual connection will be maintained.
That is, the enhancement is already there, it's just a matter of making
use of it.
i'm currently engaged with someone on the crossbow team who is working
on a proposal to allow for binding datalinks to pools.  but once again,
that's a seperate project.  ;)
ok
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.4 Zfs enhancements
In addition to the zfs_poolbind() interface proposed above.  The
zpool(1m) "import" command will need to be enhanced.  Currently the
zpool(1m) import by default scans all storage devices on the system
looking for pools to import.  The caller can also use the '-d' option to
specify a directory within which the zpool(1m) command will scan for
zpools that may be imported.  This scanning involves sampling many
objects.  When dealing with zpools encapsulated in storage objects, this
scanning is unnecessary since we already know the path to the objects
which contains the zpool.  Hence, the '-d' option will be enhanced to
allow for the specification of a file or device.  The user will also be
able to specify this option multiple times, in case the zpool spans
multiple objects.
----------
C.5 Lofi and lofiadm(1m) enhancements
Currently, there is no way for a global zone to access the contents of a
vdisk.  Vdisk support was first introduced in VirtualBox.  xVM then
adopted the VirtualBox code for vdisk support.  With both technologies,
the only way to access the contents of a vdisk is to export it to a VM.
To allow zones to use vdisk devices we propose to leverage the code
introduced by by xVM by incorporating it into lofi.  This will allow any
solaris system to access the contents of vdisk devices.  The interface
changes to lofi to allow for this are fairly straitforward.
A new '-l' option will be added to the lofiadm(1m) "-a" device creation
mode.  The '-l' option will indicate to lofi that the new device should
have a label associated with it.  Normally lofi device are named
/dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number.
When a disk device has a label associated with it, it exports many
device nodes with different names.  Therefore lofi will need to be
enhanced to support these new device names, which multiple nodes
    /dev/lofi/dsk<I>/p<j>           - block device partitions
    /dev/lofi/dsk<I>/s<j>           - block device slices
    /dev/rlofi/dsk<I>/p<j>          - char device partitions
    /dev/rlofi/dsk<I>/s<j>          - char device slices
One of the big weaknesses with lofi is that you can't count on the
device name being the same between boots.  Could -l take an argument
   lofiadm -a -l coolgames /media/coolgames.iso
   /dev/lofi/coolgames/p<j>
   /dev/lofi/coolgames/s<j>
   /dev/rlofi/coolgames/p<j>
   /dev/rlofi/coolgames/s<j>
For those cases where legacy behavior is desired, an optional %d can be
used to create the names you suggest above.
   lofiadm -a -l dsk%d /nfs/server/zone/stuff
so there are a lot of improvements that could be done to lofi.  one
improvement that i think we should do is to allow for persistent lofi
devices that come back after reboots.  custom device naming is another.
but once again, i think that is outside the scope of this project.
(this project will facilitate these other changes because it is creating
an smf service for lofi, where persistent configuration could be stored,
but adding that functionality will have to be another project.)
ok
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.6 Performance considerations
As previously mentioned, this proposal primarily simplifies the process
of configuring zones on shared storage.  In most cases these proposed
configurations can be created today, but no one has actually verified
that these configurations perform acceptably.  Hence, in conjunction
with providing functionality to simplify the setup of these configs,
we also need to be quantifying their performance to make sure that
none of the configurations suffer from gross performance problems.
The most straitforward configurations, with the least possibilities for
poor performance, are ones using local devices, fibre channel luns, and
iSCSI luns.  These configuration should perform identically to the
configurations where the global zone uses these objects to host zfs
filesystems without zones.  Additionally, the performance of these
configurations will mostly be dependent upon the hardware associated
with the storage devices.  Hence the performance of these configuration
is for the most part uninteresting and performance analysis of these
configuration can by skipped.
Looking at the performance of storage objects which are local files or
nfs files is more interesting.  In these cases the zpool that hosts the
zone will be accessing it's storage via the zpool vdev_file vdev_ops_t
interface.  Currently, this interface doesn't receive as much use and
performance testing as some of the other zpool vdev_ops_t interfaces.
Hence it will worthwhile to measure the performance of a zpool backed by
a file within another zfs filesystem.  Likewise we will want to measure
the performance of a zpool backed by a file on an NFS filesystem.
Finally, we should compare these two performance points to a zone which
is not encapsulated within a zpool, but is instead installed directly on
a local zfs filesystem.  (These comparisons are not really that
interesting when dealing with block device based storage objects.)
Reminder for when I am testing: is this a case where forcedirectio will
make a lot of sense?  That is, zfs is already buffering, don't make NFS
do it too.
this is a great question, and i don't know the answer.  i'll have to
ask some nfs folks and do some perf testing to determine what should
be done here.  i've added a not about forcedirectio to the doc.
Sounds good
--
Mike Gerdts
http://mgerdts.blogspot.com/
Edward Pilatowicz
2009-05-23 03:01:44 UTC
Permalink
comments inline below.
ed
Post by Mike Gerdts
On Fri, May 22, 2009 at 1:57 AM, Edward Pilatowicz
Post by Edward Pilatowicz
i've attached an updated version of the proposal (v1.1) which addresses
your feedback.  (i've also attached a second copy of the new proposal
that includes change bars, in case you want to review the updates.)
As I was reading through it again, I fixed a few picky things (mostly
spelling) that don't change the meaning. I don't think that I "fixed"
anything that was already right in British English.
diff attached.
i've merged your fixes in. thanks.
Post by Mike Gerdts
Post by Edward Pilatowicz
Post by Mike Gerdts
On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz
nice catch.
in early versions of my proposal, the nfs:// uri i was planning to
support allowed for the specification of mount options.  this required
allowing for per-zone nfs mounts with potentially different mount
options.  since then i've simplified things (realizing that most people
really don't need or want to specify mount options) and i've switched to
using the the nfs uri defined in rfc 2224.  this means we can do away
with the '<zonename>' path component as you suggest.
That was actually something I thought about after the fact. When I've
been involved in performance problems in the past, being able to tune
mount options (e.g. protocol versions, block sizes, caching behavior,
etc.) has been important.
yeah. so the idea is to keep is simple for the initial functionality.
i figure that i'll evaluate different options and provide the best
defaults possible. if customer requests come in for supporting
different options, well, first they can easily work around the issue by
using autofs + path:/// (and if the autofs config is in nis/ldap, then
migration will still work). then we can just come up with a new uri
spec that allows the user to specify mount options. the non-obvious and
unfortunate part of having a uri that allows for the specification of
mount options is that this we'll probably have to require that the user
percent-encode certain chars in the uri. :( leaving this off for now
gives me a simpler nfs uri format. (that should be good enough for most
people.)
Post by Mike Gerdts
Post by Edward Pilatowicz
Post by Mike Gerdts
Perhaps if this is what I would like I would be better off adding a
global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use
the path:/// uri instead.
Thoughts?
i'm not sure i understand how you would like to see this functionality
behave.
wrt vfstab, i'd rather you not use that since that moves configuration
outside of zonecfg.  so later, if you want to migrate the zone, you'll
need to remember about that vfstab configuration and move it as well.
if at all possible i'd really like to keep all the configuration within
zonecfg(1m).
perhaps you could explanin your issues with the currently planned
approach in a different way to help me understand it better?
The key thing here is that all of my zones are served from one or two
NFS servers. Let's pretend that I have a T5440 with 200 zones on it.
The way the proposal is written, I would have 200 mounts in the global
$nfsserver:/vol/zones/zone$i
on /var/zones/nfsmount/nfsserver/vol/zones/zone$i
When in reality, all I need is a single mount (subject to
implementation-specific details, as discussed above with ro vs. rw
$nfsserver:/vol/zones
on /var/myzones/nfs/$nfsserver/vol/zones
If my standard global zone deployment mechanism adds a vfstab entry
for $nfsserver:/vol/zones and configure each zone via path:/// I avoid
a storm of NFS mount requests at zone boot time as the global zone
boots. The NFS mount requests are UDP-based RPC calls, which
sometimes get lost on the wire. The timeout/retransmit may be such
that we add a bit of time to the overall zone startup process. Not a
huge deal in most cases, but a confusing problem to understand.
In this case, I wouldn't consider the NFS mounts as being something
specific to a particular zone. Rather, it is a common configuration
setting across all members of a particular "zone farm".
so if your nfs server is exporting a bunch of filesystems like:
$nfsserver:/vol/zones/zone$i

then yes, you'll end up with mounts for each. but if your nfs server
is exporting
$nfsserver:/vol/zones

then you'll only end up with one.

that said, if your nfs server is exporting
$nfsserver:/vol/zones
$nfsserver:/vol/zones/zone$i

i really don't see any way to avoid having mounts for each zone. afaik,
if the nfs server has a nested export, the exported subdirectory is only
accessible via a mount. so you couldn't mount $nfsserver:/vol/zones and
then access $nfsserver:/vol/zones/zone5 without first mounting
$nfsserver:/vol/zones/zone5. (i could always be wrong about this, but
this is my current understanding of how this works.)
Post by Mike Gerdts
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.1.viii Zoneadm(1m) clone
Normally when cloning a zone which lives on a zfs filesystem the zones
framework will take a zfs(1m) snapshot of the source zone and then do a
zfs(1m) clone operation to create a filesystem for the new zone which is
being instantiated.  This works well when all the zones on a given
system live on local storage in a single zfs filesystem, but this model
doesn't work well for zones with encapsulated roots.  First, with
encapsulated roots each zone has it's own zpool, and zfs (1m) does not
support cloning across zpools.  Second, zfs(1m) snapshotting/cloning
within the source zpool and then mounting the resultant filesystem onto
the target zones zoneroot would introduce dependencies between zones,
complicating things like zone migration.
Hence, for cloning operations, if the source zone has an encapsulated
root, zoneadm(1m) will not use zfs(1m) snapshot/clone.  Currently
zoneadm(1m) will fall back to the use of find+cpio to clone zones if it
is unable to use zfs(1m) snapshot/clone.  We could just fall back to
this default behaviour for encapsulated root zones, but find+cpio are
not error free and can have problem with large files.  So we propose to
update zoneadm(1m) clone to detect when both the source and target zones
are using separate zfs filesystems, and in that case attempt to use zfs
send/recv before falling back to find+cpio.
Can a provision be added for running an external command to produce the
clone?  I envision this being used to make a call to a storage device to
tell the storage device to create a clone of the storage.  (This implies
that the super-secret tool to re-write the GUID would need to become
available.)
The alternative seems to be to have everyone invent their own mechanism
with the same external commands and zoneadm attach.
hm.  currently there are internal brand hooks which are run during a
clone operation, but i don't think it would be appropriate to expose
these.
a "zoneadm clone" is basically a copy + sys-unconfig.  if you have a
storage device that can be used to do the copy for you, perhaps you
could simply do the copy on the storage device, and then do a "zoneadm
attach" of the new zone image?  if you want, i think it would be a
pretty trivial RFE to add a sys-unconfig option to "zoneadm attach".
that should let you get the same essential functionality as clone,
without having to add any new callbacks.  thoughts?
Since cloning already requires the zone to be down, I don't think that
too many people are probably cloning anything other than zones that
are intended to be template zones that are never booted. Such zones
can be pre-sys-unconfig'd to work around this problem, and in my
opinion is not worth a lot of effort.
I further suspect that most places would prefer that zones were not
sys-unconfig'd so that they could just tweak the few things that need
to be tweaked rather than putting bogus information in /etc/sysidcfg
then going back and fixing things afterwards. For example, sysidcfg
is unable to cope with the notion that you might use LDAP for
accounts, DNS for hosts, and files for things like services.
Patching, upgrades, etc. also tend to break things related to sysidcfg
(e.g. disabling various SMF services required by name services).
Hopefully sysidcfg goes away or gets fixed...
yeah. i'm not sure what the future plans for sysidcfg are, but i'm
hoping that the AI install project will replace/blow-up all that old
sysid stuff...
Post by Mike Gerdts
Post by Edward Pilatowicz
Post by Mike Gerdts
Post by Edward Pilatowicz
----------
C.2 Storage object uid/gid handling
One issue faced by all VTs that support shared storage is dealing with
file access permissions of storage objects accessible via NFS.  This
issue doesn't affect device based shared storage, or local files and
vdisks, since these types of storage are always accessible, regardless
of the uid of the access process (as long as the accessing process has
the necessary privileges).  But when accessing files and vdisk via NFS,
the accessing process can not use privileges to circumvent restrictive
file access premissions.  This issue is also complicated by the fact
that by default most NFS servier will map all accesses by remote root
user to a different uid, usually "nobody".  (a process known as "root
squashing".)
In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.
Additionally, whenever the zones framework attempts to access an storage
object file or vdisk it will temporarily switch its uid and gid to match
the owner and group of the file/vdisk, ensure that the file is readable
and writeable by it's owner (updating the file/vdisk permissions if
necessary), and finally setup the file/vdisk for access via a zpool
import or lofiadm -a.  This should will allow the zones framework to
access storage object files/vdisks that we created by any user,
regardless of their ownership, simplifying file ownership and management
issues for administrators.
This implies that the xvm user is getting some additional privileges.
What are those privileges?
hm.  afaik, the xvm user isn't defined as having any particular
privileges.  (/etc/user_attr doesn't have an xvm entry.)  i wasn't
planning on defining any privileg requirements for the xvm user.
zoneadmd currently runs as root with all privs.  so zoneadmd will be
able to switch to the xvm user to create encapsulated zpool
files/vdisks.  similarly, zoneadmd will also be able to switch uid to
the owner of any other objects it may need to access.
seteuid(xvmuid);
system("/sbin/zpool ...");
system("/usr/bin/su - xvm /sbin/zpool ...");
Assuming you are using system(3C) and not libzfs.
yep.
John Levon
2009-09-05 22:13:07 UTC
Permalink
Post by Edward Pilatowicz
path:///<file-absolute>
nfs://<host>[:port]/<file-absolute>
vpath:///<file-absolute>
vnfs://<host>[:port]/<file-absolute>
This makes me uncomfortable. The fact it's a vdisk is derivable except
in one case: creation. And when creating, we will already want some way
to specify the underlying format of the vdisk, so we could easily hook
the "make it a vdisk" option there.

That is, I think vdisks should just use path:/// and nfs:// not have
their own special schemes.
Post by Edward Pilatowicz
In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.
Hmmph. I really don't want the 'xvm' user to be exposed any more than it
is. It was always intended as an internal detail of the Xen least
privilege implementation. Encoding it as the official UID to access
shared storage seems very problematic to me. Not least, it means xend,
qemu-dm, etc. can suddenly write to all the shared storage even if it's
nothing to do with Xen.

Please make this be a 'user' option that the user can specify (with a
default of root or whatever). I'm pretty sure we'd agreed on that last
time we talked about this?
Post by Edward Pilatowicz
Additionally, whenever the zones framework attempts to access an storage
object file or vdisk it will temporarily switch its uid and gid to match
the owner and group of the file/vdisk, ensure that the file is readable
and writeable by it's owner (updating the file/vdisk permissions if
necessary), and finally setup the file/vdisk for access via a zpool
import or lofiadm -a. This should will allow the zones framework to
access storage object files/vdisks that we created by any user,
regardless of their ownership, simplifying file ownership and management
issues for administrators.
+1 on this bit, for sure.
Post by Edward Pilatowicz
For RAS purposes, we will need to ensure that this vdisk utility is
always running. Hence we will introduce a new lofi smf service
svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid
daemon, which will manage the starting, stopping, monitoring, and
possible re-start of the vdisk utility. Re-starts of vdisk utility
I'm confused by this bit: isn't startd what manages "starting, stopping,
monitoring, and possible re-start" of daemons? Why isn't this
svc:/system/vdisk:default ? What is lofid actually doing?

regards
john
Edward Pilatowicz
2009-09-16 23:34:06 UTC
Permalink
thanks for taking the time to look at this and sorry for the delay in
replying. my comments are line below.
ed
Post by John Levon
Post by Edward Pilatowicz
path:///<file-absolute>
nfs://<host>[:port]/<file-absolute>
vpath:///<file-absolute>
vnfs://<host>[:port]/<file-absolute>
This makes me uncomfortable. The fact it's a vdisk is derivable except
in one case: creation. And when creating, we will already want some way
to specify the underlying format of the vdisk, so we could easily hook
the "make it a vdisk" option there.
That is, I think vdisks should just use path:/// and nfs:// not have
their own special schemes.
this is easy enough to change.

but would you mind explaning what is the detection techniques are for
the different vdisk formats? are they files with well known extensions?
all directories with well known extensions? directories with certain
contents?
Post by John Levon
Post by Edward Pilatowicz
In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.
Hmmph. I really don't want the 'xvm' user to be exposed any more than it
is. It was always intended as an internal detail of the Xen least
privilege implementation. Encoding it as the official UID to access
shared storage seems very problematic to me. Not least, it means xend,
qemu-dm, etc. can suddenly write to all the shared storage even if it's
nothing to do with Xen.
Please make this be a 'user' option that the user can specify (with a
default of root or whatever). I'm pretty sure we'd agreed on that last
time we talked about this?
i have no objections to adding a 'user' option.

but i'd still like to avoid defaulting to root and being subject to
root-squashing. the xvm user seems like a good way to do this. but if
you don't like this then i could always introduce a new user just for
this purpose, say the zonesnfs user.
Post by John Levon
Post by Edward Pilatowicz
For RAS purposes, we will need to ensure that this vdisk utility is
always running. Hence we will introduce a new lofi smf service
svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid
daemon, which will manage the starting, stopping, monitoring, and
possible re-start of the vdisk utility. Re-starts of vdisk utility
I'm confused by this bit: isn't startd what manages "starting, stopping,
monitoring, and possible re-start" of daemons? Why isn't this
svc:/system/vdisk:default ? What is lofid actually doing?
well, as specified in the proposal, the administrative interface for
accessing vdisks is via lofi:

---8<---
Here's some examples of how this lofi functionality could be used
(outside of the zone framework). If there are no lofi devices on
the system, and an admin runs the following command:
lofiadm -a -l /export/xvm/vm1.disk

they would end up with the following device:
/dev/lofi/dsk0/p# - for # == 0 - 4
/dev/lofi/dsk0/s# - for # == 0 - 15
/dev/rlofi/dsk0/p# - for # == 0 - 4
/dev/rlofi/dsk0/s# - for # == 0 - 15
---8<---

so in this case, the lofi service would be started, and it would manage
starting and stopping a vdisk utility processes that services the
backend for this lofi device.

i originally made this a lofi service because i know that eventually it
would also be nice if we could persist lofi configuration across
reboots, and a lofi smf service would be a good way todo this.

there wouldn't really be any problem which changing this from a lofi
service to be a vdisk service. both services would do the same thing.
each would have a daemon that keeps track of the current vdisks on the
system and ensures that a vdisk utility remains running for each one.

if you want smf to manage the vdisk utility processes directly, then
we'll have to create a new smf service each time a vdisk is accessed
and destroy that smf service each time the vdisk is taken down.

i don't really have a strong opinion on how this gets managed, so if you
have a preference then let me know and i can update the proposal.
John Levon
2009-09-17 00:13:53 UTC
Permalink
Post by Edward Pilatowicz
thanks for taking the time to look at this and sorry for the delay in
replying.
Compared to /my/ delay...
Post by Edward Pilatowicz
Post by John Levon
That is, I think vdisks should just use path:/// and nfs:// not have
their own special schemes.
this is easy enough to change.
but would you mind explaning what is the detection techniques are for
the different vdisk formats? are they files with well known extensions?
all directories with well known extensions? directories with certain
contents?
Well, the format comes from the XML property file present in the vdisk.
At import time, it's a combination of sniffing the type from the file,
and some static checks on file name (namely .raw and .iso suffixes).
Post by Edward Pilatowicz
Post by John Levon
Hmmph. I really don't want the 'xvm' user to be exposed any more than it
is. It was always intended as an internal detail of the Xen least
privilege implementation. Encoding it as the official UID to access
shared storage seems very problematic to me. Not least, it means xend,
qemu-dm, etc. can suddenly write to all the shared storage even if it's
nothing to do with Xen.
Please make this be a 'user' option that the user can specify (with a
default of root or whatever). I'm pretty sure we'd agreed on that last
time we talked about this?
i have no objections to adding a 'user' option.
but i'd still like to avoid defaulting to root and being subject to
root-squashing.
How about defaulting to the owner of the containing directory? If it's
root, you won't be able to write if you're root-squashed (or not root
user) anyway.

Failing that, I'd indeed prefer a different user, especially one that's
configurable in terms of uid/gid.
Post by Edward Pilatowicz
there wouldn't really be any problem which changing this from a lofi
service to be a vdisk service. both services would do the same thing.
each would have a daemon that keeps track of the current vdisks on the
system and ensures that a vdisk utility remains running for each one.
if you want smf to manage the vdisk utility processes directly, then
we'll have to create a new smf service each time a vdisk is accessed
and destroy that smf service each time the vdisk is taken down.
Ah, right, I see now. Yes, out of the two options, I'd prefer each vdisk
to have its own fault container (SMF service). You avoid the need for
another hierarchy of fault management process (lofid), and get the
benefit of enhanced visibility:

# svcs
...
online 15:33:19 svc:/system/lofi:dsk0
online 15:33:19 svc:/system/lofi:dsk1
maintenance 15:33:19 svc:/system/lofi:dsk2

Heck, if we ever do represent zones or domains as SMF instances, we
could even build dependencies on the lofi instances. (Presuming we
somehow rewhack xVM to start a service instead of an isolated vdisk
process.)

regards
john
Edward Pilatowicz
2009-09-17 04:33:11 UTC
Permalink
Post by John Levon
Post by Edward Pilatowicz
thanks for taking the time to look at this and sorry for the delay in
replying.
Compared to /my/ delay...
Post by Edward Pilatowicz
Post by John Levon
That is, I think vdisks should just use path:/// and nfs:// not have
their own special schemes.
this is easy enough to change.
but would you mind explaning what is the detection techniques are for
the different vdisk formats? are they files with well known extensions?
all directories with well known extensions? directories with certain
contents?
Well, the format comes from the XML property file present in the vdisk.
there by implying that the vdisk path is a directory. ok. that's easy
enough to detect.
Post by John Levon
At import time, it's a combination of sniffing the type from the file,
and some static checks on file name (namely .raw and .iso suffixes).
well, as long as the suffixes above apply to directories and not to
files then i think we'd be ok. if the extensions above will apply to
files then we have a problem.

in the xvm world, you don't have any issues with accessing the files
above since you know that every object exported to a domain contains
a virtual disk, and there for contains a label.

but with zones this isn't the case. in my proposal there are two access
modes for files. raw file mode, where a zpool is created directly
inside a file. and vdisk mode, where we first create a label within the
device and then create a zpool inside one of the partitions.

so previously if the user specified:
file:///.../foo.raw
then we would create a zpool directly within the file, no label.

and if the user specified:
vfile:///.../foo.raw

then we would use lofi with the newly proposed -l option to access the
file, then we'd put a label on it (via the lofi device), and then create
a zpool in one of the partitions (and once again, zfs would access the
file through the lofi device).

so in the two cases, how can we make the access mode determination
without having the seperate uri syntax?
Post by John Levon
Post by Edward Pilatowicz
Post by John Levon
Hmmph. I really don't want the 'xvm' user to be exposed any more than it
is. It was always intended as an internal detail of the Xen least
privilege implementation. Encoding it as the official UID to access
shared storage seems very problematic to me. Not least, it means xend,
qemu-dm, etc. can suddenly write to all the shared storage even if it's
nothing to do with Xen.
Please make this be a 'user' option that the user can specify (with a
default of root or whatever). I'm pretty sure we'd agreed on that last
time we talked about this?
i have no objections to adding a 'user' option.
but i'd still like to avoid defaulting to root and being subject to
root-squashing.
How about defaulting to the owner of the containing directory? If it's
root, you won't be able to write if you're root-squashed (or not root
user) anyway.
Failing that, I'd indeed prefer a different user, especially one that's
configurable in terms of uid/gid.
if a directory is owned by a non-root user and i want to create a file
there, i think it's a great idea to switch to the uid of the directory
owner todo my file operations. i'll add that to the proposal.

but, say i'm on a host that is not subject to root squashing and i need
to create a file on a share that is only writable by root. in that
case, should i go ahead and create a file owned by root? imho, no.
instead, i'd rather create the file as some other user. why? because
if the administrator then tries to migrate that zone to a host that is
subject to root squashing from the server, then i'd lose access to that
file. eliminating all file accesses as root allows us to avoid
root-squashing and just help eliminate potential failure modes.

this would be my argument for adding a new non-root user that could be
used as a fallback for remote file access in cases that would otherwise
default to the root user.
Post by John Levon
Post by Edward Pilatowicz
there wouldn't really be any problem which changing this from a lofi
service to be a vdisk service. both services would do the same thing.
each would have a daemon that keeps track of the current vdisks on the
system and ensures that a vdisk utility remains running for each one.
if you want smf to manage the vdisk utility processes directly, then
we'll have to create a new smf service each time a vdisk is accessed
and destroy that smf service each time the vdisk is taken down.
Ah, right, I see now. Yes, out of the two options, I'd prefer each vdisk
to have its own fault container (SMF service). You avoid the need for
another hierarchy of fault management process (lofid), and get the
# svcs
...
online 15:33:19 svc:/system/lofi:dsk0
online 15:33:19 svc:/system/lofi:dsk1
maintenance 15:33:19 svc:/system/lofi:dsk2
Heck, if we ever do represent zones or domains as SMF instances, we
could even build dependencies on the lofi instances. (Presuming we
somehow rewhack xVM to start a service instead of an isolated vdisk
process.)
it's a little fine grained for my tastes, but ok.

one other thing to consider is that all the services above will be
running the vdisk utility which will be shuffeling data between a lofi
device node and a vdisk file. and since lofi nodes don't persist across
reboots, the services above shouldn't persist across a reboot either. i
guess that the method script for the services above could delete the
service if it noticed that the corresponding device node associated with
the vdisk was missing.

i can write this into the proposal as well.

ed
John Levon
2009-09-21 15:25:30 UTC
Permalink
Post by Edward Pilatowicz
there by implying that the vdisk path is a directory. ok. that's easy
Right.
Post by Edward Pilatowicz
enough to detect.
It's probably safer to directly use vdiskadm to sniff the directory, if
you can.
Post by Edward Pilatowicz
Post by John Levon
At import time, it's a combination of sniffing the type from the file,
and some static checks on file name (namely .raw and .iso suffixes).
well, as long as the suffixes above apply to directories and not to
files then i think we'd be ok. if the extensions above will apply to
files then we have a problem.
Once imported, the contents of the vdisk directory are private to vdisk.
The name of the containing directory can be anything.

That is, an import consists of taking the foo.raw file, and putting it
into a directory along with an XML properties file.
Post by Edward Pilatowicz
file:///.../foo.raw
then we would create a zpool directly within the file, no label.
vfile:///.../foo.raw
then we would use lofi with the newly proposed -l option to access the
file, then we'd put a label on it (via the lofi device), and then create
a zpool in one of the partitions (and once again, zfs would access the
file through the lofi device).
so in the two cases, how can we make the access mode determination
without having the seperate uri syntax?
In the creation case, which I think we're talking about above, we create
the vdisk directory (rather than direct file access, which vdiskadm
can't do, even though vdisk itself can) and the container format is
clear.

If we want to configure access to a pre-existing raw file, I'm not sure
we'd be doing the labelling ourselves. Perhaps I don't quite understand
the use cases for what you're suggesting.
Post by Edward Pilatowicz
Post by John Levon
How about defaulting to the owner of the containing directory? If it's
root, you won't be able to write if you're root-squashed (or not root
user) anyway.
Failing that, I'd indeed prefer a different user, especially one that's
configurable in terms of uid/gid.
if a directory is owned by a non-root user and i want to create a file
there, i think it's a great idea to switch to the uid of the directory
owner todo my file operations. i'll add that to the proposal.
but, say i'm on a host that is not subject to root squashing and i need
to create a file on a share that is only writable by root. in that
case, should i go ahead and create a file owned by root? imho, no.
instead, i'd rather create the file as some other user.
I don't agree that second-guessing the user's intentions when they've
explicitly disabled root-squashing is a useful behaviour.
Post by Edward Pilatowicz
if the administrator then tries to migrate that zone to a host that is
subject to root squashing from the server, then i'd lose access to that
file. eliminating all file accesses as root allows us to avoid
root-squashing and just help eliminate potential failure modes.
Replacing it with a potentially more subtle issue: what's the zonenfs
user ID, is it configured on the server, and is it unique and reserved
across the organisation, and across all OSes?

Having access fail with a clear message is an understandable failure
mode, with a clear remedy: use a suitable uid /chosen by the
administrator/. NFS users are surely comfortable and familiar with root
squashing by now.

Having a MySQL security hole allow access to all your virtual shared
storage is a much more subtle problem (yes, I discovered despite my
initial research that UID 60 is used by some Linux machines as mysqld).

regards
john
Edward Pilatowicz
2009-09-21 17:23:21 UTC
Permalink
Post by John Levon
Post by Edward Pilatowicz
there by implying that the vdisk path is a directory. ok. that's easy
Right.
Post by Edward Pilatowicz
enough to detect.
It's probably safer to directly use vdiskadm to sniff the directory, if
you can.
sure.
Post by John Levon
Post by Edward Pilatowicz
Post by John Levon
At import time, it's a combination of sniffing the type from the file,
and some static checks on file name (namely .raw and .iso suffixes).
well, as long as the suffixes above apply to directories and not to
files then i think we'd be ok. if the extensions above will apply to
files then we have a problem.
Once imported, the contents of the vdisk directory are private to vdisk.
The name of the containing directory can be anything.
That is, an import consists of taking the foo.raw file, and putting it
into a directory along with an XML properties file.
so in this context, an import is one method for creating a vdisk.
Post by John Levon
Post by Edward Pilatowicz
file:///.../foo.raw
then we would create a zpool directly within the file, no label.
vfile:///.../foo.raw
then we would use lofi with the newly proposed -l option to access the
file, then we'd put a label on it (via the lofi device), and then create
a zpool in one of the partitions (and once again, zfs would access the
file through the lofi device).
so in the two cases, how can we make the access mode determination
without having the seperate uri syntax?
In the creation case, which I think we're talking about above, we create
the vdisk directory (rather than direct file access, which vdiskadm
can't do, even though vdisk itself can) and the container format is
clear.
If we want to configure access to a pre-existing raw file, I'm not sure
we'd be doing the labelling ourselves. Perhaps I don't quite understand
the use cases for what you're suggesting.
the two use cases above were creation use cases.

i think part of the confusion here is that in the raw case, i thought a
vdisk would just have a file, not a directory with an xml file and the
disk file. (when i was using xvm that was the format of all the vdisks
i created.)

the other part of the confusion is that i was trying to support
automatic creation for raw vdisks.

if we only support vdisks created via vdiskadm(1m), then we'll always
have a directory and we can always use vdiskadm(1m) to sniff out if it's
a valid vdisk and access it as such.

then for the implicit creation case we'll just support files containing
a zpool.

sound good?
Post by John Levon
Post by Edward Pilatowicz
Post by John Levon
How about defaulting to the owner of the containing directory? If it's
root, you won't be able to write if you're root-squashed (or not root
user) anyway.
Failing that, I'd indeed prefer a different user, especially one that's
configurable in terms of uid/gid.
if a directory is owned by a non-root user and i want to create a file
there, i think it's a great idea to switch to the uid of the directory
owner todo my file operations. i'll add that to the proposal.
but, say i'm on a host that is not subject to root squashing and i need
to create a file on a share that is only writable by root. in that
case, should i go ahead and create a file owned by root? imho, no.
instead, i'd rather create the file as some other user.
I don't agree that second-guessing the user's intentions when they've
explicitly disabled root-squashing is a useful behaviour.
Post by Edward Pilatowicz
if the administrator then tries to migrate that zone to a host that is
subject to root squashing from the server, then i'd lose access to that
file. eliminating all file accesses as root allows us to avoid
root-squashing and just help eliminate potential failure modes.
Replacing it with a potentially more subtle issue: what's the zonenfs
user ID, is it configured on the server, and is it unique and reserved
across the organisation, and across all OSes?
Having access fail with a clear message is an understandable failure
mode, with a clear remedy: use a suitable uid /chosen by the
administrator/. NFS users are surely comfortable and familiar with root
squashing by now.
Having a MySQL security hole allow access to all your virtual shared
storage is a much more subtle problem (yes, I discovered despite my
initial research that UID 60 is used by some Linux machines as mysqld).
ok. so how about we just generate an error if we need to create a file,
and an explicity user id has not been specified, and root squashing is
enabled. (because under these conditions we'd generate a file owned by
nobody.)

ed
John Levon
2009-09-21 19:21:51 UTC
Permalink
Post by Edward Pilatowicz
if we only support vdisks created via vdiskadm(1m), then we'll always
have a directory and we can always use vdiskadm(1m) to sniff out if it's
a valid vdisk and access it as such.
then for the implicit creation case we'll just support files containing
a zpool.
sound good?
Yes.
Post by Edward Pilatowicz
ok. so how about we just generate an error if we need to create a file,
and an explicity user id has not been specified, and root squashing is
enabled. (because under these conditions we'd generate a file owned by
nobody.)
Sounds good to me.

regards
john
Continue reading on narkive:
Loading...