Capfs: Capability file-system

(1) Introduction
    Capfs is a little experiment in usable security interface design.

    Capfs has three technical goals:
    (a) Allow an administrator to declare a different initial
        allocation of capabilities and control them during runtime,
    (b) Allow developers to "bracket" the use of such capabilities,
        gaining them only when necessary and dropping them otherwise,
        in an easy-to-use, language-agnostic interface, and
    (c) Be as least intrusive as possible, requiring close to zero
        userland adaptation.

    The way capfs attempts to achieve these goals is by providing a
    pseudo file-system similar to procfs that is mounted together with
    a mapping between users and capabilities. Gaining and dropping
    capabilities is accomplished by using simple file operations like
    open, close, remove, etc., allowing interaction using existing
    and familiar tools for administrators and an intuitive learning
    process for developers.

    Section 2 of this document describes some basic concepts. Usage
    examples for programmers and administrators are in sections 3 and
    4, respectively. Section 5 details how to setup one or more (e.g.
    for chrooted programs) capfs instances. Finally there are some
    implementation notes (section 6) and possible directions for
    future development (section 7).

    Code implementing capfs can be downloaded from:

        http://www.NetBSD.org/~elad/capfs/capfs.tar.gz

    This document can be found at:

        http://www.NetBSD.org/~elad/capfs/capfs.txt

(2) Basic concepts
    The capfs model gives administrators the opportunity to create a
    different allocation of privileges. Instead of the traditional
    Unix "super-user" concept where it is implicitly defined that an
    effective user-id zero receives all	privileges and all other users
    receive none, capfs allows specifying the initial set associated
    with each user in a configuration file.
	
    Under capfs, every process starts with two sets of capabilities:
    (a) Permitted capabilities: indicating what capabilities the
        process can request for. Having a capability permitted does
        not imply it is in use. Ideally, a process will begin by
        permanently removing capabilities from this list it will
        never need.
		
    (b) Effective capabilities: the list of currently effective and
        in-use capabilities. Ideally, this list should be kept as
        short as possible and capabilities made effective on a need
        basis.
		
    The hierarchy is /cap/<pid>/{permitted,effective}/<cap>, where
    <pid> is the process-id or "self" if referencing the current
    process, "permitted" or "effective" reference the relevant set
    as above, and <cap> is the capability name.

    For example, if a process with PID 1234 has the "raw_socket"
    capability (indicating it's allowed to open a raw socket)
    permitted, it will appear as /cap/1234/permitted/raw_socket.
    Once the process gains it by opening this pseudo file, it will
    also appear as /cap/1234/effective/raw_socket.
		
(3) Usage: Programmers
    Using capfs inside programs requires no additional linking or new
    APIs. All the programmer has to know is (a) where capfs is mounted
    (e.g. /cap) and (b) how each capability is called. A "permitted"
    capability is made "effective" with open(2) and is removed from
    the effective set with close(2) or unlink(2).
	
    For example, assuming capfs is mounted on /cap, in a C program
    the programmer can do the following:
	
        /* Get raw_socket capability. */
        cap_fd = open("/cap/self/permitted/raw_socket", 0);
        if (cap_fd == -1)
            err(EXIT_FAILURE, "unable to get raw_socket capability");
		  
        /* Open raw socket. */
        sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP);
	  
        /* Drop the raw_socket capability. */
        close(cap_fd);
	  
    Effective capabilities can also be removed by using unlink(2):
	
        /* Drop the raw_socket capability. */
        unlink("/cap/self/effective/raw_socket");
	  
    (This may seem like duplicate functionality, but is actually
    required to support traditional behavior where a process begins
    with capabilities without having to request for them, and thus
    does not have a file descriptor to close(2).)
	  
    If the program will never need the "raw_socket" capability again,
    the programmer can permanently remove it:
	
        /* Permanently remove raw_socket capability. */
        unlink("/cap/self/permitted/raw_socket");
	  
    At any point the programmer can check for permitted or effective
    capabilities:

        /* Regain raw_socket capability if we don't have it. */
        rv = stat("/cap/self/effective/raw_socket", &sb);
        if (rv == ENOENT) {
            /* Is it even permitted? */
            rv = stat("/cap/self/permitted/raw_socket", &sb);
            if (rv == ENOENT)
                err(EXIT_FAILURE, "raw_socket capability not permitted");
        }

    Capabilities can be used in practically every language. For
    example, Python:

        whatever$ python2.6
        Python 2.6.6 (r266:84292, Apr  4 2011, 12:12:56)
        [GCC 4.1.3 20080704 prerelease (NetBSD nb2 20081120)] on netbsd5
        Type "help", "copyright", "credits" or "license" for more information.
        >>> from socket import *
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/usr/pkg/lib/python2.6/socket.py", line 184, in __init__
            _sock = _realsocket(family, type, proto)
        socket.error: [Errno 1] Operation not permitted
        >>> cap_fd = open("/cap/self/permitted/raw_socket")
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        >>> cap_fd.close()
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/usr/pkg/lib/python2.6/socket.py", line 184, in __init__
            _sock = _realsocket(family, type, proto)
        socket.error: [Errno 1] Operation not permitted
        >>>

    Or Lua:

        whatever$ lua
        Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
        > os.execute("date -n 0420")
        date: settimeofday: Operation not permitted
        > cap_fd = io.open("/cap/self/permitted/change_time")
        > os.execute("date -n 0420")
        Sun Jul 10 04:20:00 IDT 2011
        > io.close(cap_fd)
        > os.execute("date -n 0420")
        date: settimeofday: Operation not permitted
        >

    (Note that since there are no dependencies on external libraries,
    bracketing code can be always compiled in and the presence of a
    capfs determined during run-time rather than compile-time,
    making it semi-portable.)

(4) Usage: Administrators
    Capfs also allows administrators to get a quick view of which
    program is allowed to do what at any given moment. Traditional
    tools can be used to filter and process the output. An
    administrator can also manipulate capabilities of running
    processes.
	
    For example, list processes currently allowed to change time:

        whatever# for pid in `ls -1 /cap/[0-9]*/effective/change_time | cut -d/ -f3`
        > do ps -up $pid | sed -e '1d'
        > done
        elad 10758  0.0  0.6 10016 3300 ?   I     5:24PM 0:02.26 sshd: elad@pts/1 (sshd)
        elad 13594  0.0  0.3 3428 1508 ttyp1 I+    4:23AM 0:00.05 lua
        root 13695  0.0  0.2 2968 1200 ttyp3 S    12:11PM 0:00.49 ksh
        elad 15464  0.0  0.3 3176 1416 ttyp1 Is    5:24PM 0:00.14 -sh
        root 567  0.0  0.5 6948 2388 ?   Is   Fri02AM 1:47.41 /usr/libexec/postfix/master
        root 9441  0.0  0.3 3176 1424 ttyp3 I     4:40AM 0:00.46 sh
        whatever#

    List permitted capabilities for a process:

        whatever# ls /cap/13594/permitted
        bind_privport  change_time    raw_socket
        whatever#

    Controlling capabilities externally is also simple. Below is an
    annotated example, showing how an administrator can change the
    capabilities of a running program (in this case Python) using
    basic tools like rm and touch:

    [ Start Python normally and try to open a raw socket. ]

        whatever$ python2.6
        Python 2.6.6 (r266:84292, Apr  4 2011, 12:12:56)
        [GCC 4.1.3 20080704 prerelease (NetBSD nb2 20081120)] on netbsd5
        Type "help", "copyright", "credits" or "license" for more information.
        >>> from socket import *
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/usr/pkg/lib/python2.6/socket.py", line 184, in __init__
            _sock = _realsocket(family, type, proto)
        socket.error: [Errno 1] Operation not permitted
        >>>

    [ We get EPERM, so we open the permitted capability and try
      again: ]

        >>> cap_fd = open("/cap/self/permitted/raw_socket")
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        >>>

    [ On a different terminal, a root user does the following: ]

        whatever# ps -au | grep python
        elad 9041  0.0  0.7 7668 3760 ttyp1 I+    6:27PM 0:00.09 python2.6
        whatever# ls /cap/9041/effective
        raw_socket
        whatever# rm /cap/9041/effective/raw_socket
        whatever#

    [ This removed the raw socket capability from the effective
      set. We verify back on the first terminal: ]

        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/usr/pkg/lib/python2.6/socket.py", line 184, in __init__
            _sock = _realsocket(family, type, proto)
        socket.error: [Errno 1] Operation not permitted
        >>>

    [ Because it's still in the permitted set though, we can gain it
      back: ]

        >>> cap_fd2 = open("/cap/self/permitted/raw_socket")
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        >>>

    [ However, if it is removed permanently, we cannot. On a
      different terminal: ]

        whatever# rm /cap/9041/effective/raw_socket
        whatever# rm /cap/9041/permitted/raw_socket
        whatever#

    [ And back in the Python terminal, we no longer have the raw socket
      capability: ]

        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/usr/pkg/lib/python2.6/socket.py", line 184, in __init__
            _sock = _realsocket(family, type, proto)
        socket.error: [Errno 1] Operation not permitted
        >>> cap_fd3 = open("/cap/self/permitted/raw_socket")
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        IOError: [Errno 2] No such file or directory: '/cap/self/permitted/raw_socket'
        >>>

    [ At this point, a root user can grant capabilities back. For
      example, the raw socket capability can be added to the permitted
      set: ]

        whatever# touch /cap/9041/permitted/raw_socket
        whatever#

    [ Making it possible for us to open a raw socket: ]

        >>> cap_fd3 = open("/cap/self/permitted/raw_socket")
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        >>>

    [ Capabilities can also be added directly to the effective set.
      For example, we first drop the raw socket capability: ]

        >>> cap_fd3.close()
        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/usr/pkg/lib/python2.6/socket.py", line 184, in __init__
            _sock = _realsocket(family, type, proto)
        socket.error: [Errno 1] Operation not permitted
        >>>

    [ Then it's added externally: ]

        whatever# touch /cap/9041/effective/raw_socket
        whatever#

    [ And we can use it again without having to open it: ]

        >>> sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP)
        >>>

    Other tools can be written as scripts (or programs) and don't
    require modifications to any existing code. For example, a simple
    Python script that monitors changes to process capabilities is
    available from:

        http://www.NetBSD.org/~elad/capfs/capwatch.py

    Sample output while running Python in another terminal, gaining
    and dropping the raw_socket capability and exiting:

        whatever$ ./capwatch.py /cap
        14874|new|raw_socket,bind_privport,change_time|
        14874|cap|effective|add|raw_socket
        14874|cap|effective|remove|raw_socket
        14874|gone
        ^C
        whatever$

    (Keep in mind that a proper monitor would receive notifications,
    but kqueue(2) seems inadequate.)

(5) Setup
    Setting up capfs is simple. All that is necessary is a
    configuration file indicating the initial allocation of
    capabilities. Keywords can be used, simplifying the creation of an
    initial mapping.

    The configuration file is a plist, only because parsing it is
    easy using existing tools, although a JSON-like syntax could be
    more user-friendly. The general structure is a control element
    (named "flags") and a mapping element ("users"). Each user has a
    list of strings associated with it representing capabilities.

    Flags:
        traditional		if true, initial "permitted" capabilities
				are also made "effective." (default: false)
	
    Username entry keywords:
        $unspecified_users	any user not specified by an explicit entry

    Capability entry keywords:
        $all_caps		all capabilities
        $unprivileged_caps	traditional capabilities of an unprivileged user
        $privileged_caps	traditional capabilities of a privileged user

    For example, here's a configuration file specifying that a user
    called "ntpd" is allowed to bind(2) to a privileged port and
    change the system time:

        <plist version="1.0">
        <dict>
            <key>users</key>
            <array>
                <dict>
                    <key>username</key>
                    <string>ntpd</string>

                    <key>capabilities</key>
                    <array>
                        <string>$unprivileged_caps</string>
                        <string>bind_privport</string>
                        <string>change_time</string>
                    </array>
                </dict>
            </array>
        </dict>
        </plist>

    Theoretically, the traditional security model can be represented
    by the following configuration:

        <plist version="1.0">
        <dict>
            <key>flags</key>
            <dict>
                <key>traditional</key>
                <true/>
            </dict>


            <key>users</key>
            <array>
                <dict>
                    <key>username</key>
                    <string>root</string>

                    <key>capabilities</key>
                    <array>
                        <string>$all_caps</string>
                    </array>
                </dict>

                <dict>
                    <key>username</key>
                    <string>$unspecified_users</string>

                    <key>capabilities</key>
                    <array>
                        <string>$unprivileged_caps</string>
                    </array>
                </dict>
            </array>
        </dict>
        </plist>

    Once a configuration file is present, it should be passed to
    mount_capfs(8):
	
        # mount_capfs -f capfs.conf /cap /cap

    (5.1) Multiple instances

          Multiple capfs instances can be present simultaneously,
          allowing chrooted programs to use capfs, but also, and more
          importantly, allowing identical users to be presented with
          different sets of capabilities depending on the root
          directory.

          For example, the root user can be limited. When working
          normally, and / is the root directory, the root user will be
          unrestricted. However When the root directory is /chroot/ntpd
          it will be restricted to just the change_time capability, and
          when it's /chroot/httpd to just bind_privport.

          Setting up multiple instances is simple and requires only one
          additional argument to mount_capfs. For example, if we want
          to mount capfs for a chrooted ntpd user under /chroot/ntpd,
          and capfs_ntpd.conf is our (minimal) configuration file, we
          would use:

              # mount_capfs -f capfs_ntpd.conf -r /chroot/ntpd /cap /chroot/ntpd/cap          

          This ensures that when capabilities are initialized, if the
          credentials belong to a process whose root directory is
          /chroot/ntpd, the initial allocation will be according to
          that specified in capfs_ntpd.conf.

          An illustration follows. A program that sets its euid to 1000
          and tries to open a raw socket is in /chroot/rawsocket.
          Running it "normally" fails, since the top-level configuration
          has traditional mode disabled, meaning capabilities must be
          gained (as shown above) in order to become effective:

              whatever# /chroot/rawsocket
              setting euid to 1000
              opening raw socket: fail
              whatever#

          As can be expected, simply chrooting and running it inside
          the chroot fails as well:

              whatever# chroot /chroot /rawsocket
              setting euid to 1000
              opening raw socket: fail
              whatever#

          However, once we mount capfs to be available inside the
          chroot, allowing the user with id 1000 to open a raw socket
          and enable traditional mode (meaning permitted capabilities
          are automatically effective), we succeed:

              whatever# mount_capfs -f capfs_chroot.conf -r /chroot /cap /chroot/cap
              whatever# chroot /chroot /rawsocket
              setting euid to 1000
              opening raw socket: success
              whatever#
	
(6) Implementation notes
    Capfs is implemented as a pseudo file-system and a set of kauth(9)
    listeners. No userland modifications or recompilations are
    required, except of course for the addition of mount_capfs(8).

    (a) Changes to kauth(9)
        Capabilities are attached to credentials as secmodel
        private data. The semantics of the KAUTH_CRED_INIT notification
        were modified and it's now called when credentials are really
        initialized (e.g. from a set-id context) for a certain user. A
        new KAUTH_CRED_ALLOC notification was added to reflect initial
        allocation of credentials that are not yet initialized.

        Another notification, KAUTH_CRED_CHROOT, was added to indicate
        that a process is being chrooted. Code similar to that when
        changing id was added to guarantee a new set of credentials.
        Similar changes are part of Aleksey Cheusov's securechroot.

    (b) Forking
        When a process forks, the child gets a reference to the
        parent's credentials. This worked fine in traditional Unix
        settings since credentials held only user/group ids and
        whenever those were changed in e.g. do_setresuid() the
        credentials were created anew, leaving each process with its
        own set.

        In a modern, extensible environment, where a secmodel can add
        its own credential data -- like capfs does with capabilities --
        this means that every location that changes secmodel-specific
        credentials should also deal with preventing improper
        propagation. There are four ways this can be addressed:
        (A) Code similar to that in do_setresuid() should be written
            and called by every secmodel when and if it changes its
            own private credential data, or
        (B) Forking should always create a copy of the credentials,
            guaranteeing no two processes share the same instance,
        (C) The secmodel should copy the credentials itself during a
            fork, and properly handle the copy notification as well, or
        (D) We extend secmodel_register() to take e.g. a secmodel_id_t
            that describes the secmodel (name, description, etc.) and
            also contains a boolean indicating whether the secmodel
            needs credentials to be copied on fork. Every time a
            secmodel registers with this variable set to 'true' a
            reference count in kauth(9) is raised; when the secmodel
            deregisters, it's lowered. In kauth_proc_fork(), if this
            counter is zero we do the traditional kauth_cred_hold(),
            but when it's greater than zero we do a copy.

        Option (C) is clearly superior to options (A) and (B) since not
        all secmodels have their own private credential data and the
        overhead can be avoided, and it requires much less code to
        implement. That said, in the case more than one secmodel has
        private credential data the number of copies will grow
        exponentially, making it very unscalable.

        Ideally option (D) should be implemented, but since this is
        beyond the scope of capfs, it remains to be done.

    (c) Mounting
        The user that mounts capfs is allowed to specify any
        capabilities in the configuration and they are not checked
        against its own, as ideally would have happened.
		
    (d) Supported capabilities
        So far only the only capabilities supported are:

            bind_privport	bind to a privileged port
            change_time		change the system time
            raw_socket		open a raw socket

        The $all_caps and $privileged_caps sets are identical and
        contain all three. The $unprivileged_caps set is empty.

        New capabilities can be added by adding a definition for the
        capability to sys/secmodel/capfs/capfs_common.h, e.g.:

            #define CAP_REBOOT    CAP_BIT(3UL)

        and updating the relevant sets (further down in the same file)
        if needed. The CAP_FIRST/CAP_LAST definitions should also be
        kept in sync as indicated. Then the capability name should be
        added to the cap_descriptions array in
        sys/secmodel/capfs/capfs_common.c, e.g.:

            { CAP_REBOOT,    "reboot" },

        Finally, handling for the capability should be added to the
        secmodel code itself in sys/secmodel/capfs/secmodel_capfs.c,
        allowing an operation if it is present, e.g., in the system
        scope listener, secmodel_capfs_system_cb():

        [...]
	case KAUTH_SYSTEM_REBOOT:
            if (CAP_PRESENT(capdata->caps_effective, CAP_REBOOT))
                result = KAUTH_RESULT_ALLOW;

            break;
	[...]

    (e) Definition of capability
        The term "capability" is defined as "the ability to perform an
        operation a security model can hook and make a decision about."
        In this implementation, capfs implements a security model using
        kauth(9), and can only provide capabilities that correspond to
        decisions kauth(9) has a say about. If a certain task, e.g.
        connect(2), does not ask for authorization before proceeding,
        it falls outside the scope of what a capability is for this
        experiment because a security model cannot hook into the
        (non-existent) decision process.

(7) Future work
    In addition to the obvious necessity of adding more capabilities,
    future work could allow attaching capabilities to programs as
    well, replacing the set-id bits currently in use. This takes
    place outside the scope of capfs per se, but can use the same
    interfaces, so that when a process starts, its entry is looked up
    and capabilities are loaded and attached to the credentials
    regardless of the user.

    Once this is done, a combination of the two can be supported, such
    that it's possible to specify "user foo has change_time only when
    executing /bin/date" and further limit capabilities only to
    approved programs and not have them as a general trait of the
    user.

    Both features should be based on fileassoc(9), and will first
    require finishing up the KPI to support plugging in to the entire
    file life-cycle. Code that can be used for an initial
    implementation was published in the following email:

        http://mail-index.netbsd.org/tech-security/2009/06/27/msg000238.html

(8) Status
    Capfs is new and was not tested thoroughly. It is possible it
    contains bugs that will have security or stability
    implications (or both). Due to items (c) and (d) in section 6
    above it is not yet entirely flexible.

    For all of these reasons, capfs should be considered highly
    experimental and not relied on in any way.

(9) Author
    Elad Efrat <elad@NetBSD.org>