<!-- $NetBSD: chap-file-system.xml,v 1.2 2007/06/20 14:24:48 rumble Exp $ -->

<chapter id="chap-file-system">
  <title>File system internals</title>

  <para>This chapter describes in great detail the concepts behind
  file system development under NetBSD.  It presents some code examples
  under the name of egfs, a fictitious file system that stands for
  <emphasis>example file system</emphasis>.</para>

  <para>Throughout this chapter, the word <emphasis>file</emphasis> is
  used to refer to <emphasis>any kind of file</emphasis> that may exist
  in a file system; this includes directories, regular files, symbolic
  links, special devices and named pipes.  If there is a need to mention
  a file that stores data, the term <emphasis>regular file</emphasis>
  will be used explicitly.</para>

  <para>Understanding a complex subsystem as the virtual file system
  (VFS) can be difficult.  The chapter starts giving an overview on
  both the vnode (<xref linkend="vnode_layer" />) and the VFS (<xref
  linkend="vfs_layer" />) layers as well as on the existing file
  systems; they should be read in this order.  These three sections
  ought to provide a general outline on the whole subsystem, making the
  reader able to read and understand existing code, should he need
  to.</para>

  <para>Later on, it describes all other details related to file systems
  implementation and continues to extend the explanations given in the
  layers' overview (but please note that the information is not
  duplicated, so a read of the overview sections is "a must").  These
  sections may be read in any order, as they are highly hyperlinked to
  ease navigation and structured, more or less, as a reference
  guide.</para>
  
  <para>At the very end there is a section that summarizes, based on
  ready-to-copy-and-paste code examples, how to write a file system
  driver from scratch.  Note that this section does not contain
  explanations per se but only links to the appropriate sections where
  each point is described.</para>

  <!-- ================================================================ -->

  <sect1 id="vnode_layer">
    <title>vnode layer overview</title>

    <para>A vnode is an abstract representation of an active file
    within the NetBSD kernel; it provides a generic way to operate on
    the real file it represents regardless of the file system it lives
    on.  Thanks to this abstraction layer, all kernel subsystems only
    deal with vnodes.  It is important to note that there is a
    <emphasis>unique vnode for each active file</emphasis>.</para>

    <para>A vnode is described by the <type>struct vnode</type>
    structure; its definition can be found in the
    <filename>src/sys/sys/vnode.h</filename> file and information about
    its fields is available in the &man.vnode.9; manual page.  The
    following analyzes the most important ideas related to this
    structure.</para>

    <para>As the rule says, abstract representations must be specialized
    before they can be instantiated.  vnodes are not an exception: each
    file system extends both the static and dynamic parts of an vnode as
    follows:</para>

    <itemizedlist>
      <listitem>
        <para>The static part &mdash; the data fields that represent the
        object &mdash; is extended by attaching a custom data structure
        to an vnode instance during its creation.  This is done through
        the <varname>v_data</varname> field as described in <xref
        linkend="vnode_data" />.</para>
      </listitem>

      <listitem>
        <para>The dynamic part &mdash; the operations applicable to the
        object &mdash; is extended by attaching a vnode operations
        vector to a vnode instance during its creation.  This is done
        through the <varname>v_op</varname> field as described in <xref
        linkend="vnode_ops_vector" />.</para>
      </listitem>
    </itemizedlist>

    <!-- ============================================================== -->

    <sect2 id="vnode_data">
      <title>The vnode data field</title>

      <para>The <varname>v_data</varname> field in the <type>struct
      vnode</type> type is a pointer to an external data structure used
      to represent a file within a concrete file system.  This field
      must be initialized after allocating a new vnode and must be set
      to <literal>NULL</literal> before releasing it (see <xref
      linkend="vnode_dealloc" />).</para>

      <para>This external data structure contains any additional
      information to describe a specific file inside a file system.  In
      an on-disk file system, this might include the file's initial
      cluster, its creation time, its size, etc.  As an example, the
      NetBSD's Fast File System (FFS) uses the in-core memory
      representation of an inode as the vnode's data field.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_ops">
      <title>vnode operations</title>

      <para>A vnode operation is implemented by a function that follows
      the following contract: return an integer describing the operation's
      exit status and take a single <type>void *</type> parameter that
      carries a structure with the real operation's arguments.</para>
      
      <para>Using an external structure to describe the operation's
      arguments instead of using a regular argument list has a reason:
      some file systems extend the vnode with additional, non-standard
      operations; having a common prototype makes this possible.</para>

      <para>The following table summarizes the standard vnode
      operations.  Keep in mind, though, that each file system is free
      to extend this set as it wishes.  Also note that the operation's
      name is shown in the table as the macro used to call it (see <xref
      linkend="vnode_ops_exec" />).</para>

      <table id="table-vnd-ops">
        <title>vnode operations summary</title>

        <tgroup cols="3">
          <thead>
            <row>
              <entry>Operation</entry>
              <entry>Description</entry>
              <entry>See also</entry>
            </row>
          </thead>

          <tbody>
            <row>
              <entry><function>VOP_LOOKUP</function></entry>
              <entry>Performs a path name lookup.</entry>
              <entry>See <xref linkend="lookup" />.</entry>
            </row>

            <row>
              <entry><function>VOP_CREATE</function></entry>
              <entry>Creates a new file.</entry>
              <entry>See <xref linkend="vop_create" />.</entry>
            </row>

            <row>
              <entry><function>VOP_MKNOD</function></entry>
              <entry>Creates a new special file (a device or a named
              pipe).</entry>
              <entry>See <xref linkend="special_nodes" />.</entry>
            </row>

            <row>
              <entry><function>VOP_LINK</function></entry>
              <entry>Creates a new hard link for a file.</entry>
              <entry>See <xref linkend="vop_link" />.</entry>
            </row>

            <row>
              <entry><function>VOP_RENAME</function></entry>
              <entry>Renames a file.</entry>
              <entry>See <xref linkend="vop_rename" />.</entry>
            </row>

            <row>
              <entry><function>VOP_REMOVE</function></entry>
              <entry>Removes a file.</entry>
              <entry>See <xref linkend="vop_remove" />.</entry>
            </row>

            <row>
              <entry><function>VOP_OPEN</function></entry>
              <entry>Opens a file.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_CLOSE</function></entry>
              <entry>Closes a file.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_ACCESS</function></entry>
              <entry>Checks access permissions on a file.</entry>
              <entry>See <xref linkend="vop_access" />.</entry>
            </row>

            <row>
              <entry><function>VOP_GETATTR</function></entry>
              <entry>Gets a file's attributes.</entry>
              <entry>See <xref linkend="vop_getattr" />.</entry>
            </row>

            <row>
              <entry><function>VOP_SETATTR</function></entry>
              <entry>Sets a file's attributes.</entry>
              <entry>See <xref linkend="vop_setattr" />.</entry>
            </row>

            <row>
              <entry><function>VOP_READ</function></entry>
              <entry>Reads a chunk of data from a file.</entry>
              <entry>See <xref linkend="vop_read_and_vop_write" />.</entry>
            </row>

            <row>
              <entry><function>VOP_WRITE</function></entry>
              <entry>Writes a chunk of data to a file.</entry>
              <entry>See <xref linkend="vop_read_and_vop_write" />.</entry>
            </row>

            <row>
              <entry><function>VOP_IOCTL</function></entry>
              <entry>Performs an &man.ioctl.2; on a file.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_FCNTL</function></entry>
              <entry>Performs a &man.fcntl.2; on a file.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_POLL</function></entry>
              <entry>Performs a &man.poll.2; on a file.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_KQFILTER</function></entry>
              <entry>XXX</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_REVOKE</function></entry>
              <entry>Revoke access to a vnode and all aliases.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_MMAP</function></entry>
              <entry>Maps a file on a memory region.</entry>
              <entry>See <xref linkend="vop_mmap" />.</entry>
            </row>

            <row>
              <entry><function>VOP_FSYNC</function></entry>
              <entry>Synchronizes the file with on-disk
              contents.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_SEEK</function></entry>
              <entry>Test and inform file system of seek</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_MKDIR</function></entry>
              <entry>Creates a new directory.</entry>
              <entry>See <xref linkend="vop_mkdir" />.</entry>
            </row>

            <row>
              <entry><function>VOP_RMDIR</function></entry>
              <entry>Removes a directory.</entry>
              <entry>See <xref linkend="vop_rmdir" />.</entry>
            </row>

            <row>
              <entry><function>VOP_READDIR</function></entry>
              <entry>Reads directory entries from a directory.</entry>
              <entry>See <xref linkend="vop_readdir" />.</entry>
            </row>

            <row>
              <entry><function>VOP_SYMLINK</function></entry>
              <entry>Creates a new symbolic link for a file.</entry>
              <entry>See <xref linkend="vop_symlink" />.</entry>
            </row>

            <row>
              <entry><function>VOP_READLINK</function></entry>
              <entry>Reads the contents of a symbolic link.</entry>
              <entry>See <xref linkend="vop_readlink" />.</entry>
            </row>

            <row>
              <entry><function>VOP_TRUNCATE</function></entry>
              <entry>Truncates a file.</entry>
              <entry>See <xref linkend="vop_setattr" />.</entry>
            </row>

            <row>
              <entry><function>VOP_UPDATE</function></entry>
              <entry>Updates a file's times.</entry>
              <entry>See <xref linkend="vnode_times" />.</entry>
            </row>

            <row>
              <entry><function>VOP_ABORTOP</function></entry>
              <entry>Aborts an in-progress operation.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_INACTIVE</function></entry>
              <entry>Marks the vnode as inactive.</entry>
              <entry>See <xref linkend="vnode_life_cycle" />.</entry>
            </row>

            <row>
              <entry><function>VOP_RECLAIM</function></entry>
              <entry>Reclaims the vnode.</entry>
              <entry>See <xref linkend="vnode_life_cycle" />.</entry>
            </row>

            <row>
              <entry><function>VOP_LOCK</function></entry>
              <entry>Locks the vnode.</entry>
              <entry>See <xref linkend="vnode_locking" />.</entry>
            </row>

            <row>
              <entry><function>VOP_UNLOCK</function></entry>
              <entry>Unlocks the vnode.</entry>
              <entry>See <xref linkend="vnode_locking" />.</entry>
            </row>

            <row>
              <entry><function>VOP_ISLOCKED</function></entry>
              <entry>Checks whether the vnode is locked or not.</entry>
              <entry>See <xref linkend="vnode_locking" />.</entry>
            </row>

            <row>
              <entry><function>VOP_BMAP</function></entry>
              <entry>Maps a logical block number to a physical block
              number.</entry>
              <entry>See <xref linkend="vnode_ondisk" />.</entry>
            </row>

            <row>
              <entry><function>VOP_STRATEGY</function></entry>
              <entry>Performs a file transfer between the file system's
              backing store and memory.</entry>
              <entry>See <xref linkend="vnode_ondisk" />.</entry>
            </row>

            <row>
              <entry><function>VOP_PATHCONF</function></entry>
              <entry>Returns &man.pathconf.2; information.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_ADVLOCK</function></entry>
              <entry>XXX</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_BWRITE</function></entry>
              <entry>Writes a system buffer.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>VOP_GETPAGES</function></entry>
              <entry>Reads memory pages from the file.</entry>
              <entry>See <xref linkend="vop_getpages_and_vop_putpages"
              />.</entry>
            </row>

            <row>
              <entry><function>VOP_PUTPAGES</function></entry>
              <entry>Writes memory pages to the file.</entry>
              <entry>See <xref linkend="vop_getpages_and_vop_putpages"
              />.</entry>
            </row>
          </tbody>
        </tgroup>
      </table>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_ops_vector">
      <title>The vnode operations vector</title>

      <para>The <varname>v_op</varname> field in the <type>struct
      vnode</type> type is a pointer to the vnode operations vector,
      which maps logical operations to real functions (as seen in <xref
      linkend="vnode_ops" />).  This vector is file system specific as
      the actions taken by each operation depend heavily on the file
      system where the file resides (consider reading a file, setting
      its attributes, etc.).</para>

      <para>As an example, consider the following snippet; it defines
      the <function>open</function> operation and retrieves two
      parameters from its arguments structure:</para>

      <programlisting>int
egfs_open(void *v)
{
        struct vnode *vp = ((struct vop_open_args *)v)->a_vp;
        int mode = ((struct vop_open_args *)v)->a_mode;

        ...
}</programlisting>

      <para>The whole set of vnode operations defined by the file system
      is added to a vector of <type>struct
      vnodeopv_entry_desc</type>-type entries, with each entry being the
      description of a single operation.  The purpose of this vector is
      to define a mapping from logical operations such as
      <function>vop_open</function> or <function>vop_read</function> to real
      functions such as <function>egfs_open</function>,
      <function>egfs_read</function>.  <emphasis>It is not directly used
      by the system</emphasis> under normal operation.  This vector is
      not tied to a specific layout: it only lists operations available
      in the file system it describes, in any order it wishes.  It can
      even list non-standard (and unknown) operations as well as lack
      some of the most basic ones.  (The reason is, again, extensibility
      by third parties.)</para>

      <para>There are two minor restrictions, though:</para>

      <itemizedlist>
        <listitem>
          <para>The first item always points to an operation used in
          case a non-existent one is called.  For example, if the file
          system does not implement the <function>vop_bmap</function>
          operation but some code calls it, the call will be redirected
          to this default-catch function.  As such, it is often used to
          provide a generic error routine but it is also useful in
          different scenarios.  E.g., layered file systems use it to
          pass the call down the stack.</para>

          <para>It is important to note that there are two standard
          error routines available that implement this functionality:
          <function>vn_default_error</function> and
          <function>genfs_eopnotsupp</function>.  The latter correctly
          cleans up vnode references and locks while the former is the
          traditional error case one.  New code should only use the
          former.</para>
        </listitem>

        <listitem>
          <para>The last item always is a pair of null pointers.</para>
        </listitem>
      </itemizedlist>

      <para>Consider the following vector as an example:</para>

      <programlisting>const struct vnodeopv_entry_desc egfs_vnodeop_entries[] = {
        { vop_default_desc, vn_default_error },
        { vop_open_desc, egfs_open },
        { vop_read_desc, egfs_read },
        ... more operations here ...
        { NULL, NULL }
};</programlisting>

      <para>As stated above, this vector is not directly used by the
      system; in fact, it only serves to construct a secondary vector
      that follows strict ordering rules.  This secondary vector is
      automatically generated by the kernel during file system
      initialization, so the code only needs to instruct it to do the
      conversion.</para>

      <para>This secondary vector is defined as a pointer to an array of
      function pointers of type <type>int (**vops)(void *)</type>.  To
      tell the kernel where this vector is, a mapping between the two
      vectors is established through a third vector of <type>struct
      vnodeopv_desc</type>-type items.  This is easier to understand
      with an example:</para>

      <programlisting>int (**egfs_vnodeop_p)(void *);
const struct vnodeopv_desc egfs_vnodeop_opv_desc =
        { &amp;egfs_vnodeop_p, egfs_vnodeop_entries };</programlisting>

      <para>Out of the file-system's scope, users of the vnode layer
      will only deal with the <varname>egfs_vnodeop_p</varname> and
      <varname>egfs_vnodeop_opv_desc</varname> vectors.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_ops_exec">
      <title>Executing vnode operations</title>

      <para>All vnode operations are subject to a very strict locking
      protocol among several other call and return contracts.
      Furthermore, their prototype makes their call rather complex
      (remember that they receive a structure with the real arguments).
      These are some of the reasons why they cannot be called directly
      (with a few exceptions that will not be discussed here).</para>
      
      <para>The NetBSD kernel provides a set of macros and functions
      that make the execution of vnode operations trivial; please note
      that they are the standard call procedure.  These macros are named
      after the operation they refer to, all in uppercase, prefixed by
      the <literal>VOP_</literal>string.  Then, they take the list of
      arguments that will be passed to them.</para>

      <para>For example, consider the following implementation for the
      access operation:</para>

      <programlisting>int
egfs_access(void *v)
{
        struct vnode *vp = ((struct vop_access_args *)v)->a_vp;
        int mode = ((struct vop_access_args *)v)->a_mode;
        struct ucred *cred = ((struct vop_access_args *)v)->a_cred;
        struct proc *p = ((struct vop_access_args *)v)->a_p;

        ...
}</programlisting>

      <para>A call to the previous method could look like this:</para>

      <programlisting>result = VOP_ACCESS(vp, mode, cred, p);</programlisting>

      <para>For more information, see the &man.vnodeops.9; manual page,
      which describes all the mappings between vnode operations and
      their corresponding macros.</para>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="vfs_layer">
    <title>VFS layer overview</title>

    <para>The kernel's Virtual File System (VFS) subsystem provides
    access to all available file systems in an abstract fashion, just as
    vnodes do with active files.  Each file system is described by a
    list of well-defined operations that can be applied to it together
    with a data structure that keeps its status.</para>

    <!-- ============================================================== -->

    <sect2 id="vfs_struct_mount">
      <title>The mount structure</title>

      <para>File systems are attached to the virtual directory tree by
      means of mount points.  A mount point is a redirection from a
      specific directory<footnote><para>Technically speaking, a mount
      point needn't be a directory as you can NFS-mount regular files;
      the mount point could be a regular file, but this restriction is
      deliberately imposed because otherwise, the system could run out
      of name space quickly.</para></footnote> to a different file
      system's root directory and is represented by the generic
      <type>struct mount</type> type, which is defined in
      <filename>src/sys/sys/mount.h</filename>.</para>

      <para>A file system extends the static part of a <type>struct
      mount</type> object by attaching a custom data structure to its
      <varname>mnt_data</varname> field.  As with vnodes, this happens
      when allocating the structure.</para>

      <para>The kind of information that a file system stores in its
      mount structure heavily depends on its implementation.  Generally,
      it will typically include a pointer (either physical or logical)
      to the file system's root node, used as the starting point for
      further accesses.  It may also include several accounting
      variables as well as other information whose context is the whole
      file system attached to a mount point.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vfs_ops">
      <title>VFS operations</title>

      <para>A file system driver exposes a well-known interface to the
      kernel by means of a set of public operations.  The following table
      summarizes them all; note that they are sorted according to the
      order that they take in the VFS operations vector (see <xref
      linkend="vfs_ops_vector" />).</para>

      <table id="table-vfs-ops-summary">
        <title>VFS operations summary</title>

        <tgroup cols="4">
          <thead>
            <row>
              <entry>Operation</entry>
              <entry>Description</entry>
              <entry>Considerations</entry>
              <entry>See also</entry>
            </row>
          </thead>

          <tbody>
            <row>
              <entry><function>fs_mount</function></entry>
              <entry>Mounts a new instance of the file system.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="fs_mount_and_fs_unmount"
              />.</entry>
            </row>

            <row>
              <entry><function>fs_start</function></entry>
              <entry>Makes the file system operational.</entry>
              <entry>Must be defined.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>fs_unmount</function></entry>
              <entry>Unmounts an instance of the file system.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="fs_mount_and_fs_unmount"
              />.</entry>
            </row>

            <row>
              <entry><function>fs_root</function></entry>
              <entry>Gets the file system root vnode.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="fs_root" />.</entry>
            </row>

            <row>
              <entry><function>fs_quotactl</function></entry>
              <entry>Queries or modifies space quotas.</entry>
              <entry>Must be defined.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>fs_statvfs</function></entry>
              <entry>Gets file system statistics.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="fs_statvfs" />.</entry>
            </row>

            <row>
              <entry><function>fs_sync</function></entry>
              <entry>Flushes file system buffers.</entry>
              <entry>Must be defined.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>fs_vget</function></entry>
              <entry>Gets a vnode from a file identifier.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="vnode_alloc" />.</entry>
            </row>

            <row>
              <entry><function>fs_fhtovp</function></entry>
              <entry>Converts a NFS file handle to a vnode.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="vfs_nfs" />.</entry>
            </row>

            <row>
              <entry><function>fs_vptofh</function></entry>
              <entry>Converts a vnode to a NFS file handle.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="vfs_nfs" />.</entry>
            </row>

            <row>
              <entry><function>fs_init</function></entry>
              <entry>Initializes the file system driver.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="fs_init_and_fs_done" />.</entry>
            </row>

            <row>
              <entry><function>fs_reinit</function></entry>
              <entry>Reinitializes the file system driver.</entry>
              <entry>May be undefined (i.e., null).</entry>
              <entry>See <xref linkend="fs_init_and_fs_done" />.</entry>
            </row>

            <row>
              <entry><function>fs_done</function></entry>
              <entry>Finalizes the file system driver.</entry>
              <entry>Must be defined.</entry>
              <entry>See <xref linkend="fs_init_and_fs_done" />.</entry>
            </row>

            <row>
              <entry><function>fs_mountroot</function></entry>
              <entry>Mounts an instance of the file system as the root
              file system.</entry>
              <entry>May be undefined (i.e., null).</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>

            <row>
              <entry><function>fs_extattrctl</function></entry>
              <entry>Controls extended attributes.</entry>
              <entry>The generic <function>vfs_stdextattrctl</function>
              function is provided as a simple hook for file systems that
              do not support this operation.</entry>
              <!-- <entry>See <xref linkend="XXX" />.</entry> -->
            </row>
          </tbody>
        </tgroup>
      </table>

      <para>The list of VFS operations may eventually change.  When that
      happens, the kernel version number is bumped.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vfs_ops_vector">
      <title>The VFS operations structure</title>

      <para>Regardless of mount points, a file system provides a
      <type>struct vfsops</type> structure as defined in
      <filename>src/sys/sys/mount.h</filename> that describes itself
      type is.  Basically, it contains:</para>

      <itemizedlist>
        <listitem>
          <para>A public identifier, usually named after the file
          system's name suffixed by the <literal>fs</literal> string.
          As this identifier is used in multiple places &mdash; and
          specially both in kernel space and in userland &mdash;, it is
          typically defined as a macro in
          <filename>src/sys/sys/mount.h</filename>.  For example:
          <literal>#define MOUNT_EGFS "egfs"</literal>.</para>
        </listitem>

        <listitem>
          <para>A set of function pointers to file system operations.
          As opposed to vnode operations, VFS ones have different
          prototypes because the set of possible VFS operations is well
          known and cannot be extended by third party file systems.
          Please see <xref linkend="vfs_ops" /> for more details on the
          exact contents of this vector.</para>
        </listitem>

        <listitem>
          <para>A pointer to a null-terminated vector of <type>struct
          vnodeopv_desc * const</type> items.  These objects are listed
          here because, as stated in <xref linkend="vnode_ops_vector"
          />, the system uses them to construct the real vnode
          operations vectors upon file system startup.</para>

          <para>It is interesting to note that this field may contain
          more than one pointer.  Some file systems may provide more
          than a single set of vnode operations; e.g., a vector for the
          normal operations, another one for operations related to named
          pipes and another one for operations that act on special
          devices.  See the FFS code for an example of this and <xref
          linkend="special_nodes" /> for details on these special
          vectors.</para>
        </listitem>
      </itemizedlist>

      <para>Consider the following code snipped that illustrates the
      previous items:</para>

      <programlisting>const struct vnodeopv_desc * const egfs_vnodeopv_descs[] = {
        &amp;egfs_vnodeop_opv_desc,
        ... more pointers may appear here ...
        NULL
};

struct vfsops egfs_vfsops = {
        MOUNT_EGFS,
        egfs_mount,
        egfs_start,
        egfs_unmount,
        egfs_root,
        egfs_quotactl,
        egfs_statvfs,
        egfs_sync,
        egfs_vget,
        egfs_fhtovp,
        egfs_vptofh,
        egfs_init,
        NULL, /* fs_reinit: optional */
        egfs_done,
        NULL, /* fs_mountroot: optional */
        vfs_stdextattrctl,
        egfs_vnodeopv_descs
};</programlisting>

      <para>The kernel needs to know where each instance of this
      structure is located in order to keep track of the live file
      systems.  For file systems built inside the kernel's core, the
      <function>VFS_ATTACH</function> macro adds the given VFS
      operations structure to the appropriate link set.  See GNU ld's
      info manual for more details on this feature.</para>

      <programlisting>VFS_ATTACH(egfs_vfsops);</programlisting>

      <para>Standalone file system modules need not do this because the
      kernel will explicitly get a pointer to the information structure
      after the module is loaded.</para>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="fs_overview">
    <title>File systems overview</title>

    <!-- ============================================================== -->

    <sect2 id="fs_overview_ondisk">
      <title>On-disk file systems</title>

      <para>On-disk file systems are those that store their contents on
      a physical drive.</para>

      <itemizedlist>
        <listitem>
          <para>Fast File System (ffs): XXX</para>
        </listitem>

        <listitem>
          <para>Log-structured File System (lfs): XXX</para>
        </listitem>

        <listitem>
          <para>Extended 2 File System (ext2fs): XXX</para>
        </listitem>

        <listitem>
          <para>FAT (msdosfs): XXX</para>
        </listitem>

        <listitem>
          <para>ISO 9660 (cd9660): XXX</para>
        </listitem>

        <listitem>
          <para>NTFS (ntfs): XXX</para>
        </listitem>
      </itemizedlist>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="fs_overview_network">
      <title>Network file systems</title>

      <para></para>

      <itemizedlist>
        <listitem>
          <para>Network File System (nfs): XXX</para>
        </listitem>

        <listitem>
          <para>Coda (codafs): XXX</para>
        </listitem>
      </itemizedlist>
    </sect2>

    <!-- ============================================================== -->

    <sect2 id="fs_overview_synthetic">
      <title>Synthetic file systems</title>

      <para></para>

      <itemizedlist>
        <listitem>
          <para>Memory File System (mfs): XXX</para>
        </listitem>

        <listitem>
          <para>Kernel File System (kernfs): XXX</para>
        </listitem>

        <listitem>
          <para>Portal File System (portalfs): XXX</para>
        </listitem>

        <listitem>
          <para>Pseudo-terminal File System (ptyfs): XXX</para>
        </listitem>

        <listitem>
          <para>Temporary File System (tmpfs): XXX</para>
        </listitem>
      </itemizedlist>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="fs_overview_layered">
      <title>Layered file systems</title>

      <para></para>

      <itemizedlist>
        <listitem>
          <para>Null File System (nullfs): XXX</para>
        </listitem>

        <listitem>
          <para>Union File System (unionfs): XXX</para>
        </listitem>

        <listitem>
          <para>User-map File System (umapfs): XXX</para>
        </listitem>
      </itemizedlist>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="fs_overview_helper">
      <title>Helper file systems</title>

      <para>Helper file systems are just a set of functions used to
      easily implement other file systems.  As such, they can be
      considered as libraries.  These are:</para>

      <itemizedlist>
        <listitem>
          <para>fifofs: Implements all operations used to deal with
          named pipes in a file system.</para>
        </listitem>

        <listitem>
          <para>genfs: Implements generic operations shared across
          multiple file systems.</para>
        </listitem>

        <listitem>
          <para>layerfs: Implements generic operations shared across
          layered file systems (see <xref linkend="fs_overview_layered"
          />).</para>
        </listitem>

        <listitem>
          <para>specfs: Implements all operations used to deal with
          special files in a file system.</para>
        </listitem>
      </itemizedlist>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="fs_init_and_fs_done">
    <title>Initialization and cleanup</title>

    <para>Drivers often have an initialization routine and a
    finalization one, called when the driver becomes active (e.g., at
    system startup) or inactive (e.g., unloading its module)
    respectively.  File systems are subject to these rules too, so that
    they can do global tasks as a whole, regardless of any mount
    point.</para>

    <para>These initialization and finalization tasks can be done from
    the <function>fs_init</function> and <function>fs_done</function>
    hooks, respectively.  If the driver is provided as a module, the
    initialization routine is called when it is loaded and the cleanup
    function is executed when it is unloaded.  Instead, if it is built
    into the kernel, the initialization code is executed at very early
    stages of kernel boot but <emphasis>the cleanup stuff is never
    run</emphasis>, not even when the system is shut down.</para>

    <para>Furthermore, the <function>fs_reinit</function> operation is
    provided to... XXX...</para>

    <para>These three operations take the following prototypes:</para>

    <funcsynopsis>
      <funcprototype>
        <funcdef>int <function>fs_init</function></funcdef>
        <paramdef>void</paramdef>
      </funcprototype>
    </funcsynopsis>

    <funcsynopsis>
      <funcprototype>
        <funcdef>int <function>fs_reinit</function></funcdef>
        <paramdef>void</paramdef>
      </funcprototype>
    </funcsynopsis>

    <funcsynopsis>
      <funcprototype>
        <funcdef>int <function>fs_done</function></funcdef>
        <paramdef>void</paramdef>
      </funcprototype>
    </funcsynopsis>

    <para>Note how they do not take any parameter, not even a mount
    point.</para>

    <para>As an example, consider the following functions that deal with
    a malloc type (see <xref linkend="malloc_types" />) defined for a
    specific file system:</para>

    <programlisting>MALLOC_JUSTDEFINE(M_EGFSMNT, "egfs mount", "egfs mount structures");

void
egfs_init(void)
{

        malloc_type_attach(M_EGFSMNT);

        ...
}

void
egfs_done(void)
{

        ...

        malloc_type_detach(M_EGFSMNT);
}</programlisting>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="fs_mount_and_fs_unmount">
    <title>Mounting and unmounting</title>

    <para>The mount operation, namely <function>fs_mount</function>, is
    probably the most complex one in the VFS layer.  Its purpose is to
    set up a new mount point based on the arguments received from
    userland.  Basically, it receives the mount point it is operating on
    and a data structure that describes the mount call
    parameters.</para>
    
    <para>Unfortunately, this operation has been overloaded with some
    semantics that do not really belong to it.  More specifically, it is
    also in charge of updating the mount point parameters as well as
    fetching them from userland.  This ought to be cleaned up at some
    point.</para>

    <para>We will see all these details in the following
    subsections.</para>

    <!-- ============================================================== -->

    <sect2 id="fs_mount_args">
      <title>Mount call arguments</title>

      <para>Most file systems pass information from the userland mount
      utility to the kernel when a new mount point is set up; this
      information generally includes user-tunable properties that tell
      the kernel how to mount the file system.  This data set is
      encapsulated in what is known as the mount arguments structure
      and is often named after the file system, prepending the
      <literal>_args</literal> string to it.</para>

      <para>Keep in mind that this structure is only used to communicate
      the userland and the kernel.  Once the call that passes the
      information finishes, it is discarded in the kernel side.</para>

      <para>The arguments structure is versioned to make sure that the
      kernel and the userland always use the same field layout and size.
      This is achieved by inserting a field at the very beginning of the
      object, holding its version.</para>

      <para>For example, imagine a virtual file system &mdash; one that
      is not stored on disk; for real (and very similar) code, you can
      look at tmpfs.  Its mount arguments structure could describe the
      ownership of the root directory or the maximum number of files
      that the file system may hold:</para>

      <programlisting>#define EGFS_ARGSVERSION 1
struct egfs_args {
        int ea_version;

        off_t ea_size_max;

        uid_t ea_root_uid;
        gid_t ea_root_gid;
        mode_t ea_root_mode;

        ...
}</programlisting>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="fs_mount_utility">
      <title>The mount utility</title>

      <para>XXX: To be written.  Slightly describe how a userland mount
      utility works.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="fs_mount_operation">
      <title>The fs_mount operation</title>

      <para>The <varname>fs_mount</varname> operation is called whenever
      a user issues a mount command from userland.  It has the following
      prototype:</para>

      <funcsynopsis>
        <funcprototype>
          <funcdef>int <function>vfs_mount</function></funcdef>
          <paramdef>struct mount *<parameter>mp</parameter></paramdef>
          <paramdef>const char *<parameter>path</parameter></paramdef>
          <paramdef>void *<parameter>data</parameter></paramdef>
          <paramdef>struct nameidata *<parameter>ndp</parameter></paramdef>
          <paramdef>struct proc *<parameter>p</parameter></paramdef>
        </funcprototype>
      </funcsynopsis>

      <para>The caller, which is always the kernel, sets up a
      <type>struct mount</type> object and passes it to this routine
      through the <varname>mp</varname> parameter.  It also passes the
      mount arguments structure (as seen in <xref
      linkend="fs_mount_args" />) in the <varname>data</varname>
      parameter.  There are several other arguments, but they do not
      important at this point.</para>

      <para>The <varname>mp-&gt;mnt_flag</varname> field indicates what
      needs to be done (remember that this operation is semantically
      overloaded).  The following is an outline of all the tasks this
      function does and also describes the possible flags for the
      <varname>mnt_flag</varname> field:</para>

      <orderedlist>
        <listitem>
          <para>If the <literal>MNT_GETARGS</literal> flag is set in
          <varname>mp-&gt;mnt_flag</varname>, the operation returns the
          current mount parameters for the given mount point.</para>

          <para>This is further detailed in <xref
          linkend="fs_mount_getargs" />.</para>
        </listitem>

        <listitem>
          <para>Copy the mount arguments structure from userland to
          kernel space using &man.copyin.9;.</para>

          <para>This is further detailed in <xref
          linkend="fs_mount_copyin" />.</para>
        </listitem>

        <listitem>
          <para>If the <literal>MNT_UPDATE</literal> flag is set in
          <varname>mp-&gt;mnt_flag</varname>, the operation updates the
          current mount parameters of the given mount point based on the
          new arguments given (e.g., upgrade to read-write from
          read-only mode).</para>

          <para>This is further detailed in <xref
          linkend="fs_mount_update" />.</para>
        </listitem>

        <listitem>
          <para>At this point, if neither <literal>MNT_GETARGS</literal>
          nor <literal>MNT_UPDATE</literal> were set, the operation sets
          up a new mount point.</para>

          <para>This is further detailed in <xref
          linkend="fs_mount_doit" />.</para>
        </listitem>
      </orderedlist>

      <!-- ============================================================ -->

      <sect3 id="fs_mount_getargs">
        <title>Retrieving mount parameters</title>

        <para>When the <function>fs_mount</function> operation is called
        with the <literal>MNT_GETARGS</literal> flag in
        <varname>mp-&gt;mnt_flag</varname>, the routine creates and
        fills the mount arguments structure based on the data of the
        given mount point and returns it to userland by using
        &man.copyout.9;.</para>

        <para>This heavily depends on the file system, but consider the
        following simple example:</para>

        <programlisting>if (mp->mnt_flag &amp; MNT_GETARGS) {
        struct egfs_args args;
        struct egfs_mount *emp;

        if (mp->mnt_data == NULL)
                return EIO;
        emp = (struct egfs_mount *)mp->mnt_data;

        args.ea_version = EGFS_ARGSVERSION;

        ... fill the args structure here ...

        return copyout(&amp;args, data, sizeof(args));
}</programlisting>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="fs_mount_copyin">
        <title>Getting the arguments structure</title>

        <para>The <varname>data</varname> argument given to the
        <function>fs_mount</function> operation points to a memory
        region in user-space.  Therefore, it must be first copied into
        kernel-space by means of &man.copyin.9; to be able to access it
        in a safe fashion.</para>

        <para>Here is a little example:</para>

        <programlisting>int error;
struct egfs_args args;

if (data == NULL)
        return EINVAL;

error = copyin(data, &amp;args, sizeof(args));
if (error)
        return error;

if (args.ea_version != EGFS_ARGSVERSION)
        return EINVAL;</programlisting>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="fs_mount_update">
        <title>Updating mount parameters</title>

        <para>When the <function>fs_mount</function> operation is called
        with the <literal>MNT_UPDATE</literal> flag in
        <varname>mp-&gt;mnt_flag</varname>, the routine modifies the
        current parameters of the given mount point based on the new
        parameters given in the mount arguments structure.</para>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="fs_mount_doit">
        <title>Setting up a new mount point</title>

        <para>If neither <literal>MNT_GETARGS</literal> nor
        <literal>MNT_UPDATE</literal> were set in
        <varname>mp-&gt;mnt_flag</varname> when calling
        <function>fs_mount</function>, the operation sets up a new mount
        point.  In other words: it fills the <type>struct mount</type>
        object given in <varname>mp</varname> with correct data.</para>

        <para>The very first thing that it usually does is to allocate a
        structure that defines the mount point.  This structure is named
        after the file system, appending the <literal>_mount</literal>
        string to it, and is often very similar to the mount arguments
        structure.  Once allocated and filled with appropriate data, the
        object is attached to the mount point by means of its
        <varname>mnt_data</varname> field.</para>

        <para>Later on, the operation gets a file system identifier for
        the mount point being set up using the &man.vfs.getnewfsid.9;
        function and assigns.</para>

        <para>At last, it sets up any statvfs-related information for
        the mount point by using the &man.set.statvfs.info.9;
        function.</para>

        <para>This is all clearer by looking at a simple code
        example:</para>

        <programlisting>emp = (struct egfs_mount *)malloc(sizeof(struct egfs_mount), M_EGFSMOUNT, M_WAITOK);
KASSERT(emp != NULL);

/* Fill the emp structure with file system dependent values. */
emp->em_root_uid = args.ea_rood_uid;
... more comes here ...

mp->mnt_data = emp;
mp->mnt_flag = MNT_LOCAL;
mp->mnt_stat.f_namemax = MAXNAMLEN;
vfs_getnewfsid(mp);

return set_statvfs_info(path, UIO_USERSPACE, args.ea_fspec, UIO_SYSSPACE, mp, p);</programlisting>

      </sect3>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="fs_unmount">
      <title>The vfs_unmount function</title>

      <para>Unmounting a file system is often easier than mounting it,
      plus there is no need to write a file system dependent userland
      utility to do an unmount.  This is accomplished by the
      <function>fs_unmount</function> operation, which has the following
      signature:</para>

      <funcsynopsis>
        <funcprototype>
          <funcdef>int <function>fs_unmount</function></funcdef>
          <paramdef>struct mount *<parameter>mp</parameter></paramdef>
          <paramdef>int <parameter>mntflags</parameter></paramdef>
          <paramdef>struct proc *<parameter>p</parameter></paramdef>
        </funcprototype>
      </funcsynopsis>

      <para>The function's outline is similar to the following:</para>

      <orderedlist>
        <listitem>
          <para>Ask the kernel to finalize all pending I/O on the given
          mount point.  This is done through the &man.vflush.9;
          function.  Note that its last argument is a flags bitfield
          which must carry the <literal>FORCECLOSE</literal> flag if the
          file system is being forcibly unmounted &mdash; in other
          words, if the <literal>MNT_FORCE</literal> flag was set in
          <varname>mntflags</varname>.</para>
        </listitem>

        <listitem>
          <para>Free all resources attached to the mount point &mdash;
          i.e., to the mount structure pointed to by
          <varname>mp-&gt;mnt_data</varname>.  This heavily depends on
          the file system internals.</para>
        </listitem>

        <listitem>
          <para>Destroy the file system specific mount structure and
          detach it from the <varname>mp</varname> mount point.</para>
        </listitem>
      </orderedlist>

      <para>Here is a simple example of the previous outline:</para>

      <programlisting>int error, flags;
struct egfs_mount *emp;

flags = (mntflags &amp; MNT_FORCE) ? FORCECLOSE : 0;

error = vflush(mp, NULL, flags);
if (error != 0)
        return error;

emp = (struct egfs_mount *)mp->mnt_data;
... free emp contents here ...

free(mp->mnt_data, M_EGFSMNT);
mp->mnt_data = NULL;

return 0;</programlisting>

      </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="fs_statvfs">
    <title>File system statistics</title>

    <para>The &man.statvfs.2; system call is used to retrieve
    statistical information about a mounted file system, such as its
    block size, number of used blocks, etc.  This is implemented in the
    file system driver by the <function>fs_statvfs</function> operation
    whose prototype is:</para>

    <funcsynopsis>
      <funcprototype>
        <funcdef>int <function>fs_statvfs</function></funcdef>
        <paramdef>struct mount *<parameter>mp</parameter></paramdef>
        <paramdef>struct statvfs *<parameter>sbp</parameter></paramdef>
        <paramdef>struct proc *<parameter>p</parameter></paramdef>
      </funcprototype>
    </funcsynopsis>

    <para>The execution flow of this operation is quite simple: it
    basically fills <varname>sbp</varname>'s fields with appropriate
    data.  This data is derivable from the current status of the file
    system &mdash; e.g., through the contents of
    <varname>mp-&gt;mnt_data</varname>.</para>

    <para>It is interesting to note that some of the information
    returned by this operation is stored in the generic part of the
    <varname>mp</varname> structure, shared across all file systems. The
    &man.copy.statvfs.info.9; function takes care to copy this common
    information into the resulting structure with minimum efforts.
    Among other things, it copies the file system's identifier, the
    number of writes, the maximum length of file names, etc.</para>

    <para>As a general rule of thumb, the code in
    <function>fs_statvfs</function> manually initializes the following
    fields in the <varname>sbp</varname> structure:
    <varname>f_iosize</varname>, <varname>f_frsize</varname>,
    <varname>f_bsize</varname>, <varname>f_blocks</varname>,
    <varname>f_bavail</varname>, <varname>f_bfree</varname>,
    <varname>f_bresvd</varname>, <varname>f_files</varname>,
    <varname>f_favail</varname>, <varname>f_ffree</varname> and
    <varname>f_fresvd</varname>.  Details information about each field
    can be found in &man.statvfs.2;.</para>

    <para>For example, the operation's content may look like:</para>

    <programlisting>... fill sbp's fields as described above ...

copy_statvfs_info(sbp, mp);

return 0;</programlisting>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="vnode_management">
    <title>vnode management</title>

    <!-- ============================================================== -->

    <sect2 id="vnode_life_cycle">
      <title>vnode's life cycle</title>

      <para>A vnode, like any other system object, has to be allocated
      before it can be used.  Similarly, it has to be released and
      deallocated when unused.  Things are a bit special when it comes to
      handling a vnode, hence this whole section dedicated to explain
      it.</para>

      <para>XXX: A graph could be excellent to have at this point.</para>

      <para>A vnode is first brought to life by the
      &man.getnewvnode.9; function; this returns a clean vnode that can be
      used to represent a file.  This new vnode is also marked as
      <emphasis>used</emphasis> and remains as such until it is marked
      inactive.  A vnode is inactivated by calling releasing the last
      reference to it.  When this happens, <function>VOP_INACTIVE</function>
      is called for the vnode and the vnode is placed on the free list.</para>

      <para>The <emphasis>free list</emphasis>, despite its confusing
      name, contains real, live, but not currently used vnodes.  It is
      like a big LRU list.  vnodes can be brought to life again from this
      list by using the &man.vget.9; function, and when that happens, they
      leave the free list and are marked as used again until they are
      inactivated.  Why does this list exist, anyway?  For example, think
      about all the commands that need to do path lookups on
      <filename>/usr</filename>.  Anything in
      <filename>/usr/bin</filename>, <filename>/usr/sbin</filename>,
      <filename>/usr/pkg/bin</filename> and
      <filename>/usr/pkg/sbin</filename> will need the
      <filename>/usr</filename> vnode.  If it had to be regenerated from
      scratch each time, it could be slow.  Therefore, it is kept around
      on the free list.</para>

      <para>vnodes on the free list can also be
      <function>reclaimed</function> which means that they are effectively
      killed.  This can either happen because the vnode is being reused
      for a new vnode (through <function>getnewvnode</function>) or
      because it is being shut down (e.g., due to a
      &man.revoke.2;).</para>

      <para>Note that the <varname>kern.maxvnodes</varname> &man.sysctl.2;
      node specifies how many vnodes can be kept active at a time.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_tags">
      <title>vnode tags</title>

      <para>vnodes are tagged to identify their type.  The tag attached
      to them must not be used within the kernel; it is only provided
      to let userland applications (such as &man.pstat.8;) to print
      information about vnodes.</para>
      
      <para>Note that its usage is deprecated because it is not
      extensible from dynamically loadable modules.  However, since they
      are currently used, each file system defines a tag to describe its
      own vnodes.  These tags can be found in
      <filename>src/sys/sys/vnode.h</filename> and &man.vnode.9;.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_alloc">
      <title>Allocation of a vnode</title>

      <para>vnodes are allocated in three different scenarios:</para>

      <itemizedlist>
        <listitem>
          <para>Access to existing files: the kernel does a file name
          lookup as described in <xref linkend="lookup_algorithm" />.
          When the vnode lookup operation finds a match, it allocates a
          vnode for the chosen file and returns it to the system.</para>
        </listitem>

        <listitem>
          <para>Creation of a new file: the file system specific code
          allocates a new vnode after the successful creation of the new
          file and returns it to the file system generic code.  This
          can happen as a result of the vnode create, mkdir, mknod and
          symlink operations.</para>
        </listitem>

        <listitem>
          <para>Access to a file through a NFS file handle: when the
          file system is asked to convert an NFS file handle to a vnode
          through the fhtovp vnode operation, it may need to allocate
          a new vnode to represent the file.  See <xref
          linkend="vfs_nfs" />.</para>
        </listitem>
      </itemizedlist>

      <para>It is important to recall that vnodes are unique per file.
      Special care is taken to avoid allocating more than one vnode for
      a single physical file. Each file system has its own method to
      achieve this; as an example, tmpfs keeps a map between file system
      nodes and vnodes, where the former are its keys.</para>

      <para>However, please do note that there may be files with no
      in-core representation (i.e., no vnode).  Only active and inactive
      but not-yet-reclaimed files are represented by a vnode.</para>

      <para>A simple example that illustrates vnode allocation can be
      found in the <function>tmpfs_alloc_vp</function> function of
      <filename>src/sys/fs/tmpfs/tmpfs_subr.c</filename>.</para>

      <para>XXX: I think fs_vget has to be described in this
      section.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_dealloc">
      <title>Deallocation of a vnode</title>

      <para>The procedure to deallocate vnodes is usually trivial: it
      generally cleans up any file system specific information that may
      be attached to the vnode.</para>
      
      <para>Keep in mind that there is <emphasis>a single
      place</emphasis> in the code where vnodes can be detached from
      their underlying nodes and destroyed.  This place is in the vnode
      reclaim operation.  Doing it from any other place will surely
      cause further trouble because the vnode may still be active or
      reusable (see <xref linkend="vnode_life_cycle" />).</para>

      <para>Note that the <varname>v_data</varname> pointer must be set
      to null before exiting the reclaim vnode operation or the system
      will complain because the vnode was not properly cleaned.</para>

      <para>This function is also in charge of releasing the underlying
      real node, if needed.  For example, when a file is deleted the
      corresponding vnode operation is executed &mdash; be it a delete
      or a rmdir &mdash; but the vnode is not released until it is
      reclaimed.  This means that if the real node was deleted before
      this happened, the vnode would be left pointing to an invalid
      memory area.</para>

      <para>Consider the following sample operation:</para>

      <programlisting>int
egfs_reclaim(void *v)
{
        struct vnode *vp = ((struct vop_reclaim_args *)v)->a_vp;

        struct egfs_node *node;

        node = (struct egfs_node *)vp->v_data;

        cache_purge(vp);
        vp->v_data = NULL;
        node->en_vnode = NULL;

        if (node->en_nlinks == 0)
                ... free the underlying node ...

        return 0;
}</programlisting>

      <para>However, keep in mind that releasing (marking it inactive) a
      vnode is not the same as reclaiming it.  The real reclaiming will
      often happen at a much later time, unless explicitly requested. 
      The operations that remove files from disk often execute the
      reclaim code on purpose so that the vnode and its associated disk
      space is released as soon as possible.  This can be done by using
      the &man.vrecycle.9; function.</para>

      <para>As an example:</para>

      <programlisting>int
egfs_inactive(void *v)
{
        struct vnode *vp = ((struct vop_inactive_args *)v)->a_vp;

        struct egfs_node *node;

        node = (struct egfs_node *)vp->v_data;

        if (node->en_nlinks == 0) {
                /* The file was deleted from the disk; reclaim it as
                 * soon as possible to free its physical space. */
                vrecycle(vp, NULL, p);
        }

        return 0;
}</programlisting>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_locking">
      <title>vnode's locking protocol</title>

      <para>vnodes have, as almost all other system objects, a locking
      protocol associated to them to avoid access interferences and
      deadlocks.  These may arise in two scenarios:</para>

      <itemizedlist>
        <listitem>
          <para>In uniprocessor systems: a vnode operation returns
          before the operation is complete, thus having to lock the
          vnode to prevent unrelated modifications until the operation
          finishes.  This happens because most file systems are
          asynchronous.</para>

          <para>For example: the read operation prepares a read to a
          file, launches it, puts the process requesting the read to
          sleep and yields execution to another process.  Some time
          later, the disk responds with the requested data, returning it
          to the original process, which is awoken.  The system must
          ensure that while the process was sleeping, the vnode suffers
          no changes.</para>
        </listitem>

        <listitem>
          <para>In multiprocessor systems: two different CPUs want to
          access the same file at the same time, thus needing to pass
          through the same vnode to reach it.  Furthermore, the same
          problems that appear in uniprocessor systems can also appear
          here.</para>
        </listitem>
      </itemizedlist>

      <para>Each vnode operation has a specific locking contract it must
      comply to,, which is often different from other operations (this
      makes the protocol very complex and ought to be simplified).
      These contracts are described in &man.vnode.9; and
      &man.vnodeops.9;.  You can also find them in the form of
      assertions in tmpfs' code, should you want to see them expressed
      in logical notation.</para>

      <para>As regards vnode operations, each file system implements
      locking primitives in the vnode layer.  These primitives allow to
      lock a vnode (<function>vop_lock</function>), unlock it
      (<function>vop_unlock</function>) and test whether it is locked or
      not (<function>vop_islocked</function>).  Given that these
      operations are common to all file systems, the genfs pseudo-file
      system provides a set of functions that can be used instead of
      having to write custom ones.  These are
      <function>genfs_lock</function>, <function>genfs_unlock</function>
      and <function>genfs_islocked</function> and are always used except
      for very rare cases.</para>

      <para>It is very important to note that
      <emphasis><function>vop_lock</function> is never used
      directly</emphasis>.  Instead, the &man.vn.lock.9; function is
      used to lock vnodes.  Unlocking, however, is in charge of
      <function>vop_unlock</function>.</para>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="fs_root">
    <title>The root vnode</title>

    <para>As described in <xref linkend="lookup" />, the kernel does all
    path name lookups in an iterative way.  This means that in order to
    reach any file within a mount point, it must first traverse the
    mount point itself.  In other words, the mount point is the only
    place through which the system can access a file system and thus it
    must be able to resolve it.</para>

    <para>In order to accomplish this, each file system provides the
    <function>fs_root</function> hook which returns a vnode
    representing its root node.  The prototype for this function
    is:</para>

    <funcsynopsis>
      <funcprototype>
        <funcdef>int <function>fs_root</function></funcdef>
        <paramdef>struct mount *<parameter>mp</parameter></paramdef>
        <paramdef>struct vnode **<parameter>vpp</parameter></paramdef>
      </funcprototype>
    </funcsynopsis>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="lookup">
    <title>Path name resolution procedure</title>

    <para>XXX Write an introduction.</para>

    <!-- ============================================================== -->

    <sect2 id="path_name_components">
      <title>Path name components</title>

      <para>A path name component is a non-divisible part of a complete
      path name &mdash; one that does not contain the slash
      (<literal>/</literal>) character.  Any path name that includes one
      or more slashes in it can be divided in two or more different
      atoms.</para>
      
      <para>Path name components are represented by <type>struct
      componentname</type> objects (defined in
      <filename>src/sys/sys/namei.h</filename>), heavily used by several
      vnode operations.  The following are its most important
      fields:</para>

      <itemizedlist>
        <listitem>
          <para><varname>cn_flags</varname>: A bitfield that describes
          the element.  Of special interest is the
          <literal>HASBUF</literal> flag, which indicates that this
          object holds a valid path name buffer (see the
          <varname>cn_pnbuf</varname> field below).</para>
        </listitem>

        <listitem>
          <para><varname>cn_pnbuf</varname>: A pointer to the buffer
          holding the complete path name.  This is only valid if the
          <varname>cn_flags</varname> bitfield has the
          <literal>HASBUF</literal> flag.</para>

          <para>In most situations, this buffer is automatically
          allocated and deallocated by the system, but this is not
          always true.  Sometimes, it is necessary to free it in some of
          the vnode operations themselves; &man.vnodeops.9; gives more
          details about this.</para>
        </listitem>

        <listitem>
          <para><varname>cn_nameptr</varname>: A pointer within
          <varname>cn_pnbuf</varname> that specifies the start of the
          path name component described by this object.  Must
          <emphasis>always</emphasis> be used in conjunction with
          <varname>cn_namelen</varname>.</para>
        </listitem>

        <listitem>
          <para><varname>cn_namelen</varname>: The length of this path
          name component, starting at
          <varname>cn_nameptr</varname>.</para>
        </listitem>
      </itemizedlist>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="lookup_algorithm">
      <title>The lookup algorithm</title>

      <para>To <emphasis>resolve a path name</emphasis> (or to
      <emphasis>lookup a path name</emphasis>) means to get a vnode that
      uniquely represents based on a previously specified path name, be
      it absolute or relative.</para>

      <para>The NetBSD kernel uses a two-level iterative algorithm to
      resolve path names.  The first level is file system independent
      and is carried on by the &man.namei.9; function, while the second
      one relies on internal file system details and is achieved through
      the lookup vnode operation.</para>

      <para>The following list illustrates the lookup algorithm.  Lots
      of details have been left out from it to make things simpler;
      &man.namei.9; and &man.vnodeops.9; contain all the missing
      information:</para>

      <para>XXX: &lt;wrstuden&gt; I think you simplified the description
      too much.  You left out lookup(), and ascribe certain actions to
      namei() when they are performed by lookup().  While I like your
      attempt to keep it simple, I think both namei() and lookup() need
      describing.  lookup() takes a path name and turns it into a vnode,
      and namei() takes the result and handles symbolic link
      resolution.</para>

      <para>XXX: &lt;jmmv&gt; I currently don't know very much about the
      internals of lookup() and namei(), so I've left the simplified
      description in the document, temporarily.</para>

      <orderedlist>
        <listitem>
          <para><function>namei</function> constructs a
          <varname>cnp</varname> path name component (of type
          <type>struct componentname</type> as described in <xref
          linkend="path_name_components" />); its buffer holds the
          complete path name to look for.  The component pointers are
          adjusted to describe the path name's first component.</para>
        </listitem>

        <listitem>
          <para>The <function>namei</function> operation gets the vnode
          for the lookup's starting point (always a directory).  For
          absolute path names, this is the root directory's vnode.  For
          relative path names, it is the current working directory's
          vnode, as seen by the calling userland process.</para>

          <para>This vnode is generally called <varname>dvp</varname>,
          standing for <emphasis>directory vnode
          pointer</emphasis>.</para>
        </listitem>

        <listitem id="loop">
          <para><function>namei</function> calls the vnode lookup
          operation on the <varname>dvp</varname> vnode, telling it
          which is the component it has to resolve
          (<varname>cnp</varname>) starting from the given
          directory.</para>
        </listitem>

        <listitem>
          <para>If the component exists in the directory, the vnode
          lookup operation must return a vnode for its respective
          entry.</para>

          <para>However, if the component does not exist in the
          directory, the lookup will fail returning an appropriate error
          code.  There are several other error conditions that have to
          be reported, all of them appropriately described in
          &man.vnodeops.9;.</para>
        </listitem>

        <listitem>
          <para><function>namei</function> updates
          <varname>dvp</varname> to point to the returned vnode and
          advances <varname>cnp</varname> to the next component, only if
          there are more components to look for.  In that case, the
          procedure continues from <xref linkend="loop" />.</para>

          <para>In case there are no more components to look for,
          <function>namei</function> returns the vnode of the last entry
          it located.</para>
        </listitem>
      </orderedlist>

      <para>There are several reasons behind this two-level lookup
      mechanism, but they have been left over for simplicity.  XXX: The
      4.4BSD book gives them all; we should either link to it or explain
      these here in our own words (preferably the latter).</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="lookup_hints">
      <title>Lookup hints</title>

      <para>One of the arguments passed to the lookup algorithm is a
      hint that specifies the kind of lookup to execute.  This hint
      specifies whether the lookup is for a file creation
      (<literal>CREATE</literal>), a deletion
      (<literal>DELETE</literal>) or a name change
      (<literal>RENAME</literal>).  The file system uses these hints to
      speed up the corresponding operation &mdash; generally to cache
      some values that will be used while processing the real operation
      later on.</para>

      <para>For example, consider the &man.unlink.2; system call whose
      purpose is to delete the given file name.  This operation issues
      a lookup to ensure that the file exists and to get a vnode
      for it.  This way, it is able to call the vnode's remove
      operation.  So far, so good.  Now, the operation itself has to
      delete the file, but removing a file means, among other things,
      detaching it from the directory containing it.  How can the remove
      operation access the directory entry that pointed to the file
      being removed?  Obviously, it can do another lookup and traverse
      a potentially long directory.  But is this really needed?</para>

      <para>Remember that &man.unlink.2; first got a vnode for the entry
      to be removed.  This implied doing a lookup, which traversed the
      file's parent directory looking for its entry.  The algorithm
      reached the entry once, so there is no need to repeat the process
      once we are in the vnode operation itself.</para>

      <para>In the above situation, the second lookup is avoided
      by caching the affected directory entry while the lookup operation
      is executed.  This is only done when the <literal>DELETE</literal>
      hint is given.</para>

      <para>The same situation arises with file creations (because new
      entries may be overwrite previously deleted entries in on-disk
      file systems) or name changes (because the operation needs to
      modify the associated directory entry).</para>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="file_management">
    <title>File management</title>

    <para>XXX: Write an introduction.</para>

    <!-- ============================================================== -->

    <sect2 id="vop_create">
      <title>Creation of regular files</title>

      <para>XXX: To be written.  Describe vop_create.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vop_link">
      <title>Creation of hard links</title>

      <para>XXX: To be written.  Describe vop_link.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vop_remove">
      <title>Removal of a file</title>

      <para>XXX: To be written.  Describe vop_remove.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vop_rename">
      <title>Rename of a file</title>

      <para>XXX: To be written.  Describe vop_rename.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="read_and_write">
      <title>Reading and writing</title>

      <para>vnodes have an operation to read data from them
      (<function>vop_read</function>) and one to write data to them
      (<function>vop_write</function>) both called by their respective
      system calls, &man.read.2; and &man.write.2;.  The read operation
      receives an offset from which the read starts, a number that
      specifies the number of bytes to read (length) and a buffer into
      which the data will be stored.  Similarly, the write operation
      receives an offset from which the write starts, the number of
      bytes to write and a buffer from which the data is read.</para>

      <para>There is also the &man.mmap.2; system call which maps a file
      into memory and provides userland direct access to the mapped
      memory region.</para>

      <!-- ============================================================ -->

      <sect3 id="uio">
        <title>uio objects</title>

        <para>The <type>struct uio</type> type describes a data transfer
        between two different buffers.  One of them is stored within the
        uio object while the other one is external (often living in
        userland space).  These objects are created when a new data
        transfer starts and are alive until the transfer finishes
        completely; in other words, they identify a specific
        transfer.</para>

        <para>The following is a description of the most important
        fields in <type>struct uio</type> (the ones needed for basic
        understanding on how it works).  For a complete list, see
        &man.uiomove.9;.</para>

        <itemizedlist>
          <listitem>
            <para><varname>uio_offset</varname>: The offset within the
            file from which the transfer starts.  If the transfer is a
            read, the offset must be within the file size limits; if it
            is a write, it can extend beyond the end of the file &mdash;
            in which case the file is extended.</para>
          </listitem>

          <listitem>
            <para><varname>uio_resid</varname> (also known as the
            <emphasis>residual count</emphasis>): Number of bytes
            remaining to be transferred for this object.</para>
          </listitem>

          <listitem>
            <para>A set of pointers to buffers into/from which the data
            will be read/written.  These are not used directly and hence
            their names have been left out.</para>
          </listitem>

          <listitem>
            <para>A flag that indicates if data should be read from or
            written to the buffers described by the uio object.</para>
          </listitem>
        </itemizedlist>

        <para>This may be easier to understand by discussing a little
        example.  Consider the following userland program:</para>

        <programlisting>char buffer[1024];
lseek(fd, 100, SEEK_SET);
read(fd, buffer, 1024);</programlisting>

        <para>The &man.read.2; system call constructs an uio object
        containing an offset of 100 bytes and a residual count of 1024
        bytes, making the uio's buffers point to
        <varname>buffer</varname> and marking them as the data's target.
        If this was a write operation, the uio object's buffers could be
        the data's source.</para>

        <para>In order to simplify uio object management, the kernel
        provides the &man.uiomove.9; function, whose signature
        is:</para>

        <funcsynopsis>
          <funcprototype>
            <funcdef>int <function>uiomove</function></funcdef>
            <paramdef>void *<parameter>buf</parameter></paramdef>
            <paramdef>size_t<parameter>n</parameter></paramdef>
            <paramdef>struct uio *<parameter>uio</parameter></paramdef>
          </funcprototype>
        </funcsynopsis>

        <para>This function copies up to <varname>n</varname> bytes
        between the kernel buffer pointed to by <varname>buf</varname>
        into the addresses described by the <varname>uio</varname>
        instance.  If the transfer is successful, the uio object is
        updated so that <varname>uio_resid</varname> is decremented by
        the amount of data copied, <varname>uio_offset</varname> is
        increased by the same amount and the internal buffer pointers
        are updated accordingly.  This eases calling
        <function>uiomove</function> repeatedly (e.g., from within a
        loop) until the transfer is complete.</para>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="vop_getpages_and_vop_putpages">
        <title>Getting and putting pages</title>

        <para>As seen in <xref linkend="uio" />, data transfers are
        described by a high-level object that does not take into account
        any detail of the underlying file system.  More specifically,
        they are not tied to any specific on-disk block organization.
        (Remember that most on-disk file systems store data scattered
        across the disk (due to fragmentation); therefore, the transfers
        have to be broken up into pieces to read or write the data from
        the appropriate disk blocks.)</para>

        <para>Breaking the transfer into pieces, requesting them to the
        disk and handling the results is a (very) complex operation.
        Fortunately, the UVM memory subsystem (see <xref linkend="uvm"
        />) simplifies the whole task.  Each vnode has a <type>struct
        uvm_object</type> (as described in <xref linkend="uvm_object"
        />) associated to it, backed by a vnode.</para>

        <para>The vnode backs up the uobj through its
        <function>vop_getpages</function> and
        <function>vop_putpages</function> operations.  As these two
        operations are very generic (from the point of view of managing
        memory pages), genfs provides two generic functions to implement
        them.  These are <function>genfs_getpages</function> and
        <function>genfs_putpages</function>, which will usually suit the
        needs of any on-disk file system.  How they deal with specific
        file system details is something detailed in <xref
        linkend="vnode_ondisk" />.</para>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="vop_mmap">
        <title>Memory-mapping a file</title>

        <para>Thanks to the particular UBC implementation in NetBSD (see
        <xref linkend="vop_getpages_and_vop_putpages" />), a file can be
        trivially mapped into memory.  The &man.mmap.2; system call is
        used to achieve this and the kernel handles it independently
        from the file system.</para>

	<para>The <function>VOP_MMAP</function> method is used to only
	inform the file system that the vnode is about to be memory-mapped
	and ask the file system if it allows the mapping to happen.</para>

	<para>After the file is memory-mapped, file system I/O is handled
	by UVM through the vnode pager and ends up in
	<function>vop_getpages</function> and <function>vop_putpages</function>.
	In a sense this is very much like regular reading and writing,
	but instead of explicitly calling <function>vop_read</function>
	and <function>vop_write</function>, which then use
	<function>uiomove</function>, the memory window is accessed
	directly.</para>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="vop_read_and_vop_write">
        <title>The read and write operations</title>

        <para>Thanks to the particular UBC implementation in NetBSD (see
        <xref linkend="vop_getpages_and_vop_putpages" />), the vnode's
        read and write operations (<function>vop_read</function> and
        <function>vop_write</function> respectively) are very simple
        because they only deal with virtual memory.  Basically, all they
        need to do is memory-map the affected part of the file and then
        issue a simple memory copy operation.</para>
        
        <para>As an example, consider the following sample read
        code:</para>

        <programlisting>int
egfs_read(void *v)
{
        struct vnode *vp = ((struct vop_read_args *)v)->a_vp;
        struct uio *uio = ((struct vop_read_args *)v)->a_uio;

        int error;
        struct egfs_node *node;

        node = (struct egfs_node *)vp->v_data;

        if (uio->uio_offset &lt; 0)
                return EINVAL;

        if (uio->uio_resid == 0 || uio->uio_offset >= node->en_size)
                return 0;

        if (vp->v_type == VREG) {
                error = 0;
                while (uio->uio_resid &gt; 0 &amp;&amp; error == 0) {
                        int flags;
                        off_t len;
                        void *win;

                        len = MIN(uio->uio_resid, node->en_size -
                            uio->uio_offset);
                        if (len == 0)
                                break;

                        win = ubc_alloc(&amp;vp->v_uobj, uio->uio_offset,
                            &amp;len, UBC_READ);
                        error = uiomove(win, len, uio);
                        flags = UBC_WANT_UNMAP(vp) ? UBC_UNMAP : 0;
                        ubc_release(win, flags);
                }
        } else {
                ... left out for simplicity (if needed) ...
        }

        return error;
}</programlisting>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="vnode_ondisk">
        <title>Reading and writing pages</title>

        <para>As seen in <xref linkend="vop_getpages_and_vop_putpages"
        />, the <function>genfs_getpages</function> and
        <function>genfs_putpages</function> functions are enough for
        most on-disk file systems.  But if they are abstract, how do
        they deal with the specific details of each file system?  E.g.,
        if the system wants to fetch the third page of the
        <filename>/foo/bar</filename> file, how does it know which
        on-disk blocks it must read to bring the requested page into
        memory?  Where does the real transfer take place?</para>

        <para>The mapping between memory pages and disk blocks is done
        by the vnode's bmap operation, <function>vop_bmap</function>,
        called by the paging functions.  This receives the file's
        logical block number to be accessed and converts it to the
        internal, file system specific block number.</para>

        <para>Once bmap returns the physical block number to be
        accessed, the generic page handling functions check whether the
        block is already in memory or not.  If it is not, a transfer is
        done by using the vnode's strategy operation
        (<function>vop_strategy</function>).</para>

        <para>More information about these operations can be found in
        the &man.vnodeops.9; manual page.</para>

      </sect3>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_attrs">
      <title>Attributes management</title>

      <para>Within the NetBSD kernel, a file has a set of standard and
      well-known attributes associated to it.  These are:</para>

      <itemizedlist>
        <listitem>
          <para>A type: specifies whether the file is a regular file
          (<literal>VREG</literal>), a directory
          (<literal>VDIR</literal>), a symbolic link
          (<literal>VLNK</literal>), a special device
          (<literal>VCHR</literal> or <literal>VBLK</literal>), a named
          pipe (<literal>VFIFO</literal>) or a socket
          (<literal>VSOCK</literal>).  The constants mentioned here are
          the vnode types, which do not necessarily match the internal
          type representation of a file within a file system.</para>
        </listitem>

        <listitem>
          <para>An ownership: that is, a user id and a group id.</para>
        </listitem>

        <listitem>
          <para>An access mode.</para>
        </listitem>

        <listitem>
          <para>A set of flags: these include the immutable flag, the
          append-only flag, the archived flag, the opaque flag and the
          nodump flag.  See &man.chflags.2; for more information.</para>
        </listitem>

        <listitem>
          <para>A hard link count.</para>
        </listitem>

        <listitem>
          <para>A set of times: these include the birth time, the change
          time, the access time and the modification time.  See <xref
          linkend="vnode_times" /> for more details.</para>
        </listitem>

        <listitem>
          <para>A size: the exact size of the file, in bytes.</para>
        </listitem>

        <listitem>
          <para>A device number: in case of a special device (character
          or block ones), its number is also stored.</para>
        </listitem>
      </itemizedlist>

      <para>The NetBSD kernel uses the <type>struct vattr</type> type
      (detailed in &man.vattr.9;) to handle all these attributes all in
      a compact way.  Based on this set, each file system typically
      supports these attributes in its node representation structure
      (unless they are fictitious and faked when accessed).  For
      example, FFS could store them in inodes, while FAT could save only
      some of them and fake the others at run time (such as the
      ownership).</para>

      <para>A <type>struct vattr</type> instance is initialized by using
      the <function>VATTR_NULL</function> macro, which sets its vnode
      type to <function>VNON</function> and all of its other fields to
      <literal>VNOVAL</literal>, indicating that they have no valid
      values.  After using this macro, it is the responsibility of the
      caller to set all the fields it wants to the correct values.  The
      consumer of the object shall not use those fields whose value is
      unset (<literal>VNOVAL</literal>).</para>

      <para>It is interesting to note that there are no vnode operations
      that match the regular system calls used to set the file
      ownership, its mode, etc.  Instead, nodes provide two operations
      that act on the whole attribute set:
      <function>vop_getattrs</function> to read them and
      <function>vop_setattrs</function> to set them.  The rest of this
      section describes them.</para>

      <!-- ============================================================ -->

      <sect3 id="vop_getattr">
        <title>Getting file attributes</title>

        <para>The <function>vop_getattr</function> vnode operation
        fetches all the standard attributes from a given vnode.  All it
        does is fill the given <function>struct vattr</function>
        structure with the correct values.  For example:</para>

        <programlisting>int
egfs_getattr(void *v)
{
        struct vnode *vp = ((struct vop_getattr_args *)v)->a_vp;
        struct vattr *vap = ((struct vop_getattr_args *)v)->a_vap;

        struct egfs_node *node;

        node = (struct egfs_node *)vp->v_data;

        VATTR_NULL(vap);

        switch (node->en_type) {
        case EGFS_NODE_DIR:
                vap->va_type = VDIR;
                break;
        case ...:
        ...
        }
        vap->va_mode = node->en_mode;
        vap->va_uid = node->en_uid;
        vap->va_gid = node->en_gid;
        vap->va_nlink = node->en_nlink;
        vap->va_flags = node->en_flags;
        vap->va_size = node->en_size;
        ... continue filling values ...

        return 0;
}</programlisting>

      </sect3>

      <!-- ============================================================ -->

      <sect3 id="vop_setattr">
        <title>Setting file attributes</title>

        <para>Similarly to the <function>vop_getattr</function>
        operation, <function>vop_setattr</function> sets a subset of
        file attributes at once.  Only those attributes which are not
        <literal>VNOVAL</literal> are changed.  Furthermore, the
        operation ensures that the caller is not trying to set
        unsettable values; for example, one cannot set (i.e., change)
        the file type.</para>
        
        <para>Of special interest is that the file's size can be changed
        as an attribute.  In other words, this operation is the entry
        point for file truncation calls and it is its responsibility to
        call <function>vop_truncate</function> when appropriate.  The
        system never calls the vnode's truncate operation
        directly.</para>

        <para>A little sketch:</para>

        <programlisting>int
egfs_setattr(void *v)
{
        struct vnode *vp = ((struct vop_setattr_args *)v)->a_vp;
        struct vattr *vap = ((struct vop_setattr_args *)v)->a_vap;
        struct ucred *cred = ((struct vop_setattr_args *)v)->a_cred;
        struct proc *p = ((struct vop_setattr_args *)v)->a_p;

        /* Do not allow setting unsettable values. */
        if (vap->va_type != VNON || vap->va_nlink != VNOVAL || ...)
                return EINVAL;

        if (vap->va_flags != VNOVAL) {
                ... set node flags here ...
                if error, return it
        }

        if (vap->va_size != VNOVAL) {
                ... verify file type ...
                error = VOP_TRUNCATE(vp, size, 0, cred, p);
                if error, return it
        }

        ... etcetera ...

        return 0;
}</programlisting>

      </sect3>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vnode_times">
      <title>Time management</title>

      <para>Each node has four times associated to it, all of them
      represented by <type>struct timespec</type> objects.  These times
      are:</para>

      <itemizedlist>
        <listitem>
          <para>Birth time: the time the file was born.  Cannot be
          changed after the file is created.</para>
        </listitem>

        <listitem>
          <para>Access time: the time the file was last accessed.</para>
        </listitem>

        <listitem>
          <para>Change time: the time the file's node was last changed.
          For example, if a new hard link for an existing file is
          created, its change time is updated.</para>
        </listitem>

        <listitem>
          <para>Modification time: the time the file's contents were
          last modified.</para>
        </listitem>
      </itemizedlist>

      <para>Given that these times reflect the last accesses to the
      underlying files, they need to be modified extremely often.  If
      this was done synchronously, it could impose a big performance
      penalty on files accessed repeatedly.  This is why time updates
      are done in a delayed manner.</para>

      <para>Nodes usually have a set of flags (which are only kept in
      memory, never written to disk) that indicate their status
      to let asynchronous actions know what to do.  These flags are
      used, among other things, to indicate that a file's times have to
      be updated.  They are set as soon as the file is changed but the
      times are not really modified until the vnode's update operation
      (<function>vop_update</function>) is called; see &man.vnodeops.9;
      for more details on this.</para>

      <para><function>vop_update</function> is called asynchronously by
      the kernel from time to time.  However, a file system may opt to
      execute it on purpose as it wishes; such a situation may be when
      it is mounted synchronously, as it will be updating the times as
      soon as the changes happen.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vop_access">
      <title>Access control</title>

      <para>The file system is in charge of ensuring that a request is
      valid or not, permission-wise.  This is done with the vnode's
      access operation (<function>vop_access</function>), which receives
      the caller's credentials and the requested access mode.  The
      operation then checks if these are compatible with the current
      attributes of the file being accessed.</para>

      <para>The operation generally follows this structure:</para>

      <orderedlist>
        <listitem>
          <para>If the file system is mounted read only, and the caller
          wants to write to a directory, to a link or to a regular file,
          then access must be denied.</para>
        </listitem>

        <listitem>
          <para>If the file is immutable and the caller wants to write
          to it, access is denied.</para>
        </listitem>

        <listitem>
          <para>At last, &man.vaccess.9; is used to check all remaining
          access possibilities.  This simplifies a lot the code of this
          operation.</para>
        </listitem>
      </orderedlist>

      <para>For example:</para>

      <programlisting>int
egfs_access(void *v)
{
        struct vnode *vp = ((struct vop_access_args *)v)->a_vp;
        int mode = ((struct vop_access_args *)v)->a_mode;
        struct ucred *cred = ((struct vop_access_args *)v)->a_cred;

        struct egfs_node *node;

        node = (struct egfs_node *)vp->v_data;

        if (vp->v_type == VDIR || vp->v_type == VLNK || vp->v_type == VREG)
                if (mode &amp; VWRITE &amp;&amp;
                    vp->v_mount->mnt_flag &amp; MNT_RDONLY)
                        return EROFS;
        }

        if (mode &amp; VWRITE &amp;&amp; mode->tn_flags &amp; IMMUTABLE)
                return EPERM;

        return vaccess(vp->v_type, node->en_mode, node->en_uid,
            node->en_gid, mode, cred);
}</programlisting>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="symlink_management">
    <title>Symbolic link management</title>

    <!-- ============================================================== -->

    <sect2 id="vop_symlink">
      <title>Creation of symbolic links</title>

      <para>XXX: To be written.  Describe vop_symlink.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vop_readlink">
      <title>Read of symbolic link's contents</title>

      <para>XXX: To be written.  Describe vop_readlink.</para>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="dir_management">
    <title>Directory management</title>

    <para>A directory maps file names to file system nodes.  The
    internal representation of a directory depends heavily on the file
    system, but the vnode layer provides an abstract way to access them.
    This includes the <function>vop_lookup</function>,
    <function>vop_mkdir</function>, <function>vop_rmdir</function> and
    <function>vop_readdir</function> operations.</para>

    <para>For the rest of this section, assume that the following simple
    <type>struct egfs_dirent</type> describes a directory entry:</para>

    <programlisting>struct egfs_dirent {
        char ed_name[MAXNAMLEN];
        int ed_namelen;
        off_t ed_fileid;
};</programlisting>

    <!-- ============================================================== -->

    <sect2 id="vop_mkdir">
      <title>Creation of directories</title>

      <para>XXX: To be written.  Describe vop_mkdir.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vop_rmdir">
      <title>Removal of directories</title>

      <para>XXX: To be written.  Describe vop_rmdir.</para>

    </sect2>

    <!-- ============================================================== -->

    <sect2 id="vop_readdir">
      <title>Reading directories</title>

      <para>The <function>vop_readdir</function> operation reads the
      contents of directory in a file system independent way.  Remember
      that the regular read operation can also be used for this purpose,
      though all it returns is the exact contents of the directory; this
      cannot be used by programs that aim to be portable (not to mention
      that some file systems do not support this functionality).</para>

      <para>This operation returns a <type>struct dirent</type> object
      (as seen in &man.dirent.5;) for each directory entry it reads from
      the offset it was given up to that offset plus the transfer
      length.  Because it must read entire objects, the offset must
      always be aligned to a physical directory entry boundary;
      otherwise, the function shall return an error.  This is not always
      true, though: some file systems have variable-sized entries and
      they use another metric to determine which entry to read (such as
      its ordering index).</para>

      <para>It is important to note that the size of the resulting
      <type>struct dirent</type> objects is variable: it depends on the
      name stored in them.  Therefore, the code first constructs these
      objects (settings all its fields by hand) and then uses the
      <function>_DIRENT_SIZE</function> macro to calculate its size,
      later assigned to the <varname>d_reclen</varname> field.  For
      example:</para>

      <programlisting>struct egfs_dirent de;
struct egfs_node *node;
struct dirent d;

... read a directory entry from disk into de ...
... make node point to the de.ed_fileid node ...

switch (node->ed_type) {
case EGFS_NODE_DIR:
        d.d_type = DT_DIR;
case ...:
...
}

d.d_namlen = de.ed_namelen;
(void)memcpy(d.d_name, de.ed_name, de.ed_namelen);
d.d_name[de.ed_namelen] = '\0';
d.d_reclen = _DIRENT_SIZE(&amp;d);</programlisting>

      <para>With this in mind, the operation also ensures that the
      offset is correct, locates the first entry to return and loops
      until it has exhausted the transmission's length.  The following
      illustrates the process:</para>

<programlisting>int
egfs_readdir(void *v)
{
        struct vnode *vp = ((struct vop_readdir_args *)v)->a_vp;
        struct uio *uio = ((struct vop_readdir_args *)v)->a_uio;
        int *eofflag = ((struct vop_readdir_args *)v)->a_eofflag;

        int entry_counter;
        int error;
        off_t startoff;
        struct egfs_dirent de;
        struct egfs_node *dnode;
        struct egfs_node *node;

        if (vp->v_type != VDIR)
                return ENOTDIR;

        if (uio->uio_offset % sizeof(struct egfs_dirent) > 0)
                return EINVAL;

        dnode = (struct egfs_node *)vp->v_data;

        ... read the first directory entry into de ...
        ... make node point to the de.ed_fileid node ...

        entry_counter = 0;
        startoff = uio->uio_offset;
        do {
                struct dirent d;

                ... construct d from de ...

                error = uiomove(&amp;d, d.d_reclen, uio);

                entry_counter++;
                ... read the next directory entry into de ...
                ... make node point to the de.ed_fileid node ...
        } while (error == 0 &amp;&amp; uio->uio_resid > 0
            &amp;&amp; de is valid)


        /* Important: Update transfer offset to match on-disk
         * directory entries, not virtual ones. */
        uio->uio_offset = entry_counter * sizeof(egfs_dirent);

        if (eofflag != NULL)
                *eofflag = (de is invalid?);

        return error;
}</programlisting>

      <para>File systems that support NFS take some extra steps in this
      function.  See &man.vnodeops.9; for more details.  XXX: Cookies
      and the eof flag should really be explained here.</para>

    </sect2>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="special_nodes">
    <title>Special nodes</title>

    <para>File system that support named pipes and/or special devices
    implement the vnode's mknod operation
    (<function>vop_mknod</function>) in order to create them.  This is
    extremely similar to <function>vop_create</function>.  However, it
    takes some extra steps because named pipes and special devices are
    not like regular files: their contents are not stored in the file
    system and they have specific access methods.  Therefore, they
    cannot use the file system's regular vnode operations vector.</para>

    <para>In other words: the file system defines two additional vnode
    operations vectors: one for named pipes and one for special devices.
    Fortunately, this task is easy because the virtual fifofs
    (<filename>src/sys/miscfs/fifofs</filename>) and specfs
    (<filename>src/sys/miscfs/specfs</filename>) file systems
    provide generic vnode operations.  In
    general, these vectors use all the generic operations except for a
    few functions.</para>

    <para>Because the on-disk file system has to update the node's times
    when accessing these special files, some operations are implemented
    on a file system basis and later call the generic operations
    implemented in fifofs and specfs.  This basically means that those
    file systems implement their own <function>vop_close</function>,
    <function>vop_read</function> and <function>vop_write</function>
    operations for named pipes and for special devices.</para>

    <para>As a little example of such an operation:</para>

    <programlisting>int
egfs_fifo_read(void *v)
{
        struct vnode *vp = ((struct vop_read_args *)v)->a_vp;

        ((struct egfs_node *)vp->v_data)->tn_status |= TMPFS_NODE_ACCESSED;
        return VOCALL(fifo_vnodeop_p, VOFFSET(vop_read), v);
}</programlisting>

    <para>Remember that these two additional operations vectors are
    added to the vnode operations description structure; otherwise, they
    will are not initialized and therefore will not work.  See <xref
    linkend="vfs_ops_vector" />.</para>

    <para>For more sample code, consult
    <filename>src/sys/fs/tmpfs/fifofs_vnops.c</filename>,
    <filename>src/sys/fs/tmpfs/fifofs_vnops.h</filename>,
    <filename>src/sys/fs/tmpfs/specfs_vnops.c</filename> and
    <filename>src/sys/fs/tmpfs/specfs_vnops.h</filename>.</para>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="vfs_nfs">
    <title>NFS support</title>

    <para>XXX: To be written.  Describe vop_fhtovp and vfs_vptofh.</para>

  </sect1>

  <!-- ================================================================ -->

  <sect1 id="fs_steps">
    <title>Step by step file system writing</title>

    <orderedlist>
      <listitem>
        <para>Create the <filename>src/sys/fs/egfs</filename>
        directory.</para>
      </listitem>

      <listitem>
        <para>Create a minimal
        <filename>src/sys/fs/egfs/files.egfs</filename> file:</para>

        <programlisting>
deffs fs_egfs.h EGFS
file fs/egfs/egfs_vfsops.c egfs
file fs/egfs/egfs_vnops.c egfs</programlisting>
      </listitem>

      <listitem>
        <para>Modify <filename>src/sys/conf/files</filename> to include
        <filename>files.egfs</filename>.  I.e., add the following
        line:</para>

        <programlisting>include "fs/egfs/files.egfs"</programlisting>
      </listitem>

      <listitem>
        <para>Define the file system's name in
        <filename>src/sys/sys/mount.h</filename>.  I.e., add the
        following line:</para>

        <programlisting>#define MOUNT_EGFS "egfs"</programlisting>
      </listitem>

      <listitem>
        <para>Define the file system's vnode tag type.</para>

        <para>See <xref linkend="vnode_tags" />.</para>
      </listitem>

      <listitem>
        <para>Add the file system's magic number in the Linux
        compatibility layer,
        <filename>src/sys/compat/linux/common/linux_misc.c</filename>
        and
        <filename>src/sys/compat/linux/common/linux_misc.h</filename>,
        if applicable.  Fallback to the default number if there is
        nothing appropriate for the file system.</para>
      </listitem>

      <listitem>
        <para>Create a minimal
        <filename>src/sys/fs/egfs/egfs_vnops.c</filename> file that
        contains stubs for all vnode operations.</para>

        <programlisting>#include &lt;sys/cdefs.h&gt;
__KERNEL_RCSID(0, "$NetBSD: chap-file-system.xml,v 1.2 2007/06/20 14:24:48 rumble Exp $");

#include &lt;sys/param.h&gt;
#include &lt;sys/vnode.h&gt;

#include &lt;miscfs/genfs/genfs.h&gt;

#define egfs_lookup genfs_eopnotsupp
#define egfs_create genfs_eopnotsupp
#define egfs_mknod genfs_eopnotsupp
#define egfs_open genfs_eopnotsupp
#define egfs_close genfs_eopnotsupp
#define egfs_access genfs_eopnotsupp
#define egfs_getattr genfs_eopnotsupp
#define egfs_setattr genfs_eopnotsupp
#define egfs_read genfs_eopnotsupp
#define egfs_write genfs_eopnotsupp
#define egfs_fcntl genfs_eopnotsupp
#define egfs_ioctl genfs_eopnotsupp
#define egfs_poll genfs_eopnotsupp
#define egfs_kqfilter genfs_eopnotsupp
#define egfs_revoke genfs_eopnotsupp
#define egfs_mmap genfs_eopnotsupp
#define egfs_fsync genfs_eopnotsupp
#define egfs_seek genfs_eopnotsupp
#define egfs_remove genfs_eopnotsupp
#define egfs_link genfs_eopnotsupp
#define egfs_rename genfs_eopnotsupp
#define egfs_mkdir genfs_eopnotsupp
#define egfs_rmdir genfs_eopnotsupp
#define egfs_symlink genfs_eopnotsupp
#define egfs_readdir genfs_eopnotsupp
#define egfs_readlink genfs_eopnotsupp
#define egfs_abortop genfs_eopnotsupp
#define egfs_inactive genfs_eopnotsupp
#define egfs_reclaim genfs_eopnotsupp
#define egfs_lock genfs_eopnotsupp
#define egfs_unlock genfs_eopnotsupp
#define egfs_bmap genfs_eopnotsupp
#define egfs_strategy genfs_eopnotsupp
#define egfs_print genfs_eopnotsupp
#define egfs_pathconf genfs_eopnotsupp
#define egfs_islocked genfs_eopnotsupp
#define egfs_advlock genfs_eopnotsupp
#define egfs_blkatoff genfs_eopnotsupp
#define egfs_valloc genfs_eopnotsupp
#define egfs_reallocblks genfs_eopnotsupp
#define egfs_vfree genfs_eopnotsupp
#define egfs_truncate genfs_eopnotsupp
#define egfs_update genfs_eopnotsupp
#define egfs_bwrite genfs_eopnotsupp
#define egfs_getpages genfs_eopnotsupp
#define egfs_putpages genfs_eopnotsupp

int (**egfs_vnodeop_p)(void *);
const struct vnodeopv_entry_desc egfs_vnodeop_entries[] = {
        { &amp;vop_default_desc, vn_default_error },
        { &amp;vop_lookup_desc, egfs_lookup },
        { &amp;vop_create_desc, egfs_create },
        { &amp;vop_mknod_desc, egfs_mknod },
        { &amp;vop_open_desc, egfs_open },
        { &amp;vop_close_desc, egfs_close },
        { &amp;vop_access_desc, egfs_access },
        { &amp;vop_getattr_desc, egfs_getattr },
        { &amp;vop_setattr_desc, egfs_setattr },
        { &amp;vop_read_desc, egfs_read },
        { &amp;vop_write_desc, egfs_write },
        { &amp;vop_ioctl_desc, egfs_ioctl },
        { &amp;vop_fcntl_desc, egfs_fcntl },
        { &amp;vop_poll_desc, egfs_poll },
        { &amp;vop_kqfilter_desc, egfs_kqfilter },
        { &amp;vop_revoke_desc, egfs_revoke },
        { &amp;vop_mmap_desc, egfs_mmap },
        { &amp;vop_fsync_desc, egfs_fsync },
        { &amp;vop_seek_desc, egfs_seek },
        { &amp;vop_remove_desc, egfs_remove },
        { &amp;vop_link_desc, egfs_link },
        { &amp;vop_rename_desc, egfs_rename },
        { &amp;vop_mkdir_desc, egfs_mkdir },
        { &amp;vop_rmdir_desc, egfs_rmdir },
        { &amp;vop_symlink_desc, egfs_symlink },
        { &amp;vop_readdir_desc, egfs_readdir },
        { &amp;vop_readlink_desc, egfs_readlink },
        { &amp;vop_abortop_desc, egfs_abortop },
        { &amp;vop_inactive_desc, egfs_inactive },
        { &amp;vop_reclaim_desc, egfs_reclaim },
        { &amp;vop_lock_desc, egfs_lock },
        { &amp;vop_unlock_desc, egfs_unlock },
        { &amp;vop_bmap_desc, egfs_bmap },
        { &amp;vop_strategy_desc, egfs_strategy },
        { &amp;vop_print_desc, egfs_print },
        { &amp;vop_islocked_desc, egfs_islocked },
        { &amp;vop_pathconf_desc, egfs_pathconf },
        { &amp;vop_advlock_desc, egfs_advlock },
        { &amp;vop_blkatoff_desc, egfs_blkatoff },
        { &amp;vop_valloc_desc, egfs_valloc },
        { &amp;vop_reallocblks_desc, egfs_reallocblks },
        { &amp;vop_vfree_desc, egfs_vfree },
        { &amp;vop_truncate_desc, egfs_truncate },
        { &amp;vop_update_desc, egfs_update },
        { &amp;vop_bwrite_desc, egfs_bwrite },
        { &amp;vop_getpages_desc, egfs_getpages },
        { &amp;vop_putpages_desc, egfs_putpages },
        { NULL, NULL }
};
const struct vnodeopv_desc egfs_vnodeop_opv_desc =
        { &amp;egfs_vnodeop_p, egfs_vnodeop_entries };</programlisting>
      </listitem>

      <listitem>
        <para>Create a minimal
        <filename>src/sys/fs/egfs/egfs_vfsops.c</filename> file that
        contains stubs for all VFS operations.</para>

        <programlisting>#include &lt;sys/cdefs.h&gt;
__KERNEL_RCSID(0, "$NetBSD: chap-file-system.xml,v 1.2 2007/06/20 14:24:48 rumble Exp $");

#include &lt;sys/param.h&gt;
#include &lt;sys/mount.h&gt;

static int egfs_mount(struct mount *, const char *, void *,
    struct nameidata *, struct proc *);
static int egfs_start(struct mount *, int, struct proc *);
static int egfs_unmount(struct mount *, int, struct proc *);
static int egfs_root(struct mount *, struct vnode **);
static int egfs_quotactl(struct mount *, int, uid_t, void *,
    struct proc *);
static int egfs_vget(struct mount *, ino_t, struct vnode **);
static int egfs_fhtovp(struct mount *, struct fid *, struct vnode **);
static int egfs_vptofh(struct vnode *, struct fid *);
static int egfs_statvfs(struct mount *, struct statvfs *, struct proc *);
static int egfs_sync(struct mount *, int, struct ucred *, struct proc *);
static void egfs_init(void);
static void egfs_done(void);
static int egfs_checkexp(struct mount *, struct mbuf *, int *,
    struct ucred **);
static int egfs_snapshot(struct mount *, struct vnode *,
    struct timespec *);

extern const struct vnodeopv_desc egfs_vnodeop_opv_desc;

const struct vnodeopv_desc * const egfs_vnodeopv_descs[] = {
        &amp;egfs_vnodeop_opv_desc,
        NULL,
};

struct vfsops egfs_vfsops = {
        MOUNT_EGFS,
        egfs_mount,
        egfs_start,
        egfs_unmount,
        egfs_root,
        egfs_quotactl,
        egfs_statvfs,
        egfs_sync,
        egfs_vget,
        egfs_fhtovp,
        egfs_vptofh,
        egfs_init,
        NULL, /* vfs_reinit: not yet (optional) */
        egfs_done,
        NULL, /* vfs_wassysctl: deprecated */
        NULL, /* vfs_mountroot: not yet (optional) */
        egfs_checkexp,
        egfs_snapshot,
        vfs_stdextattrctl,
        egfs_vnodeopv_descs
};
VFS_ATTACH(egfs_vfsops);

static int
egfs_mount(struct mount *mp, const char *path, void *data,
    struct nameidata *ndp, struct proc *p)
{

        return EOPNOTSUPP;
}

static int
egfs_start(struct mount *mp, int, struct proc *p)
{

        return EOPNOTSUPP;
}

static int
egfs_unmount(struct mount *mp, int, struct proc *p)
{

        return EOPNOTSUPP;
}

static int
egfs_root(struct mount *mp, struct vnode **vpp)
{

        return EOPNOTSUPP;
}

static int
egfs_quotactl(struct mount *mp, int cmd, uid_t uid, void *arg,
    struct proc *p)
{

        return EOPNOTSUPP;
}

static int
egfs_vget(struct mount *mp, ino_t ino, struct vnode **vpp)
{

        return EOPNOTSUPP;
}

static int
egfs_fhtovp(struct mount *mp, struct fid *fhp, struct vnode **vpp)
{

        return EOPNOTSUPP;
}

static int
egfs_vptofh(struct vnode *mp, struct fid *fhp)
{

        return EOPNOTSUPP;
}

static int
egfs_statvfs(struct mount *mp, struct statvfs *sbp, struct proc *p)
{

        return EOPNOTSUPP;
}

static int
egfs_sync(struct mount *mp, int waitfor, struct ucred *uc, struct proc *p)
{

        return EOPNOTSUPP;
}

static void
egfs_init(void)
{

        return EOPNOTSUPP;
}

static void
egfs_done(void)
{

        return EOPNOTSUPP;
}

static int
egfs_checkexp(struct mount *mp, struct mbuf *mb, int * wh,
    struct ucred **anon)
{

        return EOPNOTSUPP;
}

static int
egfs_snapshot(struct mount *mp, struct vnode *vp, struct timespec *ctime)
{

        return EOPNOTSUPP;
}</programlisting>
      </listitem>

      <listitem>
        <para>Define a new malloc type for the file system and modify
        the <function>egfs_init</function> and
        <function>egfs_done</function> hooks to attach and detach it in
        the LKM case.</para>

        <para>See <xref linkend="fs_init_and_fs_done" />.</para>
      </listitem>

      <listitem>
        <para>Create the <filename>src/sys/fs/egfs/egfs.h</filename>
        file, that will define all the structures needed for our file
        system.</para>

        <programlisting>#if !defined(_EGFS_H_)
#  define _EGFS_H_
#else
#  error "egfs.h cannot be included multiple times."
#endif

#if defined(_KERNEL)

struct egfs_mount {
        ...
};

struct egfs_node {
        ...
};

#endif /* defined(_KERNEL) */

#define EGFS_ARGSVERSION 1
struct egfs_args {
        char *ea_fspec;

        int ea_version;

        ...
};</programlisting>
      </listitem>

      <listitem>
        <para>Create the <filename>src/sbin/mount_egfs</filename>
        directory.</para>
      </listitem>

      <listitem>
        <para>Create a simple
        <filename>src/sbin/mount_egfs/Makefile</filename> file:</para>

        <programlisting>.include &lt;bsd.own.mk&gt;

PROG= mount_egfs
SRCS= mount_egfs.c
MAN= mount_egfs.8

CPPFLAGS+= -I${NETBSDSRCDIR}/sys
WARNS= 4

.include &lt;bsd.prog.mk&gt;</programlisting>
      </listitem>

      <listitem>
        <para>Create a simple
        <filename>src/sbin/mount_egfs/mount_egfs.c</filename> program
        that calls the &man.mount.2; system call.</para>

        <para>XXX: Add an example or link to the corresponding
        section.</para>
      </listitem>

      <listitem>
        <para>Create an empty
        <filename>src/sbin/mount_egfs/mount_egfs.8</filename> manual
        page.  Details left out from this guide.</para>
      </listitem>

      <listitem>
        <para>Fill in the <function>egfs_mount</function> and
        <function>egfs_unmount</function> functions.</para>

        <para>See <xref linkend="fs_mount_and_fs_unmount" />.</para>
      </listitem>

      <listitem>
        <para>Fill in the <function>egfs_statvfs</function> function.
        Return correct data if possible at this point or leave it for a
        later step.</para>
      </listitem>

      <listitem>
        <para>Set the <function>vop_fsync</function>,
        <function>vop_bwrite</function> and
        <function>vop_putpages</function> operations to
        <function>genfs_nullop</function>.  These need to be defined and
        return successfully to avoid crashes during &man.sync.2; and
        &man.mount.2;.  We will fill them in at a later stage.</para>
      </listitem>

      <listitem>
        <para>Set the <function>vop_abortop</function> operation to
        <function>genfs_abortop</function>.</para>
      </listitem>

      <listitem>
        <para>Set the locking operations to
        <function>genfs_lock</function>,
        <function>genfs_unlock</function> and
        <function>genfs_islocked</function>.  You will most likely need
        locking, so it is better if you get it right from the
        beginning.</para>

        <para>See <xref linkend="vnode_locking" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>vop_reclaim</function> and
        <function>vop_inactive</function> operations to correctly
        destroy vnodes.</para>

        <para>See <xref linkend="vnode_dealloc" />.</para>
      </listitem>

      <listitem>
        <para>Fill in the <function>egfs_sync</function> function.  In
        case you do not know what do put in it, just return success
        (zero); otherwise, serious problems will arise because it will
        be impossible for the operating system to flush your file
        system.</para>
      </listitem>

      <listitem>
        <para>Fill in the <function>egfs_root</function> function.
        Assuming you already read the file system's root node from disk
        (or whichever backing store you use) and have it in memory,
        simply allocate and lock a vnode for it.</para>

        <para>See <xref linkend="vnode_alloc" />.</para>

        <programlisting>int
egfs_root(struct mount *mp, struct vnode **vpp)
{

        return egfs_alloc_vp(mp, ((struct egfs_mount *)mp)->em_root, vpp);
}</programlisting>
      </listitem>

      <listitem>
        <para>Improve the mount utility to support standard options (see
        getmntopts(3)) and possibly some file system specific options
        too.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_getattr</function> and
        <function>egfs_setattr</function> functions operations.  As a
        side effect, implement <function>egfs_update</function> and
        <function>egfs_sync</function> too.  For the latter, you only
        need an stub that returns success for now.</para>

        <para>See <xref linkend="vnode_attrs" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_access</function>
        operation.</para>

        <para>See <xref linkend="vop_access" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_print</function> function.
        This is trivial, as all it has to do is dump vnode information
        (its attributes, mostly) on screen, but it will help with
        debugging.</para>

        <para>See <xref linkend="vop_access" />.</para>
      </listitem>

      <listitem>
        <para>Implement a simple <function>egfs_lookup</function>
        function that can locate any given file; be careful to conform
        with the locking protocol described in &man.vnodeops.9;, as this
        part is really tricky.  At this point, you can forget about the
        lookup hints (<literal>CREATE</literal>,
        <literal>DELETE</literal> or <literal>RENAME</literal>); you
        will add them when needed.</para>

        <para>See <xref linkend="lookup" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_open</function> function.  In
        the general case, this one only needs to verify that the open
        mode is correct against the file flags.</para>

        <programlisting>int
egfs_open(void *v)
{
        struct vnode *vp = ((struct vop_open_args *)v)->a_vp;
        int mode = ((struct vop_open_args *)v)->a_mode;

        struct egfs_node *node;

        node = (struct egfs_node *)vp->v_data;

        if (node->en_flags &amp; APPEND &amp;&amp;
            mode &amp; (FWRITE | O_APPEND)) == FWRITE)
                return EPERM;

        return 0;
}</programlisting>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_close</function> function.
        In the general case, this one needs to do nothing aside
        returning success.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_readdir</function> operation
        so that you can start interacting with your file system.  After
        you add this function, you should be able to list any directory
        in it, and check that the files' attributes are shown correctly.
        And most likely, you will start seeing bugs ;-)</para>

        <para>See <xref linkend="vop_readdir" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_mkdir</function> operation.
        You may need to modify the <function>egfs_lookup</function>
        function to honour the <literal>CREATE</literal> hint.</para>

        <para>See <xref linkend="lookup_hints" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_rmdir</function> operation.
        You may need to modify the <function>egfs_lookup</function>
        function to honour the <literal>DELETE</literal> hint.  Note
        that adding an operation that removes stuff from the file system
        is tricky; problems will certainly pop up if you have got bugs
        in your vnode allocation code or in the
        <function>egfs_inactive</function> or
        <function>egfs_reclaim</function> functions.</para>

        <para>See <xref linkend="lookup_hints" /> and <xref
        linkend="vnode_dealloc" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_create</function> operation
        to create regular files (<literal>VREG</literal>) and local
        sockets (<literal>VSOCK</literal>) .</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_remove</function> operation
        to delete files.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_link</function> operation to
        create hard links.  Be sure to control the file's hard link
        count correctly.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_rename</function> operation.
        This one may seem complex due to the amount of arguments it
        takes, but it is not so difficult to implement.  Just keep in
        mind that it has to manage renames as well as moves and in which
        situation they happen.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_read</function> and
        <function>egfs_write</function> operations.  These are quite
        simple thanks to the indirection provided by the vnode's UVM
        object.</para>

        <para>See <xref linkend="read_and_write" />.</para>
      </listitem>

      <listitem>
        <para>Redirect the <function>egfs_getpages</function> and
        <function>egfs_putpages</function> to
        <function>genfs_getpages</function> and
        <function>genfs_putpages</function> respectively.  Should be
        enough for most file systems.</para>

        <para>See <xref linkend="vop_getpages_and_vop_putpages" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_bmap</function> and
        <function>egfs_strategy</function> operations.</para>

        <para>See <xref linkend="vnode_ondisk" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_truncate</function>
        operation.</para>
      </listitem>

      <listitem>
        <para>Redirect the <function>egfs_fcntl</function>,
        <function>egfs_ioctl</function>, <function>egfs_poll</function>,
        <function>egfs_revoke</function> and
        <function>egfs_mmap</function> operations to their corresponding
        ones in genfs.  Should be enough for most-filesystems; note that
        even FFS does this.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_pathconf</function>
        operation.  This one is trivial, although the documentation in
        &man.pathconf.2; and &man.vnodeops.9; is a bit
        inconsistent.</para>

        <programlisting>int
egfs_pathconf(void *v)
{
        int name = ((struct vop_pathconf_args *)v)->a_name;
        register_t *retval = ((struct vop_pathconf_args *)v)->a_retval;

        int error;

        switch (name) {
        case _PC_LINK_MAX:
                *retval = LINK_MAX;
                break;
        case ...:
        ...
        }

        return 0;
}</programlisting>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_symlink</function> and
        <function>egfs_readlink</function> operations to manage symbolic
        links.</para>
        
        <para>See <xref linkend="symlink_management" />.</para>
      </listitem>

      <listitem>
        <para>Implement the <function>egfs_mknod</function> operation,
        which adds support for named pipes and special devices.</para>

        <para>See <xref linkend="special_nodes" />.</para>
      </listitem>

      <listitem>
        <para>Add NFS support.  This basically means implementing the
        <function>egfs_vptofh</function>,
        <function>egfs_checkexp</function> and
        <function>egfs_fhtovp</function> VFS operations.</para>
        
        <para>See <xref linkend="vfs_nfs" />.</para>
      </listitem>
    </orderedlist>

  </sect1>

  <!-- ================================================================ -->

</chapter>
