<!-- $NetBSD: chap-processes.xml,v 1.1 2007/06/09 11:33:38 dsieger Exp $ -->

<chapter id="chap-processes">
  <title>Processes and threads</title>
    <para>This chapter describe processes and threads in NetBSD. This includes
    process startup, traps and system calls, process and thread creation 
    and termination, signal delivery, and thread scheduling.</para>
    <para>CAUTION! This chapter is an ongoing work: it has not been 
    reviewed yet, neither for typos, nor for technical mistakes</para>
  
  <!-- ================================================================ -->

  <sect1 id="process_startup">
    <title>Process startup</title>

    <sect2 id="execve_usage">
      <title><function>execve</function> usage</title>
      <para>On Unix systems, new programs are started using the 
      <function>execve</function> system call. If successful, 
      <function>execve</function> replaces the currently-executing program
      by a new one. This is done within the same process, by reinitializing 
      the whole virtual memory mapping and loading the new program binary in 
      memory. All process' threads but the calling one are terminated, and the
      calling thread CPU context is reset for executing the new program 
      startup.</para>

      <para>Here is <function>execve</function> prototype:</para>
        <funcsynopsis>
          <funcprototype>
            <funcdef>int <function>execve</function></funcdef>
            <paramdef>const char *<parameter>path</parameter></paramdef>
            <paramdef>char *const <parameter>argv</parameter>[]</paramdef>
            <paramdef>char *const <parameter>envp</parameter>[]</paramdef>
          </funcprototype>
        </funcsynopsis>

      <para><parameter>path</parameter> is the filesystem path to the new 
      executable. <parameter>argv</parameter> and <parameter>envp</parameter> 
      are two NULL-terminated string arrays that hold the new program 
      arguments and environment variables. <function>execve</function> is
      responsible for copying the arrays to the new process stack.</para>
    </sect2>

    <sect2 id="execve_path">
      <title>Overview of in-kernel <function>execve</function> code path</title>
      <para>Here is the top-down modular diagram for <function>execve</function>
      implementation in the NetBSD kernel when executing a native 32 bit ELF 
      binary on an i386 machine:</para>

      <itemizedlist>
        <listitem>
	<simpara>
        <filename>src/sys/kern/kern_exec.c</filename>: 
        <function>sys_execve</function>
	</simpara>
        <itemizedlist>
          <listitem>
	  <simpara>
          <filename>src/sys/kern/kern_exec.c</filename>: 
          <function>execve1</function>
	  </simpara>
          <itemizedlist>
            <listitem>
            <simpara>
            <filename>src/sys/kern/kern_exec.c</filename>: 
            <function>check_exec</function>
            </simpara>
            <itemizedlist>
              <listitem>
              <simpara>
              <filename>src/sys/kern/kern_verifiedexec.c</filename>: 
              <function>veriexec_verify</function>
              </simpara>
              </listitem>
              <listitem>
              <simpara>
              <filename>src/sys/kern/kern_conf.c</filename>: 
              <function>*execsw[]->es_makecmds</function> 
              </simpara>
	      <itemizedlist>
	        <listitem>
                <simpara>
                <filename>src/sys/kern/exec_elf32.c</filename>:
                <function>exec_elf_makecmds</function>
                </simpara>
                <itemizedlist>
                  <listitem>
                  <simpara>
                  <filename>src/sys/kern/exec_elf32.c</filename>:
                  <function>exec_check_header</function>
                  </simpara>
                  </listitem>
                  <listitem>
                  <simpara>
                  <filename>src/sys/kern/exec_elf32.c</filename>:
                  <function>exec_read_from</function>
                  </simpara>
                  </listitem>
                  <listitem>
                  <simpara>
                  <filename>src/sys/kern/exec_conf.c</filename>:
                  <function>*execsw[]->u.elf_probe_func</function>
                  </simpara>
                  <itemizedlist>
                    <listitem>
                    <simpara>
                    <filename>src/sys/kern/exec_elf32.c</filename>:
                    <function>netbsd_elf_probe</function>
                    </simpara>
                    </listitem>
                  </itemizedlist>
                  </listitem>
                  <listitem>
                  <simpara>
                  <filename>src/sys/kern/exec_elf32.c</filename>:
                  <function>elf_load_psection</function>
                  </simpara>
                  </listitem>
                  <listitem>
                  <simpara>
                  <filename>src/sys/kern/exec_elf32.c</filename>:
                  <function>elf_load_file</function>
                  </simpara>
                  </listitem>
                  <listitem>
                  <simpara>
                  <filename>src/sys/kern/exec_conf.c</filename>:
                  <function>*execsw[]->es_setup_stack</function>
                  </simpara>
                  <itemizedlist>
                    <listitem>
                    <simpara>
                    <filename>src/sys/kern/exec_subr.c</filename>:
                    <function>exec_setup_stack</function>
                    </simpara>
                    </listitem>
                  </itemizedlist>
                  </listitem>
                </itemizedlist>
	        </listitem>
	      </itemizedlist>
              </listitem>
            </itemizedlist>
          </listitem>
	  <listitem>
          <simpara>
          <function>*fetch_element</function> 
          </simpara>
          <itemizedlist>
            <listitem>
            <simpara>
	    <filename>src/sys/kern/kern_exec.c</filename>:
            <function>execve_fetch_element</function>
            </simpara>
	    </listitem>
          </itemizedlist>
	  </listitem>
	  <listitem>
          <simpara>
          <function>*vcp->ev_proc</function>
          </simpara>
          <itemizedlist>
            <listitem>
            <simpara>
	    <filename>src/sys/kern/exec_subr.c</filename>:
            <function>vmcmd_map_zero</function>
            </simpara>
            </listitem>
            <listitem>
            <simpara>
	    <filename>src/sys/kern/exec_subr.c</filename>:
            <function>vmcmd_map_pagedvn</function>
            </simpara>
            </listitem>
            <listitem>
            <simpara>
	    <filename>src/sys/kern/exec_subr.c</filename>:
            <function>vmcmd_map_readvn</function>
            </simpara>
            </listitem>
            <listitem>
            <simpara>
	    <filename>src/sys/kern/exec_subr.c</filename>:
            <function>vmcmd_readvn</function>
            </simpara>
            </listitem>
	  </itemizedlist>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/exec_conf.c</filename>:
          <function>*execsw[]->es_copyargs</function> 
          </simpara>
          <itemizedlist>
            <listitem>
            <simpara>
	    <filename>src/sys/kern/kern_exec.c</filename>:
            <function>copyargs</function>
            </simpara>
            </listitem>
	  </itemizedlist>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/kern_clock.c</filename>:
          <function>stopprofclock</function>
          </simpara>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/kern_descrip.c</filename>:
          <function>fdcloseexec</function>
          </simpara>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/kern_sig.c</filename>:
          <function>execsigs</function>
          </simpara>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/kern_ras.c</filename>:
          <function>ras_purgeall</function>
          </simpara>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/exec_subr.c</filename>:
          <function>doexechooks</function>
          </simpara>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/sys/event.h</filename>:
          <function>KNOTE</function>
          </simpara>
          <itemizedlist>
            <listitem>
            <simpara>
	    <filename>src/sys/kern/kern_event.c</filename>:
	    <function>knote</function>
            </simpara>
            </listitem>
          </itemizedlist>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/exec_conf.c</filename>:
          <function>*execsw[]->es_setregs</function>
          </simpara>
          <itemizedlist>
            <listitem>
            <simpara>
	    <filename>src/sys/arch/i386/i386/machdep.c</filename>:
            <function>setregs</function>
            </simpara>
            </listitem>
          </itemizedlist>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/kern_exec.c</filename>:
          <function>exec_sigcode_map</function>
          </simpara>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/kern_exec.c</filename>:
          <function>*p->p_emul->e_proc_exit</function> (NULL)
          </simpara>
	  </listitem>
	  <listitem>
          <simpara>
	  <filename>src/sys/kern/kern_exec.c</filename>:
          <function>*p->p_emul->e_proc_exec</function> (NULL)
          </simpara>
	  </listitem>
          </itemizedlist>
        </listitem>
        </itemizedlist>
      </listitem>
      </itemizedlist>

      <para><function>execve</function> calls <function>execve1</function>
      with a pointer to a function called <varname>fetch_element</varname>,
      responsible for loading program arguments and environment variables
      in kernel space.
      The primary reason for this abstraction function is to allow fetching
      pointers from a 32 bit process on a 64 bit system.</para>

      <para><function>execve1</function> uses a variable of type
      <type>struct exec_package</type> (defined in 
      <filename>src/sys/sys/exec.h</filename>) to share various informations
      with the called functions.</para>

      <para>The <function>makecmds</function> is responsible for checking
      if the program can be loaded, and to build a set of virtual memory
      commands (vmcmd's) that can be used later to setup the virtual memory
      space and to load the program code and data sections. The set of
      vmcmd's is stored in the <varname>ep_vmcmds</varname> field of the 
      exec package. The use of these vmcmd set allows cancellation of the
      execution process before a commitment point.</para>
    </sect2>

    <sect2 id="exec_switch">
      <title>Multiple executable format support with the exec switch</title>
      <para>The exec switch is an array of structure <type>struct execsw</type>
      defined in <filename>src/sys/kern/exec_conf.c</filename>: 
      <varname>execsw[]</varname>.
      The <type>struct execsw</type> itself is defined in 
      <filename>src/sys/sys/exec.h</filename>.</para>

      <para>Each entry in the exec switch is written for a given executable 
      format and a given kernel ABI. It contains test methods to check if
      a binary fits the format and ABI, and the methods to load it and start
      it up if it does. One
      can find here various methods called within <function>execve</function>
      code path.</para> 
      <table id="table-execsw-fields">
        <title><type>struct execsw</type> fields summary</title>

        <tgroup cols="2">
          <thead>
            <row>
              <entry>Field name</entry>
              <entry>Description</entry>
            </row>
          </thead>

          <tbody>
            <row>
              <entry><varname>es_hdrsz</varname></entry>
              <entry>The size of the executable format header</entry>
            </row>
            <row>
              <entry><function>es_makecmds</function></entry>
              <entry>A method that checks if the program can be executed,
              and if it does, create the vmcmds required to setup the virtual 
              memory space (this includes loading the executable code and
              data sections).</entry>
            </row>
            <row>
              <entry>
<function>u.elf_probe_func</function>
<function>u.ecoff_probe_func</function>
<function>u.macho_probe_func</function>
              </entry>
              <entry>Executable probe method, used by the
              <function>es_makecmds</function> method  
              to check if the binary can be executed. 
              The <varname>u</varname> field is an union that contains 
              probe methods for ELF, ECOFF and Mach-O formats</entry>
            </row>
            <row>
              <entry><varname>es_emul</varname></entry>
              <entry>The <type>struct emul</type> used for handling different
              kernel ABI. It is covered in detail in 
             <xref linkend="emul_switch"/>.</entry>
            </row>
            <row>
              <entry><varname>es_prio</varname></entry>
              <entry>A priority level for this exec switch entry. This field
              helps choosing the test order for exec switch entries</entry>
            </row>
            <row>
              <entry><varname>es_arglen</varname></entry>
              <entry>XXX ?</entry>
            </row>
            <row>
              <entry><function>es_copyargs</function></entry>
              <entry>Method used to copy the new program arguments and
              environment function in user space</entry>
            </row>
            <row>
              <entry><function>es_setregs</function></entry>
              <entry>Machine-dependent method used to set up the initial
              process CPU registers</entry>
            </row>
            <row>
              <entry><function>es_coredump</function></entry>
              <entry>Method used to produce a core from the process</entry>
            </row>
            <row>
              <entry><function>es_setup_stack</function></entry>
              <entry>Method called by <function>es_makecmds</function> 
              to produce a set of vmcmd for setting up the new process stack.
              </entry>
            </row>

          </tbody>
        </tgroup>
      </table>

      <para><function>execve1</function> iterate on the exec switch entries,
      using the <varname>es_priority</varname> for ordering, and calls the
      <function>es_makecmds</function> method of each entry until it gets
      a match.</para>

      <para>The <function>es_makecmds</function> will fill the exec package's
      <varname>ep_vmcmds</varname> field with vmcmds that will be used later
      for setting up the new process virtual memory space. See 
      <xref linkend="vmcmds"/> for details about the vmcmds.</para>

      <sect3 id="format_probe">
        <title>Executable format probe</title>
        <para>The executable format probe is called by the
        <function>es_makecmds</function> method. Its job is simply to check
        if the executable binary can be handled by this exec switch entry.
        It can check a signature in the binary (e.g.: ELF note section), 
        the name of a dynamic linker embedded in the binary, and so on.</para>

        <para>Some probe functions feature wildcard, and will be used as 
        last resort, with the help of the <varname>es_prio</varname> field.
        This is the case of the native ELF 32 bit entry, for instance.</para>
      </sect3>

      <sect3 id="vmcmds">
        <title>Virtual memory space setup commands (vmcmds)</title>
        <para>Vmcmds are stored in an array of <type>struct exec_vmcmd</type>
        (defined in <filename>src/sys/sys/exec.h</filename>) in the 
        <varname>ep_vmcmds</varname> field of the exec 
        package, before <function>execve1</function> decides to execute or
        destroy them.</para>

	<para><type>struct exec_vmcmd</type> defines,
        in the <varname>ev_proc</varname> field, a pointer to the
        method that will perform the command, The other fields are 
        used to store the method's arguments.</para>

        <para>Four methods are available in 
        <filename>src/sys/kern/exec_subr.c</filename></para>

        <table id="table-vmcmd-methods">
          <title>vmcmd methods</title>
  
          <tgroup cols="2">
            <thead>
              <row>
                <entry>Name</entry>
                <entry>Description</entry>
              </row>
            </thead>
  
            <tbody>
              <row>
                <entry><function>vmcmd_map_pagedvn</function></entry>
                <entry>Map memory from a vnode. Appropriate for handling 
                demand-paged text and data segments.</entry>
              </row>
              <row>
                <entry><function>vmcmd_map_readvn</function></entry>
                <entry>Read memory from a vnode. Appropriate for handling 
                non-demand-paged text/data segments, i.e. impure objects 
                (a la OMAGIC and NMAGIC).</entry>
              </row>
              <row>
                <entry><function>vmcmd_readvn</function></entry>
                <entry>XXX ?</entry>
              </row>
              <row>
                <entry><function>vmcmd_zero</function></entry>
                <entry>Maps a region of zero-filled memory</entry>
              </row>
            </tbody>
          </tgroup>
        </table>

      <para>Vmcmd are created using <function>new_vmcmd</function>, 
      and can be destroyed using <function>kill_vmcmd</function>.</para>

      </sect3>

      <sect3 id="stack">
        <title>Stack virtual memory space setup</title>
        <para>The <function>es_setup_stack</function> field of the exec switch
        holds a pointer to the method in charge of generating the vmcmd
        for setting up the stack space. Filling the stack with arguments and
        environment is done later, by the <function>es_copyargs</function>
        method.</para>

        <para>For native ELF binaries, the 
        <function>netbsd32_elf32_copyargs</function> 
        (obtained by a macro from <function>elf_copyargs</function> method 
        in <filename>src/sys/kern/exec_elf32.c</filename>) is used. It calls the
        <function>copyargs</function> (from 
        <filename>src/sys/kern/kern_exec.c</filename>) for the part of the 
        job which is not specific to ELF.</para>

        <para><function>copyargs</function> has to copy back the arguments 
        and environment string from the kernel copy (in the exec package) 
        to the new process stack in userland. Then
        the arrays of pointers to the strings are reconstructed, and finally,
        the pointers to the array, and the argument count, are copied to the
        top of the stack. The new program stack pointer will be set to 
        point to the argument count, followed by the argument array pointer,
        as expected by any ANSI program.</para>

        <para>Dynamic ELF executable are special: they need a structure 
        called the ELF auxiliary table to be copied on the stack. The
        table is an array of pairs of key and values for various things
        such as the ELF header address in user memory, the page size, or
        the entry point of the ELF executable</para>

        <para>Note that when starting a dynamic ELF executable, the ELF
        loader (also known as the interpreter: 
        <filename>/usr/libexec/ld.elf_so</filename>) is loaded with the
        executable by the kernel. The ELF loader is started by
        the kernel and is responsible for starting the executable itself
        afterwards.</para>
      </sect3>

      <sect3 id="regs_init">
        <title>Initial register setup</title>
        <para><function>es_setregs</function> is a machine 
        dependent method responsible for setting up the initial 
        process CPU registers. On any machine, the method will 
        have to set the registers holding the instruction pointer, 
        the stack pointer and the machine state. Some ports will need
        more work (for instance i386 will set up the segment registers,
        and Local Descriptor Table)</para>
        <para>The CPU registers are stored in a <type>struct trapframe</type>,
        available from <type>struct lwp</type>.</para>
        
      </sect3>

      <sect3 id="userland_return">
        <title>Return to userland</title>
        <para>After <function>execve</function> has finished his work,
        the new process is ready for running. It is available in the run
	queue and it will be picked up by the scheduler when 
        appropriate.</para>
        <para>From the scheduler point of view, starting or resuming a
        process execution is the same operation: returning to userland.
        This involves switching to the process virtual memory space, 
        and loading the process CPU registers. By loading the machine
        state register with the system bit off, kernel privileges are
        dropped.</para>
        <para>XXX details</para>
      </sect3>

    </sect2>
  </sect1>

  <!-- ================================================================ -->

  <sect1 id="traps_syscalls">
    <title>Traps and system calls</title>

    <para>When the processor encounter an exception (memory fault, division
    by zero, system call instruction...), it executes a trap: control
    is transferred to the kernel, and after some assembly routine in 
    <filename>locore.S</filename>, the CPU drops in the 
    <function>syscall_plain</function>
    (from <filename>src/sys/arch/i386/i386/syscall.c</filename> on i386) for
    system calls, or in the
    <function>trap</function> function 
    (from <filename>src/sys/arch/i386/i386/trap.c</filename> on i386) for
    other traps.</para>
    <para>There is also a <function>syscall_fancy</function> system call
    handler which is only used when the process is being traced by 
    <command>ktrace</command>.</para>

    <sect2 id="traps">
      <title>Traps</title>
      <para>XXX write me</para>
    </sect2>
    
    <sect2 id="emul_switch">
      <title>Multiple kernel ABI support with the emul switch</title>
        <para>The <type>struct emul</type> is defined in 
        <filename>src/sys/sys/proc.h</filename>. It defines various methods
        and parameters to handle system calls and traps. Each kernel ABI
        supported by the NetBSD kernel has its own <type>struct emul</type>.
        For instance, Linux ABI defines <varname>emul_linux</varname> in
        <filename>src/sys/compat/linux/common/linux_exec.c</filename>,
        and the native ABI defines <varname>emul_netbsd</varname>, in
        <filename>src/sys/kern/kern_exec.c</filename>.</para>

        <para>The <type>struct emul</type> for the current ABI is obtained
        from the <varname>es_emul</varname> field of the exec switch entry 
        that was selected by <function>execve</function>. The kernel holds a 
        pointer to it in the process' <type>struct proc</type> (defined in
        <filename>src/sys/sys/proc.h</filename>).</para>

       <para>Most importantly, the <type>struct emul</type> defines the
       system call handler function, and the system call table.</para>
    </sect2>

    <sect2 id="syscalls_master">
      <title>The syscalls.master table</title>
      <para>Each kernel ABI have a system call table. The table maps system
      call numbers to functions implementing the system call in the kernel
      (e.g.: system call number 2 is <function>fork</function>). The
      convention (for native syscalls) is that the kernel function
      implementing syscall <function>foo</function>
      is called <function>sys_foo</function>. Emulation syscalls have
      their own conventions, like <literal>linux_sys_</literal> prefix for the Linux emulation.
      The native system call table can be found in 
      <filename>src/sys/kern/syscalls.master</filename>.</para>

      <para>This file is not written in C language. After any change, it
      must be processed by the <filename>Makefile</filename> available 
      in the same directory. <filename>syscalls.master</filename> processing
      is controlled by the configuration found in 
      <filename>syscalls.conf</filename>, and it will output several 
      files:</para>
      
      <table id="table-syscalls.master-files">
        <title>Files produced from <filename>syscalls.master</filename></title>

        <tgroup cols="2">
          <thead>
            <row>
              <entry>File name</entry>
              <entry>Description</entry>
            </row>
          </thead>

          <tbody>
            <row>
              <entry><filename>syscallargs.h</filename></entry>
              <entry>Define the system call arguments structures, used
              to pass data from the system call handler function to the
              function implementing the system call.</entry>
            </row>
            <row>
              <entry><filename>syscalls.c</filename></entry>
              <entry>An array of strings containing the names for 
              the system calls</entry>
            </row>
            <row>
              <entry><filename>syscall.h</filename></entry>
              <entry>Preprocessor defines for each system call name and 
              number &mdash; used in libc</entry>
            </row>
            <row>
              <entry><filename>sysent.c</filename></entry>
              <entry>An array containing for each system call an entry with
              the number of arguments, the size of the system call arguments
              structure, and a pointer to the function that implements the
              system call in the kernel</entry>
            </row>
          </tbody>
        </tgroup>
      </table>
      <para>In order to avoid namespace collision, non native ABI have 
      <filename>syscalls.conf</filename> defining output file names prefixed
      by tags (e.g: <literal>linux_</literal> for Linux ABI).</para>
      
      <para>system call argument structures (syscallarg for short) are 
      always used to pass arguments to functions implementing the system
      calls. Each system call has its own syscallarg structure. This 
      encapsulation layer is here to hide endianness differences.</para>

      <para>All functions implementing system calls have the same prototype:
      </para>
        <funcsynopsis>
          <funcprototype>
            <funcdef>int <function>syscall</function></funcdef>
            <paramdef>struct lwp *<parameter>l</parameter></paramdef>
            <paramdef>void * <parameter>v</parameter></paramdef>
            <paramdef>register_t *<parameter>retval</parameter></paramdef>
          </funcprototype>
        </funcsynopsis>
      <para><parameter>l</parameter> is the <type>struct lwp</type>
      for the calling thread, <parameter>v</parameter> is the
      syscallarg structure pointer, and <parameter>retval</parameter>
      is a pointer to the return value. The function returns the error
      code (see &man.errno.2;) or 0 if there was no error.  Note that
      the prototype is not the same as the <quote>declaration</quote>
      in <filename>syscalls.master</filename>. The declaration in
      <filename>syscalls.master</filename> corresponds to the
      documented prototype for the system call. This is because system
      calls as seen from userland programs have different prototypes,
      but the <function>sys_<replaceable>...</replaceable></function>
      kernel functions implementing them must have the same prototype
      to unify the interface between MD syscall handlers and MI
      syscall implementation. In <filename>syscalls.master</filename>, the
      declaration shows the syscall arguments as seen by
      userland and determines the members of the syscallarg structure,
      which encapsulates the syscall arguments and has one member for
      each one.</para>

      <para>While generating the files listed above some substitutions
      on the function name are performed: the syscalls tagged as
      <literal>COMPAT_XX</literal> are prefixed by
      <literal>compat_xx_</literal>, same for the syscallarg structure
      name. So the actual kernel function implementing those syscalls
      have to be defined in a corresponding way. Example: if
      <filename>syscalls.master</filename> has a line
<programlisting>
<![CDATA[97	COMPAT_30	{ int sys_socket(int domain, int type, int protocol); }]]>
</programlisting>
	the actual syscall function will have this prototype:
        <funcsynopsis>
          <funcprototype>
            <funcdef>int <function>compat_30_sys_socket</function></funcdef>
            <paramdef>struct lwp *<parameter>l</parameter></paramdef>
            <paramdef>void * <parameter>v</parameter></paramdef>
            <paramdef>register_t *<parameter>retval</parameter></paramdef>
          </funcprototype>
	</funcsynopsis>
	and <parameter>v</parameter> is a pointer to <type>struct
	compat_30_sys_socket_args</type>, whose declaration is the
	following:
 	<programlisting>struct <structname>compat_30_sys_socket_args</structname> {
        <function>syscallarg</function>(int) <structfield>domain</structfield>;
        <function>syscallarg</function>(int) <structfield>type</structfield>;
        <function>syscallarg</function>(int) <structfield>protocol</structfield>;
};</programlisting>
	Note the correspondence with the documented prototype of the
	&man.socket.2; syscall and the declaration of
	<function>sys_socket</function> in
	<filename>syscalls.master</filename>. The types of syscall
	arguments are wrapped by <function>syscallarg</function>
	macro, which ensures that the structure members will be padded
	to a minimum size, again for unified interface between MD and
	MI code. That's why those members should not be accessed
	directly, but by the <function>SCARG</function> macro, which
	takes a pointer to the syscall arg structure and the argument
	name and extracts the argument's value. See
	<ulink url="#syscall_howto">below</ulink> for an example.
      </para>

    </sect2>

    <sect2 id="libc_syscall">
      <title>System call implementation in libc</title> 
      <para>The system call implementation in libc is autogenerated
      from the kernel implementation. As an example, let's examine the
      implementation of the &man.access.2; function in libc. It can be
      found in the <filename>access.S</filename> file, which does not
      exist in the sources &mdash; it is autogenerated when libc is
      built. It uses macros defined in
      <filename>src/sys/sys/syscall.h</filename> and
      <filename>src/lib/libc/arch/<replaceable>MACHINE_ARCH</replaceable>/SYS.h</filename>:
      the <filename>syscall.h</filename> file contains defines which
      map the syscall names to syscall numbers. The syscall function
      names are changed by replacing the <literal>sys_</literal>
      prefix by <literal>SYS_</literal>. The
      <filename>syscall.h</filename> header file is also autogenerated
      from <filename>src/sys/kern/syscalls.master</filename> by
      running <command>make init_sysent.c</command> in
      <filename>src/sys/kern</filename>, as described above. By
      including <filename>SYS.h</filename>, we get
      <filename>syscall.h</filename> and the
      <function>RSYSCALL</function> macro, which accepts the syscall
      name, automatically adds the <literal>SYS_</literal> prefix,
      takes the corresponding number, and defines a function of the
      name given whose body is just the execution of the syscall
      itself with the right number.  (The method of execution and of
      transfer of the syscall number and its arguments are machine
      dependent, but this is hidden in the
      <function>RSYSCALL</function> macro.)
      </para>

      <para> To continue the example of &man.access.2;,
      <filename>syscall.h</filename> contains
<programlisting>
<![CDATA[#define SYS_access      33]]>
</programlisting>
      so <programlisting>RSYSCALL(access)</programlisting> will result
      in defining the function <function>access</function>, which will
      execute the syscall with number 33. Thus,
      <filename>access.S</filename> needs to contain just:
<programlisting>
<![CDATA[#include "SYS.h"
RSYSCALL(access)]]>
</programlisting>
      To automate this further, it is enough to add the name of this
      file to the <varname>ASM</varname> variable in
      <filename>src/lib/libc/sys/Makefile.inc</filename> and the file will be
      autogenerated with this content when libc is built.</para>

      <para>The above is true for libc functions which correspond exactly
      to the kernel syscalls. It is not always the case, even if the
      functions are found in section 2 of the manuals. For example the
      &man.wait.2;, &man.wait3.2; and &man.waitpid.2; functions are
      implemented as wrappers of only one syscall, &man.wait4.2;. In
      such case the procedure above yields the
      <function>wait4</function> function and the wrappers can
      reference it as if it were a normal C function. </para>
    </sect2>

    <sect2 id="syscall_howto"><title>How to add a new system
    call</title>
    <para>Let's pretend that the &man.access.2; syscall does not exist
    yet and you want to add it to the kernel. How to proceed?
    <itemizedlist>
    <listitem>
    <para>add the syscall to the
    <filename>src/sys/kern/syscalls.master</filename> list:
    <programlisting>
<![CDATA[33      STD             { int sys_access(const char *path, int flags); }]]></programlisting>
    </para>
    </listitem>
    <listitem>
      <simpara>
	Run <command>make init_sysent.c</command> under
	<filename>src/sys/kern</filename>. This will update the
	autogenerated files: <filename>syscallargs.h</filename>,
	<filename>syscall.h</filename>,
	<filename>init_sysent.c</filename> and
	<filename>syscalls.c</filename>.
      </simpara>
    </listitem>
    <listitem>
      <para>
	Implement the kernel part of the system call, which will have
	the prototype:
	<funcsynopsis>
	  <funcprototype>
            <funcdef>int <function>sys_access</function></funcdef>
            <paramdef>struct lwp *<parameter>l</parameter></paramdef>
            <paramdef>void * <parameter>v</parameter></paramdef>
            <paramdef>register_t *<parameter>retval</parameter></paramdef>
          </funcprototype>
	</funcsynopsis>
	as all other syscalls.
	To get the syscall arguments cast
	<parameter>v</parameter> to a pointer to <type>struct
	sys_access_args</type> and use the <function>SCARG</function>
	macro to retrieve them from that structure. For example, to get the
	<parameter>flags</parameter> argument if <varname>uap</varname> is a
	pointer to <type>struct sys_access_args</type> obtained by
	casting <parameter>v</parameter>, use:
	<programlisting>SCARG(uap, flags)</programlisting> The type
	<type>struct sys_access_args</type> and the function
	<function>sys_access</function> are declared in
	<filename>sys/syscallargs.h</filename>, which is autogenerated from
	<filename>src/sys/kern/syscalls.master</filename>. Use
	<programlisting><![CDATA[#include <sys/syscallargs.h>]]></programlisting> 
        to get those declarations.
      </para>
      <simpara>Look in
      <filename>src/sys/kern/vfs_syscalls.c</filename> for the real
      implementation of <function>sys_access</function>. </simpara>
    </listitem>
    <listitem>
      <simpara>
	Run <command>make includes</command> in
	<filename>src/sys/sys</filename>. This will copy the
	autogenerated include files (most importantly,
	<filename>syscall.h</filename>) to
	<filename>usr/include</filename> under
	<varname>DESTDIR</varname>, where libc build will find them in
	the next steps.
      </simpara>
    </listitem>
    <listitem>
      <simpara>
	Add <literal>access.S</literal> to the
	<varname>ASM</varname> variable in
	<filename>src/lib/libc/sys/Makefile.inc</filename>.
      </simpara>
    </listitem>
    </itemizedlist>
    This is all. To test the new syscall, simply rebuild libc
    (<filename>access.S</filename> will be generated at his point) and
    reboot with a new kernel containing the new syscall. To make the
    new syscall generally useful, its prototype should be added to an
    appropriate header file for use by userspace programs &mdash; in
    the case of &man.access.2;, this is unistd.h, which is found in
    the NetBSD sources at <filename>src/include/unistd.h</filename>.
    </para></sect2>

    <sect2 id="syscall_versioning">
      <title>Versioning a system call</title>
      <para>If the system call ABI (or even API) changes, it is
      necessary to implement the old syscall with the original semantics
      to be used by old binaries. The new version of the syscall has a
      different syscall number, while the original one retains the old
      number. This is called versioning.</para>

      <para>The naming conventions associated with versioning are
      complex. If the original system call is called
      <function>foo</function> (and implemented by a
      <function>sys_foo</function> function) and it is changed after the
      <emphasis>x.y</emphasis> release, the new syscall will be named
      <function>__fooxy</function>, with the function implementing it
      being named <function>sys___fooxy</function>. The original syscall
      (left for compatibility) will be still declared as sys_foo in
      <filename>syscalls.master</filename>, but will be tagged as
      <literal>COMPAT_XY</literal>, so the function will be named
      <function>compat_xy_sys_foo</function>. We will call
      <function>sys_foo</function> the original version,
      <function>sys___fooxy</function> the new version and
      <function>compat_xy_sys_foo</function> the compatibility version
      in the procedure described below.</para>

      <para>Now if the syscall is versioned again after version
      <emphasis>z.q</emphasis> has been released, the newest version
      will be called <function>__foozq</function>. The intermediate
      version (formerly the new version) will have to be retained for
      compatibility, so it will be tagged as
      <literal>COMPAT_ZQ</literal>, which will change the function
      name from <function>sys___fooxy</function> to
      <function>compat_zq_sys___fooxy</function>. The oldest version
      <function>compat_xy_sys_foo</function> will be unaffected by the
      second versioning.
      </para>

      <para>HOW TO change a system call ABI or API and add a
      compatibility version? Let's look at a real example: versioning
      of the &man.socket.2; system call after the error code in case
      of unsupported address family changed from
      <errorcode>EPROTONOSUPPORT</errorcode> to
      <errorcode>EAFNOSUPPORT</errorcode> between NetBSD 3.0 and 4.0.
      <itemizedlist>
	<listitem>
	  <simpara>tag the old version
	  (<function>sys_socket</function>) with the right
	  <literal>COMPAT_XY</literal> in
	  <filename>syscalls.master</filename>. In the case of
	  <function>sys_socket</function>, it is
	  <literal>COMPAT_30</literal>, because NetBSD 3.0 was the
	  last version before the system call changed.
	  </simpara>
	</listitem>
	<listitem>
	  <para>add the new version at the end of
	  <filename>syscalls.master</filename> (this effectively allocates a
	  new syscall number). Name the new version as described
	  above. In our case, it will be <function>sys___socket30</function>:
    <programlisting><![CDATA[394	STD		{ int sys___socket30(int domain, int type, int protocol); }]]></programlisting>
	  </para>
	</listitem>
	<listitem>
	  <simpara>The function implementing the socket syscall now
	  needs to be renamed from <function>sys_socket</function> to
	  <function>sys___socket30</function> to match the change
	  above. Ideally, at this moment the change which requires
	  versioning would be made. (Though in practice it happens
	  that a change is made and only later it is realized that it
	  breaks compatibility and versioning is needed.)
	  </simpara>
	</listitem>
	<listitem>
	  <simpara>Implement the compatibility version, name it
	  compat_xy_sys_... as described above. The implementation belongs
	  under <filename>src/sys/compat</filename> and it shouldn't be a
	  modified copy of the new version, because the copies would
	  eventually diverge. Rather, it should be implemented in terms of
	  the new version, adding the adjustments needed for compatibility
	  (which means that it should behave exactly as the old
	  version did.)
	  </simpara>
	  <simpara>In our example, the compatibility version would be
	  named <function>compat_30_sys_socket</function>. It can be found in
	  <filename>src/sys/compat/common/uipc_syscalls_30.c</filename>.
	  </simpara>
	</listitem>
	<listitem>
	  <simpara>Find all references to the old syscall function in the
	  kernel and point them to the compatibility version or to the new
	  version as appropriate. (The kernel would not link
	  otherwise.) For example, many of the compatibility syscalls
	  or the <filename>syscalls.master</filename> tables
	  for various emulations under
	  <filename>src/sys/compat</filename> used to refer to
	  <function>sys_socket</function>. Decision if the references
	  should be changed to the compatibility version or to the new
	  version depend on the behavior of the OS that we intend to
	  emulate. E.g. FreeBSD uses the old error number, while
	  System V uses the new one.
	  </simpara>
	</listitem>
      </itemizedlist>
      Now the kernel should be compilable and old statically linked
      binaries should work, as should binaries using the old
      libc. Nothing uses the new syscall yet. We have to make a new
      libc, which will contain both the new and the compatibility
      syscall:
      <itemizedlist>
	<listitem>
	  <simpara>in
	  <filename>src/lib/libc/sys/Makefile.inc</filename>, replace
	  the name of the old syscall by the new syscall
	  (<function>__socket30</function> in our example). When libc is
	  rebuilt, it will contain the new function, but no programs use
	  this internal name with underscore, so it is not useful yet. Also,
	  we have lost the old name.</simpara>
	</listitem>
	<listitem>
	  <para>To make newly compiled programs use the new syscall
	  when they refer to the usual name
	  (<function>socket</function> in our example), we add a
	  <literal>__RENAME(newname)</literal> statement after the
	  declaration of the usual name is declared. In the case of
	  <function>socket</function>, this is
	  <filename>src/sys/sys/socket.h</filename>:
<programlisting>
<![CDATA[int     socket(int, int, int)
#if !defined(__LIBC12_SOURCE__) && !defined(_STANDALONE)
__RENAME(__socket30)
#endif]]>
</programlisting>
	  Now, when a program is recompiled using this header,
	  references to <function>socket</function> will be replaced
	  by <function>__socket30</function>, except for compilation
	  of standalone tools (basically bootloaders), which define
	  <literal>_STANDALONE</literal>, and libc compat code itself,
	  which defines <literal>__LIBC12_SOURCE__</literal>. The
	  <literal>__RENAME</literal> causes the compiler to emit
	  references to the <function>__socket30</function> symbol
	  when <function>socket</function> is used in the source. The
	  symbol will be then resolved by the linker to the new
	  function (implemented by the new system call). Old binaries
	  are unaware of this and continue to reference
	  <function>socket</function>, which should be resolved to the
	  old function (having the same API as before the change). We
	  will re-add the old function in the next step.
	  </para>
	</listitem>
	<listitem>
	  <simpara>To make the old binaries work with the new libc, we
	  must add the old function. We add it under
	  <filename>src/lib/libc/compat/sys</filename>, implementing
	  it using the new function. Note that we did not use the
	  compatibility syscall in the kernel at all, so old programs
	  will work with the new libc, even if the kernel is built
	  without <literal>COMPAT_30</literal>. The compatibility
	  syscall is there only for the old libc, which is used if the
	  shared library was not upgraded, or internally by statically
	  linked programs. </simpara>
	</listitem>
      </itemizedlist>
      We are done &mdash; we have covered the cases of old binaries,
      old libc and new kernel (including statically linked binaries),
      old binaries, new libc and new kernel, and new binaries, new
      libc and new kernel.
      </para>
    </sect2>

    <sect2 id="committing">
      <title>Committing changes to syscall tables</title>
      <para>When committing your work (either a new syscall or a new
      syscall version with the compatibility syscalls), you should
      remember to commit the source
      (<filename>syscalls.master</filename>) for the autogenerated files
      first, and then regenerate and commit the autogenerated
      files. They contain the RCS Id of the source file and this way,
      the RCS Id will refer to the current source version. The assembly
      files generated by
      <filename>src/lib/libc/sys/Makefile.inc</filename> are not kept in
      the repository at all, they are regenerated every time libc is
      built.</para>
    </sect2>

    <sect2 id="to64">
      <title>Managing 32 bit system calls on 64 bit systems</title>
      <para>When executing 32 bit binaries on a 64 bit system, care must be
      taken to only use addresses below 4 GB. This is a problem at 
      process creation, when the stack and heap are allocated, but also for
      each system call, where 32 bits pointers handled by the 32 bit process
      are manipulated by the 64 bit kernel.</para>

      <para>For a kernel built as a 64 bit binary, a 32 bit pointer is
      not something that makes sense: pointers can only be 64 bit long. 
      This is why 32 bit pointers are defined as an <type>u_int32_t</type>
      synonym called <type>netbsd32_pointer_t</type>
      (in <filename>src/sys/compat/netbsd32/netbsd32.h</filename>).</para>

      <para>For <function>copyin</function> and <function>copyout</function>,
      true 64 bits pointers are required. They are obtained by casting the
      <type>netbsd32_pointer_t</type> through the 
      <function>NETBSD32PTR64</function> macro.</para>

      <para>Most of the time, implementation of a 32 bit system call is just
      about casting pointers and to call the 64 version of the system call.
      An example of such a situation can be found in 
      <filename>src/sys/compat/netbsd32/netbsd32_time.c</filename>:
      <function>netbsd32_timer_delete</function>. Provided that the 32 bit
      system call argument structure pointer is called <varname>uap</varname>, 
      and the 64 bit one is called <varname>ua</varname>, then helper macros
      called <function>NETBSD32TO64_UAP</function>, 
      <function>NETBSD32TOP_UAP</function>, 
      <function>NETBSD32TOX_UAP</function>, and
      <function>NETBSD32TOX64_UAP</function> can be used. Sources in
      <filename>src/sys/compat/netbsd32</filename> provide multiple examples.
     </para>
    </sect2>
  </sect1>

  <!-- ================================================================ -->

  <sect1 id="fork">
    <title>Processes and threads creation</title>

    <sect2 id="fork_usage">
      <title><function>fork</function>, <function>clone</function>, and 
      <function>pthread_create</function> usage</title>
      <para>XXX write me</para>
    </sect2>

    <sect2 id="fork_path">
      <title>Overview of <function>fork</function> code path</title>
      <para>XXX write me</para>
    </sect2>

    <sect2 id="pthread_create_path">
      <title>Overview of <function>pthread_create</function> code path</title>
      <para>XXX write me</para>
    </sect2>
  </sect1>

  <!-- ================================================================ -->

  <sect1 id="exit">
    <title>Processes and threads termination</title>

    <sect2 id="exit_usage">
      <title><function>exit</function>, and 
      <function>pthread_exit</function> usage</title>
      <para>XXX write me</para>
    </sect2>

    <sect2 id="exit_path">
      <title>Overview of <function>exit</function> code path</title>
      <para>XXX write me</para>
    </sect2>

    <sect2 id="pthread_exit_path">
      <title>Overview of <function>pthread_exit</function> code path</title>
      <para>XXX write me</para>
    </sect2>
  </sect1>

  <!-- ================================================================ -->

  <sect1 id="signal">
    <title>Signal delivery</title>

    <sect2 id="signal_decision">
      <title>Deciding what to do with a signal</title>
      <para>XXX write me</para>
    </sect2>

    <sect2 id="sendsig">
      <title>The <function>sendsig</function> function</title>
      <para>For each kernel ABI, <type>struct emul</type> defines a 
      machine-dependent <function>sendsig</function> function, which 
      is responsible for altering the process user context so that it calls a 
      signal handler.</para>

      <para><function>sendsig</function> builds a stack frame containing
      the CPU registers before the signal handler invocation. The CPU
      registers are altered so that on return to userland, the process
      executes the signal handler and have the stack pointer set to the
      new stack frame.</para>

      <para>If requested at <function>sigaction</function> call time, 
      <function>sendsig</function> will also add a <type>struct siginfo</type>
      to the stack frame.</para> 

      <para>Last but not least, <function>sendsig</function> may copy
      a small assembly code involved in signal cleanup, which is called the
      signal trampoline. This is detailed
      in the next section. Note that that modern NetBSD native programs
      do not use that feature anymore: it is only used for older programs,
      and other OSes emulation.</para>
    </sect2>

    <sect2 id="signal_cleanup">
      <title>Cleaning up state after signal handler execution</title>
      <para>Once the signal handler returns, the kernel must destroy the
      signal handler context and restore the previous process state. This
      can be achieved by two ways.</para>
      <para>First method, using the kernel-provided signal trampoline:
      <function>sendsig</function> have copied the signal trampoline on 
      the stack and has prepared the stack and/or CPU registers so that the
      signal handler returns to the signal trampoline. The job of the 
      signal trampoline is to call the <function>sigreturn</function>
      or the <function>setcontext</function> system calls, handling a pointer
      to the CPU registers saved on stack. This restores the CPU registers
      to their values before the signal handler invocation, and next time the
      process will return to userland, it will resume its execution where it
      stopped.</para>
      <para>The native signal trampoline for i386 is called 
      <function>sigcode</function> and can be found in 
      <filename>src/sys/arch/i386/i386/locore.S</filename>. Each emulated ABI
      has its own signal trampoline, which can be quite close to the native 
      one, except usually for the <function>sigreturn</function> system call
      number.</para>
      <para>The second method is to use a signal trampoline provided by libc.
      This is how modern NetBSD native programs do. At the time the
      <function>sigaction</function> system call is invoked, the libc stub 
      handle a pointer to a signal trampoline in libc, which is in charge
      of calling <function>setcontext</function>.
      </para>
      <para>
      <function>sendsig</function> will use that pointer as the return address
      for the signal handler. This method is better than the previous one, 
      because it removes the need for an executable stack page where the
      signal trampoline is stored. The trampoline is now stored in the code
      segment of libc. For instance, for i386, the signal trampoline 
      is named <function>__sigtramp_siginfo_2</function> and can be found in 
      <filename>src/lib/libc/arch/i386/sys/__sigtramp2.S</filename>.</para>
    </sect2>
  </sect1>

  <!-- ================================================================ -->

  <sect1 id="scheduling">
    <title>Thread scheduling</title>
      <sect2 id="overview">
      <title>Overview</title>
      <para>
      The NetBSD thread scheduler is based on a variation of the
      traditional round-robin scheduling algorithm called
      "multi-level feedback queues". By dynamically adjusting a
      thread's priority to reflect its CPU and resource utilization,
      this approach allows the system to be responsive even under
      heavy loads.
      </para>
      <para>
      The scheduler maintains a set of 32 runqueues in descending
      priority from 0 to 31. Each runnable thread is placed on one of
      the runqueues, according to its priority. The single runqueues
      are served in round-robin (FIFO) order. Since thread priorities
      can range from 0 to 127, there are possibly threads with up to
      four different priorities on each runqueue.
      </para>
      <para>
      Each thread is allowed to run on the CPU for a fixed amount of
      time, its time-slice or quantum. Once the thread has used up its
      time-slice, it is placed on the back on its runqueue. When the
      scheduler searches for a new thread to run on the CPU, the first
      thread of the highest priority, non-empty runqueue is
      selected. In order to speed up the process of selecting a new
      thread, a bitmask of non-empty runqueues is maintained.
      </para>
      <para>
      A thread's priority is dynamically adjusted as it accumulates
      CPU-time. CPU utilization is incremented in
      <function>hardclock</function> each time the system clock ticks
      and the thread is found to be executing. An estimate of a
      thread's recent CPU utilization is stored in
      <varname>p_estcpu</varname>, which is adjusted once per second
      in <function>schedcpu</function> via a digital decay
      filter. Note that <varname>p_estcpu</varname> is a per-process
      value, i.e. all LWPs belonging to a process have the same value
      for <varname>p_estcpu</varname>. Whenever a thread accumulates
      four ticks in its CPU utilization,
      <function>schedclock</function> invokes
      <function>resetpriority</function> to recalculate the process's
      scheduling priority.
      </para>
      </sect2>
      <sect2 id="references">
      <title>References</title>
      <para>The scheduler subsystem is implemented within the file
      <filename>src/sys/kern/kern_synch.c</filename>. Additional
      information can be found in &man.scheduler.9;.
      </para>
      </sect2>
  </sect1>

  <!-- ================================================================ -->
</chapter>
