Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) .. _addsyscalls:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Adding a New System Call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) This document describes what's involved in adding a new system call to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) Linux kernel, over and above the normal submission advice in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) System Call Alternatives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) The first thing to consider when adding a new system call is whether one of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) the alternatives might be suitable instead.  Although system calls are the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) most traditional and most obvious interaction points between userspace and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) kernel, there are other possibilities -- choose what fits best for your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21)  - If the operations involved can be made to look like a filesystem-like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22)    object, it may make more sense to create a new filesystem or device.  This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23)    also makes it easier to encapsulate the new functionality in a kernel module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24)    rather than requiring it to be built into the main kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26)      - If the new functionality involves operations where the kernel notifies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27)        userspace that something has happened, then returning a new file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28)        descriptor for the relevant object allows userspace to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29)        ``poll``/``select``/``epoll`` to receive that notification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30)      - However, operations that don't map to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)        :manpage:`read(2)`/:manpage:`write(2)`-like operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32)        have to be implemented as :manpage:`ioctl(2)` requests, which can lead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33)        to a somewhat opaque API.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35)  - If you're just exposing runtime system information, a new node in sysfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36)    (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37)    be more appropriate.  However, access to these mechanisms requires that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38)    relevant filesystem is mounted, which might not always be the case (e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39)    in a namespaced/sandboxed/chrooted environment).  Avoid adding any API to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40)    debugfs, as this is not considered a 'production' interface to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41)  - If the operation is specific to a particular file or file descriptor, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42)    an additional :manpage:`fcntl(2)` command option may be more appropriate.  However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43)    :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44)    this option is best for when the new function is closely analogous to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45)    existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46)    (for example, getting/setting a simple flag related to a file descriptor).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47)  - If the operation is specific to a particular task or process, then an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48)    additional :manpage:`prctl(2)` command option may be more appropriate.  As
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49)    with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50)    is best reserved for near-analogs of existing ``prctl()`` commands or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51)    getting/setting a simple flag related to a process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) Designing the API: Planning for Extension
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) -----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) A new system call forms part of the API of the kernel, and has to be supported
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) indefinitely.  As such, it's a very good idea to explicitly discuss the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) interface on the kernel mailing list, and it's important to plan for future
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) extensions of the interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) (The syscall table is littered with historical examples where this wasn't done,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) together with the corresponding follow-up system calls --
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) ``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) ``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) learn from the history of the kernel and plan for extensions from the start.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) For simpler system calls that only take a couple of arguments, the preferred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) way to allow for future extensibility is to include a flags argument to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) system call.  To make sure that userspace programs can safely use flags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) between kernel versions, check whether the flags value holds any unknown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) flags, and reject the system call (with ``EINVAL``) if it does::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74)     if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)         return -EINVAL;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) (If no flags values are used yet, check that the flags argument is zero.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) For more sophisticated system calls that involve a larger number of arguments,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) it's preferred to encapsulate the majority of the arguments into a structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) that is passed in by pointer.  Such a structure can cope with future extension
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) by including a size argument in the structure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84)     struct xyzzy_params {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85)         u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86)         u32 param_1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87)         u64 param_2;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88)         u64 param_3;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89)     };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) As long as any subsequently added field, say ``param_4``, is designed so that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) zero value gives the previous behaviour, then this allows both directions of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) version mismatch:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95)  - To cope with a later userspace program calling an older kernel, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96)    code should check that any memory beyond the size of the structure that it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97)    expects is zero (effectively checking that ``param_4 == 0``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98)  - To cope with an older userspace program calling a newer kernel, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99)    code can zero-extend a smaller instance of the structure (effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)    setting ``param_4 = 0``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ``kernel/events/core.c``) for an example of this approach.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) Designing the API: Other Considerations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ---------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) If your new system call allows userspace to refer to a kernel object, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) should use a file descriptor as the handle for that object -- don't invent a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) new type of userspace object handle when the kernel already has mechanisms and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) well-defined semantics for using file descriptors.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) If your new :manpage:`xyzzy(2)` system call does return a new file descriptor,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) then the flags argument should include a value that is equivalent to setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) ``O_CLOEXEC`` on the new FD.  This makes it possible for userspace to close
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) the timing window between ``xyzzy()`` and calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) ``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) ``execve()`` in another thread could leak a descriptor to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) the exec'ed program. (However, resist the temptation to re-use the actual value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) numbering space of ``O_*`` flags that is fairly full.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) If your system call returns a new file descriptor, you should also consider
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) what it means to use the :manpage:`poll(2)` family of system calls on that file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) descriptor. Making a file descriptor ready for reading or writing is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) normal way for the kernel to indicate to userspace that an event has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) occurred on the corresponding kernel object.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) If your new :manpage:`xyzzy(2)` system call involves a filename argument::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)     int sys_xyzzy(const char __user *path, ..., unsigned int flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)     int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) This allows more flexibility for how userspace specifies the file in question;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) in particular it allows userspace to request the functionality for an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) giving an :manpage:`fxyzzy(3)` operation for free::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)  - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)  - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) (For more details on the rationale of the \*at() calls, see the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) :manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) :manpage:`fstatat(2)` man page.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) If your new :manpage:`xyzzy(2)` system call involves a parameter describing an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) offset within a file, make its type ``loff_t`` so that 64-bit offsets can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) supported even on 32-bit architectures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) If your new :manpage:`xyzzy(2)` system call involves privileged functionality,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) it needs to be governed by the appropriate Linux capability bit (checked with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) a call to ``capable()``), as described in the :manpage:`capabilities(7)` man
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) page.  Choose an existing capability bit that governs related functionality,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) but try to avoid combining lots of only vaguely related functions together
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) under the same bit, as this goes against capabilities' purpose of splitting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) the power of root.  In particular, avoid adding new uses of the already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) overly-general ``CAP_SYS_ADMIN`` capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) If your new :manpage:`xyzzy(2)` system call manipulates a process other than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) the calling process, it should be restricted (using a call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) ``ptrace_may_access()``) so that only a calling process with the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) permissions as the target process, or with the necessary capabilities, can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) manipulate the target process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) Finally, be aware that some non-x86 architectures have an easier time if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) system call parameters that are explicitly 64-bit fall on odd-numbered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) registers.  (This concern does not apply if the arguments are part of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) structure that's passed in by pointer.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) Proposing the API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) To make new system calls easy to review, it's best to divide up the patchset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) into separate chunks.  These should include at least the following items as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) distinct commits (each of which is described further below):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)  - The core implementation of the system call, together with prototypes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)    generic numbering, Kconfig changes and fallback stub implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)  - Wiring up of the new system call for one particular architecture, usually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)    x86 (including all of x86_64, x86_32 and x32).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)  - A demonstration of the use of the new system call in userspace via a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)    selftest in ``tools/testing/selftests/``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)  - A draft man-page for the new system call, either as plain text in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)    cover letter, or as a patch to the (separate) man-pages repository.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) New system call proposals, like any change to the kernel's API, should always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) be cc'ed to linux-api@vger.kernel.org.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Generic System Call Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) ----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) The main entry point for your new :manpage:`xyzzy(2)` system call will be called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) ``sys_xyzzy()``, but you add this entry point with the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) ``SYSCALL_DEFINEn()`` macro rather than explicitly.  The 'n' indicates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) number of arguments to the system call, and the macro takes the system call name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) followed by the (type, name) pairs for the parameters as arguments.  Using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) this macro allows metadata about the new system call to be made available for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) other tools.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) The new entry point also needs a corresponding function prototype, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) ``include/linux/syscalls.h``, marked as asmlinkage to match the way that system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) calls are invoked::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)     asmlinkage long sys_xyzzy(...);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) Some architectures (e.g. x86) have their own architecture-specific syscall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) tables, but several other architectures share a generic syscall table. Add your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) new system call to the generic list by adding an entry to the list in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) ``include/uapi/asm-generic/unistd.h``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)     #define __NR_xyzzy 292
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)     __SYSCALL(__NR_xyzzy, sys_xyzzy)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) Also update the __NR_syscalls count to reflect the additional system call, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) note that if multiple new system calls are added in the same merge window,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) your new syscall number may get adjusted to resolve conflicts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) system call, returning ``-ENOSYS``.  Add your new system call here too::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)     COND_SYSCALL(xyzzy);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Your new kernel functionality, and the system call that controls it, should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) normally be optional, so add a ``CONFIG`` option (typically to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) ``init/Kconfig``) for it. As usual for new ``CONFIG`` options:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)  - Include a description of the new functionality and system call controlled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)    by the option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236)  - Make the option depend on EXPERT if it should be hidden from normal users.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)  - Make any new source files implementing the function dependent on the CONFIG
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)    option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239)  - Double check that the kernel still builds with the new CONFIG option turned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)    off.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) To summarize, you need a commit that includes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)  - ``CONFIG`` option for the new function, normally in ``init/Kconfig``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245)  - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)  - corresponding prototype in ``include/linux/syscalls.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247)  - generic table entry in ``include/uapi/asm-generic/unistd.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)  - fallback stub in ``kernel/sys_ni.c``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) x86 System Call Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) ------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) To wire up your new system call for x86 platforms, you need to update the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) master syscall tables.  Assuming your new system call isn't special in some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) way (see below), this involves a "common" entry (for x86_64 and x32) in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) arch/x86/entry/syscalls/syscall_64.tbl::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259)     333   common   xyzzy     sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263)     380   i386     xyzzy     sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) Again, these numbers are liable to be changed if there are conflicts in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) relevant merge window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) Compatibility System Calls (Generic)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) For most system calls the same 64-bit implementation can be invoked even when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) the userspace program is itself 32-bit; even if the system call's parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) include an explicit pointer, this is handled transparently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) However, there are a couple of situations where a compatibility layer is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) needed to cope with size differences between 32-bit and 64-bit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) The first is if the 64-bit kernel also supports 32-bit userspace programs, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) so needs to parse areas of (``__user``) memory that could hold either 32-bit or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) 64-bit values.  In particular, this is needed whenever a system call argument
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)  - a pointer to a pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285)  - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)  - a pointer to a varying sized integral type (``time_t``, ``off_t``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)    ``long``, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)  - a pointer to a struct containing a varying sized integral type.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) The second situation that requires a compatibility layer is if one of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) system call's arguments has a type that is explicitly 64-bit even on a 32-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) architecture, for example ``loff_t`` or ``__u64``.  In this case, a value that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) arrives at a 64-bit kernel from a 32-bit application will be split into two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) 32-bit values, which then need to be re-assembled in the compatibility layer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) (Note that a system call argument that's a pointer to an explicit 64-bit type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) The compatibility version of the system call is called ``compat_sys_xyzzy()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) SYSCALL_DEFINEn.  This version of the implementation runs as part of a 64-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) kernel, but expects to receive 32-bit parameter values and does whatever is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) needed to deal with them.  (Typically, the ``compat_sys_`` version converts the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) values to 64-bit versions and either calls on to the ``sys_`` version, or both of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) them call a common inner implementation function.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) The compat entry point also needs a corresponding function prototype, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) ``include/linux/compat.h``, marked as asmlinkage to match the way that system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) calls are invoked::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312)     asmlinkage long compat_sys_xyzzy(...);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) If the system call involves a structure that is laid out differently on 32-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) header file should also include a compat version of the structure (``struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) compat_xyzzy_args``) where each variable-size field has the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) ``compat_`` type that corresponds to the type in ``struct xyzzy_args``.  The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) ``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) parse the arguments from a 32-bit invocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) For example, if there are fields::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)     struct xyzzy_args {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)         const char __user *ptr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)         __kernel_long_t varying_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327)         u64 fixed_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)         /* ... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329)     };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) in struct xyzzy_args, then struct compat_xyzzy_args would have::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)     struct compat_xyzzy_args {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334)         compat_uptr_t ptr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)         compat_long_t varying_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)         u64 fixed_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337)         /* ... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)     };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) The generic system call list also needs adjusting to allow for the compat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) version; the entry in ``include/uapi/asm-generic/unistd.h`` should use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) ``__SC_COMP`` rather than ``__SYSCALL``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)     #define __NR_xyzzy 292
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)     __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) To summarize, you need:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)  - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)  - corresponding prototype in ``include/linux/compat.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)  - (if needed) 32-bit mapping struct in ``include/linux/compat.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)  - instance of ``__SC_COMP`` not ``__SYSCALL`` in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353)    ``include/uapi/asm-generic/unistd.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) Compatibility System Calls (x86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) To wire up the x86 architecture of a system call with a compatibility version,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) the entries in the syscall tables need to be adjusted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) column to indicate that a 32-bit userspace program running on a 64-bit kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) should hit the compat entry point::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366)     380   i386     xyzzy     sys_xyzzy    __ia32_compat_sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) Second, you need to figure out what should happen for the x32 ABI version of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) the new system call.  There's a choice here: the layout of the arguments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) should either match the 64-bit version or the 32-bit version.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) If there's a pointer-to-a-pointer involved, the decision is easy: x32 is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) ILP32, so the layout should match the 32-bit version, and the entry in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) ``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) the compatibility wrapper::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377)     333   64       xyzzy     sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378)     ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379)     555   x32      xyzzy     __x32_compat_sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) If no pointers are involved, then it is preferable to re-use the 64-bit system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) call for the x32 ABI (and consequently the entry in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) arch/x86/entry/syscalls/syscall_64.tbl is unchanged).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) In either case, you should check that the types involved in your argument
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) 64-bit (-m64) equivalents.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) System Calls Returning Elsewhere
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) For most system calls, once the system call is complete the user program
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) continues exactly where it left off -- at the next instruction, with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) stack the same and most of the registers the same as before the system call,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) and with the same virtual memory space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) However, a few system calls do things differently.  They might return to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) different location (``rt_sigreturn``) or change the memory space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) (``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) of the program.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) To allow for this, the kernel implementation of the system call may need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) save and restore additional registers to the kernel stack, allowing complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) control of where and how execution continues after the system call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) This is arch-specific, but typically involves defining assembly entry points
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) that save/restore additional registers and invoke the real system call entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) point.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) For x86_64, this is implemented as a ``stub_xyzzy`` entry point in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) ``arch/x86/entry/entry_64.S``, and the entry in the syscall table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) (``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415)     333   common   xyzzy     stub_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) The equivalent for 32-bit programs running on a 64-bit kernel is normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) with the corresponding syscall table adjustment in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) ``arch/x86/entry/syscalls/syscall_32.tbl``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422)     380   i386     xyzzy     sys_xyzzy    stub32_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) If the system call needs a compatibility layer (as in the previous section)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) then the ``stub32_`` version needs to call on to the ``compat_sys_`` version
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) of the system call rather than the native 64-bit version.  Also, if the x32 ABI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) implementation is not common with the x86_64 version, then its syscall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) table will also need to invoke a stub that calls on to the ``compat_sys_``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) version.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) For completeness, it's also nice to set up a mapping so that user-mode Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) still works -- its syscall table will reference stub_xyzzy, but the UML build
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) simulates registers etc).  Fixing this is as simple as adding a #define to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) ``arch/x86/um/sys_call_table_64.c``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437)     #define stub_xyzzy sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) Other Details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) Most of the kernel treats system calls in a generic way, but there is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) occasional exception that may need updating for your particular system call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) The audit subsystem is one such special case; it includes (arch-specific)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) functions that classify some special types of system call -- specifically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) file open (``open``/``openat``), program execution (``execve``/``exeveat``) or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) socket multiplexor (``socketcall``) operations. If your new system call is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) analogous to one of these, then the audit system should be updated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) More generally, if there is an existing system call that is analogous to your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) new system call, it's worth doing a kernel-wide grep for the existing system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) call to check there are no other special cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) A new system call should obviously be tested; it is also useful to provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) reviewers with a demonstration of how user space programs will use the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) call.  A good way to combine these aims is to include a simple self-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) program in a new directory under ``tools/testing/selftests/``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) For a new system call, there will obviously be no libc wrapper function and so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) the test will need to invoke it using ``syscall()``; also, if the system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) involves a new userspace-visible structure, the corresponding header will need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) to be installed to compile the test.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) Make sure the selftest runs successfully on all supported architectures.  For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) and x32 (-mx32) ABI program.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) For more extensive and thorough testing of new functionality, you should also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) consider adding tests to the Linux Test Project, or to the xfstests project
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) for filesystem-related changes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)  - https://linux-test-project.github.io/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479)  - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) Man Page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) All new system calls should come with a complete man page, ideally using groff
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) markup, but plain text will do.  If groff is used, it's helpful to include a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) pre-rendered ASCII version of the man page in the cover email for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) patchset, for the convenience of reviewers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) The man page should be cc'ed to linux-man@vger.kernel.org
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) For more details, see https://www.kernel.org/doc/man-pages/patches.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) Do not call System Calls in the Kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) System calls are, as stated above, interaction points between userspace and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) the kernel.  Therefore, system call functions such as ``sys_xyzzy()`` or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) ``compat_sys_xyzzy()`` should only be called from userspace via the syscall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) table, but not from elsewhere in the kernel.  If the syscall functionality is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) useful to be used within the kernel, needs to be shared between an old and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) new syscall, or needs to be shared between a syscall and its compatibility
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) variant, it should be implemented by means of a "helper" function (such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) ``kern_xyzzy()``).  This kernel function may then be called within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) syscall stub (``sys_xyzzy()``), the compatibility syscall stub
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) (``compat_sys_xyzzy()``), and/or other kernel code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) At least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) call system call functions in the kernel.  It uses a different calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) convention for system calls where ``struct pt_regs`` is decoded on-the-fly in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) syscall wrapper which then hands processing over to the actual syscall function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) This means that only those parameters which are actually needed for a specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) syscall are passed on during syscall entry, instead of filling in six CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) registers with random user space content all the time (which may cause serious
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) trouble down the call chain).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) Moreover, rules on how data may be accessed may differ between kernel data and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) user data.  This is another reason why calling ``sys_xyzzy()`` is generally a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) bad idea.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) Exceptions to this rule are only allowed in architecture-specific overrides,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) architecture-specific compatibility wrappers, or other code in arch/.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) References and Sources
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528)  - LWN article from Michael Kerrisk on use of flags argument in system calls:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529)    https://lwn.net/Articles/585415/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530)  - LWN article from Michael Kerrisk on how to handle unknown flags in a system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531)    call: https://lwn.net/Articles/588444/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532)  - LWN article from Jake Edge describing constraints on 64-bit system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533)    arguments: https://lwn.net/Articles/311630/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534)  - Pair of LWN articles from David Drysdale that describe the system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535)    implementation paths in detail for v3.14:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537)     - https://lwn.net/Articles/604287/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538)     - https://lwn.net/Articles/604515/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540)  - Architecture-specific requirements for system calls are discussed in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541)    :manpage:`syscall(2)` man-page:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542)    http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543)  - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544)    https://yarchive.net/comp/linux/ioctl.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545)  - "How to not invent kernel interfaces", Arnd Bergmann,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546)    https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547)  - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548)    https://lwn.net/Articles/486306/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549)  - Recommendation from Andrew Morton that all related information for a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550)    system call should come in the same email thread:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551)    https://lkml.org/lkml/2014/7/24/641
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552)  - Recommendation from Michael Kerrisk that a new system call should come with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553)    a man page: https://lkml.org/lkml/2014/6/13/309
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554)  - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555)    commit: https://lkml.org/lkml/2014/11/19/254
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556)  - Suggestion from Greg Kroah-Hartman that it's good for new system calls to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557)    come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558)  - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559)    https://lkml.org/lkml/2014/6/3/411
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560)  - Suggestion from Ingo Molnar that system calls that involve multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561)    arguments should encapsulate those arguments in a struct, which includes a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562)    size field for future extensibility: https://lkml.org/lkml/2015/7/30/117
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563)  - Numbering oddities arising from (re-)use of O_* numbering space flags:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565)     - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566)       check")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567)     - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568)       conflict")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569)     - commit bb458c644a59 ("Safer ABI for O_TMPFILE")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571)  - Discussion from Matthew Wilcox about restrictions on 64-bit arguments:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572)    https://lkml.org/lkml/2008/12/12/187
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573)  - Recommendation from Greg Kroah-Hartman that unknown flags should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574)    policed: https://lkml.org/lkml/2014/7/17/577
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575)  - Recommendation from Linus Torvalds that x32 system calls should prefer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576)    compatibility with 64-bit versions rather than 32-bit versions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577)    https://lkml.org/lkml/2011/8/31/244