^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) .. _addsyscalls:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Adding a New System Call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) This document describes what's involved in adding a new system call to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) Linux kernel, over and above the normal submission advice in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) System Call Alternatives
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) The first thing to consider when adding a new system call is whether one of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) the alternatives might be suitable instead. Although system calls are the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) most traditional and most obvious interaction points between userspace and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) kernel, there are other possibilities -- choose what fits best for your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) - If the operations involved can be made to look like a filesystem-like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) object, it may make more sense to create a new filesystem or device. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) also makes it easier to encapsulate the new functionality in a kernel module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) rather than requiring it to be built into the main kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) - If the new functionality involves operations where the kernel notifies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) userspace that something has happened, then returning a new file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) descriptor for the relevant object allows userspace to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) ``poll``/``select``/``epoll`` to receive that notification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) - However, operations that don't map to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) :manpage:`read(2)`/:manpage:`write(2)`-like operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) have to be implemented as :manpage:`ioctl(2)` requests, which can lead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) to a somewhat opaque API.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) - If you're just exposing runtime system information, a new node in sysfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) be more appropriate. However, access to these mechanisms requires that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) relevant filesystem is mounted, which might not always be the case (e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) in a namespaced/sandboxed/chrooted environment). Avoid adding any API to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) debugfs, as this is not considered a 'production' interface to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) - If the operation is specific to a particular file or file descriptor, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) an additional :manpage:`fcntl(2)` command option may be more appropriate. However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) this option is best for when the new function is closely analogous to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) (for example, getting/setting a simple flag related to a file descriptor).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) - If the operation is specific to a particular task or process, then an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) additional :manpage:`prctl(2)` command option may be more appropriate. As
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) is best reserved for near-analogs of existing ``prctl()`` commands or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) getting/setting a simple flag related to a process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) Designing the API: Planning for Extension
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) -----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) A new system call forms part of the API of the kernel, and has to be supported
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) indefinitely. As such, it's a very good idea to explicitly discuss the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) interface on the kernel mailing list, and it's important to plan for future
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) extensions of the interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) (The syscall table is littered with historical examples where this wasn't done,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) together with the corresponding follow-up system calls --
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) ``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) ``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) learn from the history of the kernel and plan for extensions from the start.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) For simpler system calls that only take a couple of arguments, the preferred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) way to allow for future extensibility is to include a flags argument to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) system call. To make sure that userspace programs can safely use flags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) between kernel versions, check whether the flags value holds any unknown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) flags, and reject the system call (with ``EINVAL``) if it does::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) return -EINVAL;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) (If no flags values are used yet, check that the flags argument is zero.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) For more sophisticated system calls that involve a larger number of arguments,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) it's preferred to encapsulate the majority of the arguments into a structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) that is passed in by pointer. Such a structure can cope with future extension
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) by including a size argument in the structure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) struct xyzzy_params {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) u32 param_1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) u64 param_2;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) u64 param_3;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) As long as any subsequently added field, say ``param_4``, is designed so that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) zero value gives the previous behaviour, then this allows both directions of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) version mismatch:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) - To cope with a later userspace program calling an older kernel, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) code should check that any memory beyond the size of the structure that it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) expects is zero (effectively checking that ``param_4 == 0``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) - To cope with an older userspace program calling a newer kernel, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) code can zero-extend a smaller instance of the structure (effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) setting ``param_4 = 0``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ``kernel/events/core.c``) for an example of this approach.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) Designing the API: Other Considerations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ---------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) If your new system call allows userspace to refer to a kernel object, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) should use a file descriptor as the handle for that object -- don't invent a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) new type of userspace object handle when the kernel already has mechanisms and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) well-defined semantics for using file descriptors.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) If your new :manpage:`xyzzy(2)` system call does return a new file descriptor,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) then the flags argument should include a value that is equivalent to setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) ``O_CLOEXEC`` on the new FD. This makes it possible for userspace to close
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) the timing window between ``xyzzy()`` and calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) ``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) ``execve()`` in another thread could leak a descriptor to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) the exec'ed program. (However, resist the temptation to re-use the actual value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) numbering space of ``O_*`` flags that is fairly full.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) If your system call returns a new file descriptor, you should also consider
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) what it means to use the :manpage:`poll(2)` family of system calls on that file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) descriptor. Making a file descriptor ready for reading or writing is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) normal way for the kernel to indicate to userspace that an event has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) occurred on the corresponding kernel object.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) If your new :manpage:`xyzzy(2)` system call involves a filename argument::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) int sys_xyzzy(const char __user *path, ..., unsigned int flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) This allows more flexibility for how userspace specifies the file in question;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) in particular it allows userspace to request the functionality for an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) giving an :manpage:`fxyzzy(3)` operation for free::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) (For more details on the rationale of the \*at() calls, see the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) :manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) :manpage:`fstatat(2)` man page.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) If your new :manpage:`xyzzy(2)` system call involves a parameter describing an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) offset within a file, make its type ``loff_t`` so that 64-bit offsets can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) supported even on 32-bit architectures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) If your new :manpage:`xyzzy(2)` system call involves privileged functionality,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) it needs to be governed by the appropriate Linux capability bit (checked with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) a call to ``capable()``), as described in the :manpage:`capabilities(7)` man
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) page. Choose an existing capability bit that governs related functionality,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) but try to avoid combining lots of only vaguely related functions together
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) under the same bit, as this goes against capabilities' purpose of splitting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) the power of root. In particular, avoid adding new uses of the already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) overly-general ``CAP_SYS_ADMIN`` capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) If your new :manpage:`xyzzy(2)` system call manipulates a process other than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) the calling process, it should be restricted (using a call to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) ``ptrace_may_access()``) so that only a calling process with the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) permissions as the target process, or with the necessary capabilities, can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) manipulate the target process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) Finally, be aware that some non-x86 architectures have an easier time if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) system call parameters that are explicitly 64-bit fall on odd-numbered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) registers. (This concern does not apply if the arguments are part of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) structure that's passed in by pointer.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) Proposing the API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) To make new system calls easy to review, it's best to divide up the patchset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) into separate chunks. These should include at least the following items as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) distinct commits (each of which is described further below):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) - The core implementation of the system call, together with prototypes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) generic numbering, Kconfig changes and fallback stub implementation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) - Wiring up of the new system call for one particular architecture, usually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) x86 (including all of x86_64, x86_32 and x32).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) - A demonstration of the use of the new system call in userspace via a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) selftest in ``tools/testing/selftests/``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) - A draft man-page for the new system call, either as plain text in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) cover letter, or as a patch to the (separate) man-pages repository.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) New system call proposals, like any change to the kernel's API, should always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) be cc'ed to linux-api@vger.kernel.org.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) Generic System Call Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) ----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) The main entry point for your new :manpage:`xyzzy(2)` system call will be called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) ``sys_xyzzy()``, but you add this entry point with the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) ``SYSCALL_DEFINEn()`` macro rather than explicitly. The 'n' indicates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) number of arguments to the system call, and the macro takes the system call name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) followed by the (type, name) pairs for the parameters as arguments. Using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) this macro allows metadata about the new system call to be made available for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) other tools.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) The new entry point also needs a corresponding function prototype, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) ``include/linux/syscalls.h``, marked as asmlinkage to match the way that system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) calls are invoked::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) asmlinkage long sys_xyzzy(...);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) Some architectures (e.g. x86) have their own architecture-specific syscall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) tables, but several other architectures share a generic syscall table. Add your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) new system call to the generic list by adding an entry to the list in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) ``include/uapi/asm-generic/unistd.h``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) #define __NR_xyzzy 292
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) __SYSCALL(__NR_xyzzy, sys_xyzzy)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) Also update the __NR_syscalls count to reflect the additional system call, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) note that if multiple new system calls are added in the same merge window,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) your new syscall number may get adjusted to resolve conflicts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) system call, returning ``-ENOSYS``. Add your new system call here too::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) COND_SYSCALL(xyzzy);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Your new kernel functionality, and the system call that controls it, should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) normally be optional, so add a ``CONFIG`` option (typically to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) ``init/Kconfig``) for it. As usual for new ``CONFIG`` options:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) - Include a description of the new functionality and system call controlled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) by the option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) - Make the option depend on EXPERT if it should be hidden from normal users.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) - Make any new source files implementing the function dependent on the CONFIG
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) - Double check that the kernel still builds with the new CONFIG option turned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) off.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) To summarize, you need a commit that includes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) - ``CONFIG`` option for the new function, normally in ``init/Kconfig``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) - corresponding prototype in ``include/linux/syscalls.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) - generic table entry in ``include/uapi/asm-generic/unistd.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) - fallback stub in ``kernel/sys_ni.c``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) x86 System Call Implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) ------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) To wire up your new system call for x86 platforms, you need to update the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) master syscall tables. Assuming your new system call isn't special in some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) way (see below), this involves a "common" entry (for x86_64 and x32) in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) arch/x86/entry/syscalls/syscall_64.tbl::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 333 common xyzzy sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) 380 i386 xyzzy sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) Again, these numbers are liable to be changed if there are conflicts in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) relevant merge window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) Compatibility System Calls (Generic)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) For most system calls the same 64-bit implementation can be invoked even when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) the userspace program is itself 32-bit; even if the system call's parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) include an explicit pointer, this is handled transparently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) However, there are a couple of situations where a compatibility layer is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) needed to cope with size differences between 32-bit and 64-bit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) The first is if the 64-bit kernel also supports 32-bit userspace programs, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) so needs to parse areas of (``__user``) memory that could hold either 32-bit or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) 64-bit values. In particular, this is needed whenever a system call argument
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) - a pointer to a pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) - a pointer to a varying sized integral type (``time_t``, ``off_t``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) ``long``, ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) - a pointer to a struct containing a varying sized integral type.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) The second situation that requires a compatibility layer is if one of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) system call's arguments has a type that is explicitly 64-bit even on a 32-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) architecture, for example ``loff_t`` or ``__u64``. In this case, a value that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) arrives at a 64-bit kernel from a 32-bit application will be split into two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) 32-bit values, which then need to be re-assembled in the compatibility layer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) (Note that a system call argument that's a pointer to an explicit 64-bit type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) The compatibility version of the system call is called ``compat_sys_xyzzy()``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) SYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) kernel, but expects to receive 32-bit parameter values and does whatever is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) needed to deal with them. (Typically, the ``compat_sys_`` version converts the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) values to 64-bit versions and either calls on to the ``sys_`` version, or both of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) them call a common inner implementation function.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) The compat entry point also needs a corresponding function prototype, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) ``include/linux/compat.h``, marked as asmlinkage to match the way that system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) calls are invoked::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) asmlinkage long compat_sys_xyzzy(...);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) If the system call involves a structure that is laid out differently on 32-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) header file should also include a compat version of the structure (``struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) compat_xyzzy_args``) where each variable-size field has the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) ``compat_`` type that corresponds to the type in ``struct xyzzy_args``. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) ``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) parse the arguments from a 32-bit invocation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) For example, if there are fields::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) struct xyzzy_args {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) const char __user *ptr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) __kernel_long_t varying_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) u64 fixed_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) /* ... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) in struct xyzzy_args, then struct compat_xyzzy_args would have::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) struct compat_xyzzy_args {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) compat_uptr_t ptr;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) compat_long_t varying_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) u64 fixed_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) /* ... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) The generic system call list also needs adjusting to allow for the compat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) version; the entry in ``include/uapi/asm-generic/unistd.h`` should use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) ``__SC_COMP`` rather than ``__SYSCALL``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) #define __NR_xyzzy 292
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) To summarize, you need:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) - corresponding prototype in ``include/linux/compat.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) - (if needed) 32-bit mapping struct in ``include/linux/compat.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) - instance of ``__SC_COMP`` not ``__SYSCALL`` in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) ``include/uapi/asm-generic/unistd.h``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) Compatibility System Calls (x86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) To wire up the x86 architecture of a system call with a compatibility version,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) the entries in the syscall tables need to be adjusted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) column to indicate that a 32-bit userspace program running on a 64-bit kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) should hit the compat entry point::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 380 i386 xyzzy sys_xyzzy __ia32_compat_sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) Second, you need to figure out what should happen for the x32 ABI version of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) the new system call. There's a choice here: the layout of the arguments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) should either match the 64-bit version or the 32-bit version.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) If there's a pointer-to-a-pointer involved, the decision is easy: x32 is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) ILP32, so the layout should match the 32-bit version, and the entry in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) ``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) the compatibility wrapper::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 333 64 xyzzy sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 555 x32 xyzzy __x32_compat_sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) If no pointers are involved, then it is preferable to re-use the 64-bit system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) call for the x32 ABI (and consequently the entry in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) arch/x86/entry/syscalls/syscall_64.tbl is unchanged).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) In either case, you should check that the types involved in your argument
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) 64-bit (-m64) equivalents.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) System Calls Returning Elsewhere
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) For most system calls, once the system call is complete the user program
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) continues exactly where it left off -- at the next instruction, with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) stack the same and most of the registers the same as before the system call,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) and with the same virtual memory space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) However, a few system calls do things differently. They might return to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) different location (``rt_sigreturn``) or change the memory space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) (``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) of the program.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) To allow for this, the kernel implementation of the system call may need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) save and restore additional registers to the kernel stack, allowing complete
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) control of where and how execution continues after the system call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) This is arch-specific, but typically involves defining assembly entry points
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) that save/restore additional registers and invoke the real system call entry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) point.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) For x86_64, this is implemented as a ``stub_xyzzy`` entry point in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) ``arch/x86/entry/entry_64.S``, and the entry in the syscall table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) (``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) 333 common xyzzy stub_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) The equivalent for 32-bit programs running on a 64-bit kernel is normally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) with the corresponding syscall table adjustment in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) ``arch/x86/entry/syscalls/syscall_32.tbl``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) 380 i386 xyzzy sys_xyzzy stub32_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) If the system call needs a compatibility layer (as in the previous section)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) then the ``stub32_`` version needs to call on to the ``compat_sys_`` version
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) of the system call rather than the native 64-bit version. Also, if the x32 ABI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) implementation is not common with the x86_64 version, then its syscall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) table will also need to invoke a stub that calls on to the ``compat_sys_``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) version.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) For completeness, it's also nice to set up a mapping so that user-mode Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) still works -- its syscall table will reference stub_xyzzy, but the UML build
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) simulates registers etc). Fixing this is as simple as adding a #define to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) ``arch/x86/um/sys_call_table_64.c``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) #define stub_xyzzy sys_xyzzy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) Other Details
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) Most of the kernel treats system calls in a generic way, but there is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) occasional exception that may need updating for your particular system call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) The audit subsystem is one such special case; it includes (arch-specific)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) functions that classify some special types of system call -- specifically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) file open (``open``/``openat``), program execution (``execve``/``exeveat``) or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) socket multiplexor (``socketcall``) operations. If your new system call is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) analogous to one of these, then the audit system should be updated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) More generally, if there is an existing system call that is analogous to your
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) new system call, it's worth doing a kernel-wide grep for the existing system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) call to check there are no other special cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) A new system call should obviously be tested; it is also useful to provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) reviewers with a demonstration of how user space programs will use the system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) call. A good way to combine these aims is to include a simple self-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) program in a new directory under ``tools/testing/selftests/``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) For a new system call, there will obviously be no libc wrapper function and so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) the test will need to invoke it using ``syscall()``; also, if the system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) involves a new userspace-visible structure, the corresponding header will need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) to be installed to compile the test.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) Make sure the selftest runs successfully on all supported architectures. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) and x32 (-mx32) ABI program.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) For more extensive and thorough testing of new functionality, you should also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) consider adding tests to the Linux Test Project, or to the xfstests project
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) for filesystem-related changes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) - https://linux-test-project.github.io/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) Man Page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) All new system calls should come with a complete man page, ideally using groff
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) markup, but plain text will do. If groff is used, it's helpful to include a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) pre-rendered ASCII version of the man page in the cover email for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) patchset, for the convenience of reviewers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) The man page should be cc'ed to linux-man@vger.kernel.org
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) For more details, see https://www.kernel.org/doc/man-pages/patches.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) Do not call System Calls in the Kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) --------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) System calls are, as stated above, interaction points between userspace and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) the kernel. Therefore, system call functions such as ``sys_xyzzy()`` or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) ``compat_sys_xyzzy()`` should only be called from userspace via the syscall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) table, but not from elsewhere in the kernel. If the syscall functionality is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) useful to be used within the kernel, needs to be shared between an old and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) new syscall, or needs to be shared between a syscall and its compatibility
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) variant, it should be implemented by means of a "helper" function (such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) ``kern_xyzzy()``). This kernel function may then be called within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) syscall stub (``sys_xyzzy()``), the compatibility syscall stub
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) (``compat_sys_xyzzy()``), and/or other kernel code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) At least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) call system call functions in the kernel. It uses a different calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) convention for system calls where ``struct pt_regs`` is decoded on-the-fly in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) syscall wrapper which then hands processing over to the actual syscall function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) This means that only those parameters which are actually needed for a specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) syscall are passed on during syscall entry, instead of filling in six CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) registers with random user space content all the time (which may cause serious
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) trouble down the call chain).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) Moreover, rules on how data may be accessed may differ between kernel data and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) user data. This is another reason why calling ``sys_xyzzy()`` is generally a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) bad idea.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) Exceptions to this rule are only allowed in architecture-specific overrides,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) architecture-specific compatibility wrappers, or other code in arch/.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525) References and Sources
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) - LWN article from Michael Kerrisk on use of flags argument in system calls:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) https://lwn.net/Articles/585415/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530) - LWN article from Michael Kerrisk on how to handle unknown flags in a system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) call: https://lwn.net/Articles/588444/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532) - LWN article from Jake Edge describing constraints on 64-bit system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) arguments: https://lwn.net/Articles/311630/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) - Pair of LWN articles from David Drysdale that describe the system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) implementation paths in detail for v3.14:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) - https://lwn.net/Articles/604287/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538) - https://lwn.net/Articles/604515/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) - Architecture-specific requirements for system calls are discussed in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) :manpage:`syscall(2)` man-page:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544) https://yarchive.net/comp/linux/ioctl.html
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) - "How to not invent kernel interfaces", Arnd Bergmann,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546) https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) https://lwn.net/Articles/486306/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) - Recommendation from Andrew Morton that all related information for a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) system call should come in the same email thread:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551) https://lkml.org/lkml/2014/7/24/641
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) - Recommendation from Michael Kerrisk that a new system call should come with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) a man page: https://lkml.org/lkml/2014/6/13/309
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) commit: https://lkml.org/lkml/2014/11/19/254
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) - Suggestion from Greg Kroah-Hartman that it's good for new system calls to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557) come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559) https://lkml.org/lkml/2014/6/3/411
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) - Suggestion from Ingo Molnar that system calls that involve multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) arguments should encapsulate those arguments in a struct, which includes a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) size field for future extensibility: https://lkml.org/lkml/2015/7/30/117
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563) - Numbering oddities arising from (re-)use of O_* numbering space flags:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566) check")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) conflict")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) - commit bb458c644a59 ("Safer ABI for O_TMPFILE")
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571) - Discussion from Matthew Wilcox about restrictions on 64-bit arguments:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) https://lkml.org/lkml/2008/12/12/187
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) - Recommendation from Greg Kroah-Hartman that unknown flags should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) policed: https://lkml.org/lkml/2014/7/17/577
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575) - Recommendation from Linus Torvalds that x32 system calls should prefer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) compatibility with 64-bit versions rather than 32-bit versions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) https://lkml.org/lkml/2011/8/31/244