diff mbox series

[REPOST] docs: Add struct file refcounting and SCM_RIGHTS mess info

Message ID 20190411142308.179247a4@lwn.net
State New
Headers show
Series [REPOST] docs: Add struct file refcounting and SCM_RIGHTS mess info | expand

Commit Message

Jonathan Corbet April 11, 2019, 8:23 p.m. UTC
docs: Add struct file refcounting and SCM_RIGHTS mess info

Work up some text posted by Al and add it to the filesystem manual.

Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

---
This is the third sending of this patch.  Al, you *did* ask me to run it
past you before putting it in; do you think you might get a chance to have
a look sometime soon?

 Documentation/filesystems/index.rst      |   1 +
 Documentation/filesystems/lifecycles.rst | 357 +++++++++++++++++++++++
 2 files changed, 358 insertions(+)
 create mode 100644 Documentation/filesystems/lifecycles.rst

-- 
2.20.1

Comments

Al Viro April 11, 2019, 9:30 p.m. UTC | #1
On Thu, Apr 11, 2019 at 02:23:08PM -0600, Jonathan Corbet wrote:
> docs: Add struct file refcounting and SCM_RIGHTS mess info

> 

> Work up some text posted by Al and add it to the filesystem manual.

> 

> Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>

> Signed-off-by: Jonathan Corbet <corbet@lwn.net>

> ---

> This is the third sending of this patch.  Al, you *did* ask me to run it

> past you before putting it in; do you think you might get a chance to have

> a look sometime soon?


Will do, hopefully tonight...
Al Viro April 12, 2019, 9:53 p.m. UTC | #2
On Thu, Apr 11, 2019 at 02:23:08PM -0600, Jonathan Corbet wrote:
> +======================

> +Lifecycles and locking

> +======================

> +

> +This manual aspires to cover the lifecycles of VFS objects and the locking

> +that protects them.

> +

> +Reference counting for file structures

> +======================================

> +

> +(The following text derives from `this email from Al Viro

> +<https://lwn.net/ml/linux-fsdevel/20190207040058.GW2217@ZenIV.linux.org.uk/>`_).

> +

> +The :c:type:`struct file` type represents an open file in the kernel.  Its


That wants a separate section in the beginning, with overview of notions for
which the word "file" gets misused ;-)  I have something, but it's too much
of a rant at the moment (mostly regarding the unknown linguistic genius
somewhere in the bowels of POSIX committees, responsible for the term
"file description" used along with and in contrast to "file descriptor").

> +lifetime is controlled by a simple reference count (f_count) in that

> +structure.  References are obtained with functions like fget(), fdget(),

> +and fget_raw(); they are returned with fput().


... and fdput().  BTW, are you sure that "returned" is the right verb here?
Would "released" or "dropped" be less overloaded?

The set is actually
	fget{,_raw}()		fput()
	fdget{,_raw}()		fdput()
	fdget_pos()		fdput_pos()
	get_file()		fput()
	get_file_rcu()		fput()		[not for general use]

Not sure if it's worth putting here, but _raw variants are the ones that
accept O_PATH files and fdget_pos() is "with XSI 2.9.7-inflicted exclusion".

> +.. FIXME we should have kerneldoc comments for those functions

> +


> +Descriptor tables

> +-----------------


That wants an overview of WTF those are and what are they for; again, fodder
for the overview section in the beginning...

> +execve() does almost the same thing: if the pre-exec descriptor table is

> +shared, it will switch to a new copy first.  In case of success the

> +reference to the original table is dropped, in case of failure we revert to

> +the original and drop the copy.  Note that handling of close-on-exec is

> +done in the *copy*; the original is unaffected, so failing in execve() does

> +not disrupt the descriptor table.


The last part is actually a bit misleading - close-on-exec is handled only after
the point of no return on execve(), so any execve() failure either happens
before we get to doing that, or the caller gets killed.

What that really gives is that descriptor table sharing is terminated by
execve(), even if it had been across the thread group boundary, and those
who used to share it are unaffected by our close-on-exec (everyone in the
same thread group will be simply killed by execve(); however, other processes
might be sharing the damn thing as well and _they_ will keep going - with
the original table).

> +The SCM_RIGHTS datagram option with Unix-domain sockets can be used to

> +transfer a file descriptor, and its associated struct file reference, to

> +the receiving process.


"transfer X and Y to Z" implies that both X and Y are transferred; in reality
only Y is...   I'm not sure how to put it in one sentence, though ;-/

Perhaps something along the lines of "to pass opened files between the
processes" might be better, with details explained after that?  Those who
understand what SCM_RIGHTS is will be fine either way, but I had a bad
experience explaining what SCM_RIGHTS was to people who'd never looked
into that thing before.  And most of the problems had been precisely
because of confusion re what gets passed around...  The variant I'd
found to work best is along the lines of

################
	Sender has an opened file and wants to give it to recepient.
Of course, telling recepient "the file I want you to use is
descriptor 42" is useless - _their_ descriptor 42 has nothing to do
with yours, to start with.  What's needed is
	* getting the opened file reference from sender's descriptor
table into recepients'
	* telling the recepient where in _their_ descriptor table
has it ended up.
	The way it's done is
1) sender specifies which of its opened files should be passed (by
their locations in sender's descriptor table, i.e. their file descriptors).
2) references to these opened files are attached to a packet and sent
over.
3) the packet is received; unused slots in recepient's descriptor table
are chosen (same as open(2), pipe(2), etc. would have done),
opened file references from the packet are inserted into those slots
and their numbers (i.e. file descriptors for recepient to use) are
given to recepient.  They are more than likely to be different
from whatever descriptors sender used, of course; it's the opened
files they refer to that got passed.
	That requires cooperation from recepient, of course - if
nothing else, they need to tell recvmsg(2) where to put that
array of descriptors.  Details are more than slightly clumsy;
look around in csmg_data(3) for sad boilerplate.
################

I think it's too verbose to use here, though, and I've no idea how to do it
in more concise form...

[snip]

> +Note that, in both cases, the leaks are caused by loops passing through


s/loops/loops in the graph of references/, maybe?

> +some SCM_RIGHTS datagrams that can never be received.  So locating those,

> +removing them from the queues they sit in and then discarding the suckers,

> +is enough to resolve the situation. Furthermore, in both cases the loop

> +passes through the unix_sock of something that got sent over in an

> +SCM_RIGHTS datagram.  So we can do the following:


I wonder if the subsequent discussion of the garbage collector guts (as
opposed to the need to have it + "the strategy is to find a set of undeliverable
SCM_RIGHTS datagrams and kick them out, breaking the unreachable loops")
would better off living separately...
diff mbox series

Patch

diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d77f6..44ff355e0be6 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -16,6 +16,7 @@  algorithms work.
 .. toctree::
    :maxdepth: 2
 
+   lifecycles
    path-lookup.rst
    api-summary
    splice
diff --git a/Documentation/filesystems/lifecycles.rst b/Documentation/filesystems/lifecycles.rst
new file mode 100644
index 000000000000..b30f566cfe0d
--- /dev/null
+++ b/Documentation/filesystems/lifecycles.rst
@@ -0,0 +1,357 @@ 
+======================
+Lifecycles and locking
+======================
+
+This manual aspires to cover the lifecycles of VFS objects and the locking
+that protects them.
+
+Reference counting for file structures
+======================================
+
+(The following text derives from `this email from Al Viro
+<https://lwn.net/ml/linux-fsdevel/20190207040058.GW2217@ZenIV.linux.org.uk/>`_).
+
+The :c:type:`struct file` type represents an open file in the kernel.  Its
+lifetime is controlled by a simple reference count (f_count) in that
+structure.  References are obtained with functions like fget(), fdget(),
+and fget_raw(); they are returned with fput().
+
+.. FIXME we should have kerneldoc comments for those functions
+
+The struct file destructor (__fput() and the filesystem-specific
+->release() function called from it) is called once the counter hits zero.
+Each file descriptor counts as a reference.  Thus, dup() will increment
+the refcount by 1, close() will decrement it, fork() will increment it
+by the number of descriptors in your descriptor table refering to this
+struct file, destruction of the descriptor table on exit() will decrement
+by the same amount, etc.
+
+Syscalls like read() and friends turn descriptors into struct file
+references.  If the descriptor table is shared, that counts as a new
+reference that must be dropped in the end of the syscall; otherwise we are
+guaranteed that the reference in the descriptor table will stay around
+until the end of the syscall, so we may use it without bumping the file
+refcount.  That's the difference between fget() and fdget() - the former
+will bump the refcount, while the latter will try to avoid that.  Of
+course, if we do not intend to drop the reference we'd acquired by the end
+of the syscall, we want fget(); fdget() is for transient references only.
+
+Descriptor tables
+-----------------
+
+Descriptor tables (:c:type:`struct files_struct`) *can* be shared; several
+processes (usually threads that share address spaces as well, but that's
+not necessary) may be working with the same set of struct files so, for
+example, an open() call in one of them is seen by the others.  The same
+goes for close(), dup(), dup2(), etc.
+
+That makes for an interesting corner case: what if two threads happen to
+share a descriptor table, and one of them closes a file descriptor while
+another is in the middle of a read() call on that same descriptor?  That's
+one area where Unices differ; one variant is to abort the read() call,
+another would have close() wait for the read() call to finish, etc.  What
+we do is:
+
+  * close() succeeds immediately; the reference is removed from
+    the descriptor table and dropped.
+
+  * If the close() call happens before read(fd, ...) has converted the file
+    descriptor to a struct file reference, read() will fail with -EBADF.
+
+  * Otherwise, read() proceeds unmolested.  The reference it has acquired
+    is dropped at the end of the syscall.  If that's the last reference to
+    the file, the file structure will get shut down at that point.
+
+A call to clone() will result in the child sharing the parent's descriptor
+table if CLONE_FILES is in the flags.  Note that, in this case, struct file
+refcounts are not modified at all, since no new references to files are
+created.  Without CLONE_FILES, it's the same as fork(): an independent copy
+of the descriptor table is created and populated by copies of references to
+files, each bumping file's refcount.
+
+Calling unshare() with CLONE_FILES in the flags will create a copy of the
+descriptor table (same as done on fork(), etc.) and switch to using it; the
+old reference will be dropped (note: it'll only bother with that if
+descriptor table used to be shared in the first place; if we hold the only
+reference to descriptor table, we'll just keep using it).
+
+execve() does almost the same thing: if the pre-exec descriptor table is
+shared, it will switch to a new copy first.  In case of success the
+reference to the original table is dropped, in case of failure we revert to
+the original and drop the copy.  Note that handling of close-on-exec is
+done in the *copy*; the original is unaffected, so failing in execve() does
+not disrupt the descriptor table.
+
+exit() will drop the reference to the descriptor table.  When the last
+reference is dropped, all file references are removed from it (and dropped).
+
+The thread's pointer to its descriptor table (current->files) is never
+modified by other threads; something like::
+
+  ls /proc/<pid>/fd 
+
+will fetch it, so stores need to be protected (by task_lock(current)), but
+the only the thread itself can do them.
+
+Note that, while extra references to the descriptor table can appear at any
+time (/proc/<pid>/fd accesses, for example), such references may not be
+used for modifications.  In particular, you can't switch to another
+thread's descriptor table, unless it had been yours at some earlier point
+*and* you've kept a reference to it.
+
+That's about it for descriptor tables; that, by far, is the main source of
+persistently held struct file references.  Transient references are grabbed
+by syscalls when they resolve a descriptor to a struct file pointer, which
+ought to be done once per syscall *and* reasonably early in it.
+Unfortunately, that's not all; there are other persistent struct file
+references.
+
+Other persistent references
+---------------------------
+
+A key point so far is that references to file structures are not held
+(directly or indirectly) in other file structures.  If that were
+universally true, life would be simpler, since we would never have to worry
+about reference-count loops.  Unfortunately, there are some more
+complicated cases that the kernel has to worry about.
+
+Some things, such as the case of a LOOP_SET_FD ioctl() call grabbing a
+reference to a file structure and stashing it in the lo_backing_file field
+of a loop_device structure, are reasonably simple.  The struct file
+reference will be dropped later, either directly by a LOOP_CLR_FD operation
+(if nothing else holds the thing open at the time) or later in
+lo_release().
+
+Note that, in the latter case, things can get a bit more complicated.  A
+process closing /dev/loop might drop the last reference to it, triggering a
+call to bdput() that releases the last reference holding a block device
+open.  That will trigger a call to lo_release(), which will drop the
+reference on the underlying file structure, which is almost certainly the
+last one at that point.  This case is still not a problem; while we do have
+the underlying struct file pinned by something held by another struct file,
+the dependency graph is acyclic, so the plain refcounts we are using work
+fine.
+
+The same goes for the things like e.g. ecryptfs opening an underlying
+(encrypted) file on open() and dropping it when the last reference to
+ecryptfs file is dropped; the only difference here is that the underlying
+struct file never appears in anyone's descriptor tables.
+
+However, in a couple of cases we do have something trickier.
+
+File references and SCM_RIGHTS
+------------------------------
+
+The SCM_RIGHTS datagram option with Unix-domain sockets can be used to
+transfer a file descriptor, and its associated struct file reference, to
+the receiving process.  That brings about a couple of situations where
+things can go wrong.
+
+Case 1: an SCM_RIGHTS datagram can be sent to an AF_UNIX socket.  That
+converts the caller-supplied array of descriptors into an array of struct
+file references, which gets attached to the packet we queue.  When the
+datagram is received, the struct file references are moved into the
+descriptor table of the recepient or, in case of error, dropped.  Note that
+sending some descriptors in an SCM_RIGHTS datagram and closing them
+immediately is perfectly legitimate: as soon as sendmsg() returns you can
+go ahead and close the descriptors you've sent.  The references for the
+recipient are already acquired, so you don't need to wait for the packet to
+be received.
+
+That would still be simple, if not for the fact that there's nothing to
+stop you from passing AF_UNIX sockets themselves around in the same way.
+In fact, that has legitimate uses and, most of the time, doesn't cause any
+complications at all.  However, it is possible to get the situation when
+the following happens:
+
+  * struct file instances A and B are both AF_UNIX sockets.
+  * The only reference to A is in the SCM_RIGHTS packet that sits in the
+    receiving queue of B.
+  * The only reference to B is in the SCM_RIGHTS packet that sits in the
+    receiving queue of A.
+
+That, of course, is where pure refcounting of any kind will break.
+
+The SCM_RIGHTS datagram that contains the sole reference to A can't be
+received without the recepient getting hold of a reference to B.  That
+cannot happen until somebody manages to receive the SCM_RIGHTS datagram
+containing the sole reference to B.  But that cannot happen until that
+somebody manages to get hold of a reference to A, which cannot happen until
+the first SCM_RIGHTS datagram is received.
+
+Dropping the last reference to A would have discarded everything in its
+receiving queue, including the SCM_RIGHTS datagram that contains the
+reference to B; however, that can't happen either; the other SCM_RIGHTS
+datagram would have to be either received or discarded first, etc.
+
+Case 2: similar, with a bit of a twist.  An AF_UNIX socket used for
+descriptor passing is normally set up by socket(), followed by connect().
+As soon as connect() returns, one can start sending.  Note that connect()
+does *NOT* wait for the recepient to call accept(); it creates the object
+that will serve as the low-level part of the other end of connection
+(complete with received packet queue) and stashes that object into the
+queue of the *listener's* socket.  A subsequent accept() call fetches it
+from there and attaches it to a new socket, completing the setup; in the
+meanwhile, sending packets works fine.  Once accept() is done, it'll see
+the stuff you'd sent already in the queue of the new socket and everything
+works fine.
+
+If the listening socket gets closed without accept() having been called,
+its queue is flushed, discarding all pending connection attempts, complete
+with *their* queues.  Which is the same effect as accept() + close(), so
+again, normally everything just works.  However, consider the case when we
+have:
+
+  * struct file instances A and B being AF_UNIX sockets.
+  * A is a listener
+  * B is an established connection, with the other end yet to be accepted
+    on A 
+  * The only references to A and B are in an SCM_RIGHTS datagram sent over
+    to A by B.
+
+That SCM_RIGHTS datagram could have been received if somebody had managed
+to call accept() on A and recvmsg() on the socket created by that accept()
+call.  But that can't happen without that somebody getting hold of a
+reference to A in the first place, which can't happen without having
+received that SCM_RIGHTS datagram.  It can't be discarded either, since
+that can't happen without dropping the last reference to A, which sits
+right in it.
+
+The difference from the previous case is that there we had:
+
+  * A holds unix_sock of A
+  * unix_sock of A holds SCM_RIGHTS with reference to B
+  * B holds unix_sock of B
+  * unix_sock of B holds SCM_RIGHTS with reference to A
+
+and here we have:
+
+  * A holds unix_sock of A
+  * unix_sock of A holds the packet with reference to embryonic unix_sock
+    created by connect() 
+  * that embryionic unix_sock holds SCM_RIGHTS with references to A and B.
+
+The dependency graph is different, but the problem is the same; there are
+unreachable loops in it.  Note that neither class of situations
+would occur normally; in the best case it's "somebody had been
+doing rather convoluted descriptor passing, but everyone involved
+got hit with kill -9 at the wrong time; please, make sure nothing
+leaks".  That can happen, but a userland race (e.g. botched protocol
+handling of some sort) or a deliberate abuse are much more likely.
+
+Catching the loop creation is hard and paying for that every time we do
+descriptor-passing would be a bad idea.  Besides, the loop per se is not
+fatal; if, for example, in the second case the descriptor for A had been
+kept around, close(accept()) would've cleaned everything up.  Which means
+that we need a garbage collector to deal with the (rare) leaks.
+
+Note that, in both cases, the leaks are caused by loops passing through
+some SCM_RIGHTS datagrams that can never be received.  So locating those,
+removing them from the queues they sit in and then discarding the suckers,
+is enough to resolve the situation. Furthermore, in both cases the loop
+passes through the unix_sock of something that got sent over in an
+SCM_RIGHTS datagram.  So we can do the following:
+
+  1) Keep the count of references to file structures of AF_UNIX sockets
+     held by SCM_RIGHTS; this value is kept in unix_sock->inflight.  Any
+     struct unix_sock instance without such references is not a part of
+     unreachable loop.  Maintain the set of unix_sock that are not excluded
+     by that (i.e. the ones that have some of references from SCM_RIGHTS
+     instances).  Note that we don't need to maintain those counts in
+     struct file; we care only about unix_sock here.
+
+  2) Any struct file of an AF_UNIX socket with some references *NOT* from
+     SCM_RIGHTS datagrams is also not a part of unreachable loop.
+
+  3) For each unix_sock, consider the following set of SCM_RIGHTS
+     datagrams: everything in the queue of that unix_sock if it's a
+     non-listener, and everything in queues of *all* embryonic unix_sock
+     structs in the queue of a listener.  Let's call those the SCM_RIGHTS
+     associated with our unix_sock.
+
+  4) All SCM_RIGHTS associated with a reachable unix_sock are themselves
+     reachable.
+
+  5) if some references to the struct file of a unix_sock are in reachable
+     SCM_RIGHTS, that struct file is reachable.
+
+The garbage collector starts with calculating the set of potentially
+unreachable unix_socks:  the ones not excluded by (1, 2).  No unix_sock
+instances outside of that set need to be considered.
+
+If some unix_sock in that set has a counter that is *not* entirely covered
+by SCM_RIGHTS associated with the elements of the set, we can conclude that
+there are references to it in SCM_RIGHTS associated with something outside
+of our set and therefore it is reachable and can be removed from the set.
+
+If that process converges to a non-empty set, we know that everything left
+in that set is unreachable - all references to their struct file come from
+some SCM_RIGHTS datagrams, and all those SCM_RIGHTS datagrams are among
+those that can't be received or discarded without getting hold of a
+reference to struct file of something in our set.
+
+Everything outside of that set is reachable, so taking the SCM_RIGHTS with
+references to stuff in our set (all of them to be found among those
+associated with elements of our set) out of the queues they are in will
+break all unreachable loops.  Discarding the collected datagrams will do
+the rest - the file references in those will be dropped, etc.
+
+One thing to keep in mind here is the locking.  What the garbage
+collector relies upon is:
+
+  * Changes to ->inflight are serialized with respect to it (on
+    unix_gc_lock; increments are done by unix_inflight(), decrements by
+    unix_notinflight()).
+
+  * Any references extracted from SCM_RIGHTS during the garbage collector
+    run will not be actually used until the end of garbage collection.  For
+    a normal recvmsg() call, this behavior is guaranteed by having
+    unix_notinflight() called between the extraction of scm_fp_list from
+    the packet and doing anything else with the references extracted.  For
+    a MSG_PEEK recvmsg() call, it's actually broken and lacks
+    synchronization; Miklos has proposed to grab and release unix_gc_lock
+    in those, between scm_fp_dup() and doing anything else with the
+    references copied.
+
+.. FIXME: The above should be updates when the fix happens.
+
+  * adding SCM_RIGHTS in the middle of garbage collection is possible, but
+    in that case it will contain no references to anything in the initial
+    candidate set.
+
+The last one is delicate.  SCM_RIGHTS creation has unix_inflight() called
+for each reference we put there, so it's serialized with respect to
+unix_gc(); however, insertion into the queue is *NOT* covered by that.
+Queue rescans are covered, but each queue has a lock of its own and they
+are definitely not going to be held throughout the whole thing.
+
+So in theory it would be possible to have:
+
+  * thread A: sendmsg() has SCM_RIGHTS created and populated, complete with
+    file refcount and ->inflight increments implied, at which point it gets
+    preempted and loses the timeslice.
+
+  * thread B: gets to run and removes all references from descriptor table
+    it shares with thread A.
+
+  * on another CPU we have the garbage collector triggered; it determines
+    the set of potentially unreachable unix_sock and everything in our
+    SCM_RIGHTS _is_ in that set, now that no other references remain.
+
+  * on the first CPU, thread A regains the timeslice and inserts its
+    SCM_RIGHTS into queue.  And it does contain references to sockets from
+    the candidate set of running garbage collector, confusing the hell out
+    of it.
+
+That is avoided by a convoluted dance around the SCM_RIGHTS creation
+and insertion - we use fget() to obtain struct file references,
+then _duplicate_ them in SCM_RIGHTS (bumping a refcount for each, so
+we are holding *two* references), do unix_inflight() on them, then
+queue the damn thing, then drop each reference we got from fget().
+
+That way everything referred to in that SCM_RIGHTS is going to have
+extra struct file references (and thus be excluded from the initial
+candidate set) until after it gets inserted into queue.  In other
+words, if it does appear in a queue between two passes, it's
+guaranteed to contain no references to anything in the initial
+candidate set.