Message ID | 20220811152446.281723-1-idryomov@gmail.com |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] Ceph updates for 5.20-rc1 | expand |
On Thu, Aug 11, 2022 at 8:25 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > [..] Several patches > touch files outside of our normal purview to set the stage for bringing > in Jeff's long awaited ceph+fscrypt series in the near future. All of > them have appropriate acks and sat in linux-next for a while. What? No. I'm looking at the fs/dcache.c change, for example, and don't see the relevant maintainers having acked it at all. The mmdebug.h file similarly seems to not have the actual maintainer acks, and seems just plain stupid (why does it add that new folio warning macro, when the existing folio warning macro VM_WARN_ON_ONCE_FOLIO() is *better*? Those are some very core files, and while the changes seem harmless, they sure don't seem obviously ok. What's the point of warning about bogus folios more than once? That's a debug warning - if it hits even once, that's already "uhhuh, something is bad". Showing the warning more than once is likely just going to cause more problems, not give you more information. And did somebody verify that d_same_name() is still inlined in the place that truly *matters*? Because from my quick test, that patch broke it. Now __d_lookup() does a function call. And I _suspect_ it's all ok, because it turns out that __d_lookup_rcu() is the *really* hot case, and that one has inlined it all manually. But this kind of "we touch some *truly* core functionality, without the acks from the maintainers, and then we *claim* to have relevant acks" is really not even remotely ok. I've pulled this because I suspect that d_same_name() thing is fine, and I think the VM_WARN_ON_FOLIO() addition is completely wrong but not horrendous. But you're on my tentative shit-list just for having claimed to have appropriate acks and having been found wanting. Just for your information: fs/dcache.c is some of the most optimized code in the kernel, and some of the subtlest. That RCU pathname lookup is serious business. You don't make changes to pathname lookup just willy nilly. There's a reason I start looking at individual patches when I see it in the diffstat. And please just get rid of VM_WARN_ON_FOLIO() again, and use the VM_WARN_ON_ONCE_FOLIO() version. Because if you expect to have multiple, then you probably shouldn't have that warning at ALL. Linus
On Thu, Aug 11, 2022 at 10:04 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Aug 11, 2022 at 8:25 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > [..] Several patches > > touch files outside of our normal purview to set the stage for bringing > > in Jeff's long awaited ceph+fscrypt series in the near future. All of > > them have appropriate acks and sat in linux-next for a while. > > What? No. > > I'm looking at the fs/dcache.c change, for example, and don't see the > relevant maintainers having acked it at all. > > The mmdebug.h file similarly seems to not have the actual maintainer > acks, and seems just plain stupid (why does it add that new folio > warning macro, when the existing folio warning macro > VM_WARN_ON_ONCE_FOLIO() is *better*? > > Those are some very core files, and while the changes seem harmless, > they sure don't seem obviously ok. > > What's the point of warning about bogus folios more than once? That's > a debug warning - if it hits even once, that's already "uhhuh, > something is bad". Showing the warning more than once is likely just > going to cause more problems, not give you more information. Xiubo and Jeff used it to track down some issues between netfs library and folio code that have been randomly plaguing our automated tests for a couple of releases. We already knew that there were issues in that area and the actual occurrences mattered. This was done in cooperation with Willy and, since he was involved and this is a no-impact change, I didn't think twice. > > And did somebody verify that d_same_name() is still inlined in the > place that truly *matters*? Because from my quick test, that patch > broke it. Now __d_lookup() does a function call. > > And I _suspect_ it's all ok, because it turns out that > __d_lookup_rcu() is the *really* hot case, and that one has inlined it > all manually. > > But this kind of "we touch some *truly* core functionality, without > the acks from the maintainers, and then we *claim* to have relevant > acks" is really not even remotely ok. I raised the lack of a formal Acked-by from Al on the dcache change with Jeff a while ago and my understanding was that he reached out to Al and got the ack (after some ghosting on Al's behalf). I apologize if I got that wrong -- all this happened in the middle of Jeff transitioning his maintainership duties. > > I've pulled this because I suspect that d_same_name() thing is fine, > and I think the VM_WARN_ON_FOLIO() addition is completely wrong but > not horrendous. > > But you're on my tentative shit-list just for having claimed to have > appropriate acks and having been found wanting. > > Just for your information: fs/dcache.c is some of the most optimized > code in the kernel, and some of the subtlest. That RCU pathname lookup > is serious business. You don't make changes to pathname lookup just > willy nilly. There's a reason I start looking at individual patches > when I see it in the diffstat. Understood. Thanks, Ilya
On Thu, 2022-08-11 at 22:55 +0200, Ilya Dryomov wrote: > On Thu, Aug 11, 2022 at 10:04 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, Aug 11, 2022 at 8:25 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > [..] Several patches > > > touch files outside of our normal purview to set the stage for bringing > > > in Jeff's long awaited ceph+fscrypt series in the near future. All of > > > them have appropriate acks and sat in linux-next for a while. > > > > What? No. > > > > I'm looking at the fs/dcache.c change, for example, and don't see the > > relevant maintainers having acked it at all. > > > > The mmdebug.h file similarly seems to not have the actual maintainer > > acks, and seems just plain stupid (why does it add that new folio > > warning macro, when the existing folio warning macro > > VM_WARN_ON_ONCE_FOLIO() is *better*? > > > > Those are some very core files, and while the changes seem harmless, > > they sure don't seem obviously ok. > > > > What's the point of warning about bogus folios more than once? That's > > a debug warning - if it hits even once, that's already "uhhuh, > > something is bad". Showing the warning more than once is likely just > > going to cause more problems, not give you more information. > > Xiubo and Jeff used it to track down some issues between netfs library > and folio code that have been randomly plaguing our automated tests for > a couple of releases. We already knew that there were issues in that > area and the actual occurrences mattered. This was done in cooperation > with Willy and, since he was involved and this is a no-impact change, > I didn't think twice. > > > > > And did somebody verify that d_same_name() is still inlined in the > > place that truly *matters*? Because from my quick test, that patch > > broke it. Now __d_lookup() does a function call. > > > > And I _suspect_ it's all ok, because it turns out that > > __d_lookup_rcu() is the *really* hot case, and that one has inlined it > > all manually. > > > > But this kind of "we touch some *truly* core functionality, without > > the acks from the maintainers, and then we *claim* to have relevant > > acks" is really not even remotely ok. > > I raised the lack of a formal Acked-by from Al on the dcache change > with Jeff a while ago and my understanding was that he reached out to > Al and got the ack (after some ghosting on Al's behalf). I apologize > if I got that wrong -- all this happened in the middle of Jeff > transitioning his maintainership duties. > Actually, I never got a formal ack from Al. I did send it repeatedly, but I assume he has been too busy to respond. We've had it sitting in linux-next for a couple of months, and he did suggest that approach in the first place, but I too would also prefer to see his official ack on it. > > > > I've pulled this because I suspect that d_same_name() thing is fine, > > and I think the VM_WARN_ON_FOLIO() addition is completely wrong but > > not horrendous. > > > > But you're on my tentative shit-list just for having claimed to have > > appropriate acks and having been found wanting. > > > > Just for your information: fs/dcache.c is some of the most optimized > > code in the kernel, and some of the subtlest. That RCU pathname lookup > > is serious business. You don't make changes to pathname lookup just > > willy nilly. There's a reason I start looking at individual patches > > when I see it in the diffstat. > > Understood. > > Thanks, > > Ilya
On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote: > Actually, I never got a formal ack from Al. I did send it repeatedly, > but I assume he has been too busy to respond. We've had it sitting in > linux-next for a couple of months, and he did suggest that approach in > the first place, but I too would also prefer to see his official ack on > it. "Suggested approach" had been about inode_insert5() changes, right? But that's fs/inode.c side of things... I have to admit that I'd missed the unlining d_same_name() - exporting the sucker per se didn't look insane and I hadn't looked at that in details ;-/ Looking at it now... might be worth renaming it into __d_same_name(), leaving it inlined and exporting a wrapper; not sure if the impact on d_lookup()/__d_lookup()/d_alloc_parallel() is worth worrying about it, though. Profiling a case when we have a plenty of files in the same directory on tmpfs, with something earlier in the pathname to kick out of RCU mode (e.g. going through /proc/self/cwd) might be interesting...
On Thu, Aug 11, 2022 at 10:22:37PM +0100, Al Viro wrote: > On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote: > > > Actually, I never got a formal ack from Al. I did send it repeatedly, > > but I assume he has been too busy to respond. We've had it sitting in > > linux-next for a couple of months, and he did suggest that approach in > > the first place, but I too would also prefer to see his official ack on > > it. > > "Suggested approach" had been about inode_insert5() changes, right? > But that's fs/inode.c side of things... I have to admit that I'd missed > the unlining d_same_name() - exporting the sucker per se didn't look s/un/&in/ and I really need to grab some coffee...
On Thu, Aug 11, 2022 at 1:56 PM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Thu, Aug 11, 2022 at 10:04 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > What's the point of warning about bogus folios more than once? That's > > a debug warning - if it hits even once, that's already "uhhuh, > > something is bad". Showing the warning more than once is likely just > > going to cause more problems, not give you more information. > > Xiubo and Jeff used it to track down some issues between netfs library > and folio code that have been randomly plaguing our automated tests for > a couple of releases. We already knew that there were issues in that > area and the actual occurrences mattered. This was done in cooperation > with Willy and, since he was involved and this is a no-impact change, > I didn't think twice. I don't mind the warning. I mind the "more than ONCE" part. If it's a "this shouldn't happen, but if it ever happens I want to know about it" situation, then the ONCE variant should be what you want. And that variant already existed, and adding a new and inferior macro seems to just have been pointless. As to dcache issues - if you really don't get an ack from Al, you can at least make me aware of it before-hand. That's one of the files that I at least personally care about, and while I would much prefer an ack from Al for anything that touches it, at least I'll likely be less unhappy about changes if I was made aware of them ahead of time, instead of seeing a pull request that suddenly changes that file. Linus
On Thu, 2022-08-11 at 22:22 +0100, Al Viro wrote: > On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote: > > > Actually, I never got a formal ack from Al. I did send it repeatedly, > > but I assume he has been too busy to respond. We've had it sitting in > > linux-next for a couple of months, and he did suggest that approach in > > the first place, but I too would also prefer to see his official ack on > > it. > > "Suggested approach" had been about inode_insert5() changes, right? Right. I was talking about this patch (which I think is sane): fs: change test in inode_insert5 for adding to the sb list > But that's fs/inode.c side of things... I have to admit that I'd missed > the unlining d_same_name() - exporting the sucker per se didn't look > insane and I hadn't looked at that in details ;-/ > > Looking at it now... might be worth renaming it into __d_same_name(), > leaving it inlined and exporting a wrapper; not sure if the impact on > d_lookup()/__d_lookup()/d_alloc_parallel() is worth worrying about it, > though. > > Profiling a case when we have a plenty of files in the same directory > on tmpfs, with something earlier in the pathname to kick out of RCU > mode (e.g. going through /proc/self/cwd) might be interesting... The d_name_name changes seemed ok to me, but it would be good to have your ack (or qualified NAK) if possible.
The pull request you sent on Thu, 11 Aug 2022 17:24:46 +0200:
> https://github.com/ceph/ceph-client.git tags/ceph-for-5.20-rc1
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/786da5da5671c2d4cf812fe1ccc980bdde30c69e
Thank you!
On Thu, Aug 11, 2022 at 05:30:03PM -0400, Jeff Layton wrote: > On Thu, 2022-08-11 at 22:22 +0100, Al Viro wrote: > > On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote: > > > > > Actually, I never got a formal ack from Al. I did send it repeatedly, > > > but I assume he has been too busy to respond. We've had it sitting in > > > linux-next for a couple of months, and he did suggest that approach in > > > the first place, but I too would also prefer to see his official ack on > > > it. > > > > "Suggested approach" had been about inode_insert5() changes, right? > > Right. I was talking about this patch (which I think is sane): > > fs: change test in inode_insert5 for adding to the sb list It is, AFAICS. > > But that's fs/inode.c side of things... I have to admit that I'd missed > > the unlining d_same_name() - exporting the sucker per se didn't look > > insane and I hadn't looked at that in details ;-/ > > > > Looking at it now... might be worth renaming it into __d_same_name(), > > leaving it inlined and exporting a wrapper; not sure if the impact on > > d_lookup()/__d_lookup()/d_alloc_parallel() is worth worrying about it, > > though. > > > > Profiling a case when we have a plenty of files in the same directory > > on tmpfs, with something earlier in the pathname to kick out of RCU > > mode (e.g. going through /proc/self/cwd) might be interesting... > > The d_name_name changes seemed ok to me, but it would be good to have > your ack (or qualified NAK) if possible. Exporting the functionality? Sure, no problem. Uninlining that one... I suspect that it's OK, but I'd like to see profiling data; it's not as if it would be hard to return to having it inlined, obviously. Again, my apologies for not spotting that one.
On Thu, Aug 11, 2022 at 2:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Exporting the functionality? Sure, no problem. Uninlining that one... > I suspect that it's OK, but I'd like to see profiling data; it's not > as if it would be hard to return to having it inlined, obviously. The only case where I think it might matter is in __d_lookup(), and it's probably not measurable. Yes, __d_lookup() does matter, but it only matters once you've fallen out of RCU mode, and at that point the cost of the function call is likely in the noise. I don't particularly like how it's inside that dentry hash chain loop, but realistically by then we've already done a function call for the dentry lock spinlock, so that loop already has to deal with it. Again, __d_lookup_rcu() is the place where adding a function call would matter more, because that one really does show up on profiles regularly. And it so carefully tries to avoid function calls (but the DCACHE_OP_COMPARE case causes problems: at one time a few years ago I actually wanted to move the DCACHE_OP_COMPARE *out* of the loop entirely, because it's loop invariant and having that unlikely cause inside the loop still causes bad things for register allocation). So I think the uninlining is fine. If I had been really unhappy about it I would have undone the pull. It was more the "I was told there would be cake, but there was no cake" that annoyed me. Linus
On Thu, Aug 11, 2022 at 02:52:26PM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2022 at 2:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Exporting the functionality? Sure, no problem. Uninlining that one... > > I suspect that it's OK, but I'd like to see profiling data; it's not > > as if it would be hard to return to having it inlined, obviously. > > The only case where I think it might matter is in __d_lookup(), and > it's probably not measurable. > > Yes, __d_lookup() does matter, but it only matters once you've fallen > out of RCU mode, and at that point the cost of the function call is > likely in the noise. > > I don't particularly like how it's inside that dentry hash chain loop, > but realistically by then we've already done a function call for the > dentry lock spinlock, so that loop already has to deal with it. FWIW, I wonder if we should do if (READ_ONCE(dentry->d_parent) != parent) continue; before grabbing ->d_lock (and repeat the check after grabbing it, of course). It's OK from correctness POV - we are OK with false negatives from __d_lookup() if concurrent rename happens. And it just might be a sufficiently large performance win...
On Thu, Aug 11, 2022 at 3:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > FWIW, I wonder if we should do > if (READ_ONCE(dentry->d_parent) != parent) > continue; > before grabbing ->d_lock (and repeat the check after grabbing it, It kind of makes sense. We already do that d_name.hash check outside of the lock, so we already have that "we might race with a rename" situation. That said, I do think __d_lookup_rcu() is the more important of the two. Here's a recreation of that patch I mentioned where the OP_COMPARE is moved out of the loop. Just for fun, look at how much better the code generation is for the common case when you don't have the call messing up the clobbered registers etc. Entirely untested, and I might have messed something up, but I suspect this is a much bigger deal than whether d_same_name() is inlined or not in the non-RCU path. Linus
On Thu, Aug 11, 2022 at 3:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Here's a recreation of that patch I mentioned where the OP_COMPARE is > moved out of the loop. Just for fun, look at how much better the code > generation is for the common case when you don't have the call messing > up the clobbered registers etc. Oh, sadly, clang does much worse here. Gcc ends up being able to not have a stack frame at all for __d_lookup_rcu() once that DCACHE_OP_COMPARE case has been moved out. The gcc code really looks very nice. Clang, not so much, and it still has spills and reloads. The loop still ends up better with clang (since that test is no longer in the loop), but the code generated doesn't go from "ugly to really nice", it just goes from "ugly to still somewhat ugly". Linus
On Thu, Aug 11, 2022 at 03:22:38PM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2022 at 3:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > FWIW, I wonder if we should do > > if (READ_ONCE(dentry->d_parent) != parent) > > continue; > > before grabbing ->d_lock (and repeat the check after grabbing it, > > It kind of makes sense. We already do that d_name.hash check outside > of the lock, so we already have that "we might race with a rename" > situation. > > That said, I do think __d_lookup_rcu() is the more important of the two. > > Here's a recreation of that patch I mentioned where the OP_COMPARE is > moved out of the loop. Just for fun, look at how much better the code > generation is for the common case when you don't have the call messing > up the clobbered registers etc. > > Entirely untested, and I might have messed something up, but I suspect > this is a much bigger deal than whether d_same_name() is inlined or > not in the non-RCU path. Looks sane at the first pass, but right now I'm really half-asleep - 5 hours of sleep tonight, and about the same yesterday ;-/ I'll try to grab at least an hour or two and reread it once I'm more or less awake...
On Thu, Aug 11, 2022 at 3:43 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Oh, sadly, clang does much worse here. > > Gcc ends up being able to not have a stack frame at all for > __d_lookup_rcu() once that DCACHE_OP_COMPARE case has been moved out. > The gcc code really looks very nice. > > Clang, not so much, and it still has spills and reloads. I ended up looking at the clang code generation more than I probably should have, because I found it so odd. Our code is literally written to not need that many values, and it should be easy to keep everything in registers. It turns out that clang is trying much too hard to be clever in dentry_string_cmp(). The code is literally written so that we keep the count of remaining characters in 'tcount', and then at the end we can generate a 'mask' from that to ignore the parts of the pathname that are beyond the size. So it creates the mask by just doing mask = bytemask_from_count(tcount); and the bytemask_from_count() is a very straightforward "take the byte mask, multiply by 8 to get the bitmask, and then you can use that to shift bits around and you get the byte mask: #define bytemask_from_count(cnt) (~(~0ul << (cnt)*8)) But clang seems to decide that this "multiply by 8" all is a very costly operation, and seems to have figured out that since we always do everyting one word at a time, so 'tcount' is always updated in sizeof(long), and that means that at the end 'tcount' is guaranteed to be the original 'tcount & 7' (on a 64-bit machine). End result: the above mask can be pre-computed before entering the loop at all. Which is absolutely true, but adds to register pressure, and means that you pre-compute that mask whether you actually need to or not. But hey, even that is fine, because you can actually move the mask outside *both* of the loops (both the hash chain *and* the inner "check each pathname" loop, because the initial value of 'tcount' is very much a fixed value per function call. The computation is pretty cheap, the bigger expense is that now you have one more thing you need to keep track of. But on its own, that would probably still be ok, because hey, we have extra registers. But it does make the liveness of that extra value be quite large. But the extra registers are then used by the fact that we have that b = load_unaligned_zeropad(ct); which *ALSO* has that incredibly expensive "multiply by 8" operation, except now it's a different value that needs to be multiplied by 8, namely the offset within an aligned long-word, which we use to fix up any unaligned faults that take exceptions. And since clang seems to - once again - think that multiplying by 8 is INCREDIBLY EXPENSIVE, what it does is say "hey, I see all the places where that base pointer is updated, and I can have my own internal variable that tracks that value, except I do "update-by-8" there. So now clang has created _another_ special internal variable that it updates inside the inner loop, which tracks the bit offset of the *aligned* part of the 'ct' variable, so that in the *extremely* unlikely (read: never actually happens) situation where that load_unaligned_zeropad takes an exception, it has the shift value pre-computed. And now clang runs out of registers, and honestly, it took me a *loong* time to understand what the generated code was even trying to do, because this was all so incredibly pointless and the code looked so very very odd. Anyway, it's amusing mostly because both of those "optimizations" that clang did actually required some very clever compiler work. It's just that they were both doing extra work in order to avoid an operation that was _less_ work in the first place. I suspect that there's some core clang optimization that goes "addition is much cheaper than multiplication, and if we have an index variable that is multiplied by a value, we can keep a shadow variable of that index that is 'pre-multiplied', and thus avoid the multiplication entirely". It's absolute genius. It just happens to generate really quite bad code, because multiplying by 8 outside the loop is _so_ much cheaper than what clang ends up generating. On the whole, I think that any value that needs one or two ALU operations to be calculated should *not* be seen as some prime example of something that needs to be pre-computed in order to move it outside the loop. The extra register pressure will make it a loss. But admittedly we also have no way to tell clang that that exception basically never happens, so maybe it thinks it is really important. I'm adding clang people to the list, in case anybody wants to look at it. The patch that started my path down this insanity is at https://lore.kernel.org/all/CAHk-=wjCa=Xf=pA2Z844WnwEeYgy9OPoB2kWphvg7PVn3ohScw@mail.gmail.com/ in case somebody gets an itch to look at a case where clang generates some very odd code. That said, this code has been streamlined so much that even clang can't *really* make the end result slow. Just not as good as what gcc does, because clang tries a bit too hard and bites off a bit more than it can chew. Linus
On Thu, Aug 11, 2022 at 08:58:54PM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2022 at 3:43 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Oh, sadly, clang does much worse here. > > > > Gcc ends up being able to not have a stack frame at all for > > __d_lookup_rcu() once that DCACHE_OP_COMPARE case has been moved out. > > The gcc code really looks very nice. > > > > Clang, not so much, and it still has spills and reloads. > > I ended up looking at the clang code generation more than I probably > should have, because I found it so odd. > > Our code is literally written to not need that many values, and it > should be easy to keep everything in registers. > > It turns out that clang is trying much too hard to be clever in > dentry_string_cmp(). The code is literally written so that we keep the > count of remaining characters in 'tcount', and then at the end we can > generate a 'mask' from that to ignore the parts of the pathname that > are beyond the size. [snip] There's a cheap way to reduce the register pressure: seq = raw_seqcount_begin(&dentry->d_seq); if (dentry->d_parent != parent) continue; if (d_unhashed(dentry)) continue; if (dentry->d_name.hash_len != hashlen) continue; if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0) continue; *seqp = seq; could move the last store to before dentry_cmp(). Sure, we might get some extra stores out of that. Into a hot cacheline, and if we really hit many of those extra stores, we already have a problem - a lot of collisions both in ->d_parent and ->d_name.hash_len. If that happens, the cost of those extra stores is going to be trivial noise.
On Sun, Aug 14, 2022 at 12:08 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > There's a cheap way to reduce the register pressure: > seq = raw_seqcount_begin(&dentry->d_seq); > if (dentry->d_parent != parent) > continue; > if (d_unhashed(dentry)) > continue; > if (dentry->d_name.hash_len != hashlen) > continue; > if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0) > continue; > *seqp = seq; > could move the last store to before dentry_cmp(). I actually tried that, it doesn't really end up helping. Gcc does well regardless, and clang ends up really wanting to move so much out of the dentry_cmp() loop that it runs out of registers and always ends up doing a couple of spills. I think it reduced the spills by one, but not enough to generate the nice non-frame code that gcc does. It's a bit ugly, but the code probably performs quite well - the spills aren't in the inner loop. Linus
On Sun, Aug 14, 2022 at 1:03 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Gcc does well regardless, and clang ends up really wanting to move so > much out of the dentry_cmp() loop that it runs out of registers and > always ends up doing a couple of spills. > > I think it reduced the spills by one, but not enough to generate the > nice non-frame code that gcc does. Note that that code was basically written to make gcc happy, so the fact that clang then does worse is not hugely surprising. I dug into it some more, and it is really "load_unaligned_zeropad()" that makes clang really uncomfortable. The problem ends up being that clang sees that it's inside that inner loop, and tries very hard to optimize the shift-and-mask that happens if the exception happens. The fact that the exception *never* happens unless DEBUG_PAGEALLOC is enabled - and very very seldom even then - is not something we can really explain to clang. So it thinks that code is really hot in the inner loop, and does all kinds of silly things due to that. Gcc, in contrast, generates the obvious straightforward code, and that's what I wrote and optimized that code for. Linus
On Sun, Aug 14, 2022 at 1:29 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I dug into it some more, and it is really "load_unaligned_zeropad()" > that makes clang really uncomfortable. > > The problem ends up being that clang sees that it's inside that inner > loop, and tries very hard to optimize the shift-and-mask that happens > if the exception happens. > > The fact that the exception *never* happens unless DEBUG_PAGEALLOC is > enabled - and very very seldom even then - is not something we can > really explain to clang. Probably if we could express that the do_exception label in load_unaligned_zeropad was cold then that might help. https://github.com/llvm/llvm-project/issues/46831 Otherwise, could we put the exceptional case statements in a noinline or cold attributed function?