[v2,0/8] scheduler tinification

Message ID	20170606232450.30278-1-nicolas.pitre@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Nicolas Pitre <nicolas.pitre@linaro.org> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2 0/8] scheduler tinification Date: Tue, 6 Jun 2017 19:24:42 -0400 Message-Id: <20170606232450.30278-1-nicolas.pitre@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	scheduler tinification \| expand [v2,0/8] scheduler tinification [v2,1/8] cpuset/sched: cpuset makes sense for SMP only [v2,2/8] sched: omit stop_sched_class when !SMP [v2,3/8] futex: make PI support optional [v2,4/8] sched/deadline: move dl related code out of sched/core.c [v2,5/8] sched/rt: move rt related code out of sched/core.c [v2,6/8] sched/deadline: make it configurable [v2,7/8] rtmutex: compatibility wrappers when no RT support is configured [v2,8/8] sched/rt: make it configurable

Nicolas Pitre June 6, 2017, 11:24 p.m. UTC

Many embedded systems don't need the full scheduler support. Most of the
time, user space is tightly controlled and many of the scheduler facilities
are simply unused.

This patch series makes it possible to configure out some parts of the
scheduler such as the deadline and realtime scheduler classes. The saving
in kernel footprint is non negligible.

Small ARM kernel config before this series:

   text    data     bss     dec     hex filename
  28623    3404     128   32155    7d9b kernel/sched/built-in.o

With this series and dl and rt classes disabled:

   text    data     bss     dec     hex filename
  20734    3334      40   24108    5e2c kernel/sched/built-in.o

A significant part of the remaining code is support for various system calls
that could be automatically removed when user space doesn't use it but that
is a topic for another day.

Changes from v1:

- the deadline class is configurable independently from the realtime class
- split of the PI futex code to make non-PI futexes available when RT
  is configured out
- removal of many #ifdefs to keep the code more readable

diffstat for this series:

 include/linux/futex.h          |    7 +-
 include/linux/init_task.h      |   15 +-
 include/linux/rtmutex.h        |   69 +
 include/linux/sched.h          |    4 +
 include/linux/sched/deadline.h |    8 +-
 include/linux/sched/rt.h       |   10 +-
 init/Kconfig                   |   28 +-
 kernel/futex.c                 | 2829 ++++++++--------------------------
 kernel/futex_pi.c              | 1563 +++++++++++++++++++
 kernel/locking/Makefile        |    3 +
 kernel/locking/locktorture.c   |    4 +-
 kernel/locking/rtmutex.c       |    6 +-
 kernel/sched/Makefile          |    7 +-
 kernel/sched/core.c            |  759 +--------
 kernel/sched/cpudeadline.h     |    7 +-
 kernel/sched/deadline.c        |  340 ++++
 kernel/sched/debug.c           |    6 +
 kernel/sched/rt.c              |  315 +++-
 kernel/sched/sched.h           |   88 +-
 kernel/sched/stop_task.c       |    6 +
 kernel/sysctl.c                |    4 +-
 kernel/time/posix-cpu-timers.c |    7 +-
 lib/Kconfig.debug              |    2 +-
 23 files changed, 3190 insertions(+), 2897 deletions(-)

Ingo Molnar June 7, 2017, 4 p.m. UTC | #1

* Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> Many embedded systems don't need the full scheduler support. Most of the

> time, user space is tightly controlled and many of the scheduler facilities

> are simply unused.

Sorry, NAK:

>  23 files changed, 3190 insertions(+), 2897 deletions(-)

That's a lot of extra code plus churn for a code base that is already pretty
#ifdef heavy.

Also, the savings are marginal, even with significant functionality disabled:

>   text    data     bss     dec     hex filename

>  28623    3404     128   32155    7d9b kernel/sched/built-in.o

>

> With this series and dl and rt classes disabled:

>

>   text    data     bss     dec     hex filename

>  20734    3334      40   24108    5e2c kernel/sched/built-in.o

With 1GHz + 1GB RAM SoCs being well below $10 in bulk we worry about code 
complexity, predictability, testability, behavioral and ABI uniformity a lot more 
than about the last 10-20k of kernel text footprint...

So I think the 'tiny' efforts are fundamentally misguided and are shooting for an 
ever shrinking market of RAM/ROM starved products whose share is shrinking every 
month.

We want to _remove_ kernel options and reduce complexity, not increase it.

So unless there's convincing counter arguments, or Linus overrules me, this NAK is 
pretty firm.

I'd love to see scheduler complexity reduction patches though, the "CPP count" of 
the scheduler code base is pretty damn high:

  triton:~/tip> git grep -h '^#[^ ]' kernel/sched/  | cut -d' ' -f1 | sort | uniq -c | sort -n | tail -10
      2 #ifdef  CONFIG_SCHED_DEBUG
      4 #endif  /*
     19 #if
     26 #ifndef
     27 #undef
     97 #else
    161 #define
    199 #include
    317 #ifdef
    361 #endif

Thanks,

	Ingo

Nicolas Pitre June 7, 2017, 5:09 p.m. UTC | #2

On Wed, 7 Jun 2017, Ingo Molnar wrote:

> 

> * Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> 

> > Many embedded systems don't need the full scheduler support. Most of the

> > time, user space is tightly controlled and many of the scheduler facilities

> > are simply unused.

> 

> Sorry, NAK:

> 

> >  23 files changed, 3190 insertions(+), 2897 deletions(-)

> 

> That's a lot of extra code plus churn for a code base that is already pretty

> #ifdef heavy.

> 

> Also, the savings are marginal, even with significant functionality disabled:

> 

> >   text    data     bss     dec     hex filename

> >  28623    3404     128   32155    7d9b kernel/sched/built-in.o

> >

> > With this series and dl and rt classes disabled:

> >

> >   text    data     bss     dec     hex filename

> >  20734    3334      40   24108    5e2c kernel/sched/built-in.o

> 

> With 1GHz + 1GB RAM SoCs being well below $10 in bulk we worry about code 

> complexity, predictability, testability, behavioral and ABI uniformity a lot more 

> than about the last 10-20k of kernel text footprint...

> 

> So I think the 'tiny' efforts are fundamentally misguided and are shooting for an 

> ever shrinking market of RAM/ROM starved products whose share is shrinking every 

> month.

I'm rather seeing the opposite: an ever growing market of 
internet-connected coin-cell-battery-powered tiny devices where the 
amount of RAM is counted in kilobytes rather than megabytes.

Let me repeat some background as to what my fundamental motivation is, 
and then maybe you'll understand why I'm doing this.

What is the biggest buzzword in the IT industry besides AI right now?
It is IOT.

Most IOT targets are so small that people are rewriting new operating 
systems from scratch for them. Lots of fragmentation already exists. 
We're talking about systems with less than one megabyte of RAM, 
sometimes much less.  Still, those things are being connected to the 
internet. And this is going to be a total security nightmare.

I wish to be able to leverage the Linux ecosystem for as much of the IOT 
space as possible to avoid the worst of those nightmares.  The Linux 
ecosystem has a *lot* of knowledgeable people around it, a lot of 
testing infrastructure and tooling available already, etc.  If a 
security issue turns up on Linux, it has a greater chance of being 
caught early, or fixed quickly otherwise, and finding people with the 
right knowledge is easier on Linux than it could be on any RTOS out 
there. Still with me so far?

Yes we have tools that can automatically reduce the kernel size. We can 
use LTO with the compiler, etc.  LTO is pretty good already. It can 
typically reduce the kernel size by 20%.  If all system calls are 
disabled except for a few ones, then LTO can get rid of another 20%. The 
minimal kernel I get is still 400-500 KB in size.  That's still too big.

There is this 120 KB of VFS code that is always there even though there 
is no real filesystem at all configured in the kernel. There is that 
other 100 KB of core driver support code despite the fact that the set 
of drivers I'm using are very simple and make no use of most of that 
core driver code. Etc.

There comes a point where there is no option but to explicitly trim out 
parts of the kernel as such decisions cannot be automated, hence this 
patch series. Bringing the scheduler under 20KB in size is therefore 
very useful in that context. Alternatively I could push for a parallel 
implementation as I did with the TTY layer where I obtained a 6x size 
reduction. But in the scheduler case I obtained only a 2x size reduction 
so I thought it could be more profitable to get about the same saving by 
reworking the existing code instead., and eventually contributing a very 
bare scheduler class that would be a smaller alternative to the fair 
scheduler for deployments where that makes sense. Unless you actually 
changed your mind about alternative whole scheduler implementations that 
is...

For Linux to be suitable for small IoT, it has to be small, damn small. 
My target is 256 KB of RAM.  And if you look at the kind of application 
those 256-KB systems are doing, it's basically one main task typically 
acquiring sensor data and sending it in some crypted protocol over a 
wireless network on the internet, and possibly accepting commands back.  
So what do you need from the OS to achieve that?  A few system calls, a 
minimal scheduler, minimal memory management, minimal filesystem 
structure and minimal network stack. And your user app.

So, why not having each of those blocks be created using the existing 
Linux syscall interface and internal API?  At that point, it should be 
possible to take your standard full-featured Linux workstation and 
develop your user app on it, run it there using all the existing native 
debugging tools, etc. In the end you just pick the mini version of 
everything for the final target and you're done.  And you don't have to 
learn a whole new OS, development environment and program model, etc.

Next on my list would be a cache-less, completely serialized VFS bypass 
that has only what's needed to make the link between the read/write 
syscalls, a filesystem driver and a block driver while preserving the 
existing kernel APIs. And by being really small, the maintenance cost of 
a "parallel" implementation isn't very high, certainly much less than 
trying to maintain a single code path that can scale to both extremes 
in that case.

PS: As far as I remember, Linus didn't condemn the idea last time I 
    brought up this topic in his presence. I therefore hope we could 
    find ways for allowing Linux usage into the largest computing device 
    deployment to come.

Nicolas

Alan Cox June 7, 2017, 6:49 p.m. UTC | #3

> Next on my list would be a cache-less, completely serialized VFS bypass 

> that has only what's needed to make the link between the read/write 

> syscalls, a filesystem driver and a block driver while preserving the 

> existing kernel APIs. And by being really small, the maintenance cost of 

> a "parallel" implementation isn't very high, certainly much less than 

> trying to maintain a single code path that can scale to both extremes 

> in that case.


So once you've rewritten the tty layer, the device drivers, the VFS and
removed most of the syscalls why even pretend it's Linux any more. It's
something else, and that something else is totally architecturally
incompatible with Linux. That's btw a good thing - trying to fit Linux
directly into such a tiny device isn't sensible because the core
assumptions you make about scalability are just totally different.

IMHO it would be far far better to just borrow the bits that look handy,
and the bits of the ABI you need and put them together as a new OS
kernel. When you look at tiny hardware even core bits of the Linux
architecture like the wait queues are just not sensible uses of memory
and cause fragmentation. The dcache is completely insane in that
environment, the scheduler is total overkill and the networking is easy
to DoS in a tiny memory. The device layer assumes dynamic hot pluggable
device architecture - and that's extremely expensive but nonsensical for
most µcontrollers.

It's easy to put a Unixlike OS in 256K of RAM and a pile of flash. It's
going to be pretty easy to put all the major bits of the Linux API into
it. You can run 2.11BSD with only 256K of writable memory (you need more
in your PDP-11 to run it but if you look all of that in a µcontroller
would live in flash).


Alan

Nicolas Pitre June 7, 2017, 9:15 p.m. UTC | #4

On Wed, 7 Jun 2017, Alan Cox wrote:

> > Next on my list would be a cache-less, completely serialized VFS bypass 

> > that has only what's needed to make the link between the read/write 

> > syscalls, a filesystem driver and a block driver while preserving the 

> > existing kernel APIs. And by being really small, the maintenance cost of 

> > a "parallel" implementation isn't very high, certainly much less than 

> > trying to maintain a single code path that can scale to both extremes 

> > in that case.

> 

> So once you've rewritten the tty layer, the device drivers, the VFS and

> removed most of the syscalls why even pretend it's Linux any more. It's

> something else, and that something else is totally architecturally

> incompatible with Linux.

You got at least one thing wrong. One huge benefit is to leverage 
existing device drivers of which Linux is plentiful. So there is no 
point rewriting device drivers.

Then if most syscalls are removed then *of course* you won't be able to 
boot a standard "Linux" distro on it. But that's not the point either. 
However the compatibility is preserved the other way around i.e. user 
space from this Linux subset should just work as is on a full Linux 
kernel. And it would still be a Linux code base i.e. architecturally 
compatible with Linux at the source level.

> That's btw a good thing - trying to fit Linux

> directly into such a tiny device isn't sensible because the core

> assumptions you make about scalability are just totally different.

For a couple core components that's true, hence my approach with the TTY 
layer. But many other parts aren't that bad. And given that a small 
system can't afford that many whistles and bells then it is not like if 
the whole of Linux would be rewritten anyway.

> IMHO it would be far far better to just borrow the bits that look 

> handy, and the bits of the ABI you need and put them together as a new 

> OS kernel.

Hasn't that been attempted and failed already? One nasty effect of such 
an approach is effectively the creation of a fork, then you completely 
lose the community leverage and gravitational effect, create 
fragmentation, fixes are not propagated across, etc.

> When you look at tiny hardware even core bits of the Linux

> architecture like the wait queues are just not sensible uses of memory

> and cause fragmentation. The dcache is completely insane in that

> environment, the scheduler is total overkill and the networking is easy

> to DoS in a tiny memory. The device layer assumes dynamic hot pluggable

> device architecture - and that's extremely expensive but nonsensical for

> most µcontrollers.

Why do you think I'm proposing scheduler patches? And TTY patches before 
that, and having plans for the VFS? Obviously, all those things coule be 
reimplemented for small scale in a new and separate tiny OS. But what if 
those things could just live in the Linux source tree alongside their 
big cousins and be swapped according to your needs? Why couldn't those 
arguments served to the embedded people for years about joining the 
mainline effort be extended to this use case as well?

> It's easy to put a Unixlike OS in 256K of RAM and a pile of flash. It's

> going to be pretty easy to put all the major bits of the Linux API into

> it. You can run 2.11BSD with only 256K of writable memory (you need more

> in your PDP-11 to run it but if you look all of that in a µcontroller

> would live in flash).

Would be nice if that could share the same source code whenever 
possible, and also the same source tree, no?

Nicolas

Alan Cox June 7, 2017, 9:53 p.m. UTC | #5

> You got at least one thing wrong. One huge benefit is to leverage 

> existing device drivers of which Linux is plentiful. So there is no 

> point rewriting device drivers.


So you want to keep a common interface for some of the common driver
APIs. Several people have managed that.

> > IMHO it would be far far better to just borrow the bits that look 

> > handy, and the bits of the ABI you need and put them together as a new 

> > OS kernel.  

> 

> Hasn't that been attempted and failed already? One nasty effect of such 

> an approach is effectively the creation of a fork, then you completely 

> lose the community leverage and gravitational effect, create 

> fragmentation, fixes are not propagated across, etc.


Almost nothing can be shared though, and for drivers you want to re-use
then if you can re-use them you can share the code for that.

> Why do you think I'm proposing scheduler patches? And TTY patches before 

> that, and having plans for the VFS? Obviously, all those things coule be 

> reimplemented for small scale in a new and separate tiny OS. But what if 

> those things could just live in the Linux source tree alongside their 

> big cousins and be swapped according to your needs? Why couldn't those 

> arguments served to the embedded people for years about joining the 

> mainline effort be extended to this use case as well?


I don't think it works like that. The overhead of the duplication
and trying to keep them aligned rapidly exceeds the value they give. The
moment you try and do the job well you also 
> 

> > It's easy to put a Unixlike OS in 256K of RAM and a pile of flash. It's

> > going to be pretty easy to put all the major bits of the Linux API into

> > it. You can run 2.11BSD with only 256K of writable memory (you need more

> > in your PDP-11 to run it but if you look all of that in a µcontroller

> > would live in flash).  

> 

> Would be nice if that could share the same source code whenever 

> possible, and also the same source tree, no?


But that will never work. The fundamental architecture of a tiny system
is different because the scaling rules and underlying algorithms are
different. wait queues don't work sanely on tiny devices, TCP queues need
a totally different architecture, scheduling is quite different, memory
mangement is totally different, things like the dcache which is fairly
fundamental to the VFS internals make no sense, the locking model for
file systems makes no sense because you can't use all that expensive
scaling. Even the device core which is designed for dynamically managed
trees of devices with hotplug, discovery and power management heirarchies
is basically a large resource expensive paper weight.

It goes on and on. Add any desire to do hard real time or meet things
like ASIL-B to that and you hit a brick wall pretty damned quick.

When you proposed the tty changes I was dubious, now you are talking
about basically writing a new OS kernel in the same git tree that shares
the drivers it looks even less sensible from a Linux perspective.

Alan

Ingo Molnar June 8, 2017, 7:59 a.m. UTC | #6

Also, let me make it clear at the outset that we do care about RAM footprint all 
the time, and I've applied countless data structure and .text reducing patches to 
the kernel. But there's a cost/benefit analysis to be made, and this series fails 
that test in my view, because it increases the complexity of an already complex 
code base:

* Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> Most IOT targets are so small that people are rewriting new operating systems 

> from scratch for them. Lots of fragmentation already exists.

Let me offer a speculative if somewhat cynical prediction: 90% of those ghastly 
IOT hardware hacks won't survive the market. The remaining 10% will be successful 
financially, despite being ghastly hardware hacks and will eventually, in the next 
iteration or so, get a proper OS.

As users ask for more features the the hardware capabilities will increase 
dramatically and home-grown microcontroller derived code plus minimal OSes will be 
replaced by a 'real' OS. Because both developers and users will demand IPv6 
compatibility, or Bluetooth connectivity, or storage support, or any random range 
of features we have in the Linux kernel.

With the stroke of a pen from the CFO: "yes, we can spend more on our next 
hardware design!" the problem goes away, overnight, and nobody will look back at 
the hardware hack that had only 1MB of RAM.

> [...] We're talking about systems with less than one megabyte of RAM, sometimes 

> much less.

Two data points:

Firstly, by the time any Linux kernel change I commit today gets to a typical 
distro it's at least 0.5-1 years, 2 years for it to get widely used by hardware 
shops - 5 years to get used by enterprises. More latency in more conservative 
places.

Secondly, I don't see Moore's Law reversing:

   http://nerdfever.com/wp-content/uploads/2015/06/2015-06_Moravec_MIPS.png

If you combine those two time frames, the consequence of this:

Even taking the 1MB size at face value (which I don't: a networking enabled system 
can probably not function very well with just 1MB of RAM) - the RAM-starved 1 MB 
system today will effectively be a 2 MB system in 2 years.

And yes, I don't claim Moore's law will go on forever and I'm oversimplifying - 
maybe things are slowing down and it will only be 1.5 MB, but the point remains: 
the importance of your 20kb .text savings will become a 10-15k .text savings in 
just 2 years. In 8 years today's 1 MB system will be a 32 MB system if that trend 
holds up.

You can already fit a mostly full Linux system into 32 MB just fine, i.e. the 
problem has solved itself just by waiting a bit or by increasing the hardware 
capabilities a bit.

But the kernel complexity you introduce with this series stays with us! It will be 
an additional cost added to many scheduler commits going forward. It's an added 
cost for all the other usecases.

Also, it's not like 20k .text savings will magically enable Linux to fit into 1MB 
of RAM - it won't. The smallest still practical more or less generic Linux system 
in existence today is around 16 MB. You can shrink it more, but the effort 
increases exponentially once you go below a natural minimum size.

> [...]  Still, those things are being connected to the internet. [...]

So while I believe small size has its value, I think it's far more important to be 
able to _trust_ those devices than to squeeze the last kilobyte out of the kernel.

In that sense these qualities:

 - reducing complexity,
 - reducing actual line count,
 - increasing testability,
 - increasing reviewability,
 - offering behavioral and ABI uniformity

are more important than 1% of RAM of very, very RAM starved system which likely 
won't use Linux to begin with...

So while it obviously the "complexity vs. kernel size" trade-off will always be a 
judgement call, for the scheduler it's not really an open question what we need to 
do at this stage: we need to reduce complexity and #ifdef variants, not increase 
it.

Thanks,

	Ingo

Alan Cox June 8, 2017, 6:14 p.m. UTC | #7

> As users ask for more features the the hardware capabilities will increase 

> dramatically and home-grown microcontroller derived code plus minimal OSes will be 

> replaced by a 'real' OS. Because both developers and users will demand IPv6 

> compatibility, or Bluetooth connectivity, or storage support, or any random range 

> of features we have in the Linux kernel.


There are already tiny OS's with that feature set but they don't feel
Unixish and aren't quite so fun to program.

> Even taking the 1MB size at face value (which I don't: a networking enabled system 

> can probably not function very well with just 1MB of RAM) - the RAM-starved 1 MB 

> system today will effectively be a 2 MB system in 2 years.


Probably not - I may be wrong but power and what you can and can't put on
the same die are likely to mean that small RAM devices are here for a
while and in fact the CFO will be ordering the engineers to get it in
less RAM to save 20 cents a unit.

> And yes, I don't claim Moore's law will go on forever and I'm oversimplifying - 

> maybe things are slowing down and it will only be 1.5 MB, but the point remains: 

> the importance of your 20kb .text savings will become a 10-15k .text savings in 

> just 2 years. In 8 years today's 1 MB system will be a 32 MB system if that trend 

> holds up.


Power means it's more likely IMHO that todays 256K RAM system will in a
few years be either a 64K RAM system or have tons of persistent memory.

Alan

Nicolas Pitre June 8, 2017, 8:16 p.m. UTC | #8

On Thu, 8 Jun 2017, Ingo Molnar wrote:

> 

> Also, let me make it clear at the outset that we do care about RAM footprint all 

> the time, and I've applied countless data structure and .text reducing patches to 

> the kernel. But there's a cost/benefit analysis to be made, and this series fails 

> that test in my view, because it increases the complexity of an already complex 

> code base:

> 

> * Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> 

> > Most IOT targets are so small that people are rewriting new operating systems 

> > from scratch for them. Lots of fragmentation already exists.

> 

> Let me offer a speculative if somewhat cynical prediction: 90% of those ghastly 

> IOT hardware hacks won't survive the market. The remaining 10% will be successful 

> financially, despite being ghastly hardware hacks and will eventually, in the next 

> iteration or so, get a proper OS.

Your prediction is based on a false premise. There is simply no money to 
be made with IoT hardware, especially in the low end.  Those little 
devices will be given away for free because it is in the service 
subscription that the money is. So the hardware has to, and will be, 
extremely cheap to produce. If a serious bug turns up in one of those 
device, my own cynical prediction is that no one will bother with field 
upgradability and they will ask you to throw the device away instead 
while they ship you a replacement (field upgradability implies at least 
twice the flash memory size and that comes with a cost so some will 
gamble that obsolescence will happen before a serious bug turns up).

> As users ask for more features the the hardware capabilities will increase 

> dramatically and home-grown microcontroller derived code plus minimal OSes will be 

> replaced by a 'real' OS. Because both developers and users will demand IPv6 

> compatibility, or Bluetooth connectivity, or storage support, or any random range 

> of features we have in the Linux kernel.

The "Cloud" is taking care of most of that. For the rest, your cellphone 
or IoT gateway will take over. IPv6 stacks are already used in tiny 
microcontrollers with as low as 32KB of RAM.

> With the stroke of a pen from the CFO: "yes, we can spend more on our next 

> hardware design!" the problem goes away, overnight, and nobody will look back at 

> the hardware hack that had only 1MB of RAM.

Of course hobbyists can already get a Raspberry Pi Zero and run a full 
featured Linux distro on it... for a mere 5 bucks. That comes with 512MB 
of RAM so my patches certainly don't make a difference there.

But that's not that simple.  First there is a fundamental constraint 
which is power consumption. If you want your device to run for months 
(some will hope years) from the same tiny battery then you just cannot 
afford SDRAM. So we're talking static RAM here. And to keep costs down 
because you want to give away your thingies by the millions for free it 
usually means single-chip designs with on-chip sub-megabyte static RAM.  
And in that field the 256KB mark is located towards the high end of the 
spectrum.  Many IPv6-capable chips available today have less than that.

And the thing is: people already manage to do a awful lot of stuff in 
such a constrained device. Some probably did a good job of it, but most 
of them likely suck and we don't know about their bugs because we have 
no idea what's running inside.

And because it is rather easy to write a new OS from scratch for such a 
small environment (and who didn't dream of writing his own OS, right?) 
then about every company in that field did so. That's not counting most 
Open Source ones which usually are close to single-person projects. So 
you get a lot of fragmentation, very very little peer review, and no 
incentive for proper maintenance because the cost saving simply isn't 
significant enough.

It is just like asteroids. Some of them collapse to form bigger objects 
like planets, while others have too weak a gravitational field to gather 
more matter. My vision is about leveraging the Linux gravitational power 
to bring the tiny embedded space together because, on its own, the tiny 
embedded space simply has not enough community power to actually 
organize itself.

Of course there are important parts of Linux that couldn't be reused as 
is in such a setup, but yet many other things still can be reused with 
either some modifications or a tiny parallel subsystem substitution. 
Technically, it is always possible to find ways to make it low on 
maintenance and beneficial to the wider community. But first and 
foremost you have to agree with the fundamental principle of gathering 
more people around a common codebase to make it better for everyone and 
not suggest that they stick to themselves. If you agree to that then we 
can move back to a technical discussion.

> > [...] We're talking about systems with less than one megabyte of RAM, sometimes 

> > much less.

> 

> Two data points:

> 

> Firstly, by the time any Linux kernel change I commit today gets to a typical 

> distro it's at least 0.5-1 years, 2 years for it to get widely used by hardware 

> shops - 5 years to get used by enterprises. More latency in more conservative 

> places.

Don't forget that you are also merging patches today from the Android 
folks that have been deployed into actual products years ago. So the 
enterprise distro comparison simply has no commonalities here.

> Secondly, I don't see Moore's Law reversing:

> 

>    http://nerdfever.com/wp-content/uploads/2015/06/2015-06_Moravec_MIPS.png

> 

> If you combine those two time frames, the consequence of this:

> 

> Even taking the 1MB size at face value (which I don't: a networking enabled system 

> can probably not function very well with just 1MB of RAM) - the RAM-starved 1 MB 

> system today will effectively be a 2 MB system in 2 years.

As surprising as it might be, IPv6 stacks requiring only a few dozens of 
kilobytes of memory do exist. Not so surprisingly though, some people 
think that the existing stacks simply suck and they are rewriting yet 
another one ... because they think their own will be better of course.

So there *is* still a huge market for sub-megabyte systems. I was also 
counting on Moore's law so that by the time Linux actually has the 
ability to be tailored for such systems then typical SRAM in those 
10-cents microcontrollers will be 512KB instead of 128 or 32.

> You can already fit a mostly full Linux system into 32 MB just fine, i.e. the 

> problem has solved itself just by waiting a bit or by increasing the hardware 

> capabilities a bit.

You just can't procure SDRAM chips smaller than 32MB on the market 
anymore. That's why Linux didn't get any pressure to fit in smaller than 
that for quite a while. But I've heard of some people having use cases 
for thousands if not millions of Linux VMs on a single server and 
they're looking at 10MB VMs or smaller for their application.

> But the kernel complexity you introduce with this series stays with us! It will be 

> an additional cost added to many scheduler commits going forward. It's an added 

> cost for all the other usecases.

OK, let's talk about that a bit. How isn't sched/core.c with its 7387 
lines not overly complex already? How is my moving of rt related code to 
rt.c and dl related code to dl.c not helping things? Isn't it easier to 
understand the 3500 lines of code in futex.c when half of it i.e. the PI 
specific code is split into a separate file? I ask you.

If you want to pick only those patches for now then please be my guest. 
At lease the first two patches of the series should be mergeable without 
even a doubt.

As to the actual complexity I'm introducing... this is just about not 
compiling some files in and stubbing calls to them out. Isn't that a 
sign of good isolation when you can stub the dl class out with only 9 
insertions and 6 deletions to sched/core.c? I'm not saying the 
complexity is nonexistent here, but just the _ability_ to remove a 
scheduler class enforces code abstractions which should be a good thing 
maintenance wise, no?

> Also, it's not like 20k .text savings will magically enable Linux to fit into 1MB 

> of RAM - it won't. The smallest still practical more or less generic Linux system 

> in existence today is around 16 MB. You can shrink it more, but the effort 

> increases exponentially once you go below a natural minimum size.

Again, I'm not after a tiny-and-generic Linux target. I'm after a 
tiny-and-heavily-tailored Linux subset that shares the same ABI and API 
as the generic Linux. Once you start compiling out pieces of the core 
kernel, it obviously isn't generic anymore, but the potential for size 
reduction becomes much bigger.

Anyway... as I said, you have to agree with the high level goal and 
principle of leveraging the Linux codebase to gather the tiny embedded 
people around it. The tiny embedded community simply will never take 
hold otherwise. . If we cannot agree on that then any other point of 
discussion is moot. In which case I'll simply drop this project entirely 
and move on.

Nicolas

Ingo Molnar June 11, 2017, 9:23 a.m. UTC | #9

* Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> > But the kernel complexity you introduce with this series stays with us! It 

> > will be an additional cost added to many scheduler commits going forward. It's 

> > an added cost for all the other usecases.

> 

> OK, let's talk about that a bit. How isn't sched/core.c with its 7387 

> lines not overly complex already? How is my moving of rt related code to 

> rt.c and dl related code to dl.c not helping things? Isn't it easier to 

> understand the 3500 lines of code in futex.c when half of it i.e. the PI 

> specific code is split into a separate file? I ask you.

> 

> If you want to pick only those patches for now then please be my guest. 

> At lease the first two patches of the series should be mergeable without 

> even a doubt.


That's a strawman argument - I was reacting to the combined effect of your series:

 > > >  23 files changed, 3190 insertions(+), 2897 deletions(-)


A subset of the patches might be fine and note that in fact I already picked a 
patch from your series that made sense, I committed this patch of yours three days 
ago:

  f5832c1998af: sched/core: Omit building stop_sched_class when !SMP

I'll pick others as well as long as they don't complicate the code. Please send a 
revised series that only does unambiguous complexity reduction/cleanups.

Thanks,

	Ingo

Ingo Molnar June 11, 2017, 9:42 a.m. UTC | #10

* Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> > With the stroke of a pen from the CFO: "yes, we can spend more on our next 

> > hardware design!" the problem goes away, overnight, and nobody will look back at 

> > the hardware hack that had only 1MB of RAM.

> 

> Of course hobbyists can already get a Raspberry Pi Zero and run a full 

> featured Linux distro on it... for a mere 5 bucks. That comes with 512MB 

> of RAM so my patches certainly don't make a difference there.

Note that those mere 5 bucks are probably 50 cents or less in bulk. Perfectly fine 
economics for many types of 'throw away IoT hardware' products.

> But that's not that simple.  First there is a fundamental constraint 

> which is power consumption. If you want your device to run for months 

> (some will hope years) from the same tiny battery then you just cannot 

> afford SDRAM. So we're talking static RAM here. And to keep costs down 

> because you want to give away your thingies by the millions for free it 

> usually means single-chip designs with on-chip sub-megabyte static RAM.  

> And in that field the 256KB mark is located towards the high end of the 

> spectrum.  Many IPv6-capable chips available today have less than that.

> 

> And the thing is: people already manage to do a awful lot of stuff in 

> such a constrained device. Some probably did a good job of it, but most 

> of them likely suck and we don't know about their bugs because we have 

> no idea what's running inside.

Ok, let me put it this way: there's no way in hell I see a viable Linux kernel 
running (no matter how stripped down) in 32K or even 64K of RAM. 256K is a stretch 
as well - but that RAM size you claim to be already 'high end', so it probably 
wouldn't be used as a standardized solution anyway...

Today a Linux 'allnoconfig' kernel, i.e. a kernel with no device drivers and no 
filesystems whatsoever and with everything optional turned off (including all 
networking!), is over 2MB large text+data (on x86, which has a compressed 
instruction set - it would possibly be larger on simpler CPUs):

 triton:~/tip> size vmlinux
    text    data     bss     dec  filename
  926056  208624 1215904 2350584  vmlinux

A series that shrinks the .text size of the allnoconfig core Linux kernel from 1MB 
to 9.9MB in isolation is not proof.

There will literally have to be two orders of magnitude more patches than that to 
reach the 32K size envelope, if I (very) optimistically assume that the difficulty 
to shrink code is constant (which it most certainly is not).

I.e. the whole stated premise of the series is wildly not realistic AFAICS, the 
series does not make Linux more usable at all on that category of devices (Linux 
is totally inadequate because it's way too large), it only increases its 
complexity.

But you can prove me wrong: show me a Linux kernel for a real device that fits 
into 32KB of RAM (or even 256 KB) and _then_ I'll consider the cost/benefit 
equation.

Until that happens I consider most forms of additional complexity on the 
non-hardware dependent side of the kernel a net negative.

Thanks,

	Ingo

Nicolas Pitre June 11, 2017, 3:26 p.m. UTC | #11

On Sun, 11 Jun 2017, Ingo Molnar wrote:

> 

> * Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> 

> > If you want to pick only those patches for now then please be my guest. 

> > At least the first two patches of the series should be mergeable without 

> > even a doubt.

> 

> That's a strawman argument - I was reacting to the combined effect of your series:

> 

>  > > >  23 files changed, 3190 insertions(+), 2897 deletions(-)


As I mentioned, the bulk of that count comes from moving rt and dl code 
out of sched/core.c into their respective .c files:


    sched/deadline: move dl related code out of sched/core.c
    
    ... to sched/deadline.c. This helps making sched/core.c smaller and
    hopefully easier to understand and maintain. This also will help
    configuring the deadline scheduling class out of the kernel build.
    
    Signed-off-by: Nicolas Pitre <nico@linaro.org>


 kernel/sched/core.c     | 335 +----------------------------------------
 kernel/sched/deadline.c | 336 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  14 ++
 3 files changed, 356 insertions(+), 329 deletions(-)


    sched/rt: move rt related code out of sched/core.c
    
    ... to sched/rt.c. This helps making sched/core.c smaller and hopefully
    easier to understand and maintain. This also will make it easier to
    configure the realtime scheduling class out of the kernel build.
    
    Signed-off-by: Nicolas Pitre <nico@linaro.org>


 kernel/sched/core.c  | 315 ---------------------------------------------
 kernel/sched/rt.c    | 310 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   5 +
 3 files changed, 315 insertions(+), 315 deletions(-)

I also untangled the futex code so the PI support is gathered in a file 
of its own:


    futex: make PI support optional
    
    Split out the priority inheritance support to a file of its own
    to make futex.c easier to understand and, hopefully, to maintain.
    This also makes it possible to compile out the PI support when RT
    task support is not available.
    
    Signed-off-by: Nicolas Pitre <nico@linaro.org>


 include/linux/futex.h |    7 +-
 init/Kconfig          |    7 +-
 kernel/futex.c        | 2829 ++++++++++---------------------------------
 kernel/futex_pi.c     | 1563 ++++++++++++++++++++++++
 4 files changed, 2233 insertions(+), 2173 deletions(-)

Granted I made a mistake in this last description above. It should have 
said "RT mutex support" instead of "RT task support". But those 3 
patches are making the code easier to understand I'd say.

> A subset of the patches might be fine and note that in fact I already picked a 

> patch from your series that made sense, I committed this patch of yours three days 

> ago:

> 

>   f5832c1998af: sched/core: Omit building stop_sched_class when !SMP


Good. That was patch #2/8.  Why did you skip over #1/8 "cpuset/sched: 
cpuset makes sense for SMP only"? It is the same kind of simple cleanup 
as the one you did apply.

> I'll pick others as well as long as they don't complicate the code. Please send a 

> revised series that only does unambiguous complexity reduction/cleanups.


Tell me from the above which patches would qualify and I'll repost them.


Nicolas

Nicolas Pitre June 11, 2017, 4:45 p.m. UTC | #12

On Sun, 11 Jun 2017, Ingo Molnar wrote:

> * Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> 

> > But that's not that simple.  First there is a fundamental constraint 

> > which is power consumption. If you want your device to run for months 

> > (some will hope years) from the same tiny battery then you just cannot 

> > afford SDRAM. So we're talking static RAM here. And to keep costs down 

> > because you want to give away your thingies by the millions for free it 

> > usually means single-chip designs with on-chip sub-megabyte static RAM.  

> > And in that field the 256KB mark is located towards the high end of the 

> > spectrum.  Many IPv6-capable chips available today have less than that.

> > 

> > And the thing is: people already manage to do a awful lot of stuff in 

> > such a constrained device. Some probably did a good job of it, but most 

> > of them likely suck and we don't know about their bugs because we have 

> > no idea what's running inside.

> 

> Ok, let me put it this way: there's no way in hell I see a viable Linux kernel 

> running (no matter how stripped down) in 32K or even 64K of RAM. 256K is a stretch 

> as well - but that RAM size you claim to be already 'high end', so it probably 

> wouldn't be used as a standardized solution anyway...

I never pretended to make Linux runable in 32KB of RAM. Therefore we 
strongly agree here. I however mentioned that some 32KB chips are IPv6 
capable, just to give you a different perspective given that you're more 
acquainted with multi-gigabyte systems.

And as you did mention Moore's law previously, the fact that 256KB of 
RAM might be somewhat high-end today in that space, that should become 
pretty common in the near future. The test board in front of me has 
384KB of SRAM and bigger ones exist.

> Today a Linux 'allnoconfig' kernel, i.e. a kernel with no device drivers and no 

> filesystems whatsoever and with everything optional turned off (including all 

> networking!), is over 2MB large text+data (on x86, which has a compressed 

> instruction set - it would possibly be larger on simpler CPUs):

> 

>  triton:~/tip> size vmlinux

>     text    data     bss     dec  filename

>   926056  208624 1215904 2350584  vmlinux

On ARM, allnoconfig produces:

   text    data     bss     dec     hex filename
 548144   95480   24252  667876   a30e4 vmlinux

But more realistically, the test system I'm using currently runs the 
kernel XIP from flash, so the text size is an indirect metric. It uses 
external RAM as the 384KB of SRAM still doesn't allow for a successful 
boot. But here's what I get once booted:

/ # free
             total       used       free     shared    buffers     cached
Mem:          7936       1624       6312          0          0        492
-/+ buffers/cache:       1132       6804
/ # uname -a
Linux (none) 4.12.0-rc4-00013-g32352a9367 #35 PREEMPT Sun Jun 11 10:45:02 EDT 2017 armv7ml GNU/Linux

I could make user space XIP from flash as well, but right now it is just 
some initramfs living in RAM.

Obviously you can't use the native Linux networking stack in such small 
systems. But a few IPv6 stacks have been made to work in a few kilobytes 
already.

> A series that shrinks the .text size of the allnoconfig core Linux kernel from 1MB 

> to 9.9MB in isolation is not proof.

I assume you meant 0.9MB.

It is no proof of course. But I'm following the well known and proven 
"release early, release often" mantra here... unless this is no longer 
promoted?

> There will literally have to be two orders of magnitude more patches than that to 

> reach the 32K size envelope, if I (very) optimistically assume that the difficulty 

> to shrink code is constant (which it most certainly is not).

Once again, my goal is _not_ 32KB.

And I don't intend to shrink code. Most of the time I just want to 
_remove_ code. Compiling it out to be precise. The goal of this series 
is all about compiling out code. And to achieve that with the scheduler, 
I simply moved some code to different source files and not including 
those source files in the final build. That keeps the number of #ifdef's 
to a minimum but it makes a big diffstat due to the code movement.

In the TTY layer case, I found out that writing a simplistic parallel 
equivalent that doesn't have to scale to server class systems and 
remains compatible with existing drivers allowed a 6x factor in size 
reduction. The same strategy could be employed with the VFS where any 
kind of file caching doesn't make sense in a tiny system. Don't worry, 
I'm lot looking forward to using BTRFS in 256KB of RAM either.

To give you an idea, here's the size repartition from that booting 
kernel above:

$ size */built-in.o
   text    data     bss     dec     hex filename
 290669   41864    3616  336149   52115 drivers/built-in.o
 173275    1189    5472  179936   2bee0 fs/built-in.o
  10135   14084      84   24303    5eef init/built-in.o
 198624   22000   25160  245784   3c018 kernel/built-in.o
  79064     133      53   79250   13592 lib/built-in.o
  97034    6328    3532  106894   1a18e mm/built-in.o
   2135       0       0    2135     857 security/built-in.o
 146046       0       0  146046   23a7e usr/built-in.o
      0       0       0       0       0 virt/built-in.o

That's without LTO (because with LTO there's no way to size individual 
parts) and without syscall trimming. From previous experiments, LTO 
brings a 20% reduction on the final build size, and LTO with syscall 
trimming together provide about 40% reduction. One nice thing about LTO 
is that part of the 75KB of lib code automatically gets discarded when 
not referenced, etc.  This is not always the case for most of the core 
driver infrastructure despite most of it not being used in my case.

But there are pieces of the kernel that can't automatically be 
eliminated, such as scheduler classes, because the compiler just can't 
tell if they'll be used at run time.

Some "memory hogs" (relatively speaking) might need a tiny version to 
cope with a handful of processes max and a few static drivers. As Alan 
said, wait queues as they are right now consume a lot of memory. But 
since they're well defined and encapsulated already, it is possible to 
provide a light alternative implemented in a way that uses much less 
memory with the side effect of being much less scalable. But scalability 
is not a huge concern when you have only 256KB of RAM.

So it is a combination of strategies that will make the 256KB goal 
possible. And as you can see from the free output above, this is not 
_that_ far off already.

> But you can prove me wrong: show me a Linux kernel for a real device that fits 

> into 32KB of RAM (or even 256 KB) and _then_ I'll consider the cost/benefit 

> equation.

Your insisting on 32KB in this discussion is simply disingenuous.

So you are basically saying that you want me to work another year on 
this project "behind closed doors" and come out with "a final solution" 
before you tell me if my approach is worthy of your consideration? 
Thanks but no thanks. As I said elsewhere, the value in this proposal is 
mainline inclusion in an ongoing process otherwise there is no gain over 
those small OSes out there, and my time is more valuable than that.

Nicolas

Ingo Molnar June 13, 2017, 7:12 a.m. UTC | #13

* Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> > A series that shrinks the .text size of the allnoconfig core Linux kernel from 1MB 

> > to 9.9MB in isolation is not proof.

> 

> I assume you meant 0.9MB.

0.992 MB actually if we apply the ~8k .text savings. 0.9MB would imply 100k of 
savings on an allnoconfig kernel.

> It is no proof of course. But I'm following the well known and proven 

> "release early, release often" mantra here... unless this is no longer 

> promoted?

I'm following that same pattern: I gave you negative review feedback as early as 
possible. Fragmention of the scheduler ABI increases complexity and has knock-on 
costs - and the kernel size reduction for the usecase you cited are still 1-2 
orders of magnitude away from making a practical difference.

> > There will literally have to be two orders of magnitude more patches than that 

> > to reach the 32K size envelope, if I (very) optimistically assume that the 

> > difficulty to shrink code is constant (which it most certainly is not).

> 

> Once again, my goal is _not_ 32KB.

> 

> And I don't intend to shrink code. Most of the time I just want to 

> _remove_ code. Compiling it out to be precise. The goal of this series 

> is all about compiling out code. And to achieve that with the scheduler, 

> I simply moved some code to different source files and not including 

> those source files in the final build. That keeps the number of #ifdef's 

> to a minimum but it makes a big diffstat due to the code movement.

So I'm fine with most of the code movement - let's try this series without any of 
the more controversial bits which should make future arguments easier.

Thanks,

	Ingo

Nicolas Pitre June 13, 2017, 12:29 p.m. UTC | #14

On Tue, 13 Jun 2017, Ingo Molnar wrote:

> > I simply moved some code to different source files and not including 

> > those source files in the final build. That keeps the number of #ifdef's 

> > to a minimum but it makes a big diffstat due to the code movement.

> 

> So I'm fine with most of the code movement - let's try this series without any of 

> the more controversial bits which should make future arguments easier.


You should then be able to merge patches #1 to #5 already (you already 
have #2) as the more controversial ones are at the end.


Nicolas

[v2,0/8] scheduler tinification

Message

Comments