[1/2] Docs: An initial automarkup extension for sphinx

Message ID	20190425200125.12302-2-corbet@lwn.net
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Jonathan Corbet <corbet@lwn.net> To: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Matthew Wilcox <willy@infradead.org>, Mauro Carvalho Chehab <mchehab@kernel.org>, Jonathan Corbet <corbet@lwn.net> Subject: [PATCH 1/2] Docs: An initial automarkup extension for sphinx Date: Thu, 25 Apr 2019 14:01:24 -0600 Message-Id: <20190425200125.12302-2-corbet@lwn.net> In-Reply-To: <20190425200125.12302-1-corbet@lwn.net> References: <20190425200125.12302-1-corbet@lwn.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	RFC: Automatic :c:func: annotation in the sphinx build \| expand [0/2] RFC: Automatic :c:func: annotation in the sphinx build [1/2] Docs: An initial automarkup extension for sphinx [2/2] docs: remove :c:func: annotations from xarray.rst

Jonathan Corbet April 25, 2019, 8:01 p.m. UTC

Rather than fill our text files with :c:func:`function()` syntax, just do
the markup via a hook into the sphinx build process.  As is always the
case, the real problem is detecting the situations where this markup should
*not* be done.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>

---
 Documentation/conf.py              |  3 +-
 Documentation/sphinx/automarkup.py | 90 ++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/sphinx/automarkup.py

-- 
2.21.0

Jani Nikula April 26, 2019, 9:06 a.m. UTC | #1

On Thu, 25 Apr 2019, Jonathan Corbet <corbet@lwn.net> wrote:
> Rather than fill our text files with :c:func:`function()` syntax, just do

> the markup via a hook into the sphinx build process.  As is always the

> case, the real problem is detecting the situations where this markup should

> *not* be done.


This is basically a regex based pre-processing step in front of Sphinx,
but it's not independent as it embeds a limited understanding/parsing of
reStructuredText syntax. This is similar to what we do in kernel-doc the
Perl monster, except slightly different.

I understand the motivation, and I sympathize with the idea of a quick
regex hack to silence the mob. But I fear this will lead to hard to
solve corner cases and the same style of "impedance mismatches" we had
with the kernel-doc/docproc/docbook Rube Goldberg machine of the past.

It's more involved, but I think the better place to do this (as well as
the kernel-doc transformations) would be in the doctree-read event,
after the rst parsing is done. You can traverse the doctree and find the
places which weren't special for Sphinx, and replace the plain text
nodes in-place. I've toyed with this in the past, but alas I didn't have
(and still don't) have the time to finish the job. There were some
unresolved issues with e.g. replacing nodes that had syntax highlighting
(because I wanted to make the references work also within preformatted
blocks).

If you decide to go with regex anyway, I'd at least consider pulling the
transformations/highlights from kernel-doc the script to the Sphinx
extension, and use the exact same transformations for stuff in source
code comments and rst files.

BR,
Jani.

>

> Signed-off-by: Jonathan Corbet <corbet@lwn.net>

> ---

>  Documentation/conf.py              |  3 +-

>  Documentation/sphinx/automarkup.py | 90 ++++++++++++++++++++++++++++++

>  2 files changed, 92 insertions(+), 1 deletion(-)

>  create mode 100644 Documentation/sphinx/automarkup.py

>

> diff --git a/Documentation/conf.py b/Documentation/conf.py

> index 72647a38b5c2..ba7b2846b1c5 100644

> --- a/Documentation/conf.py

> +++ b/Documentation/conf.py

> @@ -34,7 +34,8 @@ needs_sphinx = '1.3'

>  # Add any Sphinx extension module names here, as strings. They can be

>  # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom

>  # ones.

> -extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig']

> +extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain',

> +              'kfigure', 'sphinx.ext.ifconfig', 'automarkup']

>  

>  # The name of the math extension changed on Sphinx 1.4

>  if major == 1 and minor > 3:

> diff --git a/Documentation/sphinx/automarkup.py b/Documentation/sphinx/automarkup.py

> new file mode 100644

> index 000000000000..c47469372bae

> --- /dev/null

> +++ b/Documentation/sphinx/automarkup.py

> @@ -0,0 +1,90 @@

> +# SPDX-License-Identifier: GPL-2.0

> +#

> +# This is a little Sphinx extension that tries to apply certain kinds

> +# of markup automatically so we can keep it out of the text files

> +# themselves.

> +#

> +# It's possible that this could be done better by hooking into the build

> +# much later and traversing through the doctree.  That would eliminate the

> +# need to duplicate some RST parsing and perhaps be less fragile, at the

> +# cost of some more complexity and the need to generate the cross-reference

> +# links ourselves.

> +#

> +# Copyright 2019 Jonathan Corbet <corbet@lwn.net>

> +#

> +from __future__ import print_function

> +import re

> +import sphinx

> +

> +#

> +# Regex nastiness.  Of course.

> +# Try to identify "function()" that's not already marked up some

> +# other way.  Sphinx doesn't like a lot of stuff right after a

> +# :c:func: block (i.e. ":c:func:`mmap()`s" flakes out), so the last

> +# bit tries to restrict matches to things that won't create trouble.

> +#

> +RE_function = re.compile(r'(^|\s+)([\w\d_]+\(\))([.,/\s]|$)')

> +#

> +# Lines consisting of a single underline character.

> +#

> +RE_underline = re.compile(r'^([-=~])\1+$')

> +#

> +# Starting a literal block.

> +#

> +RE_literal = re.compile(r'^(\s*)(.*::\s*|\.\.\s+code-block::.*)$')

> +#

> +# Just get the white space beginning a line.

> +#

> +RE_whitesp = re.compile(r'^(\s*)')

> +

> +def MangleFile(app, docname, text):

> +    ret = [ ]

> +    previous = ''

> +    literal = False

> +    for line in text[0].split('\n'):

> +        #

> +        # See if we might be ending a literal block, as denoted by

> +        # an indent no greater than when we started.

> +        #

> +        if literal and len(line) > 0:

> +            m = RE_whitesp.match(line)  # Should always match

> +            if len(m.group(1).expandtabs()) <= lit_indent:

> +                literal = False

> +        #

> +        # Blank lines, directives, and lines within literal blocks

> +        # should not be messed with.

> +        #

> +        if literal or len(line) == 0 or line[0] == '.':

> +            ret.append(line)

> +        #

> +        # Is this an underline line?  If so, and it is the same length

> +        # as the previous line, we may have mangled a heading line in

> +        # error, so undo it.

> +        #

> +        elif RE_underline.match(line):

> +            if len(line) == len(previous):

> +                ret[-1] = previous

> +            ret.append(line)

> +        #

> +        # Normal line - perform substitutions.

> +        #

> +        else:

> +            ret.append(RE_function.sub(r'\1:c:func:`\2`\3', line))

> +        #

> +        # Might we be starting a literal block?  If so make note of

> +        # the fact.

> +        #

> +        m = RE_literal.match(line)

> +        if m:

> +            literal = True

> +            lit_indent = len(m.group(1).expandtabs())

> +        previous = line

> +    text[0] = '\n'.join(ret)

> +

> +def setup(app):

> +    app.connect('source-read', MangleFile)

> +

> +    return dict(

> +        parallel_read_safe = True,

> +        parallel_write_safe = True

> +    )


-- 
Jani Nikula, Intel Open Source Graphics Center

Markus Heiser April 26, 2019, 11:31 a.m. UTC | #2

Am 26.04.19 um 11:06 schrieb Jani Nikula:
> On Thu, 25 Apr 2019, Jonathan Corbet<corbet@lwn.net>  wrote:

>> Rather than fill our text files with :c:func:`function()` syntax, just do

>> the markup via a hook into the sphinx build process.  As is always the

>> case, the real problem is detecting the situations where this markup should

>> *not*  be done.

> This is basically a regex based pre-processing step in front of Sphinx,

> but it's not independent as it embeds a limited understanding/parsing of

> reStructuredText syntax. This is similar to what we do in kernel-doc the

> Perl monster, except slightly different.

> 

> I understand the motivation, and I sympathize with the idea of a quick

> regex hack to silence the mob. But I fear this will lead to hard to

> solve corner cases and the same style of "impedance mismatches" we had

> with the kernel-doc/docproc/docbook Rube Goldberg machine of the past.

> 

> It's more involved, but I think the better place to do this (as well as

> the kernel-doc transformations) would be in the doctree-read event,

> after the rst parsing is done. You can traverse the doctree and find the

> places which weren't special for Sphinx, and replace the plain text

> nodes in-place. I've toyed with this in the past, but alas I didn't have

> (and still don't) have the time to finish the job. There were some

> unresolved issues with e.g. replacing nodes that had syntax highlighting

> (because I wanted to make the references work also within preformatted

> blocks).

> 

> If you decide to go with regex anyway, I'd at least consider pulling the

> transformations/highlights from kernel-doc the script to the Sphinx

> extension, and use the exact same transformations for stuff in source

> code comments and rst files.


FWIW mentioning: I fully agree with Jan.

   -- Markus --

Jonathan Corbet April 26, 2019, 4:52 p.m. UTC | #3

On Fri, 26 Apr 2019 12:06:42 +0300
Jani Nikula <jani.nikula@linux.intel.com> wrote:

> It's more involved, but I think the better place to do this (as well as

> the kernel-doc transformations) would be in the doctree-read event,

> after the rst parsing is done. You can traverse the doctree and find the

> places which weren't special for Sphinx, and replace the plain text

> nodes in-place. I've toyed with this in the past, but alas I didn't have

> (and still don't) have the time to finish the job.

I had thought about this; indeed, there's a comment in the posted patch to
that effect.  The tradeoff comes down to this, I think:

 - The fragility and inelegance of embedding a bit of redundant RST
   parsing into this extension v. that of adding a late parsing stage as a
   transform, trying to enumerate every node type that you might want to
   change, and digging into the C domain code to emulate its reference
   generation.

 - The time required to implement a solution; I'm having a bit of a hard
   time keeping up with docs at the moment as it is.

 - Regexes.  This solution has more of them, but we're not going to get
   away from them regardless.

I am not at all opposed to a more proper solution that might, in the
long term, produce more deterministic results.  I can even try to work in
that direction.  But this is something that can be done now that, IMO,
doesn't in any way close off a better implementation in the future.  If we
agree that we should automatically generate references for occurrences of
"function()", we can change how that is actually done later.

I'll look into this further, but my inclination is to go forward with what
I have now.  It's simple and easy to understand, and doesn't seem to screw
up anywhere in the current body of kernel docs as far as I can tell.

> If you decide to go with regex anyway, I'd at least consider pulling the

> transformations/highlights from kernel-doc the script to the Sphinx

> extension, and use the exact same transformations for stuff in source

> code comments and rst files.

Pulling all RST manipulation out of kerneldoc has a great deal of appeal;
assuming this goes forward that should indeed be high on the todo list.

Thanks,

jon

Mauro Carvalho Chehab April 26, 2019, 5:54 p.m. UTC | #4

Em Fri, 26 Apr 2019 10:52:09 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:

> On Fri, 26 Apr 2019 12:06:42 +0300

> Jani Nikula <jani.nikula@linux.intel.com> wrote:

> 

> > It's more involved, but I think the better place to do this (as well as

> > the kernel-doc transformations) would be in the doctree-read event,

> > after the rst parsing is done. You can traverse the doctree and find the

> > places which weren't special for Sphinx, and replace the plain text

> > nodes in-place. I've toyed with this in the past, but alas I didn't have

> > (and still don't) have the time to finish the job.  

> 

> I had thought about this; indeed, there's a comment in the posted patch to

> that effect.  The tradeoff comes down to this, I think:

> 

>  - The fragility and inelegance of embedding a bit of redundant RST

>    parsing into this extension v. that of adding a late parsing stage as a

>    transform, trying to enumerate every node type that you might want to

>    change, and digging into the C domain code to emulate its reference

>    generation.

> 

>  - The time required to implement a solution; I'm having a bit of a hard

>    time keeping up with docs at the moment as it is.

> 

>  - Regexes.  This solution has more of them, but we're not going to get

>    away from them regardless.

> 

> I am not at all opposed to a more proper solution that might, in the

> long term, produce more deterministic results.  I can even try to work in

> that direction.  But this is something that can be done now that, IMO,

> doesn't in any way close off a better implementation in the future.  If we

> agree that we should automatically generate references for occurrences of

> "function()", we can change how that is actually done later.

> 

> I'll look into this further, but my inclination is to go forward with what

> I have now.  It's simple and easy to understand, and doesn't seem to screw

> up anywhere in the current body of kernel docs as far as I can tell.

> 

> > If you decide to go with regex anyway, I'd at least consider pulling the

> > transformations/highlights from kernel-doc the script to the Sphinx

> > extension, and use the exact same transformations for stuff in source

> > code comments and rst files.  

> 

> Pulling all RST manipulation out of kerneldoc has a great deal of appeal;

> assuming this goes forward that should indeed be high on the todo list.


While I didn't review the python code yet, and while I agree with
Jani and Markus that it would be better only parse functions on texts
after the Sphinx parser, I agree with Jon on that: if the current code works
well enough with the current documentation, I would proceed and apply it. 

Later, the script can be improved.

Thanks,
Mauro

Mauro Carvalho Chehab April 26, 2019, 6:32 p.m. UTC | #5

Em Thu, 25 Apr 2019 14:01:24 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:

> Rather than fill our text files with :c:func:`function()` syntax, just do

> the markup via a hook into the sphinx build process.  As is always the

> case, the real problem is detecting the situations where this markup should

> *not* be done.

> 

> Signed-off-by: Jonathan Corbet <corbet@lwn.net>

> ---

>  Documentation/conf.py              |  3 +-

>  Documentation/sphinx/automarkup.py | 90 ++++++++++++++++++++++++++++++

>  2 files changed, 92 insertions(+), 1 deletion(-)

>  create mode 100644 Documentation/sphinx/automarkup.py

> 

> diff --git a/Documentation/conf.py b/Documentation/conf.py

> index 72647a38b5c2..ba7b2846b1c5 100644

> --- a/Documentation/conf.py

> +++ b/Documentation/conf.py

> @@ -34,7 +34,8 @@ needs_sphinx = '1.3'

>  # Add any Sphinx extension module names here, as strings. They can be

>  # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom

>  # ones.

> -extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig']

> +extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain',

> +              'kfigure', 'sphinx.ext.ifconfig', 'automarkup']

>  

>  # The name of the math extension changed on Sphinx 1.4

>  if major == 1 and minor > 3:

> diff --git a/Documentation/sphinx/automarkup.py b/Documentation/sphinx/automarkup.py

> new file mode 100644

> index 000000000000..c47469372bae

> --- /dev/null

> +++ b/Documentation/sphinx/automarkup.py

> @@ -0,0 +1,90 @@

> +# SPDX-License-Identifier: GPL-2.0

> +#

> +# This is a little Sphinx extension that tries to apply certain kinds

> +# of markup automatically so we can keep it out of the text files

> +# themselves.

> +#

> +# It's possible that this could be done better by hooking into the build

> +# much later and traversing through the doctree.  That would eliminate the

> +# need to duplicate some RST parsing and perhaps be less fragile, at the

> +# cost of some more complexity and the need to generate the cross-reference

> +# links ourselves.

> +#

> +# Copyright 2019 Jonathan Corbet <corbet@lwn.net>

> +#

> +from __future__ import print_function

> +import re

> +import sphinx

> +

> +#

> +# Regex nastiness.  Of course.

> +# Try to identify "function()" that's not already marked up some

> +# other way.  Sphinx doesn't like a lot of stuff right after a

> +# :c:func: block (i.e. ":c:func:`mmap()`s" flakes out), so the last

> +# bit tries to restrict matches to things that won't create trouble.

> +#

> +RE_function = re.compile(r'(^|\s+)([\w\d_]+\(\))([.,/\s]|$)')


IMHO, this looks good enough to avoid trouble, maybe except if one
wants to write a document explaining this functionality at the
doc-guide/kernel-doc.rst.

Anyway, the way it is written, we could still explain it by adding
a "\ " after the func, e. g.:

	When you write a function like: func()\ , the automarkup
	extension will automatically convert it into:
	``:c:func:`func()```.

So, this looks OK on my eyes.

> +#

> +# Lines consisting of a single underline character.

> +#

> +RE_underline = re.compile(r'^([-=~])\1+$')


Hmm... why are you calling this "underline"? Sounds a bad name to me,
as it took me a while to understand what you meant.

From the code I'm inferring that this is meant to track 3 of the
possible symbols used as a (sub).*title markup. On several places 
we use other symbols:'^', '~', '.', '*' (and others) as sub-sub(sub..)
title markups.

I would instead define this Regex as:

	RE_title_markup = re.compile(r'^([^\w\d])\1+$')

You should probably need another regex for the title itself:

	RE_possible_title = re.compile(r'^(\S.*\S)\s*$')

in order to get the size of the matched line. Doing a doing len(previous)
will get you false positives.

As on Sphinx, **all** titles should start at the first column,
or it will produce a severe error[1], we can use such regex to
minimize parsing errors.

[1] and either crash or keep running some endless loop internally.
Not being bad enough, it will also invalidate all the previously
cached data, losing a lot of time next time you try to build the
docs.

---

on a separate matter (but related to automarkup matter - and to what
I would name underline), as a future feature, perhaps we could also add
a parser for:

	_something that requires underlines_

Underlined text is probably the only feature that we use on several docs
with Sphinx doesn't support (there are some extensions for that - I guess,
but it sounds simple enough to have a parser here).

This can be tricky to get it right, as just underlines_ is a
cross reference markup - so, I would only add this after we improve the
script to come after Sphinx own markup processing.

---

> +#

> +# Starting a literal block.

> +#

> +RE_literal = re.compile(r'^(\s*)(.*::\s*|\.\.\s+code-block::.*)$')

> +#

> +# Just get the white space beginning a line.

> +#

> +RE_whitesp = re.compile(r'^(\s*)')

> +

> +def MangleFile(app, docname, text):

> +    ret = [ ]

> +    previous = ''

> +    literal = False

> +    for line in text[0].split('\n'):

> +        #

> +        # See if we might be ending a literal block, as denoted by

> +        # an indent no greater than when we started.

> +        #

> +        if literal and len(line) > 0:

> +            m = RE_whitesp.match(line)  # Should always match

> +            if len(m.group(1).expandtabs()) <= lit_indent:

> +                literal = False

> +        #

> +        # Blank lines, directives, and lines within literal blocks

> +        # should not be messed with.

> +        #

> +        if literal or len(line) == 0 or line[0] == '.':

> +            ret.append(line)



> +        #

> +        # Is this an underline line?  If so, and it is the same length

> +        # as the previous line, we may have mangled a heading line in

> +        # error, so undo it.

> +        #

> +        elif RE_underline.match(line):

> +            if len(line) == len(previous):


No, that doesn't seem enough. I would, instead, use the regex I
proposed before, in order to check if the previous line starts with
a non-space, and getting the length only up to the last non-space
(yeah, unfortunately, we have some text files that have extra blank
spaces at line's tail).

> +                ret[-1] = previous

> +            ret.append(line)

> +        #

> +        # Normal line - perform substitutions.

> +        #

> +        else:

> +            ret.append(RE_function.sub(r'\1:c:func:`\2`\3', line))

> +        #

> +        # Might we be starting a literal block?  If so make note of

> +        # the fact.

> +        #

> +        m = RE_literal.match(line)

> +        if m:

> +            literal = True

> +            lit_indent = len(m.group(1).expandtabs())

> +        previous = line

> +    text[0] = '\n'.join(ret)

> +

> +def setup(app):

> +    app.connect('source-read', MangleFile)

> +

> +    return dict(

> +        parallel_read_safe = True,

> +        parallel_write_safe = True

> +    )


The remaining looks fine to me - although I'm not a Sphinx-extension
expert, and my knowledge of python is far from being perfect.

Thanks,
Mauro

Jani Nikula April 26, 2019, 6:58 p.m. UTC | #6

On Fri, 26 Apr 2019, Jonathan Corbet <corbet@lwn.net> wrote:
> I am not at all opposed to a more proper solution that might, in the

> long term, produce more deterministic results.  I can even try to work in

> that direction.  But this is something that can be done now that, IMO,

> doesn't in any way close off a better implementation in the future.  If we

> agree that we should automatically generate references for occurrences of

> "function()", we can change how that is actually done later.

>

> I'll look into this further, but my inclination is to go forward with what

> I have now.  It's simple and easy to understand, and doesn't seem to screw

> up anywhere in the current body of kernel docs as far as I can tell.


Fair enough. It's most important that this doesn't block us from
switching to a different implementation later once someone figures it
out.

BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center

Mauro Carvalho Chehab April 26, 2019, 7:16 p.m. UTC | #7

Em Fri, 26 Apr 2019 15:32:55 -0300
Mauro Carvalho Chehab <mchehab+samsung@kernel.org> escreveu:

> Em Thu, 25 Apr 2019 14:01:24 -0600

> Jonathan Corbet <corbet@lwn.net> escreveu:

> 

> > Rather than fill our text files with :c:func:`function()` syntax, just do

> > the markup via a hook into the sphinx build process.  As is always the

> > case, the real problem is detecting the situations where this markup should

> > *not* be done.

> > 

> > Signed-off-by: Jonathan Corbet <corbet@lwn.net>

> > ---

> >  Documentation/conf.py              |  3 +-

> >  Documentation/sphinx/automarkup.py | 90 ++++++++++++++++++++++++++++++

> >  2 files changed, 92 insertions(+), 1 deletion(-)

> >  create mode 100644 Documentation/sphinx/automarkup.py

> > 

> > diff --git a/Documentation/conf.py b/Documentation/conf.py

> > index 72647a38b5c2..ba7b2846b1c5 100644

> > --- a/Documentation/conf.py

> > +++ b/Documentation/conf.py

> > @@ -34,7 +34,8 @@ needs_sphinx = '1.3'

> >  # Add any Sphinx extension module names here, as strings. They can be

> >  # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom

> >  # ones.

> > -extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig']

> > +extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain',

> > +              'kfigure', 'sphinx.ext.ifconfig', 'automarkup']

> >  

> >  # The name of the math extension changed on Sphinx 1.4

> >  if major == 1 and minor > 3:

> > diff --git a/Documentation/sphinx/automarkup.py b/Documentation/sphinx/automarkup.py

> > new file mode 100644

> > index 000000000000..c47469372bae

> > --- /dev/null

> > +++ b/Documentation/sphinx/automarkup.py

> > @@ -0,0 +1,90 @@

> > +# SPDX-License-Identifier: GPL-2.0

> > +#

> > +# This is a little Sphinx extension that tries to apply certain kinds

> > +# of markup automatically so we can keep it out of the text files

> > +# themselves.

> > +#

> > +# It's possible that this could be done better by hooking into the build

> > +# much later and traversing through the doctree.  That would eliminate the

> > +# need to duplicate some RST parsing and perhaps be less fragile, at the

> > +# cost of some more complexity and the need to generate the cross-reference

> > +# links ourselves.

> > +#

> > +# Copyright 2019 Jonathan Corbet <corbet@lwn.net>

> > +#

> > +from __future__ import print_function

> > +import re

> > +import sphinx

> > +

> > +#

> > +# Regex nastiness.  Of course.

> > +# Try to identify "function()" that's not already marked up some

> > +# other way.  Sphinx doesn't like a lot of stuff right after a

> > +# :c:func: block (i.e. ":c:func:`mmap()`s" flakes out), so the last

> > +# bit tries to restrict matches to things that won't create trouble.

> > +#

> > +RE_function = re.compile(r'(^|\s+)([\w\d_]+\(\))([.,/\s]|$)')  

> 

> IMHO, this looks good enough to avoid trouble, maybe except if one

> wants to write a document explaining this functionality at the

> doc-guide/kernel-doc.rst.

> 

> Anyway, the way it is written, we could still explain it by adding

> a "\ " after the func, e. g.:

> 

> 	When you write a function like: func()\ , the automarkup

> 	extension will automatically convert it into:

> 	``:c:func:`func()```.

> 

> So, this looks OK on my eyes.

> 

> > +#

> > +# Lines consisting of a single underline character.

> > +#

> > +RE_underline = re.compile(r'^([-=~])\1+$')  

> 

> Hmm... why are you calling this "underline"? Sounds a bad name to me,

> as it took me a while to understand what you meant.

> 

> From the code I'm inferring that this is meant to track 3 of the

> possible symbols used as a (sub).*title markup. On several places 

> we use other symbols:'^', '~', '.', '*' (and others) as sub-sub(sub..)

> title markups.

> 

> I would instead define this Regex as:

> 

> 	RE_title_markup = re.compile(r'^([^\w\d])\1+$')


In time:

	RE_title_markup = re.compile(r'^([^\w\d\s])\1+$')

As otherwise it would get whitespaces/tabs :-)

Also, please notice that this is pure review - didn't actually try to
apply your patch or test with or without the proposed changes :-)

> 

> You should probably need another regex for the title itself:

> 

> 	RE_possible_title = re.compile(r'^(\S.*\S)\s*$')

> 

> in order to get the size of the matched line. Doing a doing len(previous)

> will get you false positives.

> 

> As on Sphinx, **all** titles should start at the first column,

> or it will produce a severe error[1], we can use such regex to

> minimize parsing errors.

> 

> [1] and either crash or keep running some endless loop internally.

> Not being bad enough, it will also invalidate all the previously

> cached data, losing a lot of time next time you try to build the

> docs.

> 

> ---

> 

> on a separate matter (but related to automarkup matter - and to what

> I would name underline), as a future feature, perhaps we could also add

> a parser for:

> 

> 	_something that requires underlines_

> 

> Underlined text is probably the only feature that we use on several docs

> with Sphinx doesn't support (there are some extensions for that - I guess,

> but it sounds simple enough to have a parser here).

> 

> This can be tricky to get it right, as just underlines_ is a

> cross reference markup - so, I would only add this after we improve the

> script to come after Sphinx own markup processing.

> 

> ---

> 

> > +#

> > +# Starting a literal block.

> > +#

> > +RE_literal = re.compile(r'^(\s*)(.*::\s*|\.\.\s+code-block::.*)$')

> > +#

> > +# Just get the white space beginning a line.

> > +#

> > +RE_whitesp = re.compile(r'^(\s*)')

> > +

> > +def MangleFile(app, docname, text):

> > +    ret = [ ]

> > +    previous = ''

> > +    literal = False

> > +    for line in text[0].split('\n'):

> > +        #

> > +        # See if we might be ending a literal block, as denoted by

> > +        # an indent no greater than when we started.

> > +        #

> > +        if literal and len(line) > 0:

> > +            m = RE_whitesp.match(line)  # Should always match

> > +            if len(m.group(1).expandtabs()) <= lit_indent:

> > +                literal = False

> > +        #

> > +        # Blank lines, directives, and lines within literal blocks

> > +        # should not be messed with.

> > +        #

> > +        if literal or len(line) == 0 or line[0] == '.':

> > +            ret.append(line)  

> 

> 

> > +        #

> > +        # Is this an underline line?  If so, and it is the same length

> > +        # as the previous line, we may have mangled a heading line in

> > +        # error, so undo it.

> > +        #

> > +        elif RE_underline.match(line):

> > +            if len(line) == len(previous):  

> 

> No, that doesn't seem enough. I would, instead, use the regex I

> proposed before, in order to check if the previous line starts with

> a non-space, and getting the length only up to the last non-space

> (yeah, unfortunately, we have some text files that have extra blank

> spaces at line's tail).

> 

> > +                ret[-1] = previous

> > +            ret.append(line)

> > +        #

> > +        # Normal line - perform substitutions.

> > +        #

> > +        else:

> > +            ret.append(RE_function.sub(r'\1:c:func:`\2`\3', line))

> > +        #

> > +        # Might we be starting a literal block?  If so make note of

> > +        # the fact.

> > +        #

> > +        m = RE_literal.match(line)

> > +        if m:

> > +            literal = True

> > +            lit_indent = len(m.group(1).expandtabs())

> > +        previous = line

> > +    text[0] = '\n'.join(ret)

> > +

> > +def setup(app):

> > +    app.connect('source-read', MangleFile)

> > +

> > +    return dict(

> > +        parallel_read_safe = True,

> > +        parallel_write_safe = True

> > +    )  

> 

> The remaining looks fine to me - although I'm not a Sphinx-extension

> expert, and my knowledge of python is far from being perfect.

> 

> Thanks,

> Mauro




Thanks,
Mauro

Jonathan Corbet April 26, 2019, 7:37 p.m. UTC | #8

On Fri, 26 Apr 2019 15:32:55 -0300
Mauro Carvalho Chehab <mchehab+samsung@kernel.org> wrote:

> > +# Try to identify "function()" that's not already marked up some

> > +# other way.  Sphinx doesn't like a lot of stuff right after a

> > +# :c:func: block (i.e. ":c:func:`mmap()`s" flakes out), so the last

> > +# bit tries to restrict matches to things that won't create trouble.

> > +#

> > +RE_function = re.compile(r'(^|\s+)([\w\d_]+\(\))([.,/\s]|$)')  

> 

> IMHO, this looks good enough to avoid trouble, maybe except if one

> wants to write a document explaining this functionality at the

> doc-guide/kernel-doc.rst.


Adding something to the docs is definitely on my list.

> Anyway, the way it is written, we could still explain it by adding

> a "\ " after the func, e. g.:

> 

> 	When you write a function like: func()\ , the automarkup

> 	extension will automatically convert it into:

> 	``:c:func:`func()```.

> 

> So, this looks OK on my eyes.


Not sure I like that; the whole point is to avoid extra markup here.  Plus
I like that it catches all function references whether the author thought
to mark them or not.

> > +#

> > +# Lines consisting of a single underline character.

> > +#

> > +RE_underline = re.compile(r'^([-=~])\1+$')  

> 

> Hmm... why are you calling this "underline"? Sounds a bad name to me,

> as it took me a while to understand what you meant.


Seemed OK to me, but I can change it :)

> From the code I'm inferring that this is meant to track 3 of the

> possible symbols used as a (sub).*title markup. On several places 

> we use other symbols:'^', '~', '.', '*' (and others) as sub-sub(sub..)

> title markups.


I picked the ones that were suggested in our docs; it was enough to catch
all of the problems in the current kernel docs.

Anyway, The real documentation gives the actual set, so I'll maybe make it:

	=-'`":~^_*+#<>

I'd prefer that to something more wildcardish.

> You should probably need another regex for the title itself:

> 

> 	RE_possible_title = re.compile(r'^(\S.*\S)\s*$')

> 

> in order to get the size of the matched line. Doing a doing len(previous)

> will get you false positives.


This I don't quite get.  It's easy enough to trim off the spaces with
strip() if that turns out to be a problem (which it hasn't so far).  I can
add that.

> on a separate matter (but related to automarkup matter - and to what

> I would name underline), as a future feature, perhaps we could also add

> a parser for:

> 

> 	_something that requires underlines_

> 

> Underlined text is probably the only feature that we use on several docs

> with Sphinx doesn't support (there are some extensions for that - I guess,

> but it sounds simple enough to have a parser here).

> 

> This can be tricky to get it right, as just underlines_ is a

> cross reference markup - so, I would only add this after we improve the

> script to come after Sphinx own markup processing.


That does indeed sound tricky.  It would also probably have to come
*before* Sphinx does its thing or it's unlikely to survive.

> > +        #

> > +        # Is this an underline line?  If so, and it is the same length

> > +        # as the previous line, we may have mangled a heading line in

> > +        # error, so undo it.

> > +        #

> > +        elif RE_underline.match(line):

> > +            if len(line) == len(previous):  

> 

> No, that doesn't seem enough. I would, instead, use the regex I

> proposed before, in order to check if the previous line starts with

> a non-space, and getting the length only up to the last non-space

> (yeah, unfortunately, we have some text files that have extra blank

> spaces at line's tail).


So I'll make it "if len(line) == len(previous.strip())

Thanks,

jon

Mauro Carvalho Chehab April 26, 2019, 8:51 p.m. UTC | #9

Em Fri, 26 Apr 2019 13:37:19 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:

> On Fri, 26 Apr 2019 15:32:55 -0300

> Mauro Carvalho Chehab <mchehab+samsung@kernel.org> wrote:

> 

> > > +# Try to identify "function()" that's not already marked up some

> > > +# other way.  Sphinx doesn't like a lot of stuff right after a

> > > +# :c:func: block (i.e. ":c:func:`mmap()`s" flakes out), so the last

> > > +# bit tries to restrict matches to things that won't create trouble.

> > > +#

> > > +RE_function = re.compile(r'(^|\s+)([\w\d_]+\(\))([.,/\s]|$)')    

> > 

> > IMHO, this looks good enough to avoid trouble, maybe except if one

> > wants to write a document explaining this functionality at the

> > doc-guide/kernel-doc.rst.  

> 

> Adding something to the docs is definitely on my list.

> 

> > Anyway, the way it is written, we could still explain it by adding

> > a "\ " after the func, e. g.:

> > 

> > 	When you write a function like: func()\ , the automarkup

> > 	extension will automatically convert it into:

> > 	``:c:func:`func()```.

> > 

> > So, this looks OK on my eyes.  

> 

> Not sure I like that; the whole point is to avoid extra markup here.  Plus

> I like that it catches all function references whether the author thought

> to mark them or not.


Yes, but I'm pretty sure that there will be cases where one may want
to explicitly force the parser to not recognize it. One of such examples
is the document explaining this feature.

> 

> > > +#

> > > +# Lines consisting of a single underline character.

> > > +#

> > > +RE_underline = re.compile(r'^([-=~])\1+$')    

> > 

> > Hmm... why are you calling this "underline"? Sounds a bad name to me,

> > as it took me a while to understand what you meant.  

> 

> Seemed OK to me, but I can change it :)


I'm pretty sure that, on my last /79 patch series, I used some
patterns that would be placing function names at the title,
and that were not using '-', '=' or '~'. If the parser
would pick it or not is a separate matter[1] :-)

[1] It would probably reject on titles, as very often what
we write on titles are things like:

foo(int i, int j)
.................

As it has the variables inside, the parser won't likely get it.

Yet, I vaguely remember I saw or wrote some title that had
a pattern like:

Usage of foo()
^^^^^^^^^^^^^^

(but I can't really remember what was the used markup)

I would prefer if you change. I usually use myself:

	'=' '-' '^' and '.'

(usually on the above order - as it makes some sense to my
brain  to use the above indentation levels)

In general, function descriptions are sub-sub-title or
sub-sub-sub-title, so, on the places I wrote, it would likely
be either '^' or '.'.

But I've seen other symbols being used too to mark titles
(like '*' and '#').

> > From the code I'm inferring that this is meant to track 3 of the

> > possible symbols used as a (sub).*title markup. On several places 

> > we use other symbols:'^', '~', '.', '*' (and others) as sub-sub(sub..)

> > title markups.  

> 

> I picked the ones that were suggested in our docs; it was enough to catch

> all of the problems in the current kernel docs.

> 

> Anyway, The real documentation gives the actual set, so I'll maybe make it:

> 

> 	=-'`":~^_*+#<>


I'm pretty sure a single dot works as well, as I used this already.

> I'd prefer that to something more wildcardish.


Yeah, makes sense, provided that it will reflect what Sphinx actually
uses internally.

> > You should probably need another regex for the title itself:

> > 

> > 	RE_possible_title = re.compile(r'^(\S.*\S)\s*$')

> > 

> > in order to get the size of the matched line. Doing a doing len(previous)

> > will get you false positives.  

> 

> This I don't quite get.  It's easy enough to trim off the spaces with

> strip() if that turns out to be a problem (which it hasn't so far).  I can

> add that.



What I'm saying is that the title markup should always start at the first
position. So, this is a valid title:

Foo valid title
===============

But this causes Sphinx to crash badly:

 Foo invalid title
 =================

Knowing that, we can use a regex for the previous line assuming that it
will always start with a non-spaced character[2], and checking only the 
length of non-blank characters.

[2] Strictly speaking, I guess Sphinx would accept something like:

   Foo weirdly marked title - probably non-compliant with ReST spec
===================================================================

But I don't think we have any occurrence of something like that - and
I don't think we should concern about that, as it would be a very bad 
documentation style anyway.

So, what I'm saying is that we could use such knowledge in our benefit,
considering a valid title to be something like: ^(\S.*\S)\s*$ - e. g.
the title itself starts on a non-space char and ends on another 
non-space char.

> 

> > on a separate matter (but related to automarkup matter - and to what

> > I would name underline), as a future feature, perhaps we could also add

> > a parser for:

> > 

> > 	_something that requires underlines_

> > 

> > Underlined text is probably the only feature that we use on several docs

> > with Sphinx doesn't support (there are some extensions for that - I guess,

> > but it sounds simple enough to have a parser here).

> > 

> > This can be tricky to get it right, as just underlines_ is a

> > cross reference markup - so, I would only add this after we improve the

> > script to come after Sphinx own markup processing.  

> 

> That does indeed sound tricky.  It would also probably have to come

> *before* Sphinx does its thing or it's unlikely to survive.

> 

> > > +        #

> > > +        # Is this an underline line?  If so, and it is the same length

> > > +        # as the previous line, we may have mangled a heading line in

> > > +        # error, so undo it.

> > > +        #

> > > +        elif RE_underline.match(line):

> > > +            if len(line) == len(previous):    

> > 

> > No, that doesn't seem enough. I would, instead, use the regex I

> > proposed before, in order to check if the previous line starts with

> > a non-space, and getting the length only up to the last non-space

> > (yeah, unfortunately, we have some text files that have extra blank

> > spaces at line's tail).  

> 

> So I'll make it "if len(line) == len(previous.strip())

> 

> Thanks,

> 

> jon




Thanks,
Mauro

[1/2] Docs: An initial automarkup extension for sphinx

Commit Message

Comments

Patch