diff mbox series

mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging

Message ID 51e6d7fb-ac9e-a59f-ea63-ad06219b429d@collabora.com
State New
Headers show
Series mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging | expand

Commit Message

Guillaume Tucker Dec. 11, 2017, 10:30 a.m. UTC
Hi Daniel,

Please see below, I've had several bisection results pointing at
that commit over the week-end on mainline but also on linux-next
and net-next.  While the peach-pi is a bit flaky at the moment
and is likely to have more than one issue, it does seem like this
commit is causing some well reproducible kernel hang.

Here's a re-run with v4.15-rc3 showing the issue:

   https://lava.collabora.co.uk/scheduler/job/1018478

and here's another one with the change mentioned below reverted:

   https://lava.collabora.co.uk/scheduler/job/1018479

They both show a warning about "unbalanced disables for lcd_vdd",
I don't know if this is related as I haven't investigated any
further.  It does appear to reliably hang with v4.15-rc3 and
boot most of the time with the commit reverted though.

The automated kernelci.org bisection is still an experimental
tool and it may well be a false positive, so please take this
result with a pinch of salt...

Hope this helps!

Best wishes,
Guillaume


-------- Forwarded Message --------
Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)
From: kernelci.org bot <bot@kernelci.org>

To: guillaume.tucker@collabora.com

Bisection result for mainline/master (v4.15-rc3) on peach-pi

Good known revision:

     c6b3e96 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Bad known revision:

     50c4c4e Linux 4.15-rc3

Extra parameters:

     Tree:      mainline
     Branch:    master
     Target:    peach-pi
     Lab:       lab-collabora
     Defconfig: exynos_defconfig
     Plan:      boot


Breaking commit found:

-------------------------------------------------------------------------------
commit a703c55004e1c5076d57e43771b3e11117796ea0
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Dec 4 21:48:18 2017 +0100

     drm: safely free connectors from connector_iter
     
     In
     
     commit 613051dac40da1751ab269572766d3348d45a197
     Author: Daniel Vetter <daniel.vetter@ffwll.ch>
     Date:   Wed Dec 14 00:08:06 2016 +0100
     
         drm: locking&new iterators for connector_list
     
     we've went to extreme lengths to make sure connector iterations works
     in any context, without introducing any additional locking context.
     This worked, except for a small fumble in the implementation:
     
     When we actually race with a concurrent connector unplug event, and
     our temporary connector reference turns out to be the final one, then
     everything breaks: We call the connector release function from
     whatever context we happen to be in, which can be an irq/atomic
     context. And connector freeing grabs all kinds of locks and stuff.
     
     Fix this by creating a specially safe put function for connetor_iter,
     which (in this rare case) punts the cleanup to a worker.
     
     Reported-by: Ben Widawsky <ben@bwidawsk.net>
     Cc: Ben Widawsky <ben@bwidawsk.net>
     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")
     Cc: Dave Airlie <airlied@gmail.com>
     Cc: Chris Wilson <chris@chris-wilson.co.uk>
     Cc: Sean Paul <seanpaul@chromium.org>
     Cc: <stable@vger.kernel.org> # v4.11+
     Reviewed-by: Dave Airlie <airlied@gmail.com>

     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

     Link: https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch

-------------------------------------------------------------------------------


Git bisection log:

-------------------------------------------------------------------------------
git bisect start
# good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94
# bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3
git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36
# bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d
# bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d
# bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux
git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12
# bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992
# bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU
git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80
# good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the shrink request to prevent skip huge pool
git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6
# good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
git bisect good db8f884ca7fe6af64d443d1510464efe23826131
# bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828
# bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter
git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0
# first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter
-------------------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Daniel Vetter Dec. 11, 2017, 5:05 p.m. UTC | #1
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:
> Hi Daniel,

>

> Please see below, I've had several bisection results pointing at

> that commit over the week-end on mainline but also on linux-next

> and net-next.  While the peach-pi is a bit flaky at the moment

> and is likely to have more than one issue, it does seem like this

> commit is causing some well reproducible kernel hang.

>

> Here's a re-run with v4.15-rc3 showing the issue:

>

>   https://lava.collabora.co.uk/scheduler/job/1018478

>

> and here's another one with the change mentioned below reverted:

>

>   https://lava.collabora.co.uk/scheduler/job/1018479

>

> They both show a warning about "unbalanced disables for lcd_vdd",

> I don't know if this is related as I haven't investigated any

> further.  It does appear to reliably hang with v4.15-rc3 and

> boot most of the time with the commit reverted though.

>

> The automated kernelci.org bisection is still an experimental

> tool and it may well be a false positive, so please take this

> result with a pinch of salt...


The patch just very minimal moves the connector cleanup around (so
timing change), but except when you unload a driver (or maybe that
funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
more info than "seems to hang a bit more" I have no idea what's wrong.
The patch itself should work, at least it survived quite some serious
testing we do on everything.
-Daniel

> Hope this helps!

>

> Best wishes,

> Guillaume

>

>

> -------- Forwarded Message --------

> Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging

> Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)

> From: kernelci.org bot <bot@kernelci.org>

> To: guillaume.tucker@collabora.com

>

> Bisection result for mainline/master (v4.15-rc3) on peach-pi

>

> Good known revision:

>

>     c6b3e96 Merge branch 'for-linus' of

> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

>

> Bad known revision:

>

>     50c4c4e Linux 4.15-rc3

>

> Extra parameters:

>

>     Tree:      mainline

>     Branch:    master

>     Target:    peach-pi

>     Lab:       lab-collabora

>     Defconfig: exynos_defconfig

>     Plan:      boot

>

>

> Breaking commit found:

>

> -------------------------------------------------------------------------------

> commit a703c55004e1c5076d57e43771b3e11117796ea0

> Author: Daniel Vetter <daniel.vetter@ffwll.ch>

> Date:   Mon Dec 4 21:48:18 2017 +0100

>

>     drm: safely free connectors from connector_iter

>         In

>         commit 613051dac40da1751ab269572766d3348d45a197

>     Author: Daniel Vetter <daniel.vetter@ffwll.ch>

>     Date:   Wed Dec 14 00:08:06 2016 +0100

>             drm: locking&new iterators for connector_list

>         we've went to extreme lengths to make sure connector iterations

> works

>     in any context, without introducing any additional locking context.

>     This worked, except for a small fumble in the implementation:

>         When we actually race with a concurrent connector unplug event, and

>     our temporary connector reference turns out to be the final one, then

>     everything breaks: We call the connector release function from

>     whatever context we happen to be in, which can be an irq/atomic

>     context. And connector freeing grabs all kinds of locks and stuff.

>         Fix this by creating a specially safe put function for

> connetor_iter,

>     which (in this rare case) punts the cleanup to a worker.

>         Reported-by: Ben Widawsky <ben@bwidawsk.net>

>     Cc: Ben Widawsky <ben@bwidawsk.net>

>     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")

>     Cc: Dave Airlie <airlied@gmail.com>

>     Cc: Chris Wilson <chris@chris-wilson.co.uk>

>     Cc: Sean Paul <seanpaul@chromium.org>

>     Cc: <stable@vger.kernel.org> # v4.11+

>     Reviewed-by: Dave Airlie <airlied@gmail.com>

>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

>     Link:

> https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch

>

> diff --git a/drivers/gpu/drm/drm_connector.c

> b/drivers/gpu/drm/drm_connector.c

> index 25f4b2e..4820141 100644

> --- a/drivers/gpu/drm/drm_connector.c

> +++ b/drivers/gpu/drm/drm_connector.c

> @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref)

>         connector->funcs->destroy(connector);

>  }

>  +static void drm_connector_free_work_fn(struct work_struct *work)

> +{

> +       struct drm_connector *connector =

> +               container_of(work, struct drm_connector, free_work);

> +       struct drm_device *dev = connector->dev;

> +

> +       drm_mode_object_unregister(dev, &connector->base);

> +       connector->funcs->destroy(connector);

> +}

> +

>  /**

>   * drm_connector_init - Init a preallocated connector

>   * @dev: DRM device

> @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev,

>         if (ret)

>                 return ret;

>  +      INIT_WORK(&connector->free_work, drm_connector_free_work_fn);

> +

>         connector->base.properties = &connector->properties;

>         connector->dev = dev;

>         connector->funcs = funcs;

> @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device

> *dev,

>  }

>  EXPORT_SYMBOL(drm_connector_list_iter_begin);

>  +/*

> + * Extra-safe connector put function that works in any context. Should only

> be

> + * used from the connector_iter functions, where we never really expect to

> + * actually release the connector when dropping our final reference.

> + */

> +static void

> +drm_connector_put_safe(struct drm_connector *conn)

> +{

> +       if (refcount_dec_and_test(&conn->base.refcount.refcount))

> +               schedule_work(&conn->free_work);

> +}

> +

>  /**

>   * drm_connector_list_iter_next - return next connector

>   * @iter: connectr_list iterator

> @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct

> drm_connector_list_iter *iter)

>         spin_unlock_irqrestore(&config->connector_list_lock, flags);

>         if (old_conn)

> -               drm_connector_put(old_conn);

> +               drm_connector_put_safe(old_conn);

>         return iter->conn;

>  }

> @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct

> drm_connector_list_iter *iter)

>  {

>         iter->dev = NULL;

>         if (iter->conn)

> -               drm_connector_put(iter->conn);

> +               drm_connector_put_safe(iter->conn);

>         lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);

>  }

>  EXPORT_SYMBOL(drm_connector_list_iter_end);

> diff --git a/drivers/gpu/drm/drm_mode_config.c

> b/drivers/gpu/drm/drm_mode_config.c

> index cda8bfa..cc78b3d 100644

> --- a/drivers/gpu/drm/drm_mode_config.c

> +++ b/drivers/gpu/drm/drm_mode_config.c

> @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev)

>                 drm_connector_put(connector);

>         }

>         drm_connector_list_iter_end(&conn_iter);

> +       /* connector_iter drops references in a work item. */

> +       flush_scheduled_work();

>         if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {

>                 drm_connector_list_iter_begin(dev, &conn_iter);

>                 drm_for_each_connector_iter(connector, &conn_iter)

> diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h

> index df9807a..a4649c5 100644

> --- a/include/drm/drm_connector.h

> +++ b/include/drm/drm_connector.h

> @@ -916,6 +916,14 @@ struct drm_connector {

>         uint8_t num_h_tile, num_v_tile;

>         uint8_t tile_h_loc, tile_v_loc;

>         uint16_t tile_h_size, tile_v_size;

> +

> +       /**

> +        * @free_work:

> +        *

> +        * Work used only by &drm_connector_iter to be able to clean up a

> +        * connector from any context.

> +        */

> +       struct work_struct free_work;

>  };

>   #define obj_to_connector(x) container_of(x, struct drm_connector, base)

> -------------------------------------------------------------------------------

>

>

> Git bisection log:

>

> -------------------------------------------------------------------------------

> git bisect start

> # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus'

> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

> git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94

> # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3

> git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36

> # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge

> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

> git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d

> # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2'

> of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

> git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d

> # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag

> 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux

> git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12

> # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag

> 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel

> into drm-fixes

> git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992

> # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop

> NONCONTIG flag for buffers allocated without IOMMU

> git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80

> # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the

> shrink request to prevent skip huge pool

> git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6

> # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch

> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

> git bisect good db8f884ca7fe6af64d443d1510464efe23826131

> # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag

> 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc

> into drm-fixes

> git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828

> # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free

> connectors from connector_iter

> git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0

> # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely

> free connectors from connector_iter

> -------------------------------------------------------------------------------




-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Javier Martinez Canillas Dec. 11, 2017, 10:28 p.m. UTC | #2
[adding Marek and Shuah to cc list]

On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker

> <guillaume.tucker@collabora.com> wrote:

>> Hi Daniel,

>>

>> Please see below, I've had several bisection results pointing at

>> that commit over the week-end on mainline but also on linux-next

>> and net-next.  While the peach-pi is a bit flaky at the moment

>> and is likely to have more than one issue, it does seem like this

>> commit is causing some well reproducible kernel hang.

>>

>> Here's a re-run with v4.15-rc3 showing the issue:

>>

>>   https://lava.collabora.co.uk/scheduler/job/1018478

>>

>> and here's another one with the change mentioned below reverted:

>>

>>   https://lava.collabora.co.uk/scheduler/job/1018479

>>

>> They both show a warning about "unbalanced disables for lcd_vdd",

>> I don't know if this is related as I haven't investigated any

>> further.  It does appear to reliably hang with v4.15-rc3 and

>> boot most of the time with the commit reverted though.

>>

>> The automated kernelci.org bisection is still an experimental

>> tool and it may well be a false positive, so please take this

>> result with a pinch of salt...

>

> The patch just very minimal moves the connector cleanup around (so

> timing change), but except when you unload a driver (or maybe that

> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have

> more info than "seems to hang a bit more" I have no idea what's wrong.

> The patch itself should work, at least it survived quite some serious

> testing we do on everything.

> -Daniel

>


Marek was pointing to a different culprit [0] in this [1] thread. I
see that both commits made it to v4.15-rc3, which is the first version
where boot fails. So maybe is a combination of both? Or rather
reverting one patch masks the error in the other.

I've access to the machine but unfortunately not a lot of time to dig
on this, I could try to do it in the weekend though.

[0]: https://patchwork.kernel.org/patch/10067711/
[1]: https://www.spinics.net/lists/arm-kernel/msg622152.html

Best regards,
Javier

>> Hope this helps!

>>

>> Best wishes,

>> Guillaume

>>

>>

>> -------- Forwarded Message --------

>> Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging

>> Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)

>> From: kernelci.org bot <bot@kernelci.org>

>> To: guillaume.tucker@collabora.com

>>

>> Bisection result for mainline/master (v4.15-rc3) on peach-pi

>>

>> Good known revision:

>>

>>     c6b3e96 Merge branch 'for-linus' of

>> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

>>

>> Bad known revision:

>>

>>     50c4c4e Linux 4.15-rc3

>>

>> Extra parameters:

>>

>>     Tree:      mainline

>>     Branch:    master

>>     Target:    peach-pi

>>     Lab:       lab-collabora

>>     Defconfig: exynos_defconfig

>>     Plan:      boot

>>

>>

>> Breaking commit found:

>>

>> -------------------------------------------------------------------------------

>> commit a703c55004e1c5076d57e43771b3e11117796ea0

>> Author: Daniel Vetter <daniel.vetter@ffwll.ch>

>> Date:   Mon Dec 4 21:48:18 2017 +0100

>>

>>     drm: safely free connectors from connector_iter

>>         In

>>         commit 613051dac40da1751ab269572766d3348d45a197

>>     Author: Daniel Vetter <daniel.vetter@ffwll.ch>

>>     Date:   Wed Dec 14 00:08:06 2016 +0100

>>             drm: locking&new iterators for connector_list

>>         we've went to extreme lengths to make sure connector iterations

>> works

>>     in any context, without introducing any additional locking context.

>>     This worked, except for a small fumble in the implementation:

>>         When we actually race with a concurrent connector unplug event, and

>>     our temporary connector reference turns out to be the final one, then

>>     everything breaks: We call the connector release function from

>>     whatever context we happen to be in, which can be an irq/atomic

>>     context. And connector freeing grabs all kinds of locks and stuff.

>>         Fix this by creating a specially safe put function for

>> connetor_iter,

>>     which (in this rare case) punts the cleanup to a worker.

>>         Reported-by: Ben Widawsky <ben@bwidawsk.net>

>>     Cc: Ben Widawsky <ben@bwidawsk.net>

>>     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")

>>     Cc: Dave Airlie <airlied@gmail.com>

>>     Cc: Chris Wilson <chris@chris-wilson.co.uk>

>>     Cc: Sean Paul <seanpaul@chromium.org>

>>     Cc: <stable@vger.kernel.org> # v4.11+

>>     Reviewed-by: Dave Airlie <airlied@gmail.com>

>>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

>>     Link:

>> https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch

>>

>> diff --git a/drivers/gpu/drm/drm_connector.c

>> b/drivers/gpu/drm/drm_connector.c

>> index 25f4b2e..4820141 100644

>> --- a/drivers/gpu/drm/drm_connector.c

>> +++ b/drivers/gpu/drm/drm_connector.c

>> @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref)

>>         connector->funcs->destroy(connector);

>>  }

>>  +static void drm_connector_free_work_fn(struct work_struct *work)

>> +{

>> +       struct drm_connector *connector =

>> +               container_of(work, struct drm_connector, free_work);

>> +       struct drm_device *dev = connector->dev;

>> +

>> +       drm_mode_object_unregister(dev, &connector->base);

>> +       connector->funcs->destroy(connector);

>> +}

>> +

>>  /**

>>   * drm_connector_init - Init a preallocated connector

>>   * @dev: DRM device

>> @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev,

>>         if (ret)

>>                 return ret;

>>  +      INIT_WORK(&connector->free_work, drm_connector_free_work_fn);

>> +

>>         connector->base.properties = &connector->properties;

>>         connector->dev = dev;

>>         connector->funcs = funcs;

>> @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device

>> *dev,

>>  }

>>  EXPORT_SYMBOL(drm_connector_list_iter_begin);

>>  +/*

>> + * Extra-safe connector put function that works in any context. Should only

>> be

>> + * used from the connector_iter functions, where we never really expect to

>> + * actually release the connector when dropping our final reference.

>> + */

>> +static void

>> +drm_connector_put_safe(struct drm_connector *conn)

>> +{

>> +       if (refcount_dec_and_test(&conn->base.refcount.refcount))

>> +               schedule_work(&conn->free_work);

>> +}

>> +

>>  /**

>>   * drm_connector_list_iter_next - return next connector

>>   * @iter: connectr_list iterator

>> @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct

>> drm_connector_list_iter *iter)

>>         spin_unlock_irqrestore(&config->connector_list_lock, flags);

>>         if (old_conn)

>> -               drm_connector_put(old_conn);

>> +               drm_connector_put_safe(old_conn);

>>         return iter->conn;

>>  }

>> @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct

>> drm_connector_list_iter *iter)

>>  {

>>         iter->dev = NULL;

>>         if (iter->conn)

>> -               drm_connector_put(iter->conn);

>> +               drm_connector_put_safe(iter->conn);

>>         lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);

>>  }

>>  EXPORT_SYMBOL(drm_connector_list_iter_end);

>> diff --git a/drivers/gpu/drm/drm_mode_config.c

>> b/drivers/gpu/drm/drm_mode_config.c

>> index cda8bfa..cc78b3d 100644

>> --- a/drivers/gpu/drm/drm_mode_config.c

>> +++ b/drivers/gpu/drm/drm_mode_config.c

>> @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev)

>>                 drm_connector_put(connector);

>>         }

>>         drm_connector_list_iter_end(&conn_iter);

>> +       /* connector_iter drops references in a work item. */

>> +       flush_scheduled_work();

>>         if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {

>>                 drm_connector_list_iter_begin(dev, &conn_iter);

>>                 drm_for_each_connector_iter(connector, &conn_iter)

>> diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h

>> index df9807a..a4649c5 100644

>> --- a/include/drm/drm_connector.h

>> +++ b/include/drm/drm_connector.h

>> @@ -916,6 +916,14 @@ struct drm_connector {

>>         uint8_t num_h_tile, num_v_tile;

>>         uint8_t tile_h_loc, tile_v_loc;

>>         uint16_t tile_h_size, tile_v_size;

>> +

>> +       /**

>> +        * @free_work:

>> +        *

>> +        * Work used only by &drm_connector_iter to be able to clean up a

>> +        * connector from any context.

>> +        */

>> +       struct work_struct free_work;

>>  };

>>   #define obj_to_connector(x) container_of(x, struct drm_connector, base)

>> -------------------------------------------------------------------------------

>>

>>

>> Git bisection log:

>>

>> -------------------------------------------------------------------------------

>> git bisect start

>> # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus'

>> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

>> git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94

>> # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3

>> git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36

>> # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge

>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

>> git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d

>> # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2'

>> of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

>> git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d

>> # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag

>> 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux

>> git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12

>> # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag

>> 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel

>> into drm-fixes

>> git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992

>> # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop

>> NONCONTIG flag for buffers allocated without IOMMU

>> git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80

>> # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the

>> shrink request to prevent skip huge pool

>> git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6

>> # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch

>> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

>> git bisect good db8f884ca7fe6af64d443d1510464efe23826131

>> # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag

>> 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc

>> into drm-fixes

>> git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828

>> # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free

>> connectors from connector_iter

>> git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0

>> # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely

>> free connectors from connector_iter

>> -------------------------------------------------------------------------------

>

>

>

> --

> Daniel Vetter

> Software Engineer, Intel Corporation

> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

> --

> To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in

> the body of a message to majordomo@vger.kernel.org

> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Javier Martinez Canillas Dec. 11, 2017, 10:54 p.m. UTC | #3
On Mon, Dec 11, 2017 at 11:28 PM, Javier Martinez Canillas
<javier@dowhile0.org> wrote:
> [adding Marek and Shuah to cc list]


[snip]

>>>

>>> Please see below, I've had several bisection results pointing at

>>> that commit over the week-end on mainline but also on linux-next

>>> and net-next.  While the peach-pi is a bit flaky at the moment

>>> and is likely to have more than one issue, it does seem like this

>>> commit is causing some well reproducible kernel hang.

>>>

>>> Here's a re-run with v4.15-rc3 showing the issue:

>>>

>>>   https://lava.collabora.co.uk/scheduler/job/1018478

>>>

>>> and here's another one with the change mentioned below reverted:

>>>

>>>   https://lava.collabora.co.uk/scheduler/job/1018479

>>>

>>> They both show a warning about "unbalanced disables for lcd_vdd",

>>> I don't know if this is related as I haven't investigated any

>>> further.  It does appear to reliably hang with v4.15-rc3 and

>>> boot most of the time with the commit reverted though.

>>>

>>> The automated kernelci.org bisection is still an experimental

>>> tool and it may well be a false positive, so please take this

>>> result with a pinch of salt...

>>

>> The patch just very minimal moves the connector cleanup around (so

>> timing change), but except when you unload a driver (or maybe that

>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have

>> more info than "seems to hang a bit more" I have no idea what's wrong.

>> The patch itself should work, at least it survived quite some serious

>> testing we do on everything.

>> -Daniel

>>

>

> Marek was pointing to a different culprit [0] in this [1] thread. I

> see that both commits made it to v4.15-rc3, which is the first version

> where boot fails. So maybe is a combination of both? Or rather

> reverting one patch masks the error in the other.

>

> I've access to the machine but unfortunately not a lot of time to dig

> on this, I could try to do it in the weekend though.

>

> [0]: https://patchwork.kernel.org/patch/10067711/

> [1]: https://www.spinics.net/lists/arm-kernel/msg622152.html

>


So I gave a quick look to this, and at the very least there's a bug in
the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
exynos: Add status property to Exynos 542x Mixer nodes").

I've posted a fix for that:

https://patchwork.kernel.org/patch/10105921/

I believe this could be also be the cause for the boot failure, since
I see in the boot log that things start to go wrong after exynos-drm
fails to bind the HDMI component:

[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
0xc1398690): -1

Anyway, I don't have access to the machine now, but it would be nice
if someone test. Or I would do in a few days.

Best regards,
Javier
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Russell King (Oracle) Dec. 11, 2017, 10:58 p.m. UTC | #4
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
> So I gave a quick look to this, and at the very least there's a bug in

> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

> exynos: Add status property to Exynos 542x Mixer nodes").

> 

> I've posted a fix for that:

> 

> https://patchwork.kernel.org/patch/10105921/

> 

> I believe this could be also be the cause for the boot failure, since

> I see in the boot log that things start to go wrong after exynos-drm

> fails to bind the HDMI component:

> 

> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

> 0xc1398690): -1


Umm, -1 ?  Looking that error code up in
include/uapi/asm-generic/errno-base.h says it's -EPERM.

I suspect that's someone just returning -1 because they're lazy...
which is real bad form and needs fixing.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Russell King (Oracle) Dec. 11, 2017, 11:02 p.m. UTC | #5
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:

> > So I gave a quick look to this, and at the very least there's a bug in

> > the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

> > exynos: Add status property to Exynos 542x Mixer nodes").

> > 

> > I've posted a fix for that:

> > 

> > https://patchwork.kernel.org/patch/10105921/

> > 

> > I believe this could be also be the cause for the boot failure, since

> > I see in the boot log that things start to go wrong after exynos-drm

> > fails to bind the HDMI component:

> > 

> > [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

> > 0xc1398690): -1

> 

> Umm, -1 ?  Looking that error code up in

> include/uapi/asm-generic/errno-base.h says it's -EPERM.

> 

> I suspect that's someone just returning -1 because they're lazy...

> which is real bad form and needs fixing.


Oh, it really is -EPERM:

struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,
                                       enum exynos_drm_output_type out_type)
{
        struct drm_crtc *crtc;

        drm_for_each_crtc(crtc, drm_dev)
                if (to_exynos_crtc(crtc)->type == out_type)
                        return to_exynos_crtc(crtc);

        return ERR_PTR(-EPERM);
}

Does "Operation not permitted" really convey the error here?  It doesn't
look like a permission error to me.

Can we please avoid abusing errno codes?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shuah Khan Dec. 11, 2017, 11:25 p.m. UTC | #6
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:

>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:

>>> So I gave a quick look to this, and at the very least there's a bug in

>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

>>> exynos: Add status property to Exynos 542x Mixer nodes").

>>>

>>> I've posted a fix for that:

>>>

>>> https://patchwork.kernel.org/patch/10105921/

>>>

>>> I believe this could be also be the cause for the boot failure, since

>>> I see in the boot log that things start to go wrong after exynos-drm

>>> fails to bind the HDMI component:

>>>

>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

>>> 0xc1398690): -1

>>

>> Umm, -1 ?  Looking that error code up in

>> include/uapi/asm-generic/errno-base.h says it's -EPERM.

>>

>> I suspect that's someone just returning -1 because they're lazy...

>> which is real bad form and needs fixing.

> 

> Oh, it really is -EPERM:

> 

> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,

>                                        enum exynos_drm_output_type out_type)

> {

>         struct drm_crtc *crtc;

> 

>         drm_for_each_crtc(crtc, drm_dev)

>                 if (to_exynos_crtc(crtc)->type == out_type)

>                         return to_exynos_crtc(crtc);

> 

>         return ERR_PTR(-EPERM);

> }

> 

> Does "Operation not permitted" really convey the error here?  It doesn't

> look like a permission error to me.

> 

> Can we please avoid abusing errno codes?


I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
with top commit g968edbd worked just fine for me last Friday. I ran several
tests and everything checked out except the exynos-gsc lockdep issue I sent
a 4.14 patch for.

However, with 4.15-rc3, dmesg is gets filled with 

[  342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

I will start bisect and try to isolate the problem. I suspect this is related to dts
changes perhaps? I used to this problem a while back and it has been fixed.

thanks,
-- Shuah 
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski Dec. 12, 2017, 7:54 a.m. UTC | #7
Hi Shuah,

On 2017-12-12 00:25, Shuah Khan wrote:
> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:

>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:

>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:

>>>> So I gave a quick look to this, and at the very least there's a bug in

>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

>>>> exynos: Add status property to Exynos 542x Mixer nodes").

>>>>

>>>> I've posted a fix for that:

>>>>

>>>> https://patchwork.kernel.org/patch/10105921/

>>>>

>>>> I believe this could be also be the cause for the boot failure, since

>>>> I see in the boot log that things start to go wrong after exynos-drm

>>>> fails to bind the HDMI component:

>>>>

>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

>>>> 0xc1398690): -1

>>> Umm, -1 ?  Looking that error code up in

>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.

>>>

>>> I suspect that's someone just returning -1 because they're lazy...

>>> which is real bad form and needs fixing.

>> Oh, it really is -EPERM:

>>

>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,

>>                                         enum exynos_drm_output_type out_type)

>> {

>>          struct drm_crtc *crtc;

>>

>>          drm_for_each_crtc(crtc, drm_dev)

>>                  if (to_exynos_crtc(crtc)->type == out_type)

>>                          return to_exynos_crtc(crtc);

>>

>>          return ERR_PTR(-EPERM);

>> }

>>

>> Does "Operation not permitted" really convey the error here?  It doesn't

>> look like a permission error to me.

>>

>> Can we please avoid abusing errno codes?

> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+

> with top commit g968edbd worked just fine for me last Friday. I ran several

> tests and everything checked out except the exynos-gsc lockdep issue I sent

> a 4.14 patch for.

>

> However, with 4.15-rc3, dmesg is gets filled with

>

> [  342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

> [  402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

>

> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

>

> I will start bisect and try to isolate the problem. I suspect this is related to dts

> changes perhaps? I used to this problem a while back and it has been fixed.


This warning has been added intentionally, see following discussions:
https://patchwork.kernel.org/patch/10034919/
https://patchwork.kernel.org/patch/10070475/

This means that your test apps should be updated or you should enable Exynos
IOMMU support in your config. Maybe it is a good time to finally enable it
in exynos_defconfig.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Javier Martinez Canillas Dec. 12, 2017, 8 a.m. UTC | #8
Hello Marek,

On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi Shuah,

>

>

> On 2017-12-12 00:25, Shuah Khan wrote:

>>

>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:

>>>

>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:

>>>>

>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas

>>>> wrote:

>>>>>

>>>>> So I gave a quick look to this, and at the very least there's a bug in

>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

>>>>> exynos: Add status property to Exynos 542x Mixer nodes").

>>>>>

>>>>> I've posted a fix for that:

>>>>>

>>>>> https://patchwork.kernel.org/patch/10105921/

>>>>>

>>>>> I believe this could be also be the cause for the boot failure, since

>>>>> I see in the boot log that things start to go wrong after exynos-drm

>>>>> fails to bind the HDMI component:

>>>>>

>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

>>>>> 0xc1398690): -1

>>>>

>>>> Umm, -1 ?  Looking that error code up in

>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.

>>>>

>>>> I suspect that's someone just returning -1 because they're lazy...

>>>> which is real bad form and needs fixing.

>>>

>>> Oh, it really is -EPERM:

>>>

>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device

>>> *drm_dev,

>>>                                         enum exynos_drm_output_type

>>> out_type)

>>> {

>>>          struct drm_crtc *crtc;

>>>

>>>          drm_for_each_crtc(crtc, drm_dev)

>>>                  if (to_exynos_crtc(crtc)->type == out_type)

>>>                          return to_exynos_crtc(crtc);

>>>

>>>          return ERR_PTR(-EPERM);

>>> }

>>>

>>> Does "Operation not permitted" really convey the error here?  It doesn't

>>> look like a permission error to me.

>>>

>>> Can we please avoid abusing errno codes?

>>

>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+

>> with top commit g968edbd worked just fine for me last Friday. I ran

>> several

>> tests and everything checked out except the exynos-gsc lockdep issue I

>> sent

>> a 4.14 patch for.

>>

>> However, with 4.15-rc3, dmesg is gets filled with

>>

>> [  342.337181] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  342.337470] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  342.337851] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.382346] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.396682] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.399244] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.399496] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.399848] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.400163] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.400495] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.401294] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>> [  402.401595] [drm] Non-contiguous allocation is not supported without

>> IOMMU, falling back to contiguous buffer

>>

>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

>>

>> I will start bisect and try to isolate the problem. I suspect this is

>> related to dts

>> changes perhaps? I used to this problem a while back and it has been

>> fixed.

>

>

> This warning has been added intentionally, see following discussions:

> https://patchwork.kernel.org/patch/10034919/

> https://patchwork.kernel.org/patch/10070475/

>

> This means that your test apps should be updated or you should enable Exynos

> IOMMU support in your config. Maybe it is a good time to finally enable it

> in exynos_defconfig.

>


Has the issue that the boot-loader keeps the display controller
enabled and scanning pages on the Exynos Chromebooks resolved?

I think that's that preventing to enable it by default in
exynos_defconfig since it caused boot failures when enabled on these
machines. I don't follow exynos development too closely nowadays so
maybe there's a fix in place now.

> Best regards

> --

> Marek Szyprowski, PhD

> Samsung R&D Institute Poland


Best regards,
Javier
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski Dec. 12, 2017, 8:47 a.m. UTC | #9
Hi Javier,

On 2017-12-12 09:00, Javier Martinez Canillas wrote:
> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski

> <m.szyprowski@samsung.com> wrote:

>> On 2017-12-12 00:25, Shuah Khan wrote:

>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:

>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:

>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas

>>>>> wrote:

>>>>>> So I gave a quick look to this, and at the very least there's a bug in

>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").

>>>>>>

>>>>>> I've posted a fix for that:

>>>>>>

>>>>>> https://patchwork.kernel.org/patch/10105921/

>>>>>>

>>>>>> I believe this could be also be the cause for the boot failure, since

>>>>>> I see in the boot log that things start to go wrong after exynos-drm

>>>>>> fails to bind the HDMI component:

>>>>>>

>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

>>>>>> 0xc1398690): -1

>>>>> Umm, -1 ?  Looking that error code up in

>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.

>>>>>

>>>>> I suspect that's someone just returning -1 because they're lazy...

>>>>> which is real bad form and needs fixing.

>>>> Oh, it really is -EPERM:

>>>>

>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device

>>>> *drm_dev,

>>>>                                          enum exynos_drm_output_type

>>>> out_type)

>>>> {

>>>>           struct drm_crtc *crtc;

>>>>

>>>>           drm_for_each_crtc(crtc, drm_dev)

>>>>                   if (to_exynos_crtc(crtc)->type == out_type)

>>>>                           return to_exynos_crtc(crtc);

>>>>

>>>>           return ERR_PTR(-EPERM);

>>>> }

>>>>

>>>> Does "Operation not permitted" really convey the error here?  It doesn't

>>>> look like a permission error to me.

>>>>

>>>> Can we please avoid abusing errno codes?

>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+

>>> with top commit g968edbd worked just fine for me last Friday. I ran

>>> several

>>> tests and everything checked out except the exynos-gsc lockdep issue I

>>> sent

>>> a 4.14 patch for.

>>>

>>> However, with 4.15-rc3, dmesg is gets filled with

>>>

>>> [  342.337181] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  342.337470] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  342.337851] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.382346] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.396682] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.399244] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.399496] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.399848] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.400163] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.400495] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.401294] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>> [  402.401595] [drm] Non-contiguous allocation is not supported without

>>> IOMMU, falling back to contiguous buffer

>>>

>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

>>>

>>> I will start bisect and try to isolate the problem. I suspect this is

>>> related to dts

>>> changes perhaps? I used to this problem a while back and it has been

>>> fixed.

>>

>> This warning has been added intentionally, see following discussions:

>> https://patchwork.kernel.org/patch/10034919/

>> https://patchwork.kernel.org/patch/10070475/

>>

>> This means that your test apps should be updated or you should enable Exynos

>> IOMMU support in your config. Maybe it is a good time to finally enable it

>> in exynos_defconfig.

>>

> Has the issue that the boot-loader keeps the display controller

> enabled and scanning pages on the Exynos Chromebooks resolved?

>

> I think that's that preventing to enable it by default in

> exynos_defconfig since it caused boot failures when enabled on these

> machines. I don't follow exynos development too closely nowadays so

> maybe there's a fix in place now.


Not directly. I still didn't find time to properly add support for
devices, which were left in-working state (with active DMA
transactions) by bootloader, but due to some other changes in the
order of operations during boot process, power domains are
initialized very early and due to temporary lack of devices (which
are not yet added to the system), are turned off. This practically
stops FIMD for scanning framebuffer and "solves" this issue.

I've checked now and Exynos Snow Chromebook boots fine with IOMMU
support enabled, both with v4.15-rc3 and linux-next.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Krzysztof Kozlowski Dec. 12, 2017, 10:10 a.m. UTC | #10
On Tue, Dec 12, 2017 at 9:47 AM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi Javier,

>

> On 2017-12-12 09:00, Javier Martinez Canillas wrote:

>>

>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski

>> <m.szyprowski@samsung.com> wrote:


(...)

>>> This warning has been added intentionally, see following discussions:

>>> https://patchwork.kernel.org/patch/10034919/

>>> https://patchwork.kernel.org/patch/10070475/

>>>

>>> This means that your test apps should be updated or you should enable

>>> Exynos

>>> IOMMU support in your config. Maybe it is a good time to finally enable

>>> it

>>> in exynos_defconfig.

>>>

>> Has the issue that the boot-loader keeps the display controller

>> enabled and scanning pages on the Exynos Chromebooks resolved?

>>

>> I think that's that preventing to enable it by default in

>> exynos_defconfig since it caused boot failures when enabled on these

>> machines. I don't follow exynos development too closely nowadays so

>> maybe there's a fix in place now.

>

>

> Not directly. I still didn't find time to properly add support for

> devices, which were left in-working state (with active DMA

> transactions) by bootloader, but due to some other changes in the

> order of operations during boot process, power domains are

> initialized very early and due to temporary lack of devices (which

> are not yet added to the system), are turned off. This practically

> stops FIMD for scanning framebuffer and "solves" this issue.

>

> I've checked now and Exynos Snow Chromebook boots fine with IOMMU

> support enabled, both with v4.15-rc3 and linux-next.


Then it looks like we could give EXYNOS_IOMMU a try. At least only on
exynos_defconfig which would leave multi_v7 as a platform to compare.

Best regards,
Krzysztof
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski Dec. 12, 2017, 11:38 a.m. UTC | #11
Hi All,

On 2017-12-11 23:28, Javier Martinez Canillas wrote:
> [adding Marek and Shuah to cc list]

>

> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker

>> <guillaume.tucker@collabora.com> wrote:

>>> Hi Daniel,

>>>

>>> Please see below, I've had several bisection results pointing at

>>> that commit over the week-end on mainline but also on linux-next

>>> and net-next.  While the peach-pi is a bit flaky at the moment

>>> and is likely to have more than one issue, it does seem like this

>>> commit is causing some well reproducible kernel hang.

>>>

>>> Here's a re-run with v4.15-rc3 showing the issue:

>>>

>>>    https://lava.collabora.co.uk/scheduler/job/1018478

>>>

>>> and here's another one with the change mentioned below reverted:

>>>

>>>    https://lava.collabora.co.uk/scheduler/job/1018479

>>>

>>> They both show a warning about "unbalanced disables for lcd_vdd",

>>> I don't know if this is related as I haven't investigated any

>>> further.  It does appear to reliably hang with v4.15-rc3 and

>>> boot most of the time with the commit reverted though.

>>>

>>> The automated kernelci.org bisection is still an experimental

>>> tool and it may well be a false positive, so please take this

>>> result with a pinch of salt...

>> The patch just very minimal moves the connector cleanup around (so

>> timing change), but except when you unload a driver (or maybe that

>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have

>> more info than "seems to hang a bit more" I have no idea what's wrong.

>> The patch itself should work, at least it survived quite some serious

>> testing we do on everything.

>> -Daniel

>>

> Marek was pointing to a different culprit [0] in this [1] thread. I

> see that both commits made it to v4.15-rc3, which is the first version

> where boot fails. So maybe is a combination of both? Or rather

> reverting one patch masks the error in the other.

>

> I've access to the machine but unfortunately not a lot of time to dig

> on this, I could try to do it in the weekend though.


After a recent discussion on the Javier's patch:
https://patchwork.kernel.org/patch/10106417/
I've managed to reproduce this issue also on Exynos5250 based Samsung
Snow Chromebook and investigate a bit.

It is caused by a deadlock in the main kernel workqueue. Here are details:

1. Exynos DRM fails to initialize due to missing regulators and gets moved
to deferred probe device list

2. Deferred probe is triggered and kernel "events" workqueue calls
deferred_probe_work_func()

3. exynos_drm_bind() is called, component_bind_all() fails due to missing
Exynos Mixer device

4. error handling path is executed in exynos_drm_bind(), which calls
drm_mode_config_cleanup()

5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
deadlock.

Do You have idea how to fix this issue properly?

Taking a look at git blame, this indeed shows that the issue has been
introduced by the commit a703c55004e1 ("drm: safely free connectors from
connector_ite"), which added a call to flush_scheduled_work() in
drm_mode_config_cleanup().

drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
called from the workqueue, but I don't have idea how to check that. The
other way of fixing it would be to resurrect separate workqueue for DRM
related events.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shuah Khan Dec. 12, 2017, 2:39 p.m. UTC | #12
Hi Marek,

On 12/12/2017 04:38 AM, Marek Szyprowski wrote:
> Hi All,

> 

> On 2017-12-11 23:28, Javier Martinez Canillas wrote:

>> [adding Marek and Shuah to cc list]

>>

>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker

>>> <guillaume.tucker@collabora.com> wrote:

>>>> Hi Daniel,

>>>>

>>>> Please see below, I've had several bisection results pointing at

>>>> that commit over the week-end on mainline but also on linux-next

>>>> and net-next.  While the peach-pi is a bit flaky at the moment

>>>> and is likely to have more than one issue, it does seem like this

>>>> commit is causing some well reproducible kernel hang.

>>>>

>>>> Here's a re-run with v4.15-rc3 showing the issue:

>>>>

>>>>    https://lava.collabora.co.uk/scheduler/job/1018478

>>>>

>>>> and here's another one with the change mentioned below reverted:

>>>>

>>>>    https://lava.collabora.co.uk/scheduler/job/1018479

>>>>

>>>> They both show a warning about "unbalanced disables for lcd_vdd",

>>>> I don't know if this is related as I haven't investigated any

>>>> further.  It does appear to reliably hang with v4.15-rc3 and

>>>> boot most of the time with the commit reverted though.

>>>>

>>>> The automated kernelci.org bisection is still an experimental

>>>> tool and it may well be a false positive, so please take this

>>>> result with a pinch of salt...

>>> The patch just very minimal moves the connector cleanup around (so

>>> timing change), but except when you unload a driver (or maybe that

>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have

>>> more info than "seems to hang a bit more" I have no idea what's wrong.

>>> The patch itself should work, at least it survived quite some serious

>>> testing we do on everything.

>>> -Daniel

>>>

>> Marek was pointing to a different culprit [0] in this [1] thread. I

>> see that both commits made it to v4.15-rc3, which is the first version

>> where boot fails. So maybe is a combination of both? Or rather

>> reverting one patch masks the error in the other.

>>

>> I've access to the machine but unfortunately not a lot of time to dig

>> on this, I could try to do it in the weekend though.

> 

> After a recent discussion on the Javier's patch:

> https://patchwork.kernel.org/patch/10106417/

> I've managed to reproduce this issue also on Exynos5250 based Samsung

> Snow Chromebook and investigate a bit.

> 

> It is caused by a deadlock in the main kernel workqueue. Here are details:

> 

> 1. Exynos DRM fails to initialize due to missing regulators and gets moved

> to deferred probe device list

> 

> 2. Deferred probe is triggered and kernel "events" workqueue calls

> deferred_probe_work_func()

> 

> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing

> Exynos Mixer device

> 

> 4. error handling path is executed in exynos_drm_bind(), which calls

> drm_mode_config_cleanup()

> 

> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes

> deadlock.

> 

> Do You have idea how to fix this issue properly?

> 

> Taking a look at git blame, this indeed shows that the issue has been

> introduced by the commit a703c55004e1 ("drm: safely free connectors from

> connector_ite"), which added a call to flush_scheduled_work() in

> drm_mode_config_cleanup().


This commit is making its way into stable releases. It has been added to
4.14-6 stable. If this patch poses problems, maybe somebody should comment
on the stable release thread.

> 

> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if

> called from the workqueue, but I don't have idea how to check that. The

> other way of fixing it would be to resurrect separate workqueue for DRM

> related events.

> 


Especially since there is no solution :)

thanks,
-- Shuah

--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shuah Khan Dec. 12, 2017, 2:47 p.m. UTC | #13
On 12/12/2017 01:47 AM, Marek Szyprowski wrote:
> Hi Javier,

> 

> On 2017-12-12 09:00, Javier Martinez Canillas wrote:

>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski

>> <m.szyprowski@samsung.com> wrote:

>>> On 2017-12-12 00:25, Shuah Khan wrote:

>>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:

>>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:

>>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas

>>>>>> wrote:

>>>>>>> So I gave a quick look to this, and at the very least there's a bug in

>>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

>>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").

>>>>>>>

>>>>>>> I've posted a fix for that:

>>>>>>>

>>>>>>> https://patchwork.kernel.org/patch/10105921/

>>>>>>>

>>>>>>> I believe this could be also be the cause for the boot failure, since

>>>>>>> I see in the boot log that things start to go wrong after exynos-drm

>>>>>>> fails to bind the HDMI component:

>>>>>>>

>>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

>>>>>>> 0xc1398690): -1

>>>>>> Umm, -1 ?  Looking that error code up in

>>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.

>>>>>>

>>>>>> I suspect that's someone just returning -1 because they're lazy...

>>>>>> which is real bad form and needs fixing.

>>>>> Oh, it really is -EPERM:

>>>>>

>>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device

>>>>> *drm_dev,

>>>>>                                          enum exynos_drm_output_type

>>>>> out_type)

>>>>> {

>>>>>           struct drm_crtc *crtc;

>>>>>

>>>>>           drm_for_each_crtc(crtc, drm_dev)

>>>>>                   if (to_exynos_crtc(crtc)->type == out_type)

>>>>>                           return to_exynos_crtc(crtc);

>>>>>

>>>>>           return ERR_PTR(-EPERM);

>>>>> }

>>>>>

>>>>> Does "Operation not permitted" really convey the error here?  It doesn't

>>>>> look like a permission error to me.

>>>>>

>>>>> Can we please avoid abusing errno codes?

>>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+

>>>> with top commit g968edbd worked just fine for me last Friday. I ran

>>>> several

>>>> tests and everything checked out except the exynos-gsc lockdep issue I

>>>> sent

>>>> a 4.14 patch for.

>>>>

>>>> However, with 4.15-rc3, dmesg is gets filled with

>>>>

>>>> [  342.337181] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  342.337470] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  342.337851] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.382346] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.396682] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.399244] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.399496] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.399848] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.400163] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.400495] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.401294] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>> [  402.401595] [drm] Non-contiguous allocation is not supported without

>>>> IOMMU, falling back to contiguous buffer

>>>>

>>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

>>>>

>>>> I will start bisect and try to isolate the problem. I suspect this is

>>>> related to dts

>>>> changes perhaps? I used to this problem a while back and it has been

>>>> fixed.

>>>

>>> This warning has been added intentionally, see following discussions:

>>> https://patchwork.kernel.org/patch/10034919/

>>> https://patchwork.kernel.org/patch/10070475/

>>>

>>> This means that your test apps should be updated or you should enable Exynos

>>> IOMMU support in your config. Maybe it is a good time to finally enable it

>>> in exynos_defconfig.

>>>

>> Has the issue that the boot-loader keeps the display controller

>> enabled and scanning pages on the Exynos Chromebooks resolved?

>>

>> I think that's that preventing to enable it by default in

>> exynos_defconfig since it caused boot failures when enabled on these

>> machines. I don't follow exynos development too closely nowadays so

>> maybe there's a fix in place now.

> 

> Not directly. I still didn't find time to properly add support for

> devices, which were left in-working state (with active DMA

> transactions) by bootloader, but due to some other changes in the

> order of operations during boot process, power domains are

> initialized very early and due to temporary lack of devices (which

> are not yet added to the system), are turned off. This practically

> stops FIMD for scanning framebuffer and "solves" this issue.

> 

> I've checked now and Exynos Snow Chromebook boots fine with IOMMU

> support enabled, both with v4.15-rc3 and linux-next.

> 


Good to know it doesn't break Exynos Snow. This is why I test without
IOMMU and then enable IOMMU on odroid-xu4 for test with IOMMU

thanks,
-- Shuah

--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Vetter Dec. 12, 2017, 6:14 p.m. UTC | #14
On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi All,

>

>

> On 2017-12-11 23:28, Javier Martinez Canillas wrote:

>>

>> [adding Marek and Shuah to cc list]

>>

>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch>

>> wrote:

>>>

>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker

>>> <guillaume.tucker@collabora.com> wrote:

>>>>

>>>> Hi Daniel,

>>>>

>>>> Please see below, I've had several bisection results pointing at

>>>> that commit over the week-end on mainline but also on linux-next

>>>> and net-next.  While the peach-pi is a bit flaky at the moment

>>>> and is likely to have more than one issue, it does seem like this

>>>> commit is causing some well reproducible kernel hang.

>>>>

>>>> Here's a re-run with v4.15-rc3 showing the issue:

>>>>

>>>>    https://lava.collabora.co.uk/scheduler/job/1018478

>>>>

>>>> and here's another one with the change mentioned below reverted:

>>>>

>>>>    https://lava.collabora.co.uk/scheduler/job/1018479

>>>>

>>>> They both show a warning about "unbalanced disables for lcd_vdd",

>>>> I don't know if this is related as I haven't investigated any

>>>> further.  It does appear to reliably hang with v4.15-rc3 and

>>>> boot most of the time with the commit reverted though.

>>>>

>>>> The automated kernelci.org bisection is still an experimental

>>>> tool and it may well be a false positive, so please take this

>>>> result with a pinch of salt...

>>>

>>> The patch just very minimal moves the connector cleanup around (so

>>> timing change), but except when you unload a driver (or maybe that

>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have

>>> more info than "seems to hang a bit more" I have no idea what's wrong.

>>> The patch itself should work, at least it survived quite some serious

>>> testing we do on everything.

>>> -Daniel

>>>

>> Marek was pointing to a different culprit [0] in this [1] thread. I

>> see that both commits made it to v4.15-rc3, which is the first version

>> where boot fails. So maybe is a combination of both? Or rather

>> reverting one patch masks the error in the other.

>>

>> I've access to the machine but unfortunately not a lot of time to dig

>> on this, I could try to do it in the weekend though.

>

>

> After a recent discussion on the Javier's patch:

> https://patchwork.kernel.org/patch/10106417/

> I've managed to reproduce this issue also on Exynos5250 based Samsung

> Snow Chromebook and investigate a bit.

>

> It is caused by a deadlock in the main kernel workqueue. Here are details:

>

> 1. Exynos DRM fails to initialize due to missing regulators and gets moved

> to deferred probe device list

>

> 2. Deferred probe is triggered and kernel "events" workqueue calls

> deferred_probe_work_func()

>

> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing

> Exynos Mixer device

>

> 4. error handling path is executed in exynos_drm_bind(), which calls

> drm_mode_config_cleanup()

>

> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes

> deadlock.

>

> Do You have idea how to fix this issue properly?

>

> Taking a look at git blame, this indeed shows that the issue has been

> introduced by the commit a703c55004e1 ("drm: safely free connectors from

> connector_ite"), which added a call to flush_scheduled_work() in

> drm_mode_config_cleanup().

>

> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if

> called from the workqueue, but I don't have idea how to check that. The

> other way of fixing it would be to resurrect separate workqueue for DRM

> related events.


We need to flush the work there, or things will go wrong on unload. I
think the real fix is to make sure that the connector cleanup work
isn't run on the same work stuff as any driver bind stuff, which yes
means new separate workqueue just for this.

I guess the simple fix is to do that in the drm, but tbh I'm surprised
that driver bind/deferred probe hasn't run into this problem anywhere
else yet.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shuah Khan Dec. 12, 2017, 6:26 p.m. UTC | #15
On 12/12/2017 07:47 AM, Shuah Khan wrote:
> On 12/12/2017 01:47 AM, Marek Szyprowski wrote:

>> Hi Javier,

>>

>> On 2017-12-12 09:00, Javier Martinez Canillas wrote:

>>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski

>>> <m.szyprowski@samsung.com> wrote:

>>>> On 2017-12-12 00:25, Shuah Khan wrote:

>>>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:

>>>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:

>>>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas

>>>>>>> wrote:

>>>>>>>> So I gave a quick look to this, and at the very least there's a bug in

>>>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:

>>>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").

>>>>>>>>

>>>>>>>> I've posted a fix for that:

>>>>>>>>

>>>>>>>> https://patchwork.kernel.org/patch/10105921/

>>>>>>>>

>>>>>>>> I believe this could be also be the cause for the boot failure, since

>>>>>>>> I see in the boot log that things start to go wrong after exynos-drm

>>>>>>>> fails to bind the HDMI component:

>>>>>>>>

>>>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops

>>>>>>>> 0xc1398690): -1

>>>>>>> Umm, -1 ?  Looking that error code up in

>>>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.

>>>>>>>

>>>>>>> I suspect that's someone just returning -1 because they're lazy...

>>>>>>> which is real bad form and needs fixing.

>>>>>> Oh, it really is -EPERM:

>>>>>>

>>>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device

>>>>>> *drm_dev,

>>>>>>                                          enum exynos_drm_output_type

>>>>>> out_type)

>>>>>> {

>>>>>>           struct drm_crtc *crtc;

>>>>>>

>>>>>>           drm_for_each_crtc(crtc, drm_dev)

>>>>>>                   if (to_exynos_crtc(crtc)->type == out_type)

>>>>>>                           return to_exynos_crtc(crtc);

>>>>>>

>>>>>>           return ERR_PTR(-EPERM);

>>>>>> }

>>>>>>

>>>>>> Does "Operation not permitted" really convey the error here?  It doesn't

>>>>>> look like a permission error to me.

>>>>>>

>>>>>> Can we please avoid abusing errno codes?

>>>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+

>>>>> with top commit g968edbd worked just fine for me last Friday. I ran

>>>>> several

>>>>> tests and everything checked out except the exynos-gsc lockdep issue I

>>>>> sent

>>>>> a 4.14 patch for.

>>>>>

>>>>> However, with 4.15-rc3, dmesg is gets filled with

>>>>>

>>>>> [  342.337181] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  342.337470] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  342.337851] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.382346] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.396682] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.399244] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.399496] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.399848] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.400163] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.400495] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.401294] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>> [  402.401595] [drm] Non-contiguous allocation is not supported without

>>>>> IOMMU, falling back to contiguous buffer

>>>>>

>>>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

>>>>>

>>>>> I will start bisect and try to isolate the problem. I suspect this is

>>>>> related to dts

>>>>> changes perhaps? I used to this problem a while back and it has been

>>>>> fixed.

>>>>

>>>> This warning has been added intentionally, see following discussions:

>>>> https://patchwork.kernel.org/patch/10034919/

>>>> https://patchwork.kernel.org/patch/10070475/

>>>>

>>>> This means that your test apps should be updated or you should enable Exynos

>>>> IOMMU support in your config. Maybe it is a good time to finally enable it

>>>> in exynos_defconfig.

>>>>

>>> Has the issue that the boot-loader keeps the display controller

>>> enabled and scanning pages on the Exynos Chromebooks resolved?

>>>

>>> I think that's that preventing to enable it by default in

>>> exynos_defconfig since it caused boot failures when enabled on these

>>> machines. I don't follow exynos development too closely nowadays so

>>> maybe there's a fix in place now.

>>

>> Not directly. I still didn't find time to properly add support for

>> devices, which were left in-working state (with active DMA

>> transactions) by bootloader, but due to some other changes in the

>> order of operations during boot process, power domains are

>> initialized very early and due to temporary lack of devices (which

>> are not yet added to the system), are turned off. This practically

>> stops FIMD for scanning framebuffer and "solves" this issue.

>>

>> I've checked now and Exynos Snow Chromebook boots fine with IOMMU

>> support enabled, both with v4.15-rc3 and linux-next.


Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent
a patch to do that a while back. The decision at the time to not pull
that patch is was based on systems not booting with it enabled.

Is it time to revisit that or the recommendation is for IOMMU to be
enabled in configs manually on systems it is safe to do so?

thanks,
-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Javier Martinez Canillas Dec. 12, 2017, 6:58 p.m. UTC | #16
Hello Shuah,

On Tue, Dec 12, 2017 at 7:26 PM, Shuah Khan <shuahkh@osg.samsung.com> wrote:

[snip]

>>>

>>> Not directly. I still didn't find time to properly add support for

>>> devices, which were left in-working state (with active DMA

>>> transactions) by bootloader, but due to some other changes in the

>>> order of operations during boot process, power domains are

>>> initialized very early and due to temporary lack of devices (which

>>> are not yet added to the system), are turned off. This practically

>>> stops FIMD for scanning framebuffer and "solves" this issue.

>>>

>>> I've checked now and Exynos Snow Chromebook boots fine with IOMMU

>>> support enabled, both with v4.15-rc3 and linux-next.

>

> Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent

> a patch to do that a while back. The decision at the time to not pull

> that patch is was based on systems not booting with it enabled.

>

> Is it time to revisit that or the recommendation is for IOMMU to be

> enabled in configs manually on systems it is safe to do so?

>


Yes, I think it would be good to have it enabled by default if that
doesn't cause boot issues anymore.

Could you please resend your patch and cc Marek and me so we can test
on Snow and Peach Chromebooks?

> thanks,

> -- Shuah


Best regards,
Javier
--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski Dec. 13, 2017, 9:02 a.m. UTC | #17
Hi Daniel,

On 2017-12-12 19:14, Daniel Vetter wrote:
> On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski

> <m.szyprowski@samsung.com> wrote:

>> Hi All,

>>

>>

>> On 2017-12-11 23:28, Javier Martinez Canillas wrote:

>>> [adding Marek and Shuah to cc list]

>>>

>>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch>

>>> wrote:

>>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker

>>>> <guillaume.tucker@collabora.com> wrote:

>>>>> Hi Daniel,

>>>>>

>>>>> Please see below, I've had several bisection results pointing at

>>>>> that commit over the week-end on mainline but also on linux-next

>>>>> and net-next.  While the peach-pi is a bit flaky at the moment

>>>>> and is likely to have more than one issue, it does seem like this

>>>>> commit is causing some well reproducible kernel hang.

>>>>>

>>>>> Here's a re-run with v4.15-rc3 showing the issue:

>>>>>

>>>>>     https://lava.collabora.co.uk/scheduler/job/1018478

>>>>>

>>>>> and here's another one with the change mentioned below reverted:

>>>>>

>>>>>     https://lava.collabora.co.uk/scheduler/job/1018479

>>>>>

>>>>> They both show a warning about "unbalanced disables for lcd_vdd",

>>>>> I don't know if this is related as I haven't investigated any

>>>>> further.  It does appear to reliably hang with v4.15-rc3 and

>>>>> boot most of the time with the commit reverted though.

>>>>>

>>>>> The automated kernelci.org bisection is still an experimental

>>>>> tool and it may well be a false positive, so please take this

>>>>> result with a pinch of salt...

>>>> The patch just very minimal moves the connector cleanup around (so

>>>> timing change), but except when you unload a driver (or maybe that

>>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have

>>>> more info than "seems to hang a bit more" I have no idea what's wrong.

>>>> The patch itself should work, at least it survived quite some serious

>>>> testing we do on everything.

>>>> -Daniel

>>>>

>>> Marek was pointing to a different culprit [0] in this [1] thread. I

>>> see that both commits made it to v4.15-rc3, which is the first version

>>> where boot fails. So maybe is a combination of both? Or rather

>>> reverting one patch masks the error in the other.

>>>

>>> I've access to the machine but unfortunately not a lot of time to dig

>>> on this, I could try to do it in the weekend though.

>>

>> After a recent discussion on the Javier's patch:

>> https://patchwork.kernel.org/patch/10106417/

>> I've managed to reproduce this issue also on Exynos5250 based Samsung

>> Snow Chromebook and investigate a bit.

>>

>> It is caused by a deadlock in the main kernel workqueue. Here are details:

>>

>> 1. Exynos DRM fails to initialize due to missing regulators and gets moved

>> to deferred probe device list

>>

>> 2. Deferred probe is triggered and kernel "events" workqueue calls

>> deferred_probe_work_func()

>>

>> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing

>> Exynos Mixer device

>>

>> 4. error handling path is executed in exynos_drm_bind(), which calls

>> drm_mode_config_cleanup()

>>

>> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes

>> deadlock.

>>

>> Do You have idea how to fix this issue properly?

>>

>> Taking a look at git blame, this indeed shows that the issue has been

>> introduced by the commit a703c55004e1 ("drm: safely free connectors from

>> connector_ite"), which added a call to flush_scheduled_work() in

>> drm_mode_config_cleanup().

>>

>> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if

>> called from the workqueue, but I don't have idea how to check that. The

>> other way of fixing it would be to resurrect separate workqueue for DRM

>> related events.

> We need to flush the work there, or things will go wrong on unload. I

> think the real fix is to make sure that the connector cleanup work

> isn't run on the same work stuff as any driver bind stuff, which yes

> means new separate workqueue just for this.

>

> I guess the simple fix is to do that in the drm, but tbh I'm surprised

> that driver bind/deferred probe hasn't run into this problem anywhere

> else yet.


Well, this means that no-one tested the error paths in deferred probe
case. It's not that surprising, if we assume that typically platform
devices are deferred only once. Second probe() call (which is done from
workqueue) is successful in that case.

I've managed to fix this deadlock without additional workqueue:
https://patchwork.freedesktop.org/patch/193069/

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

--
To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c
index 25f4b2e..4820141 100644
--- a/drivers/gpu/drm/drm_connector.c
+++ b/drivers/gpu/drm/drm_connector.c
@@ -152,6 +152,16 @@  static void drm_connector_free(struct kref *kref)
  	connector->funcs->destroy(connector);
  }
  
+static void drm_connector_free_work_fn(struct work_struct *work)
+{
+	struct drm_connector *connector =
+		container_of(work, struct drm_connector, free_work);
+	struct drm_device *dev = connector->dev;
+
+	drm_mode_object_unregister(dev, &connector->base);
+	connector->funcs->destroy(connector);
+}
+
  /**
   * drm_connector_init - Init a preallocated connector
   * @dev: DRM device
@@ -181,6 +191,8 @@  int drm_connector_init(struct drm_device *dev,
  	if (ret)
  		return ret;
  
+	INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
+
  	connector->base.properties = &connector->properties;
  	connector->dev = dev;
  	connector->funcs = funcs;
@@ -529,6 +541,18 @@  void drm_connector_list_iter_begin(struct drm_device *dev,
  }
  EXPORT_SYMBOL(drm_connector_list_iter_begin);
  
+/*
+ * Extra-safe connector put function that works in any context. Should only be
+ * used from the connector_iter functions, where we never really expect to
+ * actually release the connector when dropping our final reference.
+ */
+static void
+drm_connector_put_safe(struct drm_connector *conn)
+{
+	if (refcount_dec_and_test(&conn->base.refcount.refcount))
+		schedule_work(&conn->free_work);
+}
+
  /**
   * drm_connector_list_iter_next - return next connector
   * @iter: connectr_list iterator
@@ -561,7 +585,7 @@  drm_connector_list_iter_next(struct drm_connector_list_iter *iter)
  	spin_unlock_irqrestore(&config->connector_list_lock, flags);
  
  	if (old_conn)
-		drm_connector_put(old_conn);
+		drm_connector_put_safe(old_conn);
  
  	return iter->conn;
  }
@@ -580,7 +604,7 @@  void drm_connector_list_iter_end(struct drm_connector_list_iter *iter)
  {
  	iter->dev = NULL;
  	if (iter->conn)
-		drm_connector_put(iter->conn);
+		drm_connector_put_safe(iter->conn);
  	lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
  }
  EXPORT_SYMBOL(drm_connector_list_iter_end);
diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c
index cda8bfa..cc78b3d 100644
--- a/drivers/gpu/drm/drm_mode_config.c
+++ b/drivers/gpu/drm/drm_mode_config.c
@@ -431,6 +431,8 @@  void drm_mode_config_cleanup(struct drm_device *dev)
  		drm_connector_put(connector);
  	}
  	drm_connector_list_iter_end(&conn_iter);
+	/* connector_iter drops references in a work item. */
+	flush_scheduled_work();
  	if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {
  		drm_connector_list_iter_begin(dev, &conn_iter);
  		drm_for_each_connector_iter(connector, &conn_iter)
diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h
index df9807a..a4649c5 100644
--- a/include/drm/drm_connector.h
+++ b/include/drm/drm_connector.h
@@ -916,6 +916,14 @@  struct drm_connector {
  	uint8_t num_h_tile, num_v_tile;
  	uint8_t tile_h_loc, tile_v_loc;
  	uint16_t tile_h_size, tile_v_size;
+
+	/**
+	 * @free_work:
+	 *
+	 * Work used only by &drm_connector_iter to be able to clean up a
+	 * connector from any context.
+	 */
+	struct work_struct free_work;
  };
  
  #define obj_to_connector(x) container_of(x, struct drm_connector, base)