[00/16] Modernize rest of the krait drivers

Message ID	20220313190419.2207-1-ansuelsmth@gmail.com
Headers	show Return-Path: <linux-arm-msm-owner@kernel.org> From: Ansuel Smith <ansuelsmth@gmail.com> To: Rob Herring <robh+dt@kernel.org>, Bjorn Andersson <bjorn.andersson@linaro.org>, Andy Gross <agross@kernel.org>, Michael Turquette <mturquette@baylibre.com>, Stephen Boyd <sboyd@kernel.org>, Peter De Schrijver <pdeschrijver@nvidia.com>, Prashant Gaikwad <pgaikwad@nvidia.com>, Thierry Reding <thierry.reding@gmail.com>, Jonathan Hunter <jonathanh@nvidia.com>, Ansuel Smith <ansuelsmth@gmail.com>, devicetree@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-clk@vger.kernel.org, linux-tegra@vger.kernel.org Subject: [PATCH 00/16] Modernize rest of the krait drivers Date: Sun, 13 Mar 2022 20:04:03 +0100 Message-Id: <20220313190419.2207-1-ansuelsmth@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Modernize rest of the krait drivers \| expand [00/16] Modernize rest of the krait drivers [02/16] clk: qcom: gcc-ipq806x: skip pxo/cxo fixed clk if already present [04/16] clk: qcom: clk-hfpll: use poll_timeout macro [07/16] clk: qcom: clk-krait: add hw_parent check for div2_round_rate [08/16] clk: qcom: krait-cc: convert to parent_data API [10/16] clk: qcom: krait-cc: drop hardcoded safe_sel [13/16] clk: qcom: clk-krait: add enable disable ops [14/16] dt-bindings: clock: Convert qcom,krait-cc to yaml [16/16] dt-bindings: arm: msm: Convert kpss driver Documentation to yaml

Christian Marangi March 13, 2022, 7:04 p.m. UTC

This is a follow-up to the ipq806x gcc modernize series. Manu cleanup
changes and also some discoveries of wrong definition notice only with
all these conversions.

The first patch is an improvement of the clk_hw_get_parent_index. The
original idea of clk_hw_get_parent_index was to give a way to access the
parent index but for some reason the final version limited it to the
current index. We change it to give the current parent if is not
provided and to give the requested parent if provided. Any user of this
function is updated to follow the new implementation.

The patch 2 and 3 are some additional fixes for gcc.
The first one is a fix that register the pxo and cxo fixed clock only if
they are not defined in DTS.
The patch 3 require some explaination. In short is a big HACK to prevent
kernel panic with this series.

The kpss-xcc driver is a mess.
The Documentation declare that the clocks should be provided but for some
reason it was never followed.
In fact in the ipq8064 DTSI only the clocks for l2cc are declared but
for cpu0 and cpu1 the clocks are not defined.
The kpss-xcc driver use parent_names so the clks are ignored and never
used so till now it wasn't a problem (ignoring the fact that they
doesn't follow documentation at all)
On top of that, the l2cc node declare the pxo clock in a really strange
way. It's declared using the PXO_SRC gcc clock that is never defined in
the gcc ipq8064 clock table. (the correct way was to declare a fixed
clock in dts and reference that)
To prevent any kind of problem we use the patch 3 and provide the clk
for PXO_SRC in the gcc clock table. We manually provide the clk after
gcc probe.

Patch 4 is just a minor cleanup where we use the poll macro

Patch 5 is the actually kpss-xcc conversion to parent data

Patch 6-7 should be a fixup of a real conver case

Patch 8 converts the krait-cc to parent_data
Patch 9 give some love to the code with some minor fixup
Patch 10 drop the hardcoded safe sel and use the new
clk_hw_get_parent_index to get the safe parent index.
(also I discovered that the parent order was wrong)

Patch 11 is an additional fixup to force the reset of the muxes even
more.

Patch 12-13 are some additiona taken from the qsdk that were missing in
the upstream driver

Patch 14 converts krait-cc to yaml (should i also convert the kpss-scc
driver?)

Patch 15 finally adds all this stuff to the ipq8064 dtsi (and fix the
stupid PXO_SRC phandle)

Patch 16 conver the kpss driver to yaml and fix some Docuemntation errors

I tested this series on a ipq8064 SoC by running a cache benchmark test
to make sure the changes are correct and we don't silently cause
regressions. Also I compared the output of the clk_summary every time
and we finally have a sane output where the mux are correctly placed in
the correct parent. (till now we had the cpu aux clock all over the
place, probably never cause problems but who knows.)

Ansuel Smith (16):
  clk: permit to define a custom parent for clk_hw_get_parent_index
  clk: qcom: gcc-ipq806x: skip pxo/cxo fixed clk if already present
  clk: qcom: gcc-ipq806x: add PXO_SRC in clk table
  clk: qcom: clk-hfpll: use poll_timeout macro
  clk: qcom: kpss-xcc: convert to parent data API
  clk: qcom: clk-krait: unlock spin after mux completion
  clk: qcom: clk-krait: add hw_parent check for div2_round_rate
  clk: qcom: krait-cc: convert to parent_data API
  clk: qcom: krait-cc: drop pr_info and register qsb only if needed
  clk: qcom: krait-cc: drop hardcoded safe_sel
  clk: qcom: krait-cc: force sec_mux to QSB
  clk: qcom: clk-krait: add 8064 errata workaround
  clk: qcom: clk-krait: add enable disable ops
  dt-bindings: clock: Convert qcom,krait-cc to yaml
  dts: qcom-ipq8064: add missing krait-cc compatible and clocks
  dt-bindings: arm: msm: Convert kpss driver Documentation to yaml

 .../bindings/arm/msm/qcom,kpss-acc.txt        |  49 -----
 .../bindings/arm/msm/qcom,kpss-acc.yaml       |  97 +++++++++
 .../bindings/arm/msm/qcom,kpss-gcc.txt        |  44 ----
 .../bindings/arm/msm/qcom,kpss-gcc.yaml       |  62 ++++++
 .../bindings/clock/qcom,krait-cc.txt          |  34 ---
 .../bindings/clock/qcom,krait-cc.yaml         |  63 ++++++
 arch/arm/boot/dts/qcom-ipq8064.dtsi           |  20 +-
 drivers/clk/clk.c                             |  14 +-
 drivers/clk/qcom/clk-hfpll.c                  |  13 +-
 drivers/clk/qcom/clk-krait.c                  |  44 +++-
 drivers/clk/qcom/clk-krait.h                  |   1 +
 drivers/clk/qcom/gcc-ipq806x.c                |  27 ++-
 drivers/clk/qcom/kpss-xcc.c                   |  25 +--
 drivers/clk/qcom/krait-cc.c                   | 201 ++++++++++--------
 drivers/clk/tegra/clk-periph.c                |   2 +-
 drivers/clk/tegra/clk-sdmmc-mux.c             |   2 +-
 drivers/clk/tegra/clk-super.c                 |   4 +-
 include/linux/clk-provider.h                  |   2 +-
 18 files changed, 448 insertions(+), 256 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/arm/msm/qcom,kpss-acc.txt
 create mode 100644 Documentation/devicetree/bindings/arm/msm/qcom,kpss-acc.yaml
 delete mode 100644 Documentation/devicetree/bindings/arm/msm/qcom,kpss-gcc.txt
 create mode 100644 Documentation/devicetree/bindings/arm/msm/qcom,kpss-gcc.yaml
 delete mode 100644 Documentation/devicetree/bindings/clock/qcom,krait-cc.txt
 create mode 100644 Documentation/devicetree/bindings/clock/qcom,krait-cc.yaml

Stephen Boyd March 15, 2022, 9:34 p.m. UTC | #1

Quoting Ansuel Smith (2022-03-14 05:43:20)
> On Mon, Mar 14, 2022 at 11:20:21AM +0300, Dmitry Baryshkov wrote:
> > On 13/03/2022 22:04, Ansuel Smith wrote:
> > > Add 8064 errata workaround where the sec_src clock gating needs to be
> > 
> > Could you please be more specific whether the errata applies only to the
> > ipq8064 or to the apq8064 too? 8064 is not specific enough.
> >
> 
> That's a good question... Problem is that we really don't know the
> answer. This errata comes from qsdk on an old sourcecode. I assume this
> is specific to ipq8064 and apq8064 have different mux configuration.
> 

I think it was some glitch that happened when the automatic clk gating
was enabled during a switch. The automatic clk gating didn't know that
software was running and switching the input so it killed the CPU and
stopped the clk. That lead to hangs and super badness. I assume it was
applicable to apq8064 as well because ipq8064 is basically apq8064 with
the multimedia subsystem replaced by the networking subsystem. Also I
wouldn't remember all these details because I worked on apq8064 but not
so much on ipq8064 :)

Stephen Boyd March 15, 2022, 10:41 p.m. UTC | #2

Quoting Ansuel Smith (2022-03-15 14:47:56)
> On Tue, Mar 15, 2022 at 02:34:30PM -0700, Stephen Boyd wrote:
> > Quoting Ansuel Smith (2022-03-14 05:43:20)
> > > On Mon, Mar 14, 2022 at 11:20:21AM +0300, Dmitry Baryshkov wrote:
> > > > On 13/03/2022 22:04, Ansuel Smith wrote:
> > > > > Add 8064 errata workaround where the sec_src clock gating needs to be
> > > > 
> > > > Could you please be more specific whether the errata applies only to the
> > > > ipq8064 or to the apq8064 too? 8064 is not specific enough.
> > > >
> > > 
> > > That's a good question... Problem is that we really don't know the
> > > answer. This errata comes from qsdk on an old sourcecode. I assume this
> > > is specific to ipq8064 and apq8064 have different mux configuration.
> > > 
> > 
> > I think it was some glitch that happened when the automatic clk gating
> > was enabled during a switch. The automatic clk gating didn't know that
> > software was running and switching the input so it killed the CPU and
> > stopped the clk. That lead to hangs and super badness. I assume it was
> > applicable to apq8064 as well because ipq8064 is basically apq8064 with
> > the multimedia subsystem replaced by the networking subsystem. Also I
> > wouldn't remember all these details because I worked on apq8064 but not
> > so much on ipq8064 :)
> 
> Honest question. Do you remember other glitch present on the platform?
> We are trying to bisect an instability problem and we still needs to
> find the reason. We really can't understand if it's just a power
> delivery problem or a scaling problem from muxes or other things.
> 
> The current problem is that after some time the device kernel panics
> with a number of strange reason like invalid kernel paging and other
> strange (or the device just freze and reboots, not even a crash log)
> Many kernel panics reports the crash near the mux switch (like random
> error right before the mux switch) So I suspect there is a problem
> there. But due to the fact that is very random we have NO exact way to
> repro it. I manage sometime, while playing with the code, to repo
> similar kernel crash but still i'm not sure of the real cause.
> 
> I know it's OT but do you have any idea about it? If you remember
> anything about it?
> (To scale the freq i'm using a dedicated cpufreq driver that works this
> way:
> - We first scale the cache to the max freq across all core, we set the
>   voltage
> - We scale the cpu to the correct target.
> This is all done under a lock. Do you see anything wrong in this logic?

I honestly don't remember much anymore about this. It's been a decade.
Scaling the cache used to be an independent clk and operation vs. the
CPU. Basically the clk domain and power domain for the cache was
separate from the CPU. There's also the fuse stuff that means you have
to read the fuse to know what OPP table to use. Otherwise you may be
overclocking the CPU or undervolting it. It may also be that cpuidle
can't happen during a frequency transition. Otherwise the clk gating
will be reenabled when the cpu startup code reinitializes all the cpu
registers? I'd have to look through some old vendor kernels to see if
anything jogs my memory.

> To mee these random crash looks to be really related to something wrong
> with the mux or with the cache set to a wrong state)
> 
> Thx for any suggestion about this.
> (also I will update this commit and mention both apq and ipq in the
> comments)

Stephen Boyd March 15, 2022, 10:45 p.m. UTC | #3

Quoting Ansuel Smith (2022-03-13 12:04:12)
> Drop pr_info and change them to dev_info.

Replace pr_info() with dev_info() to provide better diagnostics.

> Register qsb fixed clk only if it's not declared in DTS.
> Also reorganize variable order.

Please don't reorganize variable order.

Christian Marangi March 16, 2022, 3:46 p.m. UTC | #4

On Tue, Mar 15, 2022 at 03:41:14PM -0700, Stephen Boyd wrote:
> Quoting Ansuel Smith (2022-03-15 14:47:56)
> > On Tue, Mar 15, 2022 at 02:34:30PM -0700, Stephen Boyd wrote:
> > > Quoting Ansuel Smith (2022-03-14 05:43:20)
> > > > On Mon, Mar 14, 2022 at 11:20:21AM +0300, Dmitry Baryshkov wrote:
> > > > > On 13/03/2022 22:04, Ansuel Smith wrote:
> > > > > > Add 8064 errata workaround where the sec_src clock gating needs to be
> > > > > 
> > > > > Could you please be more specific whether the errata applies only to the
> > > > > ipq8064 or to the apq8064 too? 8064 is not specific enough.
> > > > >
> > > > 
> > > > That's a good question... Problem is that we really don't know the
> > > > answer. This errata comes from qsdk on an old sourcecode. I assume this
> > > > is specific to ipq8064 and apq8064 have different mux configuration.
> > > > 
> > > 
> > > I think it was some glitch that happened when the automatic clk gating
> > > was enabled during a switch. The automatic clk gating didn't know that
> > > software was running and switching the input so it killed the CPU and
> > > stopped the clk. That lead to hangs and super badness. I assume it was
> > > applicable to apq8064 as well because ipq8064 is basically apq8064 with
> > > the multimedia subsystem replaced by the networking subsystem. Also I
> > > wouldn't remember all these details because I worked on apq8064 but not
> > > so much on ipq8064 :)
> > 
> > Honest question. Do you remember other glitch present on the platform?
> > We are trying to bisect an instability problem and we still needs to
> > find the reason. We really can't understand if it's just a power
> > delivery problem or a scaling problem from muxes or other things.
> > 
> > The current problem is that after some time the device kernel panics
> > with a number of strange reason like invalid kernel paging and other
> > strange (or the device just freze and reboots, not even a crash log)
> > Many kernel panics reports the crash near the mux switch (like random
> > error right before the mux switch) So I suspect there is a problem
> > there. But due to the fact that is very random we have NO exact way to
> > repro it. I manage sometime, while playing with the code, to repo
> > similar kernel crash but still i'm not sure of the real cause.
> > 
> > I know it's OT but do you have any idea about it? If you remember
> > anything about it?
> > (To scale the freq i'm using a dedicated cpufreq driver that works this
> > way:
> > - We first scale the cache to the max freq across all core, we set the
> >   voltage
> > - We scale the cpu to the correct target.
> > This is all done under a lock. Do you see anything wrong in this logic?
> 
> I honestly don't remember much anymore about this. It's been a decade.
> Scaling the cache used to be an independent clk and operation vs. the
> CPU. Basically the clk domain and power domain for the cache was
> separate from the CPU. There's also the fuse stuff that means you have
> to read the fuse to know what OPP table to use. Otherwise you may be
> overclocking the CPU or undervolting it. It may also be that cpuidle
> can't happen during a frequency transition. Otherwise the clk gating
> will be reenabled when the cpu startup code reinitializes all the cpu
> registers? I'd have to look through some old vendor kernels to see if
> anything jogs my memory.
> 
> > To mee these random crash looks to be really related to something wrong
> > with the mux or with the cache set to a wrong state)
> > 
> > Thx for any suggestion about this.
> > (also I will update this commit and mention both apq and ipq in the
> > comments)

Hi, i'm checking the spm qcom idle driver and something doesn't look
right to me... Aside from the different sequence used for boot cpu and
the abset l2 sequence, it looks like to me that WFI is enabled anyway
(even if it's not defined in the DTS or set disabled) and on top of that
it looks like we overwrite the WFI logic but we actually set to
enter power collapse (spc). Why?

Also I think we are missing the assembly code to enter wfi on krait cpu.
Am I totally confused or there are some problems in the code that nobody
notice?

Stephen Boyd March 17, 2022, 7:34 p.m. UTC | #5

Quoting Ansuel Smith (2022-03-16 08:46:54)
> On Tue, Mar 15, 2022 at 03:41:14PM -0700, Stephen Boyd wrote:
> > Quoting Ansuel Smith (2022-03-15 14:47:56)
> > > On Tue, Mar 15, 2022 at 02:34:30PM -0700, Stephen Boyd wrote:
> > > > Quoting Ansuel Smith (2022-03-14 05:43:20)
> > > > > On Mon, Mar 14, 2022 at 11:20:21AM +0300, Dmitry Baryshkov wrote:
> > > > > > On 13/03/2022 22:04, Ansuel Smith wrote:
> > > > > > > Add 8064 errata workaround where the sec_src clock gating needs to be
> > > > > > 
> > > > > > Could you please be more specific whether the errata applies only to the
> > > > > > ipq8064 or to the apq8064 too? 8064 is not specific enough.
> > > > > >
> > > > > 
> > > > > That's a good question... Problem is that we really don't know the
> > > > > answer. This errata comes from qsdk on an old sourcecode. I assume this
> > > > > is specific to ipq8064 and apq8064 have different mux configuration.
> > > > > 
> > > > 
> > > > I think it was some glitch that happened when the automatic clk gating
> > > > was enabled during a switch. The automatic clk gating didn't know that
> > > > software was running and switching the input so it killed the CPU and
> > > > stopped the clk. That lead to hangs and super badness. I assume it was
> > > > applicable to apq8064 as well because ipq8064 is basically apq8064 with
> > > > the multimedia subsystem replaced by the networking subsystem. Also I
> > > > wouldn't remember all these details because I worked on apq8064 but not
> > > > so much on ipq8064 :)
> > > 
> > > Honest question. Do you remember other glitch present on the platform?
> > > We are trying to bisect an instability problem and we still needs to
> > > find the reason. We really can't understand if it's just a power
> > > delivery problem or a scaling problem from muxes or other things.
> > > 
> > > The current problem is that after some time the device kernel panics
> > > with a number of strange reason like invalid kernel paging and other
> > > strange (or the device just freze and reboots, not even a crash log)
> > > Many kernel panics reports the crash near the mux switch (like random
> > > error right before the mux switch) So I suspect there is a problem
> > > there. But due to the fact that is very random we have NO exact way to
> > > repro it. I manage sometime, while playing with the code, to repo
> > > similar kernel crash but still i'm not sure of the real cause.
> > > 
> > > I know it's OT but do you have any idea about it? If you remember
> > > anything about it?
> > > (To scale the freq i'm using a dedicated cpufreq driver that works this
> > > way:
> > > - We first scale the cache to the max freq across all core, we set the
> > >   voltage
> > > - We scale the cpu to the correct target.
> > > This is all done under a lock. Do you see anything wrong in this logic?
> > 
> > I honestly don't remember much anymore about this. It's been a decade.
> > Scaling the cache used to be an independent clk and operation vs. the
> > CPU. Basically the clk domain and power domain for the cache was
> > separate from the CPU. There's also the fuse stuff that means you have
> > to read the fuse to know what OPP table to use. Otherwise you may be
> > overclocking the CPU or undervolting it. It may also be that cpuidle
> > can't happen during a frequency transition. Otherwise the clk gating
> > will be reenabled when the cpu startup code reinitializes all the cpu
> > registers? I'd have to look through some old vendor kernels to see if
> > anything jogs my memory.
> > 
> > > To mee these random crash looks to be really related to something wrong
> > > with the mux or with the cache set to a wrong state)
> > > 
> > > Thx for any suggestion about this.
> > > (also I will update this commit and mention both apq and ipq in the
> > > comments)
> 
> Hi, i'm checking the spm qcom idle driver and something doesn't look
> right to me... Aside from the different sequence used for boot cpu and
> the abset l2 sequence, it looks like to me that WFI is enabled anyway
> (even if it's not defined in the DTS or set disabled) and on top of that
> it looks like we overwrite the WFI logic but we actually set to
> enter power collapse (spc). Why?

When the CPU is power collapsed they need to notify software running in
the secure world that the CPU is going to be reset. The CPU comes out of
reset in secure mode and it has to jump to non-secure mode. It's still a
WFI, but we don't see it in the kernel because the secure world code
executes the wfi and that runs the power collapse sequence to turn all
the power off. On power up the secure world will restore various cpu
registers (*cough* workarounds *cough*) and then switch to non-secure
mode wherever linux told it to execute at on warm boot.

> 
> Also I think we are missing the assembly code to enter wfi on krait cpu.
> Am I totally confused or there are some problems in the code that nobody
> notice?
> 

I'd expect that to run through some scm_call() path into the secure
world. The wfi can still be run by the kernel in non-secure mode, but
that will only gate the CPU clk and not actually power collapse the
core. It's a "light sleep" for the CPU. All this stuff predates PSCI but
it is very similar, just a bespoke solution instead of a standard
calling format.

[00/16] Modernize rest of the krait drivers

Message

Comments