[v4,0/9] Add Tegra Quad SPI driver

Message ID	1608236927-28701-1-git-send-email-skomatineni@nvidia.com
Headers	show Return-Path: <linux-spi-owner@kernel.org> TLS: TLSv1.2, AES256-SHA) id <B5fdbbf880001>; Thu, 17 Dec 2020 12:28:56 -0800 From: Sowjanya Komatineni <skomatineni@nvidia.com> To: <thierry.reding@gmail.com>, <jonathanh@nvidia.com>, <broonie@kernel.org>, <robh+dt@kernel.org>, <lukas@wunner.de> CC: <skomatineni@nvidia.com>, <bbrezillon@kernel.org>, <p.yadav@ti.com>, <tudor.ambarus@microchip.com>, <linux-spi@vger.kernel.org>, <linux-tegra@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <devicetree@vger.kernel.org> Subject: [PATCH v4 0/9] Add Tegra Quad SPI driver Date: Thu, 17 Dec 2020 12:28:38 -0800 Message-ID: <1608236927-28701-1-git-send-email-skomatineni@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	Add Tegra Quad SPI driver \| expand [v4,0/9] Add Tegra Quad SPI driver [v4,1/9] dt-bindings: clock: tegra: Add clock ID TEGRA210_CLK_QSPI_PM [v4,2/9] dt-bindings: spi: Add Tegra Quad SPI device tree binding [v4,3/9] MAINTAINERS: Add Tegra Quad SPI driver section [v4,4/9] spi: tegra210-quad: Add support for Tegra210 QSPI controller [v4,5/9] spi: spi-mem: Mark dummy transfers by setting dummy_data bit [v4,6/9] spi: tegra210-quad: Add support for hardware dummy cycles transfer [v4,7/9] arm64: tegra: Enable QSPI on Jetson Nano [v4,8/9] arm64: tegra: Add QSPI nodes on Tegra194 [v4,9/9] arm64: tegra: Enable QSPI on Jetson Xavier NX

Sowjanya Komatineni Dec. 17, 2020, 8:28 p.m. UTC

This series adds Tegra210, Tegra186, and Tegra194 Quad SPI driver and
enables Quad SPI on Jetson Nano and Jetson Xavier NX.

QSPI controller is available on Tegra210, Tegra186 and Tegra194.

Tegra186 and Tegra194 has additional feature of combined sequence mode
where command, address and data can all be transferred in a single transfer.
Combined sequence mode is useful only when using DMA mode transfer.

This series does not have combined sequence mode feature as Tegra186/Tegra194
GPCDMA driver is not upstreamed yet.

This series includes
- dt-binding document
- QSPI driver for Tegra210/Tegra186/Tegra194
- Enables QSPI on Jetson Nano and Jetson Xavier NX.

Delta between patch versions:
[v4]:	Updated dummy cycles implementation based on v3 feedback
	- Added dummy_data bit field int spi_transfer to indicate corresponding
	  transfer is dummy bytes transfer.
	- Updated Tegra QSPI transfer_one_message to identify dummy transfer and
	  to use HW supported dummy bytes transfer when dummy cycles are with in
	  Tegra QSPI supported max HW dummy cycles otherwise fallsback to transfer
	  dummy bytes from software.
	- Updated dt-bindings based on v3 feedback.

[v3]:	v2 has some mixed patches sent out accidentally.
	v3 sends proper patches with fixes mentioned in v2.

[v2]:	below v1 feedback
	- Added SPI_MASTER_USES_HW_DUMMY_CYCLES flag for controllers supporting
	  hardware dummy cycles and skips dummy bytes transfer from software for
	  these controllers.
	- Updated dt-binding doc with tx/rx tap delay properties.
	- Added qspi_out clock to dt-binding doc which will be used later with
	  ddr mode support.
	- All other v1 feedback on some cleanup.


Sowjanya Komatineni (9):
  dt-bindings: clock: tegra: Add clock ID TEGRA210_CLK_QSPI_PM
  dt-bindings: spi: Add Tegra Quad SPI device tree binding
  MAINTAINERS: Add Tegra Quad SPI driver section
  spi: tegra210-quad: Add support for Tegra210 QSPI controller
  spi: spi-mem: Mark dummy transfers by setting dummy_data bit
  spi: tegra210-quad: Add support for hardware dummy cycles transfer
  arm64: tegra: Enable QSPI on Jetson Nano
  arm64: tegra: Add QSPI nodes on Tegra194
  arm64: tegra: Enable QSPI on Jetson Xavier NX

 .../bindings/spi/nvidia,tegra210-quad.yaml         |  117 ++
 MAINTAINERS                                        |    8 +
 .../dts/nvidia/tegra194-p3509-0000+p3668-0000.dts  |   12 +
 arch/arm64/boot/dts/nvidia/tegra194.dtsi           |   28 +
 arch/arm64/boot/dts/nvidia/tegra210-p3450-0000.dts |   12 +
 arch/arm64/boot/dts/nvidia/tegra210.dtsi           |    5 +-
 drivers/spi/Kconfig                                |    9 +
 drivers/spi/Makefile                               |    1 +
 drivers/spi/spi-mem.c                              |    1 +
 drivers/spi/spi-tegra210-quad.c                    | 1421 ++++++++++++++++++++
 include/dt-bindings/clock/tegra210-car.h           |    2 +-
 include/linux/spi/spi.h                            |    2 +
 12 files changed, 1615 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/spi/nvidia,tegra210-quad.yaml
 create mode 100644 drivers/spi/spi-tegra210-quad.c

Pratyush Yadav Dec. 18, 2020, 9:21 a.m. UTC | #1

Hi Sowjanya,

On 17/12/20 12:28PM, Sowjanya Komatineni wrote:
> This patch marks dummy transfer by setting dummy_data bit to 1.

> 

> Controllers supporting dummy transfer by hardware use this bit field

> to skip software transfer of dummy bytes and use hardware dummy bytes

> transfer.


What is the benefit you get from this change? You add complexity in 
spi-mem and the controller driver, so that must come with some benefits. 
Here I don't see any. The transfer will certainly take the same amount 
of time because the number or period of the dummy cycles has not 
changed. So why is this needed?
 
> Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>

> ---

>  drivers/spi/spi-mem.c   | 1 +

>  include/linux/spi/spi.h | 2 ++

>  2 files changed, 3 insertions(+)

> 

> diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c

> index f3a3f19..c64371c 100644

> --- a/drivers/spi/spi-mem.c

> +++ b/drivers/spi/spi-mem.c

> @@ -354,6 +354,7 @@ int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)

>  		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;

>  		xfers[xferpos].len = op->dummy.nbytes;

>  		xfers[xferpos].tx_nbits = op->dummy.buswidth;

> +		xfers[xferpos].dummy_data = 1;

>  		spi_message_add_tail(&xfers[xferpos], &msg);

>  		xferpos++;

>  		totalxferlen += op->dummy.nbytes;

> diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h

> index aa09fdc..708f2f5 100644

> --- a/include/linux/spi/spi.h

> +++ b/include/linux/spi/spi.h

> @@ -827,6 +827,7 @@ extern void spi_res_release(struct spi_controller *ctlr,

>   *      transfer. If 0 the default (from @spi_device) is used.

>   * @bits_per_word: select a bits_per_word other than the device default

>   *      for this transfer. If 0 the default (from @spi_device) is used.

> + * @dummy_data: indicates transfer is dummy bytes transfer.

>   * @cs_change: affects chipselect after this transfer completes

>   * @cs_change_delay: delay between cs deassert and assert when

>   *      @cs_change is set and @spi_transfer is not the last in @spi_message

> @@ -939,6 +940,7 @@ struct spi_transfer {

>  	struct sg_table tx_sg;

>  	struct sg_table rx_sg;

>  

> +	unsigned	dummy_data:1;

>  	unsigned	cs_change:1;

>  	unsigned	tx_nbits:3;

>  	unsigned	rx_nbits:3;

> -- 

> 2.7.4

> 


-- 
Regards,
Pratyush Yadav
Texas Instruments India

Boris Brezillon Dec. 18, 2020, 9:57 a.m. UTC | #2

On Fri, 18 Dec 2020 14:51:08 +0530
Pratyush Yadav <p.yadav@ti.com> wrote:

> Hi Sowjanya,

> 

> On 17/12/20 12:28PM, Sowjanya Komatineni wrote:

> > This patch marks dummy transfer by setting dummy_data bit to 1.

> > 

> > Controllers supporting dummy transfer by hardware use this bit field

> > to skip software transfer of dummy bytes and use hardware dummy bytes

> > transfer.  

> 

> What is the benefit you get from this change? You add complexity in 

> spi-mem and the controller driver, so that must come with some benefits. 

> Here I don't see any. The transfer will certainly take the same amount 

> of time because the number or period of the dummy cycles has not 

> changed. So why is this needed?


Well, you don't have to queue TX bytes if you use HW-based dummy
cycles, but I agree, I'd expect the overhead to be negligible,
especially since we're talking about emitting a few bytes, not hundreds.
This being said, the complexity added to the core is reasonable IMHO,
so if it really helps reducing the CPU overhead (we might need some
numbers to prove that), I guess it's okay.

>  

> > Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>

> > ---

> >  drivers/spi/spi-mem.c   | 1 +

> >  include/linux/spi/spi.h | 2 ++

> >  2 files changed, 3 insertions(+)

> > 

> > diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c

> > index f3a3f19..c64371c 100644

> > --- a/drivers/spi/spi-mem.c

> > +++ b/drivers/spi/spi-mem.c

> > @@ -354,6 +354,7 @@ int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)

> >  		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;

> >  		xfers[xferpos].len = op->dummy.nbytes;

> >  		xfers[xferpos].tx_nbits = op->dummy.buswidth;

> > +		xfers[xferpos].dummy_data = 1;

> >  		spi_message_add_tail(&xfers[xferpos], &msg);

> >  		xferpos++;

> >  		totalxferlen += op->dummy.nbytes;

> > diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h

> > index aa09fdc..708f2f5 100644

> > --- a/include/linux/spi/spi.h

> > +++ b/include/linux/spi/spi.h

> > @@ -827,6 +827,7 @@ extern void spi_res_release(struct spi_controller *ctlr,

> >   *      transfer. If 0 the default (from @spi_device) is used.

> >   * @bits_per_word: select a bits_per_word other than the device default

> >   *      for this transfer. If 0 the default (from @spi_device) is used.

> > + * @dummy_data: indicates transfer is dummy bytes transfer.

> >   * @cs_change: affects chipselect after this transfer completes

> >   * @cs_change_delay: delay between cs deassert and assert when

> >   *      @cs_change is set and @spi_transfer is not the last in @spi_message

> > @@ -939,6 +940,7 @@ struct spi_transfer {

> >  	struct sg_table tx_sg;

> >  	struct sg_table rx_sg;

> >  

> > +	unsigned	dummy_data:1;

> >  	unsigned	cs_change:1;

> >  	unsigned	tx_nbits:3;

> >  	unsigned	rx_nbits:3;

> > -- 

> > 2.7.4

> >   

>

Sowjanya Komatineni Dec. 18, 2020, 6:09 p.m. UTC | #3

On 12/18/20 1:57 AM, Boris Brezillon wrote:
> On Fri, 18 Dec 2020 14:51:08 +0530

> Pratyush Yadav <p.yadav@ti.com> wrote:

>

>> Hi Sowjanya,

>>

>> On 17/12/20 12:28PM, Sowjanya Komatineni wrote:

>>> This patch marks dummy transfer by setting dummy_data bit to 1.

>>>

>>> Controllers supporting dummy transfer by hardware use this bit field

>>> to skip software transfer of dummy bytes and use hardware dummy bytes

>>> transfer.

>> What is the benefit you get from this change? You add complexity in

>> spi-mem and the controller driver, so that must come with some benefits.

>> Here I don't see any. The transfer will certainly take the same amount

>> of time because the number or period of the dummy cycles has not

>> changed. So why is this needed?

> Well, you don't have to queue TX bytes if you use HW-based dummy

> cycles, but I agree, I'd expect the overhead to be negligible,

> especially since we're talking about emitting a few bytes, not hundreds.

> This being said, the complexity added to the core is reasonable IMHO,

> so if it really helps reducing the CPU overhead (we might need some

> numbers to prove that), I guess it's okay.


Hardware dummy cycles feature of Tegra QSPI is to save SW transfer cycle 
of dummy bytes by filling FIFO.

I don't have numbers as we always use hardware dummy cycles with Tegra QSPI.

>>   

>>> Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>

>>> ---

>>>   drivers/spi/spi-mem.c   | 1 +

>>>   include/linux/spi/spi.h | 2 ++

>>>   2 files changed, 3 insertions(+)

>>>

>>> diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c

>>> index f3a3f19..c64371c 100644

>>> --- a/drivers/spi/spi-mem.c

>>> +++ b/drivers/spi/spi-mem.c

>>> @@ -354,6 +354,7 @@ int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)

>>>   		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;

>>>   		xfers[xferpos].len = op->dummy.nbytes;

>>>   		xfers[xferpos].tx_nbits = op->dummy.buswidth;

>>> +		xfers[xferpos].dummy_data = 1;

>>>   		spi_message_add_tail(&xfers[xferpos], &msg);

>>>   		xferpos++;

>>>   		totalxferlen += op->dummy.nbytes;

>>> diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h

>>> index aa09fdc..708f2f5 100644

>>> --- a/include/linux/spi/spi.h

>>> +++ b/include/linux/spi/spi.h

>>> @@ -827,6 +827,7 @@ extern void spi_res_release(struct spi_controller *ctlr,

>>>    *      transfer. If 0 the default (from @spi_device) is used.

>>>    * @bits_per_word: select a bits_per_word other than the device default

>>>    *      for this transfer. If 0 the default (from @spi_device) is used.

>>> + * @dummy_data: indicates transfer is dummy bytes transfer.

>>>    * @cs_change: affects chipselect after this transfer completes

>>>    * @cs_change_delay: delay between cs deassert and assert when

>>>    *      @cs_change is set and @spi_transfer is not the last in @spi_message

>>> @@ -939,6 +940,7 @@ struct spi_transfer {

>>>   	struct sg_table tx_sg;

>>>   	struct sg_table rx_sg;

>>>   

>>> +	unsigned	dummy_data:1;

>>>   	unsigned	cs_change:1;

>>>   	unsigned	tx_nbits:3;

>>>   	unsigned	rx_nbits:3;

>>> -- 

>>> 2.7.4

>>>

Mark Brown Dec. 18, 2020, 8:41 p.m. UTC | #4

On Sat, Dec 19, 2020 at 12:49:38AM +0530, Pratyush Yadav wrote:

> Anyway, if the SPI maintainers think this is worth it, I won't object.


This gets kind of circular, for me it's a question of if there's some
meaningful benefit from using the feature vs the cost to support it and
from the sounds of it we don't have numbers on the benefits from using
it at present.

Sowjanya Komatineni Dec. 18, 2020, 10:01 p.m. UTC | #5

On 12/18/20 12:44 PM, Mark Brown wrote:
> On Fri, Dec 18, 2020 at 08:41:02PM +0000, Mark Brown wrote:
>> On Sat, Dec 19, 2020 at 12:49:38AM +0530, Pratyush Yadav wrote:
>>> Anyway, if the SPI maintainers think this is worth it, I won't object.
>> This gets kind of circular, for me it's a question of if there's some
>> meaningful benefit from using the feature vs the cost to support it and
>> from the sounds of it we don't have numbers on the benefits from using
>> it at present.
> ...although I do have to say looking at the implementation that the cost
> seems low, it's just a flag set on an existing transfer.  The only issue
> is if we'd get more win from coalesing the entire transaction (or entire
> transmit) into a single transfer that could be DMAed and/or requires
> fewer trips through the stack which does make it seem like an unclear
> tradeoff from the point of view of client drivers

Using HW dummy cycles save extra software cycle of transfer which 
involves transfer setup register writes, writing dummy bytes to TX FIFO, 
interrupt processing.

Implementation wise it just a single bit field added to spi_transfer and 
on Tegra controller driver programming dummy cycles with prior transfer 
and skipping sw dummy transfer which is actually not complex.

 From quick check, I see HW dummy cycles transfer of 128KB shows 18 Mb/s 
while SW transfer of bytes shows 17.3 MB/s on average.

When back-to-back read commands are executed using HW dummy cycles will 
definitely save cycles.

Mark Brown Dec. 21, 2020, 4:50 p.m. UTC | #6

On Fri, Dec 18, 2020 at 02:01:56PM -0800, Sowjanya Komatineni wrote:

> From quick check, I see HW dummy cycles transfer of 128KB shows 18 Mb/s

> while SW transfer of bytes shows 17.3 MB/s on average.


OK, it's not going to revolutionize the world or anything but that's
definitely a speedup.

[v4,0/9] Add Tegra Quad SPI driver

Message

Comments