diff mbox series

riscv: set the width of the physical address/size data type based on arch

Message ID 20250506092401.646595-1-sughosh.ganu@linaro.org
State New
Headers show
Series riscv: set the width of the physical address/size data type based on arch | expand

Commit Message

Sughosh Ganu May 6, 2025, 9:24 a.m. UTC
U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
the width of the phys_{addr,size}_t data types based on the register
size of the architecture.

Currently, even the 32-bit RiscV platforms have a 64-bit
phys_{addr,size}_t data types. This causes issues on the 32-bit
platforms, where the upper 32-bits of the variables of these types
can have junk data, and that can cause all kinds of side-effects.

This was discovered on the qemu Riscv 32-bit platform  when the return
value of an LMB API was checked, and some LMB allocation that ought
not to have failed, was failing. The upper 32-bits of the address
variable contained garbage, resulting in failures.

Signed-off-by: Sughosh Ganu <sughosh.ganu@linaro.org>
---

Note:
Although the LMB API cleanup series depends on this patch, I am
sending it separately so that it gets noticed by the RiscV
maintainers. Sometimes a patch may not get the required attention when
sent as part of another seemingly unrelated series.


 arch/riscv/include/asm/types.h | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Heinrich Schuchardt May 6, 2025, 9:49 a.m. UTC | #1
On 5/6/25 11:24, Sughosh Ganu wrote:
> U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
> the width of the phys_{addr,size}_t data types based on the register
> size of the architecture.
> 
> Currently, even the 32-bit RiscV platforms have a 64-bit
> phys_{addr,size}_t data types. This causes issues on the 32-bit
> platforms, where the upper 32-bits of the variables of these types
> can have junk data, and that can cause all kinds of side-effects.

How could it be that the upper 32-bit have junk data?

When we convert from a shorter variable the compiler should fill the 
upper bits with zero.

> 
> This was discovered on the qemu Riscv 32-bit platform  when the return
> value of an LMB API was checked, and some LMB allocation that ought
> not to have failed, was failing. The upper 32-bits of the address
> variable contained garbage, resulting in failures.
> 
> Signed-off-by: Sughosh Ganu <sughosh.ganu@linaro.org>
> ---
> 
> Note:
> Although the LMB API cleanup series depends on this patch, I am
> sending it separately so that it gets noticed by the RiscV
> maintainers. Sometimes a patch may not get the required attention when
> sent as part of another seemingly unrelated series.
> 
> 
>   arch/riscv/include/asm/types.h | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/types.h b/arch/riscv/include/asm/types.h
> index 49f7a5d6b3a..45d806c83eb 100644
> --- a/arch/riscv/include/asm/types.h
> +++ b/arch/riscv/include/asm/types.h
> @@ -35,8 +35,13 @@ typedef u64 dma_addr_t;
>   typedef u32 dma_addr_t;
>   #endif
>   
> -typedef unsigned long long phys_addr_t;
> -typedef unsigned long long phys_size_t;
> +#ifdef CONFIG_PHYS_64BIT
> +typedef u64 phys_addr_t;
> +typedef u64 phys_size_t;
> +#else
> +typedef u32 phys_addr_t;
> +typedef u32 phys_size_t;

This matches what has been done for ARM.

86c915628d58 ("riscv: Change phys_addr_t and phys_size_t to 64-bit") 
changed the definition to 64bit phys addr_t to support the 34bit 
physical addresses (similar to LPAE on arm32).

I don't think that we need 34bit support currently. But I would prefer 
if we could find the root cause why the upper 32bit gets messed up as 
this might point to a generic problem.

Best regards

Heinrich

> +#endif
>   
>   #endif /* __KERNEL__ */
>
Sughosh Ganu May 6, 2025, 10:49 a.m. UTC | #2
On Tue, 6 May 2025 at 15:19, Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
>
> On 5/6/25 11:24, Sughosh Ganu wrote:
> > U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
> > the width of the phys_{addr,size}_t data types based on the register
> > size of the architecture.
> >
> > Currently, even the 32-bit RiscV platforms have a 64-bit
> > phys_{addr,size}_t data types. This causes issues on the 32-bit
> > platforms, where the upper 32-bits of the variables of these types
> > can have junk data, and that can cause all kinds of side-effects.
>
> How could it be that the upper 32-bit have junk data?
>
> When we convert from a shorter variable the compiler should fill the
> upper bits with zero.

That does not seem to be happening. The efi_fit test fails on the
qemu-riscv32 platform, when attempting to boot the OS from the FIT
image.

These are the values of the base address that I see in the
_lmb_alloc_addr() function.

_lmb_alloc_addr: 755, rgn => -1, base => 0x1a1c0e00802000bc, size => 0x50b1

The actual value that gets passed is 0x802000bc. The upper 32-bits do
not get zeroed out. Which causes the _lmb_alloc_addr() to return an
error. You can check my branch [1], where I have put a temporary
commit to print the values that cause the issue.

-sughosh

[1] - https://source.denx.de/u-boot/contributors/sng/u-boot/-/commits/lmb_apis_cleanup_bisectable_v2

>
> >
> > This was discovered on the qemu Riscv 32-bit platform  when the return
> > value of an LMB API was checked, and some LMB allocation that ought
> > not to have failed, was failing. The upper 32-bits of the address
> > variable contained garbage, resulting in failures.
> >
> > Signed-off-by: Sughosh Ganu <sughosh.ganu@linaro.org>
> > ---
> >
> > Note:
> > Although the LMB API cleanup series depends on this patch, I am
> > sending it separately so that it gets noticed by the RiscV
> > maintainers. Sometimes a patch may not get the required attention when
> > sent as part of another seemingly unrelated series.
> >
> >
> >   arch/riscv/include/asm/types.h | 9 +++++++--
> >   1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/types.h b/arch/riscv/include/asm/types.h
> > index 49f7a5d6b3a..45d806c83eb 100644
> > --- a/arch/riscv/include/asm/types.h
> > +++ b/arch/riscv/include/asm/types.h
> > @@ -35,8 +35,13 @@ typedef u64 dma_addr_t;
> >   typedef u32 dma_addr_t;
> >   #endif
> >
> > -typedef unsigned long long phys_addr_t;
> > -typedef unsigned long long phys_size_t;
> > +#ifdef CONFIG_PHYS_64BIT
> > +typedef u64 phys_addr_t;
> > +typedef u64 phys_size_t;
> > +#else
> > +typedef u32 phys_addr_t;
> > +typedef u32 phys_size_t;
>
> This matches what has been done for ARM.
>
> 86c915628d58 ("riscv: Change phys_addr_t and phys_size_t to 64-bit")
> changed the definition to 64bit phys addr_t to support the 34bit
> physical addresses (similar to LPAE on arm32).
>
> I don't think that we need 34bit support currently. But I would prefer
> if we could find the root cause why the upper 32bit gets messed up as
> this might point to a generic problem.
>
> Best regards
>
> Heinrich
>
> > +#endif
> >
> >   #endif /* __KERNEL__ */
> >
>
Heinrich Schuchardt May 6, 2025, 11:05 a.m. UTC | #3
Sughosh Ganu <sughosh.ganu@linaro.org> schrieb am Di., 6. Mai 2025, 12:50:

> On Tue, 6 May 2025 at 15:19, Heinrich Schuchardt
> <heinrich.schuchardt@canonical.com> wrote:
> >
> > On 5/6/25 11:24, Sughosh Ganu wrote:
> > > U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
> > > the width of the phys_{addr,size}_t data types based on the register
> > > size of the architecture.
> > >
> > > Currently, even the 32-bit RiscV platforms have a 64-bit
> > > phys_{addr,size}_t data types. This causes issues on the 32-bit
> > > platforms, where the upper 32-bits of the variables of these types
> > > can have junk data, and that can cause all kinds of side-effects.
> >
> > How could it be that the upper 32-bit have junk data?
> >
> > When we convert from a shorter variable the compiler should fill the
> > upper bits with zero.
>
> That does not seem to be happening. The efi_fit test fails on the
> qemu-riscv32 platform, when attempting to boot the OS from the FIT
> image.
>
> These are the values of the base address that I see in the
> _lmb_alloc_addr() function.
>
> _lmb_alloc_addr: 755, rgn => -1, base => 0x1a1c0e00802000bc, size => 0x50b1
>

As you are running on QEMU you should be able to track down where the value
is actually assigned with gdb. This could for instance be a buffer overrun.

If we don't correct the root cause, it will hit us again.

Best regards

Heinrich


> The actual value that gets passed is 0x802000bc. The upper 32-bits do
> not get zeroed out. Which causes the _lmb_alloc_addr() to return an
> error. You can check my branch [1], where I have put a temporary
> commit to print the values that cause the issue.
>
> -sughosh
>
> [1] -
> https://source.denx.de/u-boot/contributors/sng/u-boot/-/commits/lmb_apis_cleanup_bisectable_v2
>
> >
> > >
> > > This was discovered on the qemu Riscv 32-bit platform  when the return
> > > value of an LMB API was checked, and some LMB allocation that ought
> > > not to have failed, was failing. The upper 32-bits of the address
> > > variable contained garbage, resulting in failures.
> > >
> > > Signed-off-by: Sughosh Ganu <sughosh.ganu@linaro.org>
> > > ---
> > >
> > > Note:
> > > Although the LMB API cleanup series depends on this patch, I am
> > > sending it separately so that it gets noticed by the RiscV
> > > maintainers. Sometimes a patch may not get the required attention when
> > > sent as part of another seemingly unrelated series.
> > >
> > >
> > >   arch/riscv/include/asm/types.h | 9 +++++++--
> > >   1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/riscv/include/asm/types.h
> b/arch/riscv/include/asm/types.h
> > > index 49f7a5d6b3a..45d806c83eb 100644
> > > --- a/arch/riscv/include/asm/types.h
> > > +++ b/arch/riscv/include/asm/types.h
> > > @@ -35,8 +35,13 @@ typedef u64 dma_addr_t;
> > >   typedef u32 dma_addr_t;
> > >   #endif
> > >
> > > -typedef unsigned long long phys_addr_t;
> > > -typedef unsigned long long phys_size_t;
> > > +#ifdef CONFIG_PHYS_64BIT
> > > +typedef u64 phys_addr_t;
> > > +typedef u64 phys_size_t;
> > > +#else
> > > +typedef u32 phys_addr_t;
> > > +typedef u32 phys_size_t;
> >
> > This matches what has been done for ARM.
> >
> > 86c915628d58 ("riscv: Change phys_addr_t and phys_size_t to 64-bit")
> > changed the definition to 64bit phys addr_t to support the 34bit
> > physical addresses (similar to LPAE on arm32).
> >
> > I don't think that we need 34bit support currently. But I would prefer
> > if we could find the root cause why the upper 32bit gets messed up as
> > this might point to a generic problem.
> >
> > Best regards
> >
> > Heinrich
> >
> > > +#endif
> > >
> > >   #endif /* __KERNEL__ */
> > >
> >
>
Sughosh Ganu May 6, 2025, 11:13 a.m. UTC | #4
On Tue, 6 May 2025 at 16:35, Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
>
>
>
> Sughosh Ganu <sughosh.ganu@linaro.org> schrieb am Di., 6. Mai 2025, 12:50:
>>
>> On Tue, 6 May 2025 at 15:19, Heinrich Schuchardt
>> <heinrich.schuchardt@canonical.com> wrote:
>> >
>> > On 5/6/25 11:24, Sughosh Ganu wrote:
>> > > U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
>> > > the width of the phys_{addr,size}_t data types based on the register
>> > > size of the architecture.
>> > >
>> > > Currently, even the 32-bit RiscV platforms have a 64-bit
>> > > phys_{addr,size}_t data types. This causes issues on the 32-bit
>> > > platforms, where the upper 32-bits of the variables of these types
>> > > can have junk data, and that can cause all kinds of side-effects.
>> >
>> > How could it be that the upper 32-bit have junk data?
>> >
>> > When we convert from a shorter variable the compiler should fill the
>> > upper bits with zero.
>>
>> That does not seem to be happening. The efi_fit test fails on the
>> qemu-riscv32 platform, when attempting to boot the OS from the FIT
>> image.
>>
>> These are the values of the base address that I see in the
>> _lmb_alloc_addr() function.
>>
>> _lmb_alloc_addr: 755, rgn => -1, base => 0x1a1c0e00802000bc, size => 0x50b1
>
>
> As you are running on QEMU you should be able to track down where the value is actually assigned with gdb. This could for instance be a buffer overrun.
>
> If we don't correct the root cause, it will hit us again.

How will we hit this issue if the phys_addr_t is set to a 32-bit
value? So, this can be investigated as to why the compiler does not
set the upper 32-bits to 0. But given your earlier comment on the fact
that LPAE is not being used, the change proposed by this patch should
work? What issue do you see with a 32-bit address space.

-sughosh

>
> Best regards
>
> Heinrich
>
>>
>> The actual value that gets passed is 0x802000bc. The upper 32-bits do
>> not get zeroed out. Which causes the _lmb_alloc_addr() to return an
>> error. You can check my branch [1], where I have put a temporary
>> commit to print the values that cause the issue.
>>
>> -sughosh
>>
>> [1] - https://source.denx.de/u-boot/contributors/sng/u-boot/-/commits/lmb_apis_cleanup_bisectable_v2
>>
>> >
>> > >
>> > > This was discovered on the qemu Riscv 32-bit platform  when the return
>> > > value of an LMB API was checked, and some LMB allocation that ought
>> > > not to have failed, was failing. The upper 32-bits of the address
>> > > variable contained garbage, resulting in failures.
>> > >
>> > > Signed-off-by: Sughosh Ganu <sughosh.ganu@linaro.org>
>> > > ---
>> > >
>> > > Note:
>> > > Although the LMB API cleanup series depends on this patch, I am
>> > > sending it separately so that it gets noticed by the RiscV
>> > > maintainers. Sometimes a patch may not get the required attention when
>> > > sent as part of another seemingly unrelated series.
>> > >
>> > >
>> > >   arch/riscv/include/asm/types.h | 9 +++++++--
>> > >   1 file changed, 7 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/arch/riscv/include/asm/types.h b/arch/riscv/include/asm/types.h
>> > > index 49f7a5d6b3a..45d806c83eb 100644
>> > > --- a/arch/riscv/include/asm/types.h
>> > > +++ b/arch/riscv/include/asm/types.h
>> > > @@ -35,8 +35,13 @@ typedef u64 dma_addr_t;
>> > >   typedef u32 dma_addr_t;
>> > >   #endif
>> > >
>> > > -typedef unsigned long long phys_addr_t;
>> > > -typedef unsigned long long phys_size_t;
>> > > +#ifdef CONFIG_PHYS_64BIT
>> > > +typedef u64 phys_addr_t;
>> > > +typedef u64 phys_size_t;
>> > > +#else
>> > > +typedef u32 phys_addr_t;
>> > > +typedef u32 phys_size_t;
>> >
>> > This matches what has been done for ARM.
>> >
>> > 86c915628d58 ("riscv: Change phys_addr_t and phys_size_t to 64-bit")
>> > changed the definition to 64bit phys addr_t to support the 34bit
>> > physical addresses (similar to LPAE on arm32).
>> >
>> > I don't think that we need 34bit support currently. But I would prefer
>> > if we could find the root cause why the upper 32bit gets messed up as
>> > this might point to a generic problem.
>> >
>> > Best regards
>> >
>> > Heinrich
>> >
>> > > +#endif
>> > >
>> > >   #endif /* __KERNEL__ */
>> > >
>> >
Sughosh Ganu May 7, 2025, 7:49 a.m. UTC | #5
On Tue, 6 May 2025 at 16:35, Heinrich Schuchardt
<heinrich.schuchardt@canonical.com> wrote:
>
>
>
> Sughosh Ganu <sughosh.ganu@linaro.org> schrieb am Di., 6. Mai 2025, 12:50:
>>
>> On Tue, 6 May 2025 at 15:19, Heinrich Schuchardt
>> <heinrich.schuchardt@canonical.com> wrote:
>> >
>> > On 5/6/25 11:24, Sughosh Ganu wrote:
>> > > U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
>> > > the width of the phys_{addr,size}_t data types based on the register
>> > > size of the architecture.
>> > >
>> > > Currently, even the 32-bit RiscV platforms have a 64-bit
>> > > phys_{addr,size}_t data types. This causes issues on the 32-bit
>> > > platforms, where the upper 32-bits of the variables of these types
>> > > can have junk data, and that can cause all kinds of side-effects.
>> >
>> > How could it be that the upper 32-bit have junk data?
>> >
>> > When we convert from a shorter variable the compiler should fill the
>> > upper bits with zero.
>>
>> That does not seem to be happening. The efi_fit test fails on the
>> qemu-riscv32 platform, when attempting to boot the OS from the FIT
>> image.
>>
>> These are the values of the base address that I see in the
>> _lmb_alloc_addr() function.
>>
>> _lmb_alloc_addr: 755, rgn => -1, base => 0x1a1c0e00802000bc, size => 0x50b1
>
>
> As you are running on QEMU you should be able to track down where the value is actually assigned with gdb. This could for instance be a buffer overrun.

I was able to hook up gdb and re-create the issue. What I observe is
that when the lmb_allocate_mem() function is called, the base address
parameter, which is 64-bits, shows a value with the upper 32-bits not
zeroed out. So, this looks like a compiler issue, where the upper
32-bits are not being zeroed out. Fwiw, this shows up with the
compiler being used in the CI environment, as well as the one that I
am using.

/toolchains/riscv32-linux/bin/riscv32-linux-gcc --version
riscv32-linux-gcc (GCC) 13.2.0

/toolchains/riscv32-linux/bin/riscv32-linux-ld --version
GNU ld (GNU Binutils) 2.41

Given that LPAE is not being used on riscv32 platforms in U-Boot, I
think this patch should be considered. Thanks.

-sughosh

>
> If we don't correct the root cause, it will hit us again.
>
> Best regards
>
> Heinrich
>
>>
>> The actual value that gets passed is 0x802000bc. The upper 32-bits do
>> not get zeroed out. Which causes the _lmb_alloc_addr() to return an
>> error. You can check my branch [1], where I have put a temporary
>> commit to print the values that cause the issue.
>>
>> -sughosh
>>
>> [1] - https://source.denx.de/u-boot/contributors/sng/u-boot/-/commits/lmb_apis_cleanup_bisectable_v2
>>
>> >
>> > >
>> > > This was discovered on the qemu Riscv 32-bit platform  when the return
>> > > value of an LMB API was checked, and some LMB allocation that ought
>> > > not to have failed, was failing. The upper 32-bits of the address
>> > > variable contained garbage, resulting in failures.
>> > >
>> > > Signed-off-by: Sughosh Ganu <sughosh.ganu@linaro.org>
>> > > ---
>> > >
>> > > Note:
>> > > Although the LMB API cleanup series depends on this patch, I am
>> > > sending it separately so that it gets noticed by the RiscV
>> > > maintainers. Sometimes a patch may not get the required attention when
>> > > sent as part of another seemingly unrelated series.
>> > >
>> > >
>> > >   arch/riscv/include/asm/types.h | 9 +++++++--
>> > >   1 file changed, 7 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/arch/riscv/include/asm/types.h b/arch/riscv/include/asm/types.h
>> > > index 49f7a5d6b3a..45d806c83eb 100644
>> > > --- a/arch/riscv/include/asm/types.h
>> > > +++ b/arch/riscv/include/asm/types.h
>> > > @@ -35,8 +35,13 @@ typedef u64 dma_addr_t;
>> > >   typedef u32 dma_addr_t;
>> > >   #endif
>> > >
>> > > -typedef unsigned long long phys_addr_t;
>> > > -typedef unsigned long long phys_size_t;
>> > > +#ifdef CONFIG_PHYS_64BIT
>> > > +typedef u64 phys_addr_t;
>> > > +typedef u64 phys_size_t;
>> > > +#else
>> > > +typedef u32 phys_addr_t;
>> > > +typedef u32 phys_size_t;
>> >
>> > This matches what has been done for ARM.
>> >
>> > 86c915628d58 ("riscv: Change phys_addr_t and phys_size_t to 64-bit")
>> > changed the definition to 64bit phys addr_t to support the 34bit
>> > physical addresses (similar to LPAE on arm32).
>> >
>> > I don't think that we need 34bit support currently. But I would prefer
>> > if we could find the root cause why the upper 32bit gets messed up as
>> > this might point to a generic problem.
>> >
>> > Best regards
>> >
>> > Heinrich
>> >
>> > > +#endif
>> > >
>> > >   #endif /* __KERNEL__ */
>> > >
>> >
Sughosh Ganu May 7, 2025, 9:41 a.m. UTC | #6
On Wed, 7 May 2025 at 13:19, Sughosh Ganu <sughosh.ganu@linaro.org> wrote:
>
> On Tue, 6 May 2025 at 16:35, Heinrich Schuchardt
> <heinrich.schuchardt@canonical.com> wrote:
> >
> >
> >
> > Sughosh Ganu <sughosh.ganu@linaro.org> schrieb am Di., 6. Mai 2025, 12:50:
> >>
> >> On Tue, 6 May 2025 at 15:19, Heinrich Schuchardt
> >> <heinrich.schuchardt@canonical.com> wrote:
> >> >
> >> > On 5/6/25 11:24, Sughosh Ganu wrote:
> >> > > U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
> >> > > the width of the phys_{addr,size}_t data types based on the register
> >> > > size of the architecture.
> >> > >
> >> > > Currently, even the 32-bit RiscV platforms have a 64-bit
> >> > > phys_{addr,size}_t data types. This causes issues on the 32-bit
> >> > > platforms, where the upper 32-bits of the variables of these types
> >> > > can have junk data, and that can cause all kinds of side-effects.
> >> >
> >> > How could it be that the upper 32-bit have junk data?
> >> >
> >> > When we convert from a shorter variable the compiler should fill the
> >> > upper bits with zero.
> >>
> >> That does not seem to be happening. The efi_fit test fails on the
> >> qemu-riscv32 platform, when attempting to boot the OS from the FIT
> >> image.
> >>
> >> These are the values of the base address that I see in the
> >> _lmb_alloc_addr() function.
> >>
> >> _lmb_alloc_addr: 755, rgn => -1, base => 0x1a1c0e00802000bc, size => 0x50b1
> >
> >
> > As you are running on QEMU you should be able to track down where the value is actually assigned with gdb. This could for instance be a buffer overrun.
>
> I was able to hook up gdb and re-create the issue. What I observe is
> that when the lmb_allocate_mem() function is called, the base address
> parameter, which is 64-bits, shows a value with the upper 32-bits not
> zeroed out. So, this looks like a compiler issue, where the upper
> 32-bits are not being zeroed out. Fwiw, this shows up with the
> compiler being used in the CI environment, as well as the one that I
> am using.

Thinking a bit on this, I don't think this is a compiler issue. The
problem is that we are using the ulong type in some places(especially
in the boot* commands) for storing the address values, while we use
phys_addr_t in other places. And because this is a pointer being
passed across functions, when the data-type that the pointer is
pointing to changes from a 32-bit to 64-bit value, the upper 32-bits
get considered. So the issue is that we use ulong in some places, and
phys_addr_t in others for storing the addresses.

But I think that the solution for this(at least for now) is to set
phys_addr_t based on the underlying architecture. In the long run,
there needs to be an audit of the usage of ulong for storing
addresses, and that needs to be changed to phys_addr_t.

-sughosh

>
> /toolchains/riscv32-linux/bin/riscv32-linux-gcc --version
> riscv32-linux-gcc (GCC) 13.2.0
>
> /toolchains/riscv32-linux/bin/riscv32-linux-ld --version
> GNU ld (GNU Binutils) 2.41
>
> Given that LPAE is not being used on riscv32 platforms in U-Boot, I
> think this patch should be considered. Thanks.
>
> -sughosh
>
> >
> > If we don't correct the root cause, it will hit us again.
> >
> > Best regards
> >
> > Heinrich
> >
> >>
> >> The actual value that gets passed is 0x802000bc. The upper 32-bits do
> >> not get zeroed out. Which causes the _lmb_alloc_addr() to return an
> >> error. You can check my branch [1], where I have put a temporary
> >> commit to print the values that cause the issue.
> >>
> >> -sughosh
> >>
> >> [1] - https://source.denx.de/u-boot/contributors/sng/u-boot/-/commits/lmb_apis_cleanup_bisectable_v2
> >>
> >> >
> >> > >
> >> > > This was discovered on the qemu Riscv 32-bit platform  when the return
> >> > > value of an LMB API was checked, and some LMB allocation that ought
> >> > > not to have failed, was failing. The upper 32-bits of the address
> >> > > variable contained garbage, resulting in failures.
> >> > >
> >> > > Signed-off-by: Sughosh Ganu <sughosh.ganu@linaro.org>
> >> > > ---
> >> > >
> >> > > Note:
> >> > > Although the LMB API cleanup series depends on this patch, I am
> >> > > sending it separately so that it gets noticed by the RiscV
> >> > > maintainers. Sometimes a patch may not get the required attention when
> >> > > sent as part of another seemingly unrelated series.
> >> > >
> >> > >
> >> > >   arch/riscv/include/asm/types.h | 9 +++++++--
> >> > >   1 file changed, 7 insertions(+), 2 deletions(-)
> >> > >
> >> > > diff --git a/arch/riscv/include/asm/types.h b/arch/riscv/include/asm/types.h
> >> > > index 49f7a5d6b3a..45d806c83eb 100644
> >> > > --- a/arch/riscv/include/asm/types.h
> >> > > +++ b/arch/riscv/include/asm/types.h
> >> > > @@ -35,8 +35,13 @@ typedef u64 dma_addr_t;
> >> > >   typedef u32 dma_addr_t;
> >> > >   #endif
> >> > >
> >> > > -typedef unsigned long long phys_addr_t;
> >> > > -typedef unsigned long long phys_size_t;
> >> > > +#ifdef CONFIG_PHYS_64BIT
> >> > > +typedef u64 phys_addr_t;
> >> > > +typedef u64 phys_size_t;
> >> > > +#else
> >> > > +typedef u32 phys_addr_t;
> >> > > +typedef u32 phys_size_t;
> >> >
> >> > This matches what has been done for ARM.
> >> >
> >> > 86c915628d58 ("riscv: Change phys_addr_t and phys_size_t to 64-bit")
> >> > changed the definition to 64bit phys addr_t to support the 34bit
> >> > physical addresses (similar to LPAE on arm32).
> >> >
> >> > I don't think that we need 34bit support currently. But I would prefer
> >> > if we could find the root cause why the upper 32bit gets messed up as
> >> > this might point to a generic problem.
> >> >
> >> > Best regards
> >> >
> >> > Heinrich
> >> >
> >> > > +#endif
> >> > >
> >> > >   #endif /* __KERNEL__ */
> >> > >
> >> >
Tom Rini May 7, 2025, 3:48 p.m. UTC | #7
On Wed, May 07, 2025 at 03:11:38PM +0530, Sughosh Ganu wrote:
> On Wed, 7 May 2025 at 13:19, Sughosh Ganu <sughosh.ganu@linaro.org> wrote:
> >
> > On Tue, 6 May 2025 at 16:35, Heinrich Schuchardt
> > <heinrich.schuchardt@canonical.com> wrote:
> > >
> > >
> > >
> > > Sughosh Ganu <sughosh.ganu@linaro.org> schrieb am Di., 6. Mai 2025, 12:50:
> > >>
> > >> On Tue, 6 May 2025 at 15:19, Heinrich Schuchardt
> > >> <heinrich.schuchardt@canonical.com> wrote:
> > >> >
> > >> > On 5/6/25 11:24, Sughosh Ganu wrote:
> > >> > > U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
> > >> > > the width of the phys_{addr,size}_t data types based on the register
> > >> > > size of the architecture.
> > >> > >
> > >> > > Currently, even the 32-bit RiscV platforms have a 64-bit
> > >> > > phys_{addr,size}_t data types. This causes issues on the 32-bit
> > >> > > platforms, where the upper 32-bits of the variables of these types
> > >> > > can have junk data, and that can cause all kinds of side-effects.
> > >> >
> > >> > How could it be that the upper 32-bit have junk data?
> > >> >
> > >> > When we convert from a shorter variable the compiler should fill the
> > >> > upper bits with zero.
> > >>
> > >> That does not seem to be happening. The efi_fit test fails on the
> > >> qemu-riscv32 platform, when attempting to boot the OS from the FIT
> > >> image.
> > >>
> > >> These are the values of the base address that I see in the
> > >> _lmb_alloc_addr() function.
> > >>
> > >> _lmb_alloc_addr: 755, rgn => -1, base => 0x1a1c0e00802000bc, size => 0x50b1
> > >
> > >
> > > As you are running on QEMU you should be able to track down where the value is actually assigned with gdb. This could for instance be a buffer overrun.
> >
> > I was able to hook up gdb and re-create the issue. What I observe is
> > that when the lmb_allocate_mem() function is called, the base address
> > parameter, which is 64-bits, shows a value with the upper 32-bits not
> > zeroed out. So, this looks like a compiler issue, where the upper
> > 32-bits are not being zeroed out. Fwiw, this shows up with the
> > compiler being used in the CI environment, as well as the one that I
> > am using.
> 
> Thinking a bit on this, I don't think this is a compiler issue. The
> problem is that we are using the ulong type in some places(especially
> in the boot* commands) for storing the address values, while we use
> phys_addr_t in other places. And because this is a pointer being
> passed across functions, when the data-type that the pointer is
> pointing to changes from a 32-bit to 64-bit value, the upper 32-bits
> get considered. So the issue is that we use ulong in some places, and
> phys_addr_t in others for storing the addresses.
> 
> But I think that the solution for this(at least for now) is to set
> phys_addr_t based on the underlying architecture. In the long run,
> there needs to be an audit of the usage of ulong for storing
> addresses, and that needs to be changed to phys_addr_t.

Thanks for digging in to this more. I agree with what you're saying here
for both the short and long term.
Sughosh Ganu May 7, 2025, 6:37 p.m. UTC | #8
On Wed, 7 May 2025 at 21:18, Tom Rini <trini@konsulko.com> wrote:
>
> On Wed, May 07, 2025 at 03:11:38PM +0530, Sughosh Ganu wrote:
> > On Wed, 7 May 2025 at 13:19, Sughosh Ganu <sughosh.ganu@linaro.org> wrote:
> > >
> > > On Tue, 6 May 2025 at 16:35, Heinrich Schuchardt
> > > <heinrich.schuchardt@canonical.com> wrote:
> > > >
> > > >
> > > >
> > > > Sughosh Ganu <sughosh.ganu@linaro.org> schrieb am Di., 6. Mai 2025, 12:50:
> > > >>
> > > >> On Tue, 6 May 2025 at 15:19, Heinrich Schuchardt
> > > >> <heinrich.schuchardt@canonical.com> wrote:
> > > >> >
> > > >> > On 5/6/25 11:24, Sughosh Ganu wrote:
> > > >> > > U-Boot has support for both the 32-bit and 64-bit RiscV platforms. Set
> > > >> > > the width of the phys_{addr,size}_t data types based on the register
> > > >> > > size of the architecture.
> > > >> > >
> > > >> > > Currently, even the 32-bit RiscV platforms have a 64-bit
> > > >> > > phys_{addr,size}_t data types. This causes issues on the 32-bit
> > > >> > > platforms, where the upper 32-bits of the variables of these types
> > > >> > > can have junk data, and that can cause all kinds of side-effects.
> > > >> >
> > > >> > How could it be that the upper 32-bit have junk data?
> > > >> >
> > > >> > When we convert from a shorter variable the compiler should fill the
> > > >> > upper bits with zero.
> > > >>
> > > >> That does not seem to be happening. The efi_fit test fails on the
> > > >> qemu-riscv32 platform, when attempting to boot the OS from the FIT
> > > >> image.
> > > >>
> > > >> These are the values of the base address that I see in the
> > > >> _lmb_alloc_addr() function.
> > > >>
> > > >> _lmb_alloc_addr: 755, rgn => -1, base => 0x1a1c0e00802000bc, size => 0x50b1
> > > >
> > > >
> > > > As you are running on QEMU you should be able to track down where the value is actually assigned with gdb. This could for instance be a buffer overrun.
> > >
> > > I was able to hook up gdb and re-create the issue. What I observe is
> > > that when the lmb_allocate_mem() function is called, the base address
> > > parameter, which is 64-bits, shows a value with the upper 32-bits not
> > > zeroed out. So, this looks like a compiler issue, where the upper
> > > 32-bits are not being zeroed out. Fwiw, this shows up with the
> > > compiler being used in the CI environment, as well as the one that I
> > > am using.
> >
> > Thinking a bit on this, I don't think this is a compiler issue. The
> > problem is that we are using the ulong type in some places(especially
> > in the boot* commands) for storing the address values, while we use
> > phys_addr_t in other places. And because this is a pointer being
> > passed across functions, when the data-type that the pointer is
> > pointing to changes from a 32-bit to 64-bit value, the upper 32-bits
> > get considered. So the issue is that we use ulong in some places, and
> > phys_addr_t in others for storing the addresses.
> >
> > But I think that the solution for this(at least for now) is to set
> > phys_addr_t based on the underlying architecture. In the long run,
> > there needs to be an audit of the usage of ulong for storing
> > addresses, and that needs to be changed to phys_addr_t.
>
> Thanks for digging in to this more. I agree with what you're saying here
> for both the short and long term.

Heinrich and I had a discussion on IRC on this, and for the short
term, it was decided to instead have the ulong values copied into a
local variable of type phys_addr_t before calling the lmb API. This
approach too will work for now. Heinrich is of the opinion that it
would be better not to make the change to the riscv32 file as the
maintainers think it appropriate to use u64 for phys_addr_t. I will be
making this change as part of my upcoming version of the lmb API
series. I will be on leave for the next week, and will send the v2
once back. Thanks.

-sughosh

>
> --
> Tom
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/types.h b/arch/riscv/include/asm/types.h
index 49f7a5d6b3a..45d806c83eb 100644
--- a/arch/riscv/include/asm/types.h
+++ b/arch/riscv/include/asm/types.h
@@ -35,8 +35,13 @@  typedef u64 dma_addr_t;
 typedef u32 dma_addr_t;
 #endif
 
-typedef unsigned long long phys_addr_t;
-typedef unsigned long long phys_size_t;
+#ifdef CONFIG_PHYS_64BIT
+typedef u64 phys_addr_t;
+typedef u64 phys_size_t;
+#else
+typedef u32 phys_addr_t;
+typedef u32 phys_size_t;
+#endif
 
 #endif /* __KERNEL__ */