mbox series

[RFC,0/9] net: dsa: microchip: PTP support for KSZ956x

Message ID 20201019172435.4416-1-ceggers@arri.de
Headers show
Series net: dsa: microchip: PTP support for KSZ956x | expand

Message

Christian Eggers Oct. 19, 2020, 5:24 p.m. UTC
This series adds support for PTP to the KSZ956x and KSZ9477 devices.

1/9: Convert device tree binding from .txt to .yaml
2/9: Split ksz_common.h, so it can be included in tag_ksz.c
3/9: ksz9477.c --> ksz9477_main.c (ksz9477_ptp.c will be added soon)
4/9: Add dt-bindings for interrupts
5/9: Infrastructure for interrupts
6/9: Posix clock routines for chip PTP clock
7/9: Support for hardware time stamping
8/9: Support for PPS
9/9: Support for perout

There is only little documentation for PTP available on the data sheet
[1] (more or less only the register reference). Questions to the
Microchip support were seldom answered comprehensively or in reasonable
time. So this is more or less the result of reverse engineering.

[1]
http://ww1.microchip.com/downloads/en/DeviceDoc/KSZ9563R-Data-Sheet-DS00002419D.pdf

Comments

Christian Eggers Oct. 22, 2020, 7:30 a.m. UTC | #1
Hi Richard,

On Thursday, 22 October 2020, 04:42:01 CEST, Richard Cochran wrote:
> I'm just catching up with this.
> 
> Really. Truly. Please -- Include the maintainer on CC for such patches!
sorry for missing you on the recipients list. I blindly trusted the output of
get_maintainer.pl.

I recently sent two other patches which may also be of interest for you. They
related to handling of SO_TIMESTAMPING on 32 bit platforms with newer C 
libraries:

https://patchwork.ozlabs.org/project/netdev/patch/20201012093542.15504-1-ceggers@arri.de/
https://patchwork.ozlabs.org/project/netdev/patch/20201012093542.15504-2-ceggers@arri.de/

> On Thu, Oct 22, 2020 at 02:39:35AM +0300, Vladimir Oltean wrote:
> > On Mon, Oct 19, 2020 at 07:24:33PM +0200, Christian Eggers wrote:
> > > The PTP hardware performs internal detection of PTP frames (likely
> > > similar as ptp_classify_raw() and ptp_parse_header()). As these filters
> > > cannot be disabled, the current delay mode (E2E/P2P) and the clock mode
> > > (master/slave) must be configured via sysfs attributes.
> 
> This is a complete no-go.  NAK.
I didn't design the hardware nor do I have access to adequate documentation.
I will try to figure out what functionality is concretely affected by these
two settings.

If I am correct, the KSZ hardware consists of two main building blocks:
1. A TC on the switch path.
2. An OC on the DSA host port.
Vladimir Oltean Oct. 22, 2020, 11:32 a.m. UTC | #2
On Thu, Oct 22, 2020 at 01:11:40PM +0200, Christian Eggers wrote:
> Hi Vladimir,

> 

> On Thursday, 22 October 2020, 12:50:14 CEST, Vladimir Oltean wrote:

> after applying the RX timestamp correctly to the correction field (shifting

> the nanoseconds by 16),


That modification should have been done anyway, since the unit of
measurement for correctionField is scaled ppb (48 bits nanoseconds, 16
bits scaled nanoseconds), and not nanoseconds.

> it seems that "moving" the timestamp back to the tail tag on TX is not

> required anymore. Keeping the RX timestamp simply in the correction

> field (negative value), works fine now. So this halves the effort in

> the tag_ksz driver.


Ok, this makes sense.
Depending on what Richard responds, it now looks like the cleanest
approach would be to move your implementation that is currently in
ksz9477_update_ptp_correction_field() into a generic function called

static inline void ptp_onestep_p2p_move_t2_to_correction(struct sk_buff *skb,
							 unsigned int ptp_type,
							 struct ptp_header *ptp_header,
							 ktime_t t2)

You should then clearly document that this function is needed for
hardware capable of one-step P2P that does not already modify the
correctionField of Pdelay_Req event messages on ingress.
Richard Cochran Oct. 22, 2020, 2:34 p.m. UTC | #3
On Thu, Oct 22, 2020 at 02:32:43PM +0300, Vladimir Oltean wrote:
> On Thu, Oct 22, 2020 at 01:11:40PM +0200, Christian Eggers wrote:
> 
> > it seems that "moving" the timestamp back to the tail tag on TX is not
> > required anymore. Keeping the RX timestamp simply in the correction
> > field (negative value), works fine now. So this halves the effort in
> > the tag_ksz driver.
> 
> Ok, this makes sense.
> Depending on what Richard responds, it now looks like the cleanest
> approach would be to move your implementation that is currently in
> ksz9477_update_ptp_correction_field() into a generic function called

+1
Christian Eggers Nov. 1, 2020, 9:35 a.m. UTC | #4
Hi Vladimir,

On Friday, 30 October 2020, 19:24:47 CET, Vladimir Oltean wrote:
> On Thu, Oct 22, 2020 at 12:17:48PM +0200, Christian Eggers wrote:
> > I tried to study the effect of setting the ocmode bit on the KSZ either to
> > master or to slave. The main visible change is, that some PTP message
> > types
> > are be filtered out on RX:
> > - in "master" mode, "Sync" messages from other nodes will not be received
> > (but everything else like "Announce" seem to work)
> > - in "slave" mode, "Delay_Req" messages from other nodes will not be
> > received
> Could you dump the contents of your REG_PTP_MSG_CONF2 register?
runtime register value is 0x1004 (matches default value from the data sheet).
The Linux driver doesn't touch this register. Below is a dump of all PTP
related (global) registers.

regards
Christian

    KSZ9563 (Ethernet switch)

      Global

        IEEE 1588 PTP
        CLKCTRL      0002      SWFA         enabled        CLKSADJ        NOP              PTPSD        subtract
                               CLKREAD      NOP            CLKWRITE       NOP              CLKCADJ      disabled
                               EN           enabled        RESET          normal
        RTCCP        0002      PHASE        16ns
        RTCNS        17D72FF0  NANOSECONDS    399978480
        RTCS         00000023  SECONDS               35
        RTCSUBNS     00000000  RATEDIR      subtract       TEMPADJ        permanent
                               SUBNS                  0
        RTCTMPADJ    00000000  CYCLES                 0
        MSGCFG1      007D      MODEEN       enabled        IEEE802.3      enabled          UDPv4        enabled
                               UDPv6        enabled        TCMODE         P2P              OCMODE       slave
        MSGCFG2      1004      UNICASTEN    both           ALTMASTER      disabled         PRIOTX       event only
                               CHKSYNFU     disabled       CHKDLY         disabled         CHKPDLY      disabled
                               DROP         disabled       CHKDOM         disabled         IPv4CHKSUM   calc
        DOMVER       0200      VERSION      2              DOMAIN            0
        UNITIDX      00000000  GPIO_IDX     GPIO_1         TS_IDX         Unit 0           TRIG_IDX     Unit 0
Vladimir Oltean Nov. 1, 2020, 11:10 a.m. UTC | #5
On Sun, Nov 01, 2020 at 10:35:01AM +0100, Christian Eggers wrote:
> Hi Vladimir,
> 
> On Friday, 30 October 2020, 19:24:47 CET, Vladimir Oltean wrote:
> > On Thu, Oct 22, 2020 at 12:17:48PM +0200, Christian Eggers wrote:
> > > I tried to study the effect of setting the ocmode bit on the KSZ either to
> > > master or to slave. The main visible change is, that some PTP message
> > > types
> > > are be filtered out on RX:
> > > - in "master" mode, "Sync" messages from other nodes will not be received
> > > (but everything else like "Announce" seem to work)
> > > - in "slave" mode, "Delay_Req" messages from other nodes will not be
> > > received
> > Could you dump the contents of your REG_PTP_MSG_CONF2 register?
> runtime register value is 0x1004 (matches default value from the data sheet).
> The Linux driver doesn't touch this register. Below is a dump of all PTP
> related (global) registers.

So the bit 5 ("Enable Dropping of Sync/Follow_Up and Delay_Req PTP
Messages") is not set. When the PTP messages are dropped, do you know
which error counter in ethtool -S is increasing?
Vladimir Oltean Nov. 1, 2020, 11:55 p.m. UTC | #6
On Mon, Nov 02, 2020 at 01:41:49AM +0200, Vladimir Oltean wrote:
> In principle I don't see any reason why this switch would not be able
> to operate as a one-step peer delay BC.

What I meant to say was "one-step E2E BC", since I was talking about
having to receive both Sync and Delay_Req at the same time, of course.
Vladimir Oltean Nov. 2, 2020, 12:28 p.m. UTC | #7
On Mon, Nov 02, 2020 at 11:35:00AM +0100, Christian Eggers wrote:
> Maybe my mail from October, 22 was ambiguous. I meant that despite of the
> presence of filtering, a BCMA algorithm should be about to work (as no
> Announce messages are filtered out).
> 
> Additionally I said, that switching between "master" and "slave" mode could
> not be done automatically by the driver, as the driver could at most detect
> the presence of Sync messages (indication for master mode), but would do hard
> to detect a transition to slave mode.
> 
> I see a chance that user space (ptp4l) could configure the appropriate
> "hardware filter setup" for master/slave mode.

The concept that you want from user space is hard to define.
You want ptp4l to make the driver aware of the port state, which is
something that happens at runtime and is not part of the "hardware
filter setup" as you say. Then, even if ptp4l notifies the driver of
port state, E2E BC setups would still be broken because that's how the
hardware works and no user space assistance can help with that. Also,
what abstraction do you plan using for programming the PTP port state
into the kernel.

Maybe you should optimize for what you plan to use. If you need to use
an E2E profile, maybe it would be worth the effort to find an
appropriate abstraction for this port state thing. If you only use it
for testing, then maybe it would be a good idea to keep the sysfs in
your tree.

> > Why am I mentioning this? Because the setting that's causing trouble for
> > us is 'port state of the host port OC', which in the context of what I
> > said above is nonsense. There _is_ no host port OC. There are 2 switch
> > ports which can act as individual OCs, or as a BC, or as a TC.
> But the switch has only one clock at all. I assume that the switch cannot be a
> boundary clock, only TC seems possible.

Why would a P2P BC not work? It does not require more than one clock.

> As said above, having "filter setups" for E2E/P2P and for MASTER/SLAVE would
> probably fit well for this kind of hardware.

For E2E/P2P is one thing, for master/slave is a completely different
thing.
Christian Eggers Nov. 5, 2020, 8:18 p.m. UTC | #8
Hi Vladimir,

On Thursday, 22 October 2020, 13:32:43 CET, Vladimir Oltean wrote:
> On Thu, Oct 22, 2020 at 01:11:40PM +0200, Christian Eggers wrote:

> > On Thursday, 22 October 2020, 12:50:14 CEST, Vladimir Oltean wrote:

> > after applying the RX timestamp correctly to the correction field

> > (shifting

> > the nanoseconds by 16),

> 

> That modification should have been done anyway, since the unit of

> measurement for correctionField is scaled ppb (48 bits nanoseconds, 16

> bits scaled nanoseconds), and not nanoseconds.

> 

> > it seems that "moving" the timestamp back to the tail tag on TX is not

> > required anymore. Keeping the RX timestamp simply in the correction

> > field (negative value), works fine now. So this halves the effort in

> > the tag_ksz driver.


unfortunately I made a mistake when testing. Actually the timestamp *must* be
moved from the correction field (negative) to the egress tail tag.

> Ok, this makes sense.

> Depending on what Richard responds, it now looks like the cleanest

> approach would be to move your implementation that is currently in

> ksz9477_update_ptp_correction_field() into a generic function called

> 

> static inline void ptp_onestep_p2p_move_t2_to_correction(struct sk_buff

> *skb, unsigned int ptp_type,

>  struct ptp_header *ptp_header,

> ktime_t t2)

I have implemented this in ptp_classify.h. Passing t2 instead of the correction
field itself is fine for rx, but as this function is now still required for
transmit, it looks a little bit misused there (see below).

Shall I keep it as below, or revert it to passing value of the correction field
itself?

regards
Christian


static void ksz9477_xmit_timestamp(struct sk_buff *skb)
{
	struct sk_buff *clone = DSA_SKB_CB(skb)->clone;
	struct ptp_header *ptp_hdr;
	u32 tstamp_raw = 0;
	u64 correction;

	if (!clone)
		goto out_put_tag;

	/* Use cached PTP header and type from ksz9477_ptp_should_tstamp(). Note
	 * that KSZ9477_SKB_CB(clone)->ptp_header != NULL implies that this is a
	 * Pdelay_resp message.
	 */
	ptp_hdr = KSZ9477_SKB_CB(clone)->ptp_header;
	if (!ptp_hdr)
		goto out_put_tag;

	correction = get_unaligned_be64(&ptp_hdr->correction);

	/* For PDelay_Resp messages we will likely have a negative value in the
	 * correction field (see ksz9477_rcv()). The switch hardware cannot
	 * correctly update such values, so it must be moved to the time stamp
	 * field in the tail tag.
	 */
	if ((s64)correction < 0) {
		unsigned int ptp_type = KSZ9477_SKB_CB(clone)->ptp_type;
		struct timespec64 ts;
		u64 ns;

		/* Move ingress time stamp from PTP header's correction field to
		 * tail tag. Format of the correction filed is 48 bit ns + 16
		 * bit fractional ns.  Avoid shifting negative numbers.
		 */
		ns = -((s64)correction) >> 16;
		ts = ns_to_timespec64(ns);
		tstamp_raw = ((ts.tv_sec & 3) << 30) | ts.tv_nsec;

>>>		/* Set correction field to 0 (by subtracting the negative value)

>>>		 * and update UDP checksum.

>>>		 */

>>>		ptp_onestep_p2p_move_t2_to_correction(skb, ptp_type, ptp_hdr, ns_to_ktime(-ns));

	}

out_put_tag:
	put_unaligned_be32(tstamp_raw, skb_put(skb, KSZ9477_PTP_TAG_LEN));
}


Addtionally ptp_onestep_p2p_move_t2_to_correction() must be able to handle negative values:

static inline
void ptp_onestep_p2p_move_t2_to_correction(struct sk_buff *skb,
					   unsigned int type,
					   struct ptp_header *hdr,
					   ktime_t t2)
{
	u8 *ptr = skb_mac_header(skb);
	struct udphdr *uhdr = NULL;
	s64 ns = ktime_to_ns(t2);
	__be64 correction_old;
	s64 correction;

	/* previous correction value is required for checksum update. */
	memcpy(&correction_old,  &hdr->correction, sizeof(correction_old));
	correction = (s64)be64_to_cpu(correction_old);

	/* PTP correction field consists of 32 bit nanoseconds and 16 bit
	 * fractional nanoseconds.  Avoid shifting negative numbers.
	 */
>>>	if (ns >= 0)

>>>		correction -= ns << 16;

>>>	else

>>>		correction += -ns << 16;


	/* write new correction value */
	put_unaligned_be64((u64)correction, &hdr->correction);
...
}
Vladimir Oltean Nov. 10, 2020, 1:42 a.m. UTC | #9
Sorry for getting back late to you. It did not compute when I read your
email the first time around, then I let it sit for a while.

On Thu, Nov 05, 2020 at 09:18:04PM +0100, Christian Eggers wrote:
> unfortunately I made a mistake when testing. Actually the timestamp *must* be

> moved from the correction field (negative) to the egress tail tag.


That doesn't sound very good at all.

I have a simple question. Can your driver, when operating as a PTP
master clock, distribute a time in 2020 into your network? (you can
check with "phc_ctl /dev/ptp0 get")
Vladimir Oltean Nov. 10, 2020, 4:40 p.m. UTC | #10
On Tue, Nov 10, 2020 at 03:36:02PM +0100, Christian Eggers wrote:
> On Tuesday, 10 November 2020, 02:42:34 CET, Vladimir Oltean wrote:
> > Sorry for getting back late to you. It did not compute when I read your
> > email the first time around, then I let it sit for a while.
> >
> > On Thu, Nov 05, 2020 at 09:18:04PM +0100, Christian Eggers wrote:
> > > unfortunately I made a mistake when testing. Actually the timestamp *must*
> > > be moved from the correction field (negative) to the egress tail tag.
> > That doesn't sound very good at all.
> I think that is no drawback. It is implemented and works.

It works, but how?

The reason why I'm asking you is this. I can guess that this hardware
plays by a very simple song when performing one-step TX timestamping.

correctionField_New = correctionField_Old + (t_TX - t_Tail_Tag)

It does this because, as far as the hardware is concerned, it needs to
make no difference between event messages, be they Sync, Pdelay_Req or
Pdelay_Resp messages.

First, let's see if we agree what happens for a Sync.

          |                    |
       t1 |------\ Sync        |
          |       ------\      |
          |              ----->| t2
Master    |                    |     Slave
Clock     |                    |     Clock
          |                    |

When receiving the one-step Sync message, the Slave Clock must add the
correctionField to the originTimestamp in order to figure out what is
the t1 it should use. Usually, the correctionField would only be updated
by transparent clocks. But when using one-step TX timestamping, it is
not mandatory for master clocks to send the correctionField as zero
(something which is mandatory for two-step). So they don't.

Here's what the IEEE 1588 standard says:

-----------------------------[cut here]-----------------------------
9.5.9.3 One-step clocks
-----------------------

The originTimestamp field of the Sync message shall be an estimate no
worse than ±1 s of the <syncEventEgressTimestamp> excluding any
fractional nanoseconds.
-----------------------------[cut here]-----------------------------

In practice, the Master Clock will probably set the originTimestamp to
something more or less arbitrary by reading the current PTP time of the
NIC, then it will let the MAC rewrite the correctionField with the
precise t_TX (hardware TX timestamp) minus the value from the
originTimestamp field*.

*Actually parsing the packet at that rate, while also rewriting and
timestamping it, might be too tricky and too late, so the MAC, instead
of reading the originTimestamp from the actual Sync message, will
actually require you, the driver writer, to pass them the value of the
originTimestamp twice, this time also through the tail tag (or through
whatever other buffer metadata that hardware might use).

So this formula would do that job:

correctionField_New = correctionField_Old + (t_TX - t_Tail_Tag)

where t_TX is the MAC TX timestamp and t_Tail_Tag should be the
approximate originTimestamp of the Sync message.

I am fairly confident that this is how your hardware works, because
that's also how peer delay wants to be timestamped, it seems. It's just
that in the case of Pdelay_Resp, t_Tail_Tag must be set to t2 (the
precise timestamp of the Pdelay_Req), in order for the correctionField
for that Pdelay_Resp message to contain (t3 - t2):

          |                    |
       t1 |------\ Pdelay_Req  |
          |       ------\      |
          |              ----->| t2
Clock A   |                    |     Clock B
          | Pdelay_Resp /------| t3
          |       ------       |
       t4 |<-----/             |
          |                    |

You don't do any of that here. So what must be happening in your case is
that the originTimestamp for Sync messages is always zero, and the
correctionField provides the rest of t1, in its entirety. That's because
in your tagger, you also leave the t_Tail_Tag as zero for Sync messages
(you only touch it for peer delay). Is that ok? Well, on paper yes,
because 0 + t1 = t1 + 0 = t1 (either the correctionField or the
originTimestamp could be zero, and they would add up to the same
number), but in practice, the bit width of the originTimestamp is 64
bits, whereas the correctionField only has 48 bits for nanoseconds, the
lower 16 bits being for fixed point. So if you're going to keep your
entire t1 into the correctionField, then in practice your master clock
can only advertise a time that holds 48 bits worth of nanoseconds, i.e.
3 1/4 days worth of TAI starting with Jan 1st 1970.  Then it would wrap
around. So how come this is not what happens in your case? I don't see
you update the originTimestamp nor the t_Tail_Tag for Sync messages.

The arithmetic that is done on the correctionField should be simple
48-bit two's complement. It should not matter if the correctionField is
positive or negative. The only case I believe it would matter that the
correctionField is negative is if the arithmetic is not actually on 48
bit two's complement, but on the partial timestamps directly (i.e. 32
bit two's complement). What I'm saying is that this formula is probably
calculated in hardware in two steps:

correctionField_New = correctionField_Old + (t_TX - t_Tail_Tag)

Step 1:
t_tmp = t_TX (32 bits) - t_Tail_Tag (32 bits), using 32-bit two's complement arithmetic

Step 2:
correctionField_New (48 bits) = correctionField_Old (48 bits) + t_tmp (32 bits),
using 48-bit two's complement arithmetic, and where t_tmp is probably
treated as an unsigned integer that is zero-padded up to 48 bits

I am getting the feeling that there are still some things we're missing
in this series. I would not hurry to send a v2 until they're clear.

I hope that you're not testing PTP just between yourself and yourself.
It's really easy for bugs to cancel out.
Vladimir Oltean Nov. 10, 2020, 7:32 p.m. UTC | #11
On Tue, Nov 10, 2020 at 06:40:45PM +0200, Vladimir Oltean wrote:
> I am fairly confident that this is how your hardware works, because

> that's also how peer delay wants to be timestamped, it seems.


So I was confident and also wrong, it appears. My KSZ9477 datasheet
says:

In the host-to-switch direction, the 4-byte timestamp field is always
present when PTP is enabled, as shown in Figure 4-6. This is true for
all packets sent by the host, including IBA packets. The host uses this
field to insert the receive timestamp from PTP Pdelay_Req messages into
the Pdelay_Resp messages. For all other traffic and PTP message types,
the host should populate the timestamp field with zeros.

Hm. Does that mean that the switch updates the originTimestamp field of
the Sync frames by itself? Ok... Very interesting that they decided to
introduce a field in the tail tag for a single type of PTP message.

But something is still wrong if you need to special-case the negative
correctionField, it looks like the arithmetic is not done on the correct
number of bits, either by the driver or by the hardware.

And zeroing out the correctionField of the Pdelay_Resp on transmission,
to put that value into t_Tail_Tag? How can you squeeze a 48-bit value
into a 32-bit value without truncation?
Christian Eggers Nov. 17, 2020, 11:27 a.m. UTC | #12
On Thursday, 12 November 2020, 16:28:44 CET, Christian Eggers wrote:
> Hi Vladimir,

> 

> On Tuesday, 10 November 2020, 20:32:45 CET, Vladimir Oltean wrote:

> > But something is still wrong if you need to special-case the negative

> > correctionField, it looks like the arithmetic is not done on the correct

> > number of bits, either by the driver or by the hardware.

> 

> I got it! 

I got it not!

While keeping the (negative) correction field works perfect when using
PTP over L2 (what I did the last weeks), this causes an unwanted side effect
when using UDP:

...
> User Datagram Protocol, Src Port: 319, Dst Port: 319

> 

>     Source Port: 319

>     Destination Port: 319

>     Length: 62

>  

>  Checksum: 0x2285 incorrect, should be 0x2286 (maybe caused by "UDP checksum offload"?)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>     [Checksum Status: Bad]

>     [Stream index: 0]

>     [Timestamps]

> 

> Precision Time Protocol (IEEE1588)

> 

>     0000 .... = transportSpecific: 0x0

>     .... 0011 = messageId: Peer_Delay_Resp Message (0x3)

>     0000 .... = Reserved: 0

>     .... 0010 = versionPTP: 2

>     messageLength: 54

>     subdomainNumber: 0

>     Reserved: 0

>     flags: 0x0000

>     correction: 5579788,000000 nanoseconds

>     Reserved: 0

>     ClockIdentity: 0x849000fffe0980f7

>     SourcePortID: 1

>     sequenceId: 785

>     control: Other Message (5)

>     logMessagePeriod: 127

>     requestreceiptTimestamp (seconds): 0

>     requestreceiptTimestamp (nanoseconds): 0

>     requestingSourcePortIdentity: 0x849000fffe0980f6

>     requestingSourcePortId: 2

While correction field is ok (residential delay ~5ms, using one printk...),
the UDP checksum is off by one in all PDelay_Resp messages.

The KSZ device offers on option to set the UDP checksum to zero, but this also
didn't help and additionally wouldn't work for IPv6.

It seems that I should return to "moving T2 from the correction field to the
tail tag" on tx.
 
regards
Christian