mbox series

[RFC,00/48] Target cluster implementation over DLM

Message ID 20220803162857.27770-1-d.bogdanov@yadro.com
Headers show
Series Target cluster implementation over DLM | expand

Message

Dmitry Bogdanov Aug. 3, 2022, 4:04 p.m. UTC
Hi linux target comminity.

Let's me present RFC of an implementation of cluster features for Target
Core that needs for backstore devices shared through cluster nodes.

The patchset is big and of several subsets, but it contains some arguable
things and it would take too much time to discsuss them separatelly.

Patches 1-9:
Make RTPI be part of se_tpg instead of se_lun. That is a must because
there is no possibility to assign RTPI on a LUN.
That data model is different from SCST and current in LIO but still does
not contradict with SAM and even is more according to SAM - a whole TCM
is a SCSI Device, and all its ports are SCSI Ports with unique RTPIs.
 + unique identification of TPG through the cluster.
 + possibility of assignment of RPTI.
 - number of all TPGs will be limited to 65535.
This patchset was published first time 2 years ago [1]. In previous
version the peers RTPIs were put in <device>/alua/... folder. In this
version the peers RTPIs are part of TPGs on the remote fabric (patch 35).

Patches 10-29:
Fixes some bugs and deviations from the standard in PR code.
Undepend pr_reg from se_nacl and se_tpg to be just a registration holder.
Make APTPL registrations (not linked to se_dev_entry) be full-fledged
registrations.

Patches 30-34:
DLM_CKV module that uses DLM and provides:
 * Cluster Lock service (pure wrapper over DLM).
 * Cluster Key-Value service in memory storage.
 * Cluster Notification service with a blocking acknowledge.
 * Cluster membership callbacks.
This module is supposed to be used by TCM and nvmet to implement cluster
operations.

Patch 35:
New 'remote' (in fact dummy) fabric module. Configuration on this fabric will
provide to TCM a view of TPG/LUN/ACL configuration on a peer nodes.

Patche 36:
Introduce cluster ops and functions to register a cluster ops
implementation modules. There could be a several different modules.
The device attrib cluster_impl regulates which implementation to use
for that device. 'single' is for default (no cluster) implementation.

Patches 37-48:
TCM Cluster over DLM module implementation inspired by SCST.
 * Use DLM_CKV Lock service to serialize order of PR OUT commands
 * Use DLM_CKV Key-Value storage service to store PR cluster data.
Sync it after successful execution of PR OUT command.
 * Use DLM_CKV Notification service to notify (in blocking manner) other
nodes to fetch PR cluster data. The handling of PR OUT command is
blocked until other nodes read the cluster PR data.

It provides:
 * Cluster lock per LBA for Compare And Write.
 * Full support of SCSI-3 Persistent Reservations including
   PREEMPT AND ABORT and REGISTER AND MOVE.
 * Normal PR APTPL imlementation (persistanse over power loss)
 * Shared LUN RESET
 * Shared SCSI-2 Reservations.
 * Unit Attentions for all TPGs in cluster




How to test

1. Setup DLM over worked corosync & pacemaker cluster
 $ zypper install libdlm libdlm3
 $ crm configure primitive dlm ocf:pacemaker:controld args="-q0 -f0" allow_stonith_disabled=true  op monitor interval="60" timeout="60"
 $ crm configure clone clone-dlm dlm meta interleave=true target-role=Started
2. setup TCM cluster configuration
You can use my script [2] that I used to use while developing the patchset.
3. Test what you want.
For example, my test report [3].

[1] https://lore.kernel.org/all/20200429094443.43937-1-r.bolshakov@yadro.com/
[2] https://pastebin.com/HgCfjywh
[3] https://pastebin.com/AgLSgnWn


Dmitry Bogdanov (39):
  target: core: check RTPI uniquity for enabled TPG
  target: core: fix preempt and abort for allreg res
  target: core: fix memory leak in preempt_and_abort
  target: core: abort all preempted regs if requested
  target: core: new key must be used for moved PR
  target: core: remove unused variable in se_dev_entry
  target: core: undepend PR registrant of nacl
  target: core: make some functions public
  target: core: proper clear reservation on LUN RESET
  target: core: remove superfluous checks
  target: core: proper check of SCSI-2 reservation
  target: core: checks against peer node SCSI2 reservation
  target: core: UA on all luns after reset
  target: core: refactor LUN_RESET code
  target: core: pr: use RTPI in APTPL
  target: core: pr: have Transport ID stored
  target: core: pr: remove se_tpg from pr_reg
  target: core: fix parsing PR OUT TID
  target: core: add function to compare TransportID
  target: core: store proto_id in APTPL
  target: core: rethink APTPL registrations
  dlm_ckv: introduce DLM cluster key-value storage
  dlm_ckv: add notification service
  dlm_ckv: add key-value storage service
  dlm_ckv: add KV get/set async API
  target: add virtual remote target
  target: cluster: introduce cluster ops
  target: cluster: introduce dlm cluster
  target: cluster: store PR data in DLM cluster
  target: cluster: read PR data from cluster
  target: cluster: sync PR for dynamic acls
  target: cluster: sync-up PR data on cluster join
  target: cluster: sync SPC-2 reservations
  target: cluster: allocate UAs on PR sync
  target: cluster: support PR OUT preempt and abort
  target: cluster: add reset cluster function
  target: cluster: implement LUN reset in DLM cluster
  target: cluster: split cluster sync function
  target: cluster: request data on initial sync

Konstantin Shelekhin (1):
  scsi: target/core: Unlock PR generation bump

Roman Bolshakov (8):
  scsi: target/core: Add a way to hide a port group
  scsi: target/core: Set MULTIP bit for se_device with multiple ports
  scsi: target/core: Add cleanup sequence in core_tpg_register()
  scsi: target/core: Add RTPI field to target port
  scsi: target/core: Use RTPI from target port
  scsi: target/core: Drop device-based RTPI
  scsi: target/core: Add common port attributes
  scsi: target/core: Add RTPI attribute for target port

 drivers/target/Kconfig                       |    7 +
 drivers/target/Makefile                      |    4 +
 drivers/target/dlm_ckv.c                     |  757 +++++++++++++
 drivers/target/dlm_ckv.h                     |   44 +
 drivers/target/target_cluster_dlm.c          | 1012 ++++++++++++++++++
 drivers/target/target_core_alua.c            |   12 +-
 drivers/target/target_core_configfs.c        |  191 +++-
 drivers/target/target_core_device.c          |  244 ++++-
 drivers/target/target_core_fabric_configfs.c |   68 +-
 drivers/target/target_core_fabric_lib.c      |  262 +++--
 drivers/target/target_core_internal.h        |   29 +-
 drivers/target/target_core_pr.c              |  613 +++++------
 drivers/target/target_core_pr.h              |   33 +-
 drivers/target/target_core_sbc.c             |   12 +-
 drivers/target/target_core_spc.c             |   19 +-
 drivers/target/target_core_stat.c            |    6 +-
 drivers/target/target_core_tmr.c             |   38 +-
 drivers/target/target_core_tpg.c             |  193 +++-
 drivers/target/target_core_transport.c       |    9 +-
 drivers/target/target_core_ua.c              |    1 +
 drivers/target/tcm_remote/Kconfig            |    8 +
 drivers/target/tcm_remote/Makefile           |    2 +
 drivers/target/tcm_remote/tcm_remote.c       |  405 +++++++
 drivers/target/tcm_remote/tcm_remote.h       |   29 +
 include/target/target_core_base.h            |   56 +-
 25 files changed, 3436 insertions(+), 618 deletions(-)
 create mode 100644 drivers/target/dlm_ckv.c
 create mode 100644 drivers/target/dlm_ckv.h
 create mode 100644 drivers/target/target_cluster_dlm.c
 create mode 100644 drivers/target/tcm_remote/Kconfig
 create mode 100644 drivers/target/tcm_remote/Makefile
 create mode 100644 drivers/target/tcm_remote/tcm_remote.c
 create mode 100644 drivers/target/tcm_remote/tcm_remote.h

Comments

Mike Christie Aug. 3, 2022, 5:36 p.m. UTC | #1
On 8/3/22 11:04 AM, Dmitry Bogdanov wrote:
> Hi linux target comminity.
> 
> Let's me present RFC of an implementation of cluster features for Target
> Core that needs for backstore devices shared through cluster nodes.
> 
> The patchset is big and of several subsets, but it contains some arguable
> things and it would take too much time to discsuss them separatelly.
> 
> Patches 1-9:
> Make RTPI be part of se_tpg instead of se_lun. That is a must because
> there is no possibility to assign RTPI on a LUN.
> That data model is different from SCST and current in LIO but still does
> not contradict with SAM and even is more according to SAM - a whole TCM
> is a SCSI Device, and all its ports are SCSI Ports with unique RTPIs.
>  + unique identification of TPG through the cluster.
>  + possibility of assignment of RPTI.
>  - number of all TPGs will be limited to 65535.
> This patchset was published first time 2 years ago [1]. In previous
> version the peers RTPIs were put in <device>/alua/... folder. In this
> version the peers RTPIs are part of TPGs on the remote fabric (patch 35).
> 
> Patches 10-29:
> Fixes some bugs and deviations from the standard in PR code.
> Undepend pr_reg from se_nacl and se_tpg to be just a registration holder.
> Make APTPL registrations (not linked to se_dev_entry) be full-fledged
> registrations.


What are the arguable parts? Do you think it will be the DLM part
and coordinating it with nvmet developers? Or was it patches 1-9
and the multi-node support? Or both :)

Is it possible and would it be valuable to at least kind of break this
up a little?

I would break this up and post the fixes in one set. I'll help you get
them in as soon as possible.

For patches 1-9, I think I remember you posting them before, but I was in
the middle of starting a new job so I didn't review them. I really needed
something like that at my last 2 jobs so I think it's a valuable feature
and I'll review that as well.

If we could at least get those 2 chunks separated then it would make the DLM
parts below easier to get eyeballs on. I'm ok with the idea in general. I
think every nvmet developer will see the massive patchset and not even look at
this first 0/48 email :)


> 
> Patches 30-34:
> DLM_CKV module that uses DLM and provides:
>  * Cluster Lock service (pure wrapper over DLM).
>  * Cluster Key-Value service in memory storage.
>  * Cluster Notification service with a blocking acknowledge.
>  * Cluster membership callbacks.
> This module is supposed to be used by TCM and nvmet to implement cluster
> operations.
> 
> Patch 35:
> New 'remote' (in fact dummy) fabric module. Configuration on this fabric will
> provide to TCM a view of TPG/LUN/ACL configuration on a peer nodes.
> 
> Patche 36:
> Introduce cluster ops and functions to register a cluster ops
> implementation modules. There could be a several different modules.
> The device attrib cluster_impl regulates which implementation to use
> for that device. 'single' is for default (no cluster) implementation.
> 
> Patches 37-48:
> TCM Cluster over DLM module implementation inspired by SCST.
>  * Use DLM_CKV Lock service to serialize order of PR OUT commands
>  * Use DLM_CKV Key-Value storage service to store PR cluster data.
> Sync it after successful execution of PR OUT command.
>  * Use DLM_CKV Notification service to notify (in blocking manner) other
> nodes to fetch PR cluster data. The handling of PR OUT command is
> blocked until other nodes read the cluster PR data.
> 
> It provides:
>  * Cluster lock per LBA for Compare And Write.
>  * Full support of SCSI-3 Persistent Reservations including
>    PREEMPT AND ABORT and REGISTER AND MOVE.
>  * Normal PR APTPL imlementation (persistanse over power loss)
>  * Shared LUN RESET
>  * Shared SCSI-2 Reservations.
>  * Unit Attentions for all TPGs in cluster
>
Dmitry Bogdanov Aug. 4, 2022, 11:01 a.m. UTC | #2
On Wed, Aug 03, 2022 at 12:36:56PM -0500, Mike Christie wrote:
> 
> On 8/3/22 11:04 AM, Dmitry Bogdanov wrote:
> > Hi linux target comminity.
> >
> > Let's me present RFC of an implementation of cluster features for Target
> > Core that needs for backstore devices shared through cluster nodes.
> >
> > The patchset is big and of several subsets, but it contains some arguable
> > things and it would take too much time to discsuss them separatelly.
> >
> > Patches 1-9:
> > Make RTPI be part of se_tpg instead of se_lun. That is a must because
> > there is no possibility to assign RTPI on a LUN.
> > That data model is different from SCST and current in LIO but still does
> > not contradict with SAM and even is more according to SAM - a whole TCM
> > is a SCSI Device, and all its ports are SCSI Ports with unique RTPIs.
> >  + unique identification of TPG through the cluster.
> >  + possibility of assignment of RPTI.
> >  - number of all TPGs will be limited to 65535.
> > This patchset was published first time 2 years ago [1]. In previous
> > version the peers RTPIs were put in <device>/alua/... folder. In this
> > version the peers RTPIs are part of TPGs on the remote fabric (patch 35).
> >
> > Patches 10-29:
> > Fixes some bugs and deviations from the standard in PR code.
> > Undepend pr_reg from se_nacl and se_tpg to be just a registration holder.
> > Make APTPL registrations (not linked to se_dev_entry) be full-fledged
> > registrations.
> 
> 
> What are the arguable parts? Do you think it will be the DLM part
> and coordinating it with nvmet developers? Or was it patches 1-9
> and the multi-node support? Or both :)
In fact every subset can be a subject to argue :) 
* RTPI patchset - changing data model from RTPI-set on backstore device
to RTPI-set on a whole node.
* PR refactoring - to much changes, may be APTPL changes are not
  backward compatible
* remote/dummy fabric - name 
* DLM_CKV - name, place and even a meaning of the module
* tcm_cluster - too much new exported symbols, not resistant to
  node death in between of storing PR data in DLM_CKV and other error
  cases.

> Is it possible and would it be valuable to at least kind of break this
> up a little?
> 
> I would break this up and post the fixes in one set. I'll help you get
> them in as soon as possible.
After approve of the idea I can break the patch set to several ones
and start to post it without RFC prefix. The only problem is that they
all depend on previous ones. So I have to post each after the previous
gets merged.
> 
> For patches 1-9, I think I remember you posting them before, but I was in
> the middle of starting a new job so I didn't review them. I really needed
> something like that at my last 2 jobs so I think it's a valuable feature
> and I'll review that as well.
> 
> If we could at least get those 2 chunks separated then it would make the DLM
> parts below easier to get eyeballs on. I'm ok with the idea in general. I
> think every nvmet developer will see the massive patchset and not even look at
> this first 0/48 email :)
I am not going to share this patchset to nvmet dev list :)
nvmet does not yet have a local version of CompareAndWrite and
Reservations features, so it is too early for them.
> 
> 
> >
> > Patches 30-34:
> > DLM_CKV module that uses DLM and provides:
> >  * Cluster Lock service (pure wrapper over DLM).
> >  * Cluster Key-Value service in memory storage.
> >  * Cluster Notification service with a blocking acknowledge.
> >  * Cluster membership callbacks.
> > This module is supposed to be used by TCM and nvmet to implement cluster
> > operations.
> >
> > Patch 35:
> > New 'remote' (in fact dummy) fabric module. Configuration on this fabric will
> > provide to TCM a view of TPG/LUN/ACL configuration on a peer nodes.
> >
> > Patche 36:
> > Introduce cluster ops and functions to register a cluster ops
> > implementation modules. There could be a several different modules.
> > The device attrib cluster_impl regulates which implementation to use
> > for that device. 'single' is for default (no cluster) implementation.
> >
> > Patches 37-48:
> > TCM Cluster over DLM module implementation inspired by SCST.
> >  * Use DLM_CKV Lock service to serialize order of PR OUT commands
> >  * Use DLM_CKV Key-Value storage service to store PR cluster data.
> > Sync it after successful execution of PR OUT command.
> >  * Use DLM_CKV Notification service to notify (in blocking manner) other
> > nodes to fetch PR cluster data. The handling of PR OUT command is
> > blocked until other nodes read the cluster PR data.
> >
> > It provides:
> >  * Cluster lock per LBA for Compare And Write.
> >  * Full support of SCSI-3 Persistent Reservations including
> >    PREEMPT AND ABORT and REGISTER AND MOVE.
> >  * Normal PR APTPL imlementation (persistanse over power loss)
> >  * Shared LUN RESET
> >  * Shared SCSI-2 Reservations.
> >  * Unit Attentions for all TPGs in cluster
> >