[v3,0/6] iscsi: Fix in kernel conn failure handling

Message ID	20210424221755.124438-1-michael.christie@oracle.com
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Mike Christie <michael.christie@oracle.com> To: khazhy@google.com, lduncan@suse.com, martin.petersen@oracle.com, rbharath@google.com, krisman@collabora.com, linux-scsi@vger.kernel.org, jejb@linux.ibm.com Subject: [PATCH v3 0/6] iscsi: Fix in kernel conn failure handling Date: Sat, 24 Apr 2021 17:17:49 -0500 Message-Id: <20210424221755.124438-1-michael.christie@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Precedence: bulk
Series	iscsi: Fix in kernel conn failure handling \| expand [v3,0/6] iscsi: Fix in kernel conn failure handling [v3,1/6] scsi: iscsi: force immediate failure during shutdown [v3,2/6] scsi: iscsi: use system_unbound_wq for destroy_work [v3,3/6] scsi: iscsi: have callers do the put on iscsi_lookup_endpoint [v3,4/6] scsi: iscsi: fix in-kernel conn failure handling [v3,5/6] scsi: iscsi_tcp: set no linger [v3,6/6] scsi: iscsi_tcp: start socket shutdown during conn stop

Message ID

20210424221755.124438-1-michael.christie@oracle.com

Headers

From: Mike Christie <michael.christie@oracle.com>
To: khazhy@google.com, lduncan@suse.com, martin.petersen@oracle.com,
	rbharath@google.com, krisman@collabora.com,
	linux-scsi@vger.kernel.org, jejb@linux.ibm.com
Subject: [PATCH v3 0/6] iscsi: Fix in kernel conn failure handling
Date: Sat, 24 Apr 2021 17:17:49 -0500
Message-Id: <20210424221755.124438-1-michael.christie@oracle.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
Received: from localhost.localdomain (73.88.28.6) by
	DM3PR12CA0079.namprd12.prod.outlook.com (2603:10b6:0:57::23)
	with Microsoft SMTP Server (version=TLS1_2,
	cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
	15.20.4065.21 via Frontend Transport;
	Sat, 24 Apr 2021 22:18:02 +0000
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: 595eafbb-2567-4f91-791a-08d9076ed3cd
X-MS-TrafficTypeDiagnostic: BYAPR10MB2885:
X-MS-Exchange-Transport-Forked: True
X-Microsoft-Antispam-PRVS: <BYAPR10MB2885811B38CF7E70121B1993F1449@BYAPR10MB2885.namprd10.prod.outlook.com>
X-MS-Oob-TLC-OOBClassifiers: OLM:9508;
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 2MC8KL5l8is7tq5mUebv1Q+r13gHK1m/PuwwQIL1yucDN0tC0HomVMeCbBesOKGvx6nqpHRJRdK8M23EASUNbEM3cjYXQcyJH3DIdo2k9gjGVaM0Xk6X9tr+Df1T+fmxSqVfUJZEhI0ryoXbQzFFDeV9FXsOg6aWi3iGvmXMBuC3xmrtFezL0PurZECrtMywAphiHgX1uoJ8bNQWwTC78eD8PkWeuAORYEb9KzZHUkyWYBYoZJ+75Oyeb3qprevJZBQYsDC9qswJyBXS/7IGFMKdldGtUFfYPWbKSwIxOKXZotPhhkZdra3VKTZckK6DndPk0dQfYcGG3JpYHzDGhg9Elnsl8R+2B4LTQVhJDNr6plKO73zyFMvwf+of3g4sGDZzgQNaMolTakPeBSkntcPAY7XlmVNR4tE3QTl+BjfNUQvMI1ZYbq5BTL3c35P+AHjlgAkTQrpPsEa76Na5p99sdBpK3eYf1R78KfIfZReUpd08hlapa+AoVI536kaBNEudaV0ci09mjwVOeFZxn9QLnBsO2m5GopHx5//w4FOKKG5niaTiYK57qoxzvN2mCC5FV+QFfDaWKnKeUBAsRlUyh6CoSfP4mh3zF+ND3/mjOEy/VabLaEaEQ2oiFABZU6LaqdJGX9M8QbuBBPrvTPoAFs3B83cm52wKj8BO1aXnhfBErgLnq+oyFxanNeSF0ac8pTK9+5FxAR8oMbeemtAWyNUmpTrlJC5Cgt+JeExDDEYWNQhvsQwNHQEVWuVCZCVp2P4NasV7Gob1ZJ5YQg==
X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:;
	IPV:NLI; SFV:NSPM; H:BYAPR10MB3573.namprd10.prod.outlook.com;
	PTR:; CAT:NONE;
	SFS:(136003)(346002)(39860400002)(376002)(396003)(366004)(83380400001)(6486002)(66946007)(66556008)(5660300002)(66476007)(38100700002)(966005)(6666004)(6506007)(478600001)(6512007)(2616005)(186003)(316002)(36756003)(52116002)(26005)(86362001)(2906002)(1076003)(956004)(8936002)(8676002)(16526019)(38350700002)(69590400013);
	DIR:OUT; SFP:1101; 
X-MS-Exchange-AntiSpam-MessageData: ikLOHZlsHO/ZK999rxRkxbqI2JqibDTufoIFfuJCOmrezdmdY7EJSGOBuG7IRnrt9/QegN5egWBM3pWHaVWEcanxMb4xNzNMKfm2fHNOypmBeK0pQNsSOBZZDxZIu+877Co9EZxsQmwa/yMHscP4siX7C1NyZ/ELWs3EDLv3kZJRWmD1/X+EO1NfXrTpSbG973Yyy7f6RWo6r5kpiobSW39Jo4IkMbCKHq0UiVYy/Y/8rgv7QI63wOb9R6/OY4Ga1kTQypb51mofuyoGIGq8nkIGpjgMdyBXML4S5UHV/N+WctSleSrXXLq3kJ4WYlwQthJJd2Tx9a3qmImJFls4TbDwOYsh46v7gyPopm+rvdWJDYAjt3b46uwtNZC65Abijuh1Lei0cU1fXx3Vu8OvhRpLFM4taCTN0gias3hFNXAbalM//UnpvvyNQPyBZ4HsRzxk3lpHimSb9tY8X9Iv0HD/jgBkdvz7g7gacD42Bfch/3L71lwkawh2WaNUGwYTmE4f5s8TZouMlY6SJoOylNA63etCjmWjiNu/0AdF7kr/pUkuEOeJWwfYgY7o8NpTJgNVrnsz3S9NGM3k3N39TZ7sUlNsIoJnVZMmOrcTozO/MKJWUvCMrsQlnQx8cnkaLAj9/Mz6oGMLh83BaDtW0y1xl9hklz2hNqu8K2CEI5G/mxRYJbhcaF2cM4J2pA5AjmwaR7IgL3Wi6f94H3WVj6YFAeLP5vDQBZoAaglCKDduvjaMELfU7u+eI20SbZjJVNsLHMDsw+HbGL67B85NQjk0PJ12Vf82jBApSDHoZJhuF2QsCWfTs2fwobGSlcuSJ75XGjdggRMflt/BkHBmE0aU7bcgZ5j+UJxuYOlbA7QBDu/J5meWIuDE4BiWl/gwSx7yvcZqchWfLSX4bLG+WHXvt0pnXrxN43eXkEI8K7jteoY8CmJSv6fjVEvg/SbV9kbsb51s/gvvHFtJrIGDTBckeuDtKrvaagH/N5zXs8WEWG0uPLIENKHW4+1wyx1NT5g1linoEbGkLdRr5MmWUam6QAxzvJrRgdFG+vNnwK7l+3cobnYwBvm7NvLrnOUA2tQQIOqqKa9Kli/uxQDRb2tOOjudTcXVwn6sxfpd8VT9pe9l7ReyFJX7PLCxAUDpioaFle6wHDYlQFw3Pe9P9K5WOvAN6BbPdkFbWNYswhx/bfdxbIkjvYzh/f7umc9GODVZe+Lb+Xe1AQqi4oeOyKmyvrwJF/PQ5FQ/dSBXaQnmyxOal7qei82DHqnHNR/iJBD9Rynt2uNjGb/Y30dF2l8P52rulQr79RpN2lBwi+MdXsu/sUhlITJ360NHSAgh
X-OriginatorOrg: oracle.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 595eafbb-2567-4f91-791a-08d9076ed3cd
X-MS-Exchange-CrossTenant-AuthSource: BYAPR10MB3573.namprd10.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Apr 2021 22:18:03.2980
	(UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: dNy22eisCdRFhXIwwk890ohsps3vgv1gfwr1WYFUH63laC09sOkrQ2jsAenyLbzEQe8COEVS3oi5mNC3wnbLLTI/J+whOZItSjvbk2goXZw=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR10MB2885
X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=9964
	signatures=668683
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0
	suspectscore=0 phishscore=0 spamscore=0
	mlxlogscore=999 malwarescore=0 bulkscore=0 adultscore=0 mlxscore=0
	classifier=spam adjust=0 reason=mlx scancount=1
	engine=8.12.0-2104060000
	definitions=main-2104240169
X-Proofpoint-ORIG-GUID: iPtf8-TtNyh3wBRb4KbJIDMv-vbW84x2
X-Proofpoint-GUID: iPtf8-TtNyh3wBRb4KbJIDMv-vbW84x2
X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=9964
	signatures=668683
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0
	lowpriorityscore=0
	mlxlogscore=999 malwarescore=0 bulkscore=0 suspectscore=0 spamscore=0
	priorityscore=1501 clxscore=1015 impostorscore=0 phishscore=0
	mlxscore=0
	classifier=spam adjust=0 reason=mlx scancount=1
	engine=8.12.0-2104060000
	definitions=main-2104240169
Precedence: bulk
List-ID: <linux-scsi.vger.kernel.org>
X-Mailing-List: linux-scsi@vger.kernel.org

Series

iscsi: Fix in kernel conn failure handling | expand

Message

Mike Christie April 24, 2021, 10:17 p.m. UTC

These patches are made over the qedi/libiscsi TMF fixes:

https://lore.kernel.org/linux-scsi/20210424220603.123703-1-michael.christie@oracle.com/T/#t

The patch 0ab710458da1 ("scsi: iscsi: Perform connection failure entirely
in kernel space") has the following regressions or bugs that this patch
set fixes.

1. It can return cmds to upper layers like dm-multipath where that can
retry them. After they are successful the fs/app can send new IO to the
same sectors, but we've left the cmds running in FW or in the net layer.
We need to be calling ep_disconnect.

2. The drivers that implement ep_disconnect expect that it's called
before conn_stop. Besides crashes, if the cleanup_task callout is called
before ep_disconnect it might free up driver/card resources for session1
then they could be allocated for session2. But because the driver's
ep_disconnect is not called it has not cleaned up the firmware so the
card is still using the resources for the original cmd.

3. The system shutdown case does not work for the eh path. Passing
stop_conn STOP_CONN_TERM will never block the session and start the
recovery timer, because for that flag userspace will do the unbind
and destroy events which would remove the devices and wake up and kill
the eh. We should be using STOP_CONN_RECOVER.

4. The stop_conn_work_fn can run after userspace has done it's
recovery and we are happily using the session. We will then end up
with various bugs depending on what is going on at the time.

We may also run stop_conn_work_fn late after userspace has called
stop_conn and ep_disconnect and is now going to call start/bind
conn. If stop_conn_work_fn runs after bind but before start,
we would leave the conn in a unbound but sort of started state where
IO might be allowed even though the drivers have been set in a state
where they no longer expect IO.

5. returning -EAGAIN in iscsi_if_destroy_conn if we haven't yet run
the in kernel stop_conn function is breaking userspace. We should have
been doing this for the caller.


V3:
- Rebase over
https://lore.kernel.org/linux-scsi/20210424220603.123703-1-michael.christie@oracle.com/T/#t
- Add fix for if we are doing offload boot and hit a conn failure at the
same time userspace is syncing up it's state and doing a relogin.
- Drop RFC as google has tested.

V2:
- Handle second part of #4 above and fix missing locking
- Include iscsi_tcp kernel sock shutdown patch