Message ID | 20250102183232.115279-1-robdclark@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2 | expand |
On 1/3/2025 1:00 AM, Akhil P Oommen wrote: > On 1/3/2025 12:02 AM, Rob Clark wrote: >> From: Rob Clark <robdclark@chromium.org> >> >> On mmu-500, stall-on-fault seems to stall all context banks, causing the >> GMU to misbehave. So limit this feature to smmu-v2 for now. >> >> This fixes an issue with an older mesa bug taking outo the system >> because of GMU going off into the weeds. >> >> What we _think_ is happening is that, if the GPU generates 1000's of >> faults at ~once (which is something that GPUs can be good at), it can >> result in a sufficient number of stalled translations preventing other >> transactions from entering the same TBU. >> >> Signed-off-by: Rob Clark <robdclark@chromium.org> > > Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com> > Btw, if stall is not enabled, I think there is no point in capturing coredump from adreno pagefault handler. By the time we start coredump, gpu might have switched context. -Akhil. > -Akhil >
On Mon, Jan 6, 2025 at 12:11 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > On 1/3/2025 1:00 AM, Akhil P Oommen wrote: > > On 1/3/2025 12:02 AM, Rob Clark wrote: > >> From: Rob Clark <robdclark@chromium.org> > >> > >> On mmu-500, stall-on-fault seems to stall all context banks, causing the > >> GMU to misbehave. So limit this feature to smmu-v2 for now. > >> > >> This fixes an issue with an older mesa bug taking outo the system > >> because of GMU going off into the weeds. > >> > >> What we _think_ is happening is that, if the GPU generates 1000's of > >> faults at ~once (which is something that GPUs can be good at), it can > >> result in a sufficient number of stalled translations preventing other > >> transactions from entering the same TBU. > >> > >> Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com> > > > > Btw, if stall is not enabled, I think there is no point in capturing > coredump from adreno pagefault handler. By the time we start coredump, > gpu might have switched context. > > -Akhil. > > > -Akhil > >
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index 6372f3e25c4b..3239bbf18514 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -16,6 +16,10 @@ #define QCOM_DUMMY_VAL -1 +static int enable_stall = -1; +MODULE_PARM_DESC(enable_stall, "Enable stall on iova fault (1=on , 0=disable, -1=auto (default))"); +module_param(enable_stall, int, 0600); + static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu) { return container_of(smmu, struct qcom_smmu, smmu); @@ -210,7 +214,9 @@ static bool qcom_adreno_can_do_ttbr1(struct arm_smmu_device *smmu) static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, struct io_pgtable_cfg *pgtbl_cfg, struct device *dev) { + const struct device_node *np = smmu_domain->smmu->dev->of_node; struct adreno_smmu_priv *priv; + bool stall_enabled; smmu_domain->cfg.flush_walk_prefer_tlbiasid = true; @@ -237,8 +243,17 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; priv->get_fault_info = qcom_adreno_smmu_get_fault_info; - priv->set_stall = qcom_adreno_smmu_set_stall; - priv->resume_translation = qcom_adreno_smmu_resume_translation; + + if (enable_stall < 0) { + stall_enabled = of_device_is_compatible(np, "qcom,smmu-v2"); + } else { + stall_enabled = !!enable_stall; + } + + if (stall_enabled) { + priv->set_stall = qcom_adreno_smmu_set_stall; + priv->resume_translation = qcom_adreno_smmu_resume_translation; + } return 0; }