在使用kfserving发布模型服务的时候会遇到以下这些问题,进行记录。
核心原则
1. 多看github的issue
2. 多看项目的源码
检查命令
- kubectl get inferenceservice -A
- kubectl describe inferenceservice.serving.kubeflow.org/sklearn-iris -n kubeflow
- 查看kfserving的日志:kubectl logs StatefulSet/kfserving-controller-manager -c manager -n kubeflow
- kubectl get revision sklearn-iris-predictor-default-ngztb -n kubeflow -o yaml
- kubectl get ksvc -n kubeflow
- kubectl describe ksvc/sklearn-iris-predictor-default -n kubeflow
- kubectl get configuration -A
- kubectl -n kubeflow get events
问题一:no endpoints available for service "kfserving-webhook-server-service"
解决方案:
kubectl get mutatingwebhookconfigurations
kubectl delete mutatingwebhookconfigurations inferenceservice.serving.kubeflow.org
kubectl delete validatingwebhookconfigurations inferenceservice.serving.kubeflow.org
kubectl delete po kfserving-controller-manager-0 -n kfserving-system
重新启动:kfctl apply -f $(pwd)/kfctl_k8s_istio.v1.0.1.yaml -V
kubectl get revision -n kubeflow
kubectl describe revision xgb-kfserving-predictor-default -n kubeflow
问题二:对镜像的的鉴权
Warning InternalError 12m (x32 over 7h20m) revision-controller failed to resolve image to digest: failed to fetch image information: Get https:/xxx.xxx.com/v2/: x509: certificate is valid for *.parkingcrew.net, parkingcrew.net, not harbor.prd.com
解决方案:
命令:kubectl -n knative-serving edit configmap config-deployment
修改值registriesSkippingTagResolving: "ko.local,dev.local"为registriesSkippingTagResolving: "xx.xx.com"
记得把example注释掉
一劳永逸的方式,修改配置文件:kustomize/knative-install/base/config-map.yaml
kuebctl get pods -n kubeflow
kubectl describe pod sklearn-iris-predictor-default-zxgdn-deployment-b89978b-fhzw6 -n kubeflow
问题三:Normal BackOff 28s kubelet, docker-dsu-sitsvr-kubeflow011 Back-off pulling image "gcr.io/kfserving/storage-initializer:0.2.2"
解决方案
重新下载knative-releases_knative_dev_serving_cmd_queue:0.0.2到harbor上
下载地址:https://hub.docker.com/search?q=sklearnserver&type=image
docker pull adamjm32/storage-initializer:0.2.2
docker tag adamjm32/storage-initializer:0.2.2 xx.xx.com/kubeflow/storage-initializer:0.2.2
docker push xx.xx.com/kubeflow/storage-initializer:0.2.2
命令:kubectl -n knative-serving edit configmap config-deployment
修改值queueSidecarImage: gcr.io/kfserving/storage-initializer:0.2.2为queueSidecarImage: xx.xx.com/kubeflow/knative-releases_knative_dev_serving_cmd_queue:0.0.2
如果遇到问题:standard_init_linux.go:211: exec user process caused "exec format error"
是镜像问题,换个镜像就好了
kuebctl get pods -n kubeflow
kubectl logs sklearn-iris-predictor-default-vj4jx-deployment-66f78fb5cd4btsr -n kubeflow -c storage-initializer
问题四:Warning InspectFailed 9s (x2 over 24s) kubelet, docker-dsu-sitsvr-kubeflow011 Failed to apply default image tag "harbor.prd.com/kubeflow/sklearnserver:": couldn't parse image reference "harbor.prd.com/kubeflow/sklearnserver:": invalid reference format
在kustomize/kfserving-install/base/config-map.yaml添加
1 2 3 4 5 6 7 8 9 | "sklearn": { "image": "harbor.prd.com/kubeflow/sklearnserver", "defaultImageVersion": "0.2.2", "allowedImageVersions": [ "0.2.2", "0.2.3", "0.2.4" ] }, |
代码里有个使用runtimeVersion的判断
https://github.com/kubeflow/kfserving/blob/master/pkg/apis/serving/v1alpha2/framework_scikit.go
在模型发布服务的yaml里面添加一个runtimeVersion
1 2 3 4 5 6 7 8 9 10 11 12 | apiVersion: "serving.kubeflow.org/v1alpha2" kind: "InferenceService" metadata: name: "sklearn-iris" namespace: kubeflow spec: default: predictor: minReplicas: 1 sklearn: storageUri: "pvc://kfserving-pvc-source/sklearn_iris/model.joblib" runtimeVersion: "0.2.4" |
kubectl logs sklearn-iris-predictor-default-pjfxj-deployment-6c95bbf6557l4rl -n kubeflow -c kfserving-container
问题五
1 2 3 4 5 6 7 8 9 10 11 12 | [I 201118 12:17:40 storage:35] Copying contents of /mnt/models to local Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/sklearnserver/sklearnserver/__main__.py", line 33, in <module> model.load() File "/sklearnserver/sklearnserver/model.py", line 33, in load model_path = kfserving.Storage.download(self.model_dir) File "/kfserving/kfserving/storage.py", line 58, in download (_GCS_PREFIX, _S3_PREFIX, _LOCAL_PREFIX)) |
问题解析
找不到相应的路径,需要挂载pvc
参照代码:https://github.com/kubeflow/kfserving/blob/master/pkg/webhook/admission/pod/storage_initializer_injector.go
PvcSourceMountName = "kfserving-pvc-source"
解决方案
把pvc的名字对应上:kfserving-pvc-source
创建新的pvc
1 2 3 4 5 6 7 8 9 10 11 12 | apiVersion: v1 kind: PersistentVolumeClaim metadata: name: kfserving-pvc-source namespace: kubeflow spec: resources: requests: storage: 10Gi accessModes: - ReadWriteMany storageClassName: cbs |
修改模型发布服务的yaml文件内容
storageUri: "pvc://kfserving-pvc-source/sklearn_iris/model.joblib"
问题六:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/sklearnserver/sklearnserver/__main__.py", line 33, in <module> model.load() File "/sklearnserver/sklearnserver/model.py", line 37, in load self._model = joblib.load(model_file) #pylint:disable=attribute-defined-outside-init File "/usr/local/lib/python3.6/dist-packages/joblib/numpy_pickle.py", line 585, in load obj = _unpickle(fobj, filename, mmap_mode) File "/usr/local/lib/python3.6/dist-packages/joblib/numpy_pickle.py", line 504, in _unpickle obj = unpickler.load() File "/usr/lib/python3.6/pickle.py", line 1050, in load dispatch[key[0]](self) File "/usr/lib/python3.6/pickle.py", line 1338, in load_global klass = self.find_class(module, name) File "/usr/lib/python3.6/pickle.py", line 1388, in find_class __import__(module, level=0) ModuleNotFoundError: No module named 'sklearn.svm._classes' |
解决方案
https://github.com/kubeflow/kfserving/issues/1214
需要修改下sklearn的版本
pip3 install --upgrade scikit-learn==0.20.3