kubeflow/fate-operator

[bug] Errors when installing Helm chart in KubeFate

Closed this issue · 8 comments

I am using config/samples to create the fatecluster. But the operator enters a live lock and cannot create it successfully.

It creates the cluster then removes it in the next reconciliation.

spec:
  clusterSpec:
    chartName: fate
    chartVersion: v1.4.0-a
    istio: {}
    modules:
    - rollsite
    - clustermanager
    - nodemanager
    - mysql
    - python
    - client
    mysql:
      accessMode: ReadWriteOnce
      database: eggroll_meta
      ip: mysql
      nodeSelector: {}
      password: fate_dev
      port: 3306
      size: 1Gi
      storageClass: mysql
      user: fate
    name: fatecluster-sample
    namespace: fate-9999
    nodemanager:
      count: 3
      list:
      - accessMode: ReadWriteOnce
        name: nodemanager
        nodeSelector: {}
        sessionProcessorsPerNode: 2
        size: 1Gi
        storageClass: nodemanager
        subPath: nodemanager
      sessionProcessorsPerNode: 4
    partyId: 9999
    python:
      fateflowNodePort: 30109
      fateflowType: NodePort
      nodeSelector: {}
    rollsite:
      exchange:
        ip: 192.168.1.1
        port: 30000
      nodePort: 30009
      nodeSelector: {}
      partyList:
      - partyId: 10000
        partyIp: 192.168.10.1
        partyPort: 30010
      type: NodePort
    servingIp: 192.168.9.1
    servingPort: 30209
  kubefate:
    name: kubefate-sample
    namespace: kube-fate
status:
  jobId: 6d6703c3-a521-4cf6-a930-f191babe61a9
  clusterId: d57d0daa-5479-4638-ae45-7b6309d2efaf
  status: Creating

After a while, the status becomes:

status:
  status: Creating

Logs: https://paste.ubuntu.com/p/MdSj8x7YRH/

/kind bug

Found the reason

2020-07-02T12:03:18.515+0800	DEBUG	controllers.FateCluster	request success	{"body": "{\"data\":{\"uuid\":\"3b3cf865-0c97-41c3-a6be-bc1981ef24b4\",\"start_time\":\"2020-07-02T04:03:17.297Z\",\"end_time\":\"2020-07-02T04:03:17.316Z\",\"method\":\"ClusterInstall\",\"result\":\"failed to download \\\"kubefate/fate\\\" (hint: running `helm repo update` may help)\",\"cluster_id\":\"afa78a89-c27c-4182-acc0-3ddcb65beb6c\",\"creator\":\"admin\",\"sub-jobs\":null,\"status\":\"Failed\",\"time_limit\":3600000000000}}\n"}

Prefer to output the error as one event to the fatecluster CR.

/kind feature

2020-07-02T03:56:39Z ERR pkg/service/chart.go:356 > repoAdd error="looks like \"https://federatedai.github.io/KubeFATE/\" is not a valid chart repository or cannot be reached: Get \"https://federatedai.github.io/KubeFATE/index.yaml\": dial tcp: lookup federatedai.github.io on 10.0.0.10:53: server misbehaving"

Can we support offline installation for the cluster? Downloading from GitHub is impractical in many industries.

Can we support offline installation for the cluster? Downloading from GitHub is impractical in many industries.

Actually, there is a offline installation solution based on Harbor: Helm charts (what cluster? FATE? FATE-Serving? what version?) and images are stored in Harbor (https://github.com/FederatedAI/KubeFATE/blob/8ef9c24813c01b05a99abb84a03f7e85cd97beca/registry/README.md). We will refine it and add to this repo.

Fixes merge

/close

@LaynePeng: Closing this issue.

In response to this:

Fixes merge

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.