Buildkit is returning 500 errors

Hey all,

I’m getting some strange occurrences with our self-hosted buildkit pod. On an attempt to build a specific namespace of ours, we are receiving the following error within buildkit:

time="2022-11-11T13:32:30Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = unexpected status: 500 Internal Server Error\n"
time="2022-11-11T13:55:56Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = unexpected status: 500 Internal Server Error\n"
time="2022-11-11T14:58:57Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = unexpected status: 500 Internal Server Error\n"

It’s fairly reproducible and seems to then break all other namespaces. After a quick restart of the pod other namespaces do work again.

Any ideas? Debugging steps?

Thanks!

NB: Still learning k8s so if you would like logs etc. please do be specific :+1:

That error typically happens when the buildkit pods are not healthy.

First thing to check is the overall status of your buildkit pods. Run the command below, and look for pods with the prefix okteto-buildkit

kubectl get -n=okteto pod

A healthy pod looks like this:

NAME                                                   READY   STATUS      RESTARTS   AGE
okteto-buildkit-8634cf04bf-0                  1/1         Running       0                   6d8h

You can then get more information on each pod by running the following command (replace with the name of your buildkit pod)

kubectl get pod -n=okteto -oyaml  okteto-buildkit-8634cf04bf-0

You can also check the pod logs by running the command below if that doesn’t show anything odd.

kubectl logs -n=okteto okteto-buildkit-8634cf04bf-0

If a restart solves it, it could mean that the pod is running out of space (on restart, the cache is cleaned). Would you mind sharing the logs from the buildkit pods, and your config.yaml file?

Hi @ramiro, thank you for getting back to me - I had previously run the above commands and couldn’t see anything too out of the ordinary. Unfortunately/fortunately the issue isn’t occurring anymore so I can’t get a log dump for you to use.

It appears one of our PRs was causing an outage - on a namespace rebuild it would cause the buildkit to freeze and then all other namespaces would be unable to build, considering a re-deployment of a pod fixed this it most likely was something along the lines of it running out of something (we have quite a restrictive environment with regards to resources).

Again, thanks for the support, and if we ever enter this state again I’ll get some logs for you!