Possible to get a list of docker images that are published to our okteto registry?

Hi! Been a minute. We somehow horked our self-hosted installation during an upgrade, and now we’re seeing some errors like this in our preview environments:

container 'migrate' image 'registry.dev.upshift.earth/pr-81/fuji-service-fuji-service:okteto' not found or it is private and 'imagePullSecrets' is not properly configured.

Curiously, the dev envs work okay. I don’t think it’s an issue with pull secrets - we’re just using the standard okteto registry inside the namespace for the preview env.

another error:

2023-07-29 17:08:48.00 UTCfuji-service-6b655f4c94-kct4z[pod-event]Error: ImagePullBackOff
2023-07-29 17:22:17.00 UTCfuji-service-6bd9d5d58f-wwbf6[pod-event]Pulling image "registry.dev.upshift.earth/pr-81/fuji-service-fuji-service:okteto"
2023-07-29 17:22:17.00 UTCfuji-service-6bd9d5d58f-wwbf6[pod-event]Failed to pull image "registry.dev.upshift.earth/pr-81/fuji-service-fuji-service:okteto": rpc error: code = NotFound desc = failed to pull and unpack image "registry.dev.upshift.earth/pr-81/fuji-service-fuji-service:okteto": failed to resolve reference "registry.dev.upshift.earth/pr-81/fuji-service-fuji-service:okteto": registry.dev.upshift.earth/pr-81/fuji-service-fuji-service:okteto: not found

Is there a way we can list the catalogs? My k8s fu just isn’t quite what I wish it were to be able to suss this out.

Hi Ben!

Yes, I don’t think this is an issue with secrets. In the latest versions of Okteto, we introduced a cache to improve the build and image pull speed. I think you might be hitting something related to how we internally name images.

Could you share your okteto manifest?

Not at the moment. It’s a feature request we are considering though.

No worries about the catalog cmd. That would be useful, but I’m sure we can muddle through.

Here is the manifest:

❯ cat okteto.yml
name: fuji-service

# The build section defines how to build the images of your development environment
# More info: https://www.okteto.com/docs/reference/manifest/#build

  # You can use the following env vars to refer to this image in your deploy commands:
  #  - OKTETO_BUILD_HELLO_ROCKET_SHA: image tag sha256
    context: .
    dockerfile: Dockerfile

# The deploy section defines how to deploy your development environment
# More info: https://www.okteto.com/docs/reference/manifest/#deploy
  - name: Deploy the helms
    # if SYSTEM_TOKEN is supplied, we assume we have other preview env vars for now.
    command: |
      if [ -f ".env" ]; then
        source .env
        helm upgrade --install chart ./helm \
          --set db_url="${DATABASE_URL}" \
          --set oso_api_key="${OSO_API_KEY}"
        helm upgrade --install chart ./helm \
          --set db_url="postgresql://fuji@db:26257/defaultdb" \
          --set oso_api_key="${OSO_API_KEY}" \
          --set cockroachdb.enabled=true
# The dependencies section defines other git repositories to be deployed as part of your development environment
# More info: https://www.okteto.com/docs/reference/manifest/#dependencies
# dependencies:
#   - https://github.com/okteto/sample

# The dev section defines how to activate a development container
# More info: https://www.okteto.com/docs/reference/manifest/#dev
      app: fuji-service
    command: bash
    workdir: /usr/src/app
      - .:/usr/src/app
      - /usr/local/cargo/registry
      - /home/root/app/target
      enabled: true
      #storageClass: okteto-standard
      size: 10Gi

Thanks for looking into this, Ramiro!

I also see this issue in the admin UI (not sure if’s related, but since you mentioned caching):

I tried setting some buildkit values in the config.yaml:

    enabled: false
    type: LoadBalancer
    enabled: true
    storageClass: ssd
    size: 180Gi
    cache: 150000

But that didn’t get the error to go away.

Self-hosted, Amazon EKS, Kubernetes 1.24.

Another thing that has seemed to stop working is that I can no longer build from local:

❯ okteto build
 i  Building 'Dockerfile' in tcp://buildkit.dev.upshift.earth:443...
[+] Building 0.0s (0/0)
 x  Error building service 'fuji-service': error building image 'registry.dev.upshift.earth/bennidhamma/fuji-service-fuji-service:okteto': build failed: failed to get status: rpc error: code = Unavailable desc = connection closed

Looking at the last error you showed, it seems like buildkit stopped functioning after the upgrade. Could you start a separate topic for this? It’ll be useful if you mention which version you upgrade from and to, the error you are seeing on the building pods, and your values.yaml (the one used to install Okteto)

It does seem to build though, from inside of okteto (see attached) I just can’t seem to connect to buildkit from with my okteto context on my dev machine… ?

interesting. I rolled back to 1.7.1 and it looks like it’s able to pull the images again!

Note: I fixed the buildkit issue by adding a separate route 53 entry for buildkit. seems like maybe some versions of okteto use the shared NLB, and some put it on a different load balancer maybe? I forget…

Hi @benjoldersma ,

I have a some questions to understand better the issue.

Related to the pull error you are getting, how do you refer to the image generated in the build phase within your helm manifest? I don’t see that it’s been set in the deploy command so I assume is somehow harcoded in the helm manifest of your application?

Related to BuildKit error. Was BuildKit running ok before the upgrade to 1.10 or you were getting the connection closed ones too?

When did you add this to the chart values?

    enabled: false
    type: LoadBalancer

With that change, BuildKit starts to be exposed in its own LoadBalancer instead of behind Okteto’s main ingress controller. That’s why you needed to add the separated route 53 entry for BuildKit. Reading your previous messages is not clear to me if the BuildKit error started after the upgrade to 1.10 or after this change in the chart values, so I would like to understand better the issue and the timeline to help you.

If you are running a NLB for Okteto ingress controller, we recommend to run BuildKit in its own LoadBalancer. In this post there is an explanation on how to configure it, which is mainly what you already did.

Thanks in advance,

Hi Nacho,

This is an example from the helm chart:

        image: {{ .Values.docker_reg }}/{{ .Values.docker_repo }}:{{ .Values.docker_tag }}

and then in values.yaml:

docker_reg: okteto.dev
docker_repo: fuji-service-fuji-service
docker_tag: okteto

Regarding the buildkit values, I’m not sure when they were added. We’ve been using okteto for the last 6 months or so, but there’s been a few of us jumping in an changing things / fixing it / upgrading it. My apologies. But, it does seem like buildkit is working okay now. It was definitely working before the 1.10 upgrade, then I suspect I horked something when I was reinstalling the okteto chart, I ended up mashing up some route 53 settings. Not my forte lol.

Hi @benjoldersma ,

thanks for all the information.

Regarding the issue with the images to be pulled. The error you are getting is because the cache system mentioned by Ramiro. When an image is built and the working directory is clean, images are pushed to <registry-url>/okteto/<image-name> (the shortcut is okteto.global/<image>) instead of <registry-url>/<namespace>/<image-name> (the shortcut is okteto.dev/<image>). The reason because of that is because the image pushed to okteto.global is accessible by all the users, so they don’t have to build it again.

So the issues seems to be that image is being pushed to okteto.global but then, the manifest is pulling from okteto.dev and that doesn’t exist. You can check it checking the image pushed. The recommended way to achieve what you want is to do something similar to what we have in our movies example. You should use the environment variables we generate OKTETO_BUILD_<service-name>_XXXX and pass it to helm. So it will pass the right image built during the process.

On the other hand, related to BuildKit issue. I wanted to know if, BuildKit stopped working because maybe in the upgrade to 1.10 you also changed the BuildKit values and as the route 53 entry was missing, that was expected. Or, on the contrary, BuildKit stopped working after the upgrade and the, you changed the config values to try to fix it. Anyways, we will try to reproduce the issue to see if during the upgrade something breaks BuildKit. Thank you so much for the information.

Let us know if you have any other question or the proposed solution doesn’t work for the image issue.

Nacho, thanks so much - I’ve updated our helm templates to use the image env variable and things look better now. back up to 1.10 :slight_smile: