After running the command okteto stack deploy --build --log-level debug 5 out of 7 containers successfully started. However two of them have been stuck at pulling stage and the memory allocation at this point is 0/6GB and the CPU allocation is 0.0/20
Are those images overly big? Did they eventually start?
Yes, they eventually started after i manually restarted. Looks like it had hanged. I have been noticing some performance issues.
We are training a rasa chatbot and we are hardly 1/4 of feeding it with training data but now the training takes like 3 hours to finish from one of the pods.
In the pod at times i see 75.6MB / 6GB and 0.00 / 2.0 CPU.
What does this mean. Does it mean 0 processing speed is left and only 75.6MB left and that is why it is extremely slow and it takes longer.
If this is the case how can i know the potential capacity i will need especially during training of models. I have noticed that this happens only when training the model but normally it is ok and faster after the model has been trained.
Like now, nothing has been happening for the past 1 hour with 0.0 CPU
This pod has refused to start. It keeps restarting by itself pulling image continously since morning. It cannot be that its because of training. I think it has something to do with resources. The logs just indicate continous restarts and pulling of image, processing to a certain point where CPU gets to 1.8/2.0 and then later the process restarts again.
Urgently advice on this
Its not possible that this issue is persistent for over 12 hours now. Pod not restarting. Is it resources? Am i supposed to upgrade to a higher band. I just need to know so that we move forward. The CPU is maximng out at 1.9/2.0 showing me red. What am i supposed to do?
Hi Hebifa,
-
The CPU/Memory indicator that we have in the UI indicate how much CPU and Memory you have requested, it’s not meant to represent the CPU/Memory consumption at that specific moment. We are aware that this is confusing, and we have plans on how to improve that. For now, if your pods are being created, the issue is not related to quotas or tier.
-
but now the training takes like 3 hours to finish from one of the pods.
→ The Okteto Starter tier doesn’t guarantee consistent performance since you are running on shared resources (this is similar to how other platforms like Gitpod or Heroku work). At busy times, your pod’s “real-life” speed can fluctuate. Okteto Starter is designed for development environments, not for workloads that are CPU sensitive or that are meant to run over a long period. This is why you are seeing this behavior only during training rather than during the execution of your bot. -
It keeps restarting by itself pulling image continously since morning
→ If this is still happening, can you open a support ticket? Please add a screenshot of the issue, and include as much diagnostic information as you can (okteto.yaml, name of the namespace, approximate time of the issue) so our team can look into this further.
Hope this helps!
Thank you for the response.
Just for correction and clarity, i wish to confirm we are not on Okteto Start. We are on Okteto Starter Pro. Is there need to shift to the next? We really need your guide because Okteto and how it works is still relatively new and with your guide we shall learn faster on how to optimize and use the environment better.
Yes this is still happening. I have submitted a support ticket with the requested information.
Looking forward to your feedback and support as usual.
Thank you
@ramiro i have sent numerous emails, responding with the information you have requested, three days later i have not received any response. Should i expect a response on the issue affecting us. We are still down five days later and we have paid for the services.
Is it even fair to ignore messages for this long.
I am now begging for the paid services and requesting if i need to pay additional support to get the services back up. please let me know. I am not getting any feedback from you please.
@rlamana @rubengarcia0510 @RuslanT @relston @Rohit-Ranjane
anyone who can help us please we are completely stuck
Hello. I take note from your suggestions and resolve the problem
I don’t understand. Do you mean you have resolved the problem or you are working on it. Please clarify and thank you very much for getting back to me on this.
I truely appreciate your feedback and help.
@Hebifa.Rafiki I will follow up with you on the support ticket.
Support tickets are the best escalation path for issues related to infrastructure.
For guidance on how to use Okteto, or how to troubleshoot your application, I recommend that you continue to use this forum. Okteto community and technical staff will do our best to help you, but there is no SLA on the response in forums.
Regarding your support question, the Starter Pro plan includes US business-hours ticket support, Monday to Friday (excluding US holidays). Extended support hours and response SLAs are available as part of our Scale and Enterprise plans.
Yes. I’ve resolve the problem
I have noted this and will engage you on the support ticket. Thank you
Thank you. Let me validate this and revert back.
A few other things to try that might help you here:
- Split the “training” phase from the “running” phase: This way, even if the training fails, your application can be online.
- Save the progress of your training: Don’t assume that the pod will be active for X hours. A lot of things can happen that will make your pod exit (e.g. node being too full, pod running out of memory, network failures, etc). This is a best practice for cloud services in general but it’s specially important for long-running models. Can you save the progress as a file? For this, you should set up a persistent volume, so the volume persists even if your pod goes away
- Parallelize the training across multiple pods. This could also help you make your process more resilient.
Hope this helps!
Thank you very much for this feedback.
I believe this will help us become more resilient. The only challenge is
i honestly do not know from where to begin.
- Do you have some documentation on how to do the above recommendation
from you platform so we can follow and implement? - Do you have trainings we can plan and pay for in order to make us
self sufficient? - Do you have a personnel who can walk through with us for the above
recommendation and how much would such a service cost.
The chatbot we are building and the services we are offering requires
alot of work and continous improvement and as people interact with it,
the more we get the traction and feedback we need to make it better. An
extended period of outage such as the one we have experienced has huge
negative impact on the traction we need as a team.
PLease advice on from where i can start.
Once again, thank you very much for this feedback
Regards
Thank you for the attempt.
Unfortuantely, the restart is still persistent.
Attached the logs
(Attachment okteto-rasa-log-file-02.log is missing)
I have great news that after rebuilding and restarting the pod, it took
long but now the issue is resolved.
Forgive me for the annoyance and persistence and take my sincere
appreciation for your support.
Thank you.
We don’t offer those services, since we are not experts on building RASA bots. Have you tried to reach out to the RASA community? The stuff mentioned are good practices when building Cloud services, there is nothing Okteto-specific about those.
One place to start is to use separate namespaces for the different phases of Development. Have one namespace per developer, another for training the model, etc. This will help you mitigate any failures, and prevent them from propagating to the next development stage.
Overall, is a bad practice to make your development environment available to your end users. We always recommend to our users use Okteto for all their development and testing needs, and then use something like Google Cloud or AWS for their production needs. Based on what you describe above, it sounds like you are mixing your development vs production requirements.