Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS example is not working #37

Open
edonD opened this issue Nov 3, 2023 · 5 comments
Open

EKS example is not working #37

edonD opened this issue Nov 3, 2023 · 5 comments

Comments

@edonD
Copy link

edonD commented Nov 3, 2023

I have tried many times now to implement the EKS example but it is not working. After following all the steps and siginin in with the username and password the jupyterhub is just stuck. It shows 0 on the console log and then after 5 minutes 100 but failed.

@ethanfah
Copy link
Contributor

@edonD thank you for bringing this to my attention, and apologies that you wasted time having to debug this.
I went through the solution and was able to replicate the issue you saw.
The jupyter notebook environment was not able to start because the pod was not able to be scheduled onto a node in the EKS cluster.
I found this thread elsewhere where other users were running into the same issue:
In this thread, multiple users confirmed that changing this setting solved their issue:
scheduling.userScheduler.enabled = false

To implement this back in the EKS example project here, one can add a block to the 'daskhub.yaml' file so that the beginning of the file looks like this:

jupyterhub:
    scheduling:
      userScheduler:
        enabled: false

I have also gone through the solution and updated the software packages to the latest versions of EKS, eksctl, etc. Pull requests pending before these changes are all merged back into the solution.

@jflasher
Copy link
Collaborator

@ethanfah I just merged all the PRs in, so latest should be in place.

@edonD if you get a chance to rerun with the latest and confirm it's working, just leave a comment here or go ahead and close the issue. Thanks!

@edonD
Copy link
Author

edonD commented Nov 13, 2023

Wow, impressed with the fast feedback from you guys. I will try it out and let you know. Thank you nonetheless!

@edonD
Copy link
Author

edonD commented Nov 13, 2023

The configuration is smooth and works perfectly. The only problem which I can see until now is that it somehow doesnt create the gateway cluster. I am using the cmip6_zarr.ipynb example. It gets stuck in the

cluster = GatewayCluster(worker_cores=0.8, worker_memory=3.3)
cluster.scale(32)
client = cluster.get_client()
cluster

I tried to initiate it with less workers but it is still the same. I guess this can have many reasons so if it works on your setup I can spend some more time and try to debugg it.

@ethanfah
Copy link
Contributor

The configuration is smooth and works perfectly. The only problem which I can see until now is that it somehow doesnt create the gateway cluster. I am using the cmip6_zarr.ipynb example. It gets stuck in the

cluster = GatewayCluster(worker_cores=0.8, worker_memory=3.3)
cluster.scale(32)
client = cluster.get_client()
cluster

I tried to initiate it with less workers but it is still the same. I guess this can have many reasons so if it works on your setup I can spend some more time and try to debugg it.

If you got as far as logging into the notebook, then that means that at least the minimum number of EC2 instances were provisioned correctly. At the step that you are running into issues with, it requires additional EC2 instances to be created . The way it works is that dask worker pods are "scheduled", and then because the cluster will not have enough room to fit all of the scheduled pods, the cluster autoscaler will step in and create additional nodes for those pods to get scheduled on.

So the first thing to check is whether all of those dask work pods are scheduled or not. If they are scheduled, then the next thing to check is whether the cluster autoscaler was installed correctly such that it is trying to create more EC2 instances. The process of instantiating more EC2 instances can take a while, sometimes 5-10 minutes.

And here is the other thing to keep in mind. The default configuration for this solution uses Spot Intances for the worker nodes. Spot instances are not always available! So in some cases, you could find that your EC2 instances are not instantiating simply because there are not enough instances in your region/AZ, and the solution would work if you tried it the next day.

Let me know what you find, hoping to help you get this working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants