JupyterHub with K8s: Shared /home volume?

Hello Jovyans,

I’m trying to figure out a way to have /home exist as a share volume mount, and have users’ home directories, i.e. /home/$USER, all be found on this mount.

Here’s my motivation. I want to:

  1. use fewer volumes (provided by the underlying cloud), as all user data will exist on a single volume as opposed to one volume per user
  2. simplify volume backups/snapshots; 1 vs many
  3. allow JHub admins (in my case, classroom instructors) to more easily access student files for grading of assignments, while students cannot peak at each others’ work

Roughly, I see a path forward to get me most of the way there:

  1. Export an NFS volume from a dedicated NFS server Pod
  2. Use the extraVolumeMounts and extraVolumes to mount this shared volume at /home for each single user server
  3. Use the hub.extraConfig to subclass kubespawner and:
    • define a function that returns some necessary environment variables as a dictionary*
    • set KubeSpawner.environment to what’s returned by this function
  4. Start the single user container as root

* The necessary environment variables are:

  • NB_USER = <jhub-username>
  • NB_GROUP = "users" or NB_GROUP = "admin" (for JHub administrators)
  • NB_UID = <some unique uid>
  • NB_GID = "100" (users) or NB_GROUP = "200" (admin)

The last two steps are so that this section of start.sh, provided as part of jupyter-docker-stacks, will run appropriately and start the jupyterlab server as the appropriate user, uid, group, gid, and with home directory (in /home/$NB_USER).

I’m sure there’s some unintentional hand-waving in the steps I’ve described.


The part that I’m having trouble figuring out is how to give users <some unique uid>, which will allow us to then set home directory ownership to <uid>:200 (i.e. <user>:admin) and permission to 770 (rwx by <user> and admins, inaccessible by everybody else).

My gut tells me that I should store user uid’s in the JHub data base (specifically in each user’s Spawner state). My function that returns the KubeSpawner.environment would then have to either read this value from the data base or, if it doesn’t exist, create the next available uid. I don’t know how to do this!

After reading through the docs and some the source for JupyterHub and Kubespawner, I’ve decided that I should reach out for help since I’m having trouble understanding how data gets to/from the database and the spawner instances.


To be explicit in what my questions are:

  1. First of all, based on my motivations, is having a shared /home directory an appropriate solution?
  2. Is this an appropriate implementation?
  3. If yes to the above, how can I interact with the JHub database to create/read unique user uid’s?

Thanks!

ana v e

Which authenticator are you using and which is the identity provider? Ideally this sort of user related information should come from IdP. I am assuming your IdP does not store/provide uid for the users. In this case, creating uid on the authenticator and passing it to spawner via authstate can be solution. You can save the created uid info in a simple text file or a SQLite DB in the post_auth_hook of Authenticator.

2 Likes

Mahendra,

Thanks for the response!

We’re using GitHub OAuth.

Indeed, your suggestion seems simpler than my approach. I’ll be looking further into this and will share out when I have something of substance.

Best,

ave

1 Like

Hi Ava,

I do something similar as to what I think you are asking. First I create a PV then a PVC in the namespace, this is my PVC YAML

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fcba-su25-pvc
spec:
  storageClassName: fcba-su25-sc
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Then in the jupyterhub config I have:

  storage:
    type: static
    static:
      pvcName: 'fcba-su25-pvc'
      subPath: '{username}'

All users files are under one folder in my NFS mounted storage on the cluster. For example:

/home/fcba-su25/user1
/home/fcba-su25/user2
/home/fcba-su25/user3

The folders are automatically created when users login. I have used this method with github auth and google oauth2.

This may or not be the correct way to do it, but it works well for me.

Tony

3 Likes

Hey Tony,

Thanks for the feedback. This seems like a handy method and I’ll give it a try when I have a chance. I assume, however, that each of these directories are all owned by the jovyan user, as is the nature of the JupyterLab containers. Would this not give users rwx permissions on each others files? The motivation behind giving users their own uid is to prevent this.

Best,

ana v e

Hi Ana,

When I started this method, I had Yuvi (Zero-to-Jupyterhub developer) look at it and he did not see any security issues. Check this out if I open a terminal:

jovyan@jupyter-montereytony-40gmail-2ecom:~$ pwd
/home/jovyan
jovyan@jupyter-montereytony-40gmail-2ecom:~$ cd ..
jovyan@jupyter-montereytony-40gmail-2ecom:/home$ ls
jovyan
jovyan@jupyter-montereytony-40gmail-2ecom:/home$ cd /
jovyan@jupyter-montereytony-40gmail-2ecom:/$ ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
jovyan@jupyter-montereytony-40gmail-2ecom:/$ cd
jovyan@jupyter-montereytony-40gmail-2ecom:~$ pwd
/home/jovyan
jovyan@jupyter-montereytony-40gmail-2ecom:~$

I do not see anyone else’s folder.

Tony

1 Like

Tony,

Excellent! I think I understand what is going on here now. The subPath of the PVC is being mounted and not the full volume (with every user’s data) itself?

Thanks again for the input, I’ll be looking more into this to make sure I see the full picture.

ana v e

1 Like

Hey Ana,

I am not sure of the terminology, best to wait for experts to chime in. But for sure, the users can only see their own data.

Tony

1 Like

You are right. We use the same mechanism to manage submissions: The lecturer receives the “full volume” (i.e., a directory with sub-directories for each student), and the students receive the volume with a subPath (i.e., their username).

1 Like

Thanks for the help everybody!

I’ve successfully been able to get a similar setup working in our cluster. Next week I’ll reply with a more substantial description for others to have a look at and replicate if they come across this thread. In short, I’ve:

  • Create a cloud backed PVC that will act as the shared storage
  • Create an NFS server Deployment and Service; I followed one of K8s archived examples adapted to my needs to achieve this
  • Create a PV of NFS type
  • In the same namespace as my JupyterHub, create a PVC that binds to the above PV
  • Edit my values.yaml to have every Pod mount a subPath of the above PVC

There are still a few things I’ve yet to figure out:

  1. Tony’s example shows a storage request of 10Gi. Does this correspond to an enforced storage quota? I suppose this is easy enough to test myself!

  2. How does one give instructor/admin-only access to the full volume? In principle, it’s as easy as mounting it as an extraVolume without the subPath. Is this accomplished by modifying Kubespawner to configure the extra volume (and volumeMount) when the authenticated user is an admin? Or is this yet another case of me over-complicating things?

Best,

ana v e

NFS quotas on Kubernetes depend on the NFS implementation, many ignore the quota request and give you unlimited storage.

Adding a conditional extra volume for instructors should work. group_overrides might work, if not you should be able to add the volume using modify_pod_hook

1 Like