
(env) ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_ting-1-3_ubuntu_0.pem -o StrictHostKe圜hecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s bash -login -c -i 'true & source ~/.bashrc & export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore & (docker pull rayproject/ray-ml:latest-gpu)' Latest-gpu: Pulling from rayproject/ray-mlĬ4623a7ed5da: Extracting 3.453GB/3.453GB Running `docker pull rayproject/ray-ml:latest-gpu`įull command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_ting-1-3_ubuntu_0.pem -o StrictHostKe圜hecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s bash -login -c -i 'true & source ~/.bashrc & export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore & (docker pull rayproject/ray-ml:latest-gpu)'` Note I have altered the project name in my copy/paste to protect myself.

(On the latest GPU VMs I am working with, it does not always fail at this part.) It’s late, but tomorrow I will try to isolate it as requested. When I re-ran it (forgetting to add the -vvv to the ssh), it succeeded.

I ran it just now and it failed on the docker pull. Status: Downloaded newer image for rayproject/ray-ml:latest-gpu Setting TCPKeepAlive no tells the client to just assume the connection is still good until proven otherwise by a user request, meaning that temporary connection breakages while your ssh term is sitting idle in the background won't kill the connection.Why does ray like to repeatedly die with messages like this? I’ve been using ray for a few months and have gotten it to work, but I have noticed the amount of time spent setting up and tearing down VMs is ridiculous due to issues like this. The trouble with this is that if the connection between the client and server is broken for a brief period of time, this will cause the keepalive messages to fail, and the client will end the connection with "broken pipe". This will detect if the server goes down, reboots, etc. If the converse, TCPKeepAlive yes, were set, then the client sends keepalive messages to the server and requires a response in order to maintain its end of the connection.

TCPKeepAlive no means "do not send keepalive messages to the server". This keeps the connection active so that doesn't happen.

ServerAliveInterval 120 means "If there is no activity for 120 seconds on the connection, send a request to the server, requesting a response." I believe this is useful because some servers are configured to drop inactive ssh sessions. Do you have understanding to provide a little background on why these settings fix the problem?
