How to wait for child processes to finish and then restart the systemctl service to avoid downtime?
0
votes
0
answers
46
views
I have a situation where I want to migration from using system slice to ABC slice for a systemctl service. Now that service is running some child processes as well during communication from data plane to control plane. I am trying to find out a way to safely restart this systemctl service without / minimum downtime of this service so that communication from control plane to data plane is not broken due to service downtime.
For this to achieve, we need to find a way to wait for all child processes to complete and then update the slice from system to ABC. If we restart without waiting for child processes to complete, those processes would still remain in the system slice and the cgroup would not reflect to use ABC slice. In this case, once all child processes are completed, then on restart it would reflect ABC slice.
Below is the journalctl logs we get while restarting the service to reflect the ABC slice while it is running child processes.
systemd: Stopped pqr-agent Service.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: pqr-agent.service: Found left-over process 8309 (cmd_execute) in control group while starting unit. Ignoring.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: pqr-agent.service: Found left-over process 8314 (pqr-upgrade) in control group while starting unit. Ignoring.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: pqr-agent.service: Found left-over process 8331 (python3) in control group while starting unit. Ignoring.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: pqr-agent.service: Found left-over process 8334 (quota-governanc) in control group while starting unit. Ignoring.
Jan 03 20:21:45 ip-10-10-80-7.ap-south-1.pqr.internal systemd: This usually indicates unclean termination of a previous run, or service implementation deficiencies
1. ExecStopPre is not supported by systemd service. It is was supported, we would have used it to wait for all the child processes and then restarted the service.
2. ExecStop would lead to the downtime of service since it would first move the service to deactivating state (stopping) due to which it would not be able to receive further commands and till the time it waits for child processes to complete, communication from control plane to data plane is broken.
3. If we use ExecStartPost, then we would not be able to track the child processes (because restart works as stop and start, the MainPID would change so the new MainPID would not have any child processes) and here the service restart would work only if we wait for the time duration until all the child processes in older cgroup are completed.
Restart should happen to reflect ABC slice with minimum downtime of service.
Asked by samay varshney
(1 rep)
Feb 10, 2025, 07:11 AM