Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

1 votes

1 answers

162 views

Shell Variable Expansion in qsub command through drmaa

I am running a bulk job submission to SGE (Sun Grid Engine) using [python drmaa bindings][1]. For the bulk job submission I am submitting a python script that takes in one argument and is command line executable, through a shebang. To properly parameterize the job bulk submission I am setting enviro...

I am running a bulk job submission to SGE (Sun Grid Engine) using python drmaa bindings . For the bulk job submission I am submitting a python script that takes in one argument and is command line executable, through a shebang. To properly parameterize the job bulk submission I am setting environment variables to propagate to the python script through the -v option. I am trying to do an indirect variable expansion in my zsh environment based on the $TASK_ID/$SGE_TASK_ID environment variable that SGE exports during job submittal. As a minimal reproducible example of the indirect variable expansion I am trying to do something like this, which works in my shell.

export foo1=2
export num=1

echo $(tmp=foo$num; echo ${(P)tmp})

which produces 2 The example script job_script.py

#! /usr/bin/python
import argparse
import os

parser = argparse.ArgumentParser()
parser.add_argument("input_path", type=os.path.realpath)

def main(input_path):
    # do stuff
    ...

if __name__ == "__main__":
    args = parser.parse_args
    input_path = args.input_path
    main(input_path)

The example drmaa submittal script

import os

# add path to libs
os.environ["DMRAA_LIBRARY_PATH"] = "path to DMRAA shared object"
os.environ["SGE_ROOT"] = "path to SGE root directory"
import drmaa

input_dir_suffixes = [1, 2, 5, 7, 10, 11]

INPUT_BASE_DIR = "/home/mel/input_data"

base_qsub_options = {
    "P": "project",
    "q": "queue",
    "b": "y", # means is an executable
    "shell": "y", # start up shell
}
native_specification = " ".join(f"-{k} {v}" for k,v in base_qsub_options.items())
remote_command = "job_script.py"

num_task_ids = len(input_dir_suffixes)
task_start = 1
task_stop = num_task_ids + 1
task_step = 1
task_id_zip = zip(range(1, num_task_ids + 1), input_dir_suffixes) 
task_id_env_vars = {
   f"TASK_ID_{task_id}_SUFFIX": str(suffix) for task_id, suffix in task_id_zip 
}

io_task_id = r"$(tmp=SUFFIX_TASK_ID_$TASK_ID; echo ${(P)tmp)})"
arg_task_id = r"$(tmp=SUFFIX_TASK_ID_$SGE_TASK_ID; echo ${(P)tmp)})"

with drmaa.Session() as session:
    
    template = session.createJobTemplate()
    template.nativeSpecification = native_specification
    template.remoteCommand = remote_command
    template.jobEnvironment = task_id_env_vars
    template.outputPath = f":{INPUT_BASE_DIR}/output/{io_task_id}.o"
    template.outputPath = f":{INPUT_BASE_DIR}/error/{io_task_id}.e"

    args_list = [f"{INPUT_BASE_DIR}/data{arg_task_id}"]
    template.args = args_list
    session.runBulkJobs(template, task_start, task_stop - 1, task_step)
    session.deleteJobTemplate(template)

Apologize if there is a syntax error, I have to hand copy this, as its on a different system. With the submission done If I do a qstat -j on the job number I get the following settings displayed

sge_o_shell:         /usr/bin/zsh
stderr_path_list:    NONE::/home/mel/input_data/error_log/$(tmp=SUFFIX_TASK_ID_$TASK_ID; echo ${(P)tmp}).e
stdout_path_list:    NONE::/home/mel/input_data/output_log/$(tmp=SUFFIX_TASK_ID_$TASK_ID; echo ${(P)tmp}).o
job_args:            /home/mel/input_data/data$(tmp=SUFFIX_TASK_ID$SGE_TASK_ID; echo ${(P)tmp})
script_file:         job_script.py

env_list: 
SUFFIX_TASK_ID_1=1,SUFFIX_TASK_ID_2=2,SUFFIX_TASK_ID_3=5,SUFFIX_TASK_ID_4=7,SUFFIX_TASK_ID_5=10,SUFFIX_TASK_ID_6=11

error logs and output logs get made respectively but there is only a partial expansion. Examples

$(tmp=SUFFIX_TASK_ID1; echo ${(P)tmp}).e
$(tmp=SUFFIX_TASK_ID1; echo ${(P)tmp}).o

If we cat the error logs we see Illegal variable name Is what I am trying to do possible? So I am presuming something somewhere is not activating my zsh correctly.

Melendowski (111 rep)

May 31, 2023, 10:09 PM • Last activity: Jun 4, 2023, 11:45 AM

0 votes

0 answers

32 views

Small TaskQueue shared on two computers

network-interface software-rec slurm gridengine

There are two computers with 12 physical cores each. Computer A should accept jobs and distribute them among A and B I want to setup Computers A and B such that - A will accept jobs (via ssh) and distribute them among A and B (more or less intelligently) - if possible I'd like to block 4 cores on ea...

                                  There are two computers with 12 physical cores each. 

Computer A should accept jobs and distribute them among A and B

I want to setup Computers A and B such that

- A will accept jobs (via ssh) and distribute them among A and B (more or less intelligently)
- if possible I'd like to block 4 cores on each computer as "personal requiremets"

Jobs are expected to be either python scripts or executables written in c++ (can involve mpi code).

 I have read of slurm and the Sun Grid Engine but that seems a bit too powerful/complicated for this use case (I don't want to spend a week reading how to do it and troubleshooting). Is there an easier solution that satisfies the requirements?

infinitezero (207 rep)

Mar 14, 2022, 03:12 PM • Last activity: Mar 14, 2022, 04:38 PM

0 votes

2 answers

1592 views

Syntax for number of cores in a Sun Grid Engine job file

cpu gridengine qsub

I want to use the HPC of my university to `qsub` an array job of **3** tasks. Each task runs a Matlab code which uses a solver (MOSEK) that exploits multiple **threads** to solve an optimization problem. A parameter can control the number of threads we want the solver to use. The maximum number of t...

                                  I want to use the HPC of my university to qsub an array job of **3** tasks. 

Each task runs a Matlab code which uses a solver (MOSEK) that exploits multiple **threads** to solve an optimization problem. A parameter can control the number of threads we want the solver to use. The maximum number of threads allowed should never exceed the number of cores. 

Suppose I want the solver to use **4 threads**. Hence, I should ensure that each task is assigned to a machine with at least 4 cores free. How can I request that in the bash file? How should I count, in turn, the memory usage (i.e., should I declare the memory per core or the total memory)?

At the moment this is my bash file 

    #$ -S /bin/bash
    #$ -l h_vmem=18G
    #$ -l tmem=18G
    #$ -l h_rt=480:0:0
    #$ -cwd
    #$ -j y
    
    #Run 3 tasks
    #$ -t 1-3
    
    #$ -N try
    date
    hostname
    
    
    #Output the Task ID
    echo "Task ID is $SGE_TASK_ID"
    
    matlab -nodisplay -nodesktop -nojvm -nosplash -r "main_1; ID = $SGE_TASK_ID; f_1; exit"
                                

Star (125 rep)

Sep 9, 2020, 01:50 PM • Last activity: Jun 26, 2021, 04:21 PM

0 votes

1 answers

807 views

Syntax for memory request in a Sun Grid Engine job file

memory gridengine qsub

I'm submitting a Matlab job in the cluster of my university using `qsub` after having logged in a node using `ssh`. The job runs out of memory. This is the advice I received to fix my issue: "**Possible solutions are run on a bigger machine or buy more RAM**." What does this mean in practice for my...

                                  I'm submitting a Matlab job in the cluster of my university using qsub after having logged in a node using ssh.

The job runs out of memory. This is the advice I received to fix my issue: "**Possible solutions are run on a bigger machine or buy more RAM**."

What does this mean in practice for my bash file? Which lines of the bash file control the size of the machine or the RAM? At the moment, in my bash file (see below) I request vmem and tmem. Is any of these RAM?

    #$ -S /bin/bash
    #$ -l h_vmem=18G
    #$ -l tmem=18G
    #$ -l h_rt=480:0:0
    #$ -cwd
    #$ -j y
    
    #Run 600 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 600
    #$ -t 1-600
    
    #$ -N try
    date
    hostname
    
    #Output the Task ID
    echo "Task ID is $SGE_TASK_ID"
    
    matlab -nodisplay -nodesktop -nojvm -nosplash -r "main_1; ID = $SGE_TASK_ID; f_1; exit"

Star (125 rep)

Sep 9, 2020, 11:01 AM • Last activity: Sep 9, 2020, 02:24 PM

1 votes

0 answers

77 views

Determine slot ID for a running job

gridengine

On a compute node with multiple slots, are the running jobs each explicitly assigned a slot ID as they start, and if so how can the user or submission script see it? To see the job ID, one can use the `$JOB_ID` environment variable within the submission script. What about the slot number? I looked f...

                                  On a compute node with multiple slots, are the running jobs each explicitly assigned a slot ID as they start, and if so how can the user or submission script see it?

To see the job ID, one can use the $JOB_ID environment variable within the submission script. What about the slot number? 

I looked for slot information using qstat -j  but the information about the job does not contain any information about which of the slots the job is using. I was hoping there would be an integer variable related to the slot number. 

EDIT: in the general case, a job might be assigned multiple slots if it is parallelized, so in this case the list of slot IDs could be determined.

feedMe (219 rep)

Feb 6, 2019, 11:47 AM • Last activity: Feb 7, 2019, 09:27 AM

0 votes

1 answers

240 views

Accessing Job ID during gridengine submission

gridengine

I am using a bash script to submit jobs to gridengine. Is there a way for the script to know the job ID assigned to it by the scheduler?

                                  I am using a bash script to submit jobs to gridengine.

Is there a way for the script to know the job ID assigned to it by the scheduler?

feedMe (219 rep)

Feb 7, 2019, 08:30 AM • Last activity: Feb 7, 2019, 09:20 AM

1 votes

2 answers

2206 views

Qsub to any node with more than n cores available

gridengine

I have a program that is parallelized using MPI. It thinks that it is able to run across multiple nodes on our (CentOS 6.6)-based HPC grid, when in actual fact it only runs successfully on multiple cores *of the same compute node*. e.g. If I `qsub` a job to the grid asking for 20 cores, and Grid Eng...

                                  I have a program that is parallelized using MPI. It thinks that it is able to run across multiple nodes on our (CentOS 6.6)-based HPC grid, when in actual fact it only runs successfully on multiple cores *of the same compute node*. 

e.g. If I qsub a job to the grid asking for 20 cores, and Grid Engine decides to split it over two different nodes, the program fails. However, if there is a node with 20 cores available, and Grid Engine sends it all to that one, the program runs successfully. The qsub script contains the command #$ -pe mpi 20 to select the number of cores. 

So at the moment, I do a qstat -f -u "*" to manually identify a compute node with 20 available cores, and submit to that node with qsub -q general.q@node-X-X

What I am looking for is a way to tell Grid Engine to wait and only submit the job to a single compute node that has the required number of available cores. This will allow me to automate my job submission. 

I am considering writing a bash script to parse the qstat -f -u "*" command, but there must be a more elegant solution. I have looked through the qsub manual but am unable to find a suitable flag or command line argument. 

I'm not able to modify the program itself at this time and I am not a system administrator. 

Here is some information on the different software versions I have available:

MPI/gridengine info:

    > ompi_info | grep gridengine
    MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2)

Grid engine version is: OGS/GE 2011.11p1

feedMe (219 rep)

May 15, 2017, 09:15 AM • Last activity: Oct 22, 2018, 09:52 AM

0 votes

1 answers

74 views

SSH connections difficulties

ssh rhel gridengine

I'm using RED HAT 5.9 OS on my grid, having 3 machine: 1 Head node (known as ilmn-qm.ilmn) and 2 compute nodes (aka compute-00-00 and compute-00-01). **Problem is that i cant use SSH from either one of the compute nodes units.** I tried: 1) SSH FROM and TO head node works perfectly. 2) SSH from head...

                                  I'm using RED HAT 5.9 OS on my grid, having 3 machine:
1 Head node (known as ilmn-qm.ilmn) and 2 compute nodes (aka compute-00-00 and compute-00-01).

**Problem is that i cant use SSH from either one of the compute nodes units.**

I tried:

1) SSH FROM and TO head node works perfectly.

2) SSH from head node to compute nodes works.

3) vise versa SSH from compute nodes to head nodes work as well.

4) Head node define as gateway:

    [root@compute-00-01 ~]# route
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    172.20.22.0     *               255.255.255.0   U     0      0        0 eth1
    172.20.20.0     *               255.255.255.0   U     0      0        0 eth0
    169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
    default         ilmn-qm.ilmn    0.0.0.0         UG    0      0        0 eth0

5) I've checked that ipv4 forwarding is enabled on the Head node

    cat /etc/sysctl.conf
    # Kernel sysctl configuration file for Red Hat Linux
    #
    # For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
    # sysctl.conf(5) for more details.
    
    # Controls IP packet forwarding
    net.ipv4.ip_forward = 1
    
    # Controls source route verification
    net.ipv4.conf.default.rp_filter = 1
    
    # Do not accept source routing
    net.ipv4.conf.default.accept_source_route = 0
    
    # Controls the System Request debugging functionality of the kernel
    kernel.sysrq = 0
    
    # Controls whether core dumps will append the PID to the core filename
    # Useful for debugging multi-threaded applications
    kernel.core_uses_pid = 1
    
    # Controls the use of TCP syncookies
    net.ipv4.tcp_syncookies = 1
    
    # Controls the maximum size of a message, in bytes
    kernel.msgmnb = 65536
    
    # Controls the default maxmimum size of a mesage queue
    kernel.msgmax = 65536
    
    # Controls the maximum shared segment size, in bytes
    kernel.shmmax = 68719476736
    
    # Controls the maximum number of shared memory segments, in pages
    kernel.shmall = 4294967296

and yet any ssh attempt ends up with:

    ssh: connect to host 132.68.107.69 port 22: Connection timed out

from Head node:

    root@ilmn-qm ~ # ip a show
    1: lo:  mtu 16436 qdisc noqueue
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        inet6 ::1/128 scope host
           valid_lft forever preferred_lft forever
    2: eth0:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether f0:4d:a2:0b:2d:b9 brd ff:ff:ff:ff:ff:ff
        inet 132.68.106.1/28 brd 132.68.106.15 scope global eth0
        inet6 fe80::f24d:a2ff:fe0b:2db9/64 scope link
           valid_lft forever preferred_lft forever
    3: eth1:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether f0:4d:a2:0b:2d:bb brd ff:ff:ff:ff:ff:ff
        inet 172.20.20.5/24 brd 172.20.20.255 scope global eth1
        inet6 fe80::f24d:a2ff:fe0b:2dbb/64 scope link
           valid_lft forever preferred_lft forever
    4: eth2:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether f0:4d:a2:0b:2d:bd brd ff:ff:ff:ff:ff:ff
        inet 172.20.21.2/24 brd 172.20.21.255 scope global eth2
        inet6 fe80::f24d:a2ff:fe0b:2dbd/64 scope link
           valid_lft forever preferred_lft forever
    5: eth3:  mtu 1500 qdisc noop qlen 1000
        link/ether f0:4d:a2:0b:2d:bf brd ff:ff:ff:ff:ff:ff
    6: sit0:  mtu 1480 qdisc noop
        link/sit 0.0.0.0 brd 0.0.0.0
    root@ilmn-qm ~ # ip route show
    132.68.106.0/28 dev eth0  proto kernel  scope link  src 132.68.106.1
    172.20.21.0/24 dev eth2  proto kernel  scope link  src 172.20.21.2
    172.20.20.0/24 dev eth1  proto kernel  scope link  src 172.20.20.5
    169.254.0.0/16 dev eth2  scope link
    default via 132.68.106.14 dev eth0

from compute-00-00:

    [root@compute-00-00 ~]# ip a show
    1: lo:  mtu 16436 qdisc noqueue
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        inet6 ::1/128 scope host
           valid_lft forever preferred_lft forever
    2: eth0:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether f0:4d:a2:0b:2d:c2 brd ff:ff:ff:ff:ff:ff
        inet 172.20.20.6/24 brd 172.20.20.255 scope global eth0
        inet6 fe80::f24d:a2ff:fe0b:2dc2/64 scope link
           valid_lft forever preferred_lft forever
    3: eth1:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether f0:4d:a2:0b:2d:c4 brd ff:ff:ff:ff:ff:ff
        inet 172.20.22.6/24 brd 172.20.22.255 scope global eth1
    4: eth2:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether f0:4d:a2:0b:2d:c6 brd ff:ff:ff:ff:ff:ff
    5: eth3:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether f0:4d:a2:0b:2d:c8 brd ff:ff:ff:ff:ff:ff
    6: sit0:  mtu 1480 qdisc noop
        link/sit 0.0.0.0 brd 0.0.0.0
    [root@compute-00-00 ~]# ip route show
    172.20.22.0/24 dev eth1  proto kernel  scope link  src 172.20.22.6
    172.20.20.0/24 dev eth0  proto kernel  scope link  src 172.20.20.6
    169.254.0.0/16 dev eth1  scope link
    default via 172.20.20.5 dev eth0

from compute-00-01:

    [root@compute-00-01 ~]# ip a show
    1: lo:  mtu 16436 qdisc noqueue
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        inet6 ::1/128 scope host
           valid_lft forever preferred_lft forever
    2: eth0:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether 84:2b:2b:f9:9e:11 brd ff:ff:ff:ff:ff:ff
        inet 172.20.20.7/24 brd 172.20.20.255 scope global eth0
        inet6 fe80::862b:2bff:fef9:9e11/64 scope link
           valid_lft forever preferred_lft forever
    3: eth1:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether 84:2b:2b:f9:9e:13 brd ff:ff:ff:ff:ff:ff
        inet 172.20.22.7/24 brd 172.20.22.255 scope global eth1
    4: eth2:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether 84:2b:2b:f9:9e:15 brd ff:ff:ff:ff:ff:ff
    5: eth3:  mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether 84:2b:2b:f9:9e:17 brd ff:ff:ff:ff:ff:ff
    6: sit0:  mtu 1480 qdisc noop
        link/sit 0.0.0.0 brd 0.0.0.0
    [root@compute-00-01 ~]# ip route show
    172.20.22.0/24 dev eth1  proto kernel  scope link  src 172.20.22.7
    172.20.20.0/24 dev eth0  proto kernel  scope link  src 172.20.20.7
    169.254.0.0/16 dev eth0  scope link
    default via 172.20.20.5 dev eth0


                                

hamaor (3 rep)

Feb 6, 2018, 11:52 AM • Last activity: Feb 16, 2018, 11:50 AM

0 votes

1 answers

622 views

Grid engine/cluster management and job scheduler for Debian/ubuntu

debian ubuntu cluster gridengine

I need to perform large amount of computations on a something resembling a cluster, the hardware and the OS are identical (the OS is ubuntu) but no central management software or grid engine is installed. The web search results in mostly outdated or proprietary software. I hope my question is not to...

                                  I need to perform large amount of computations on a something resembling a cluster, the hardware and the OS are identical (the OS is ubuntu) but no central management software or grid engine is installed. The web search results in mostly outdated or proprietary software.

I hope my question is not too general but, what are the cluster management and job scheduling options for Debian and its derivatives?

For the general management of the cluster I use cssh but this approach is not very efficient when it comes to job scheduling and monitoring. I have experience using the venerable Sun grid engine RIP.

Thanks for reading this!

lazaraza (3 rep)

Jul 15, 2017, 09:18 AM • Last activity: Aug 5, 2017, 09:44 PM

1 votes

1 answers

173 views

Stack screen output into columns to make use of screen width and avoid scrolling

linux pipe stdout columns gridengine

I often use `gridengine`'s qstat command on our HPC cluster but since I have many jobs running on the cluster the output is too long to fit on my screen and I end up doing a lot of scrolling to see the upper section of the output. My terminal has enough space for two columns so it would be nice if t...

                                  I often use gridengine's qstat command on our HPC cluster but since I have many jobs running on the cluster the output is too long to fit on my screen and I end up doing a lot of scrolling to see the upper section of the output. My terminal has enough space for two columns so it would be nice if they output could flow into columns and shown side-by-side. 

**Example using simple data file:**
Obviously this should be general to any screen output so to illustrate here is a simpler example: 
 
My file data1.txt contains 100 lines of "This is a test".

    >> cat data1.txt
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    This is a test
    (etc. until 100th line)
    >> 

**Desired output:**
    
    >> cat data1.txt | something | something_else -n 2
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    This is a test	This is a test
    (etc. until 50 rows)

Of course, it would be nice to specify any arbitrary number of columns. 

The only similar question/answer that I found was this one  but I'm hoping there is a simpler way to do this in one line using pipes and no script files. 



                                

feedMe (219 rep)

Feb 22, 2017, 04:12 PM

1 votes

1 answers

268 views

Generalising Grid Engine qsub job file for multiple programs and input file names

bash scripting cluster gridengine

I am using Grid Engine on a Linux cluster. I am running many jobs with different programs and different input files. I don't want to create multiple specific job scripts for each pair of program and input file. Instead I want to be able to specify the program name and the input file on the `qsub` li...

                                  I am using Grid Engine on a Linux cluster. I am running many jobs with different programs and different input files. I don't want to create multiple specific job scripts for each pair of program and input file. Instead I want to be able to specify the program name and the input file on the qsub line.

Therefore I can use qsub job.sh  

Where job.sh takes two arguments. This works fine. But there is another twist: my programs are located in a very very long directory which I don't want to type every time I submit a job - so aliases are an obvious choice.

So I want to do something like qsub job.sh  

I initially set the alias in my .bashrc but was getting the error: : command not found

So I set the alias in submit.sh. But I am getting the same error.

Thoughts on how can I can get the command qsub job.sh $1 $2 to accept aliases also?

cyuut (11 rep)

Jan 19, 2017, 07:11 PM • Last activity: Jan 22, 2017, 05:04 PM

0 votes

1 answers

182 views

Grid Engine for program that needs X11 but doesn't require user input

bash x11 gridengine

I have a bash script that calls an executable (some commercial software) in "batch mode". On the commandline, if X is available the program runs to completion and then quits, but if not the program hangs. I think this because: - It works over VNC - It doesn't work over ssh if `ssh -X` has not been s...

                                  I have a bash script that calls an executable (some commercial software) in "batch mode". On the commandline, if X is available the program runs to completion and then quits, but if not the program hangs. 
I think this because:

 - It works over VNC
 - It doesn't work over ssh if ssh -X has not been specified. 
 - It works over ssh if -X has been specified
 - It doesn't work with Grid Engine. When I qsub the script it just stays on status 'r' indefinitely and I cannot see any output in the .sh.o.XXX or .sh.e.XXX files

The upshot is, I want to submit this script to Grid Engine, but I can't! 

The program never asks for user input when in the so-called "batch mode". 

Is there some way to provide an X environment in Grid Engine, just to allow the program to complete on its own? 
I guess one problem is that, since I cannot see the source code, it is difficult to see exactly what the program is asking for. 
                                

feedMe (219 rep)

Jun 15, 2016, 08:03 AM • Last activity: Dec 15, 2016, 10:30 AM

0 votes

0 answers

345 views

submitting a job array script on SGE

shell-script files gridengine

I am trying to make a job array script to do a particular task for several files. Let's assume as a start for only 2 fastq files. Name : abc.fastq , def.fastq #!/bin/bash file=$(ls -1 *.fastq | tail -n +${SGE_TASK_ID}| head -1) filename=${file%.fastq} awk 'NR % 2 == 0{print substr($1,7,100)};NR % 2...

                                  I am trying to make a job array script to do a particular task for several files. Let's assume as a start for only 2 fastq files. Name : abc.fastq , def.fastq

    #!/bin/bash
    
        file=$(ls -1 *.fastq | tail -n +${SGE_TASK_ID}| head -1)
        filename=${file%.fastq}
        awk 'NR % 2 == 0{print substr($1,7,100)};NR % 2 ==1' $file > ${filename}_BR.fastq

I submitted the script as:

    qsub -t 1-2:1 -cwd -j y -N array_job ./jobarray.sh

But only 1 file is processed which is abc.fastq. What happened to the def.fastq file?. I provided the -t parameter with 2 jobs and SGE_TASK_ID is declared in my script.

Hope to hear from you guys soon.

user3138373 (2589 rep)

Sep 2, 2015, 04:36 PM • Last activity: Aug 25, 2016, 01:39 AM

1 votes

2 answers

6913 views

How do I check if a job is running on cluster using job name (CentOS)

process gridengine

I am running a bash script to submit multiple jobs. The submission of a job only happens if such job is not already running. I want to use an if statement inside my bash script to simply check if "job123" is already running or in the queue. I have tried different options with qstat and qstatus but I...

                                  I am running a bash script to submit multiple jobs. The submission of a job only happens if such job is not already running. I want to use an if statement inside my bash script to simply check if "job123" is already running or in the queue. 

I have tried different options with qstat and qstatus but I can't seem to check by job name. How can this information be retrieved? Also these outputs are just strings, I also did not have any luck using grep but I think there must be a specific command.

Herman Toothrot (353 rep)

Aug 1, 2016, 03:00 PM • Last activity: Aug 2, 2016, 10:56 AM

0 votes

1 answers

2177 views

How to tell the memory usage of each background job

background-process jobs gridengine

I am working on SGE, and I am logged on to it. I use qlogin -l mf=30G so as to get onto one compute node. I am running 2 jobs on this compute node in the background. [1] 4408 Running /apps1/sratoolkit/2.3.5-2/bin/fastq-dump --split-files SRR1660.sra & [2] 4415 Running /apps1/sratoolkit/2.3.5-2/bin/f...

                                  I am working on SGE, and I am logged on to it. I use qlogin -l mf=30G so as to get onto one compute node.

I am running 2 jobs on this compute node in the background.

       4408 Running                 /apps1/sratoolkit/2.3.5-2/bin/fastq-dump --split-files SRR1660.sra &
       4415 Running                 /apps1/sratoolkit/2.3.5-2/bin/fastq-dump --split-files SRR1661.sra &

I want to know how much memory each of my background jobs is consuming out of 30G i assigned in the beginning. Any command to find that out??

Thanks

user3138373 (2589 rep)

Mar 24, 2015, 04:32 PM • Last activity: Mar 24, 2015, 04:41 PM

3 votes

1 answers

1266 views

stdout redirect. sh: resource temporarily unavailable

lvm stdout gridengine

I have large batches of bash processes. Each bash script invokes executeables which have their stdout redirected to distinct log files. About 5% of the runs end up with: _sh: [name of log]: Resource temporarily unavailable_ I tried to reduce amount of jobs running in parallel, but still the error pe...

                                  I have large batches of bash processes.
Each bash script invokes executeables which have their stdout redirected to distinct log files.
About 5% of the runs end up with:
_sh: [name of log]: Resource temporarily unavailable_
I tried to reduce amount of jobs running in parallel, but still the error persisted on some of the bash scripts.

### Additional info: ###
- Ubuntu 14.04 LTS running on VM using ESXi
- Happens on a new partition, allocated with gparted and LVM (new logical volume consisting of the entire partition)
- The LV is exported using nfs-kernel-server
- The LV is also shared to windows using Samba
- The LV is formatted using ext4
- I have admin rights on this machine

### More detailed info ###
- Everything is run in a cluster, using Sun-Grid-Engine
- There are 4 virtual machines: m1, m2, m3, m4
- m1 runs sge master, sge exec, and ldap server
- m2, m3, m4 run sge exec
- m3 runs nfs-kernel-server, exporting a _home_ folder sitting in logical volume (using LVM) that uses a partition on a local disk, to m1, m2, m4
- m3 has a soft link to the _home_ folder
- m1, m2, m4 mount the _home_ folder through fstab, so all machines end up pointing to the same _home_ folder
- m3, m2, m4 run ldap clients, connecting to m1
- All jobs are submitted to the cluster through m1 (configured as a submission host)
- Jobs fail exclusively on m3 (which exports the disk). Most of the jobs on m3 are passing though. Failures are random, but consistently on m3 alone.
- m3 also shares the _home_ via samba to windows clients

Any help would be greatly appreciated :) (how to debug, which logs are relevant, how to get more info out of the system, etc...)

Thank you in advance!

                                

lev haikin (131 rep)

Dec 31, 2014, 07:47 AM • Last activity: Jan 5, 2015, 09:56 AM

2 votes

2 answers

2093 views

Remotely compile and run program using ssh and screen

ssh gnu-screen gridengine

I'm trying to compile and run a program remotely. However, I'd like to this within a screen and also I'd like to run this using grid engine on another node after I ssh. Currently I have: ssh me@server screen -R session 'qlogin; cd path; mvn options program' This basically works, but I get a message...

                                  I'm trying to compile and run a program remotely.  However, I'd like to this within a screen and also I'd like to run this using grid engine on another node after I ssh. Currently I have: 

    ssh me@server screen -R session 'qlogin; cd path; mvn options program'

This basically works, but I get a message saying that I must be connected to a terminal.  I read about this and added the -t option to ssh. With that, my command breaks: it seems like I ssh over, screen starts, then doesn't know about the "mvn" command and terminates my session.

I'm wondering why this is happening and how to correctly launch jobs from my local machine, within a screen, on a remote node while using grid engine.

akobre01 (121 rep)

Aug 8, 2013, 10:35 PM • Last activity: Sep 18, 2014, 04:20 AM

2 votes

1 answers

10558 views

usr/bin/xterm Xt error: Can't open display: /usr/bin/xterm: DISPLAY is not set?

x11 cluster gridengine

I'm trying to submit a job to a school server (HPC) with: #!/bin/bash #$ -S /bin/bash #$ -cwd #$ -o ./out_$JOB_ID.txt #$ -e ./err_$JOB_ID.txt #$ -notify #$ -pe orte 1 date pwd ################################## RESULT_DIR=~/Results SCRIPT_FILE=sample_job ################################## . /etc/pro...

                                  I'm trying to submit a job to a school server (HPC) with:

    #!/bin/bash
    
    #$ -S /bin/bash
    #$ -cwd
    #$ -o ./out_$JOB_ID.txt
    #$ -e ./err_$JOB_ID.txt
    #$ -notify
    
    #$ -pe orte 1
    
    date
    pwd
    
    ##################################
    RESULT_DIR=~/Results
    SCRIPT_FILE=sample_job
    ##################################
    
    . /etc/profile
    . /etc/bashrc
    
    module load packages/comsol/4.4
    module load packages/matlab/r2012b
    
    comsol server matlab "sample_job, exit" -nodesktop -mlnosplash
    
    /bin/uname -a
    
    mkdir $RESULT_DIR/$name
    cp *.csv $RESULT_DIR/$name

The job aborts saying:

    Sun Jun  8 14:20:21 EDT 2014
    COMSOL 4.4 (Build: 150) started listening on port 2036
    Use the console command 'close' to exit the program
    /usr/bin/xterm Xt error: Can't open display: 
    /usr/bin/xterm:  DISPLAY is not set
    Program_did_not_exit_normally
    Exception:
    	com.comsol.util.exceptions.FlException: Program did not exit normally
    Messages:
    	Program did not exit normally
    
    Stack trace:
    	at com.comsol.mli.application.a.a(Unknown Source)
    	at com.comsol.mli.application.MatlabApplication.doStart(Unknown Source)
    	at com.comsol.util.application.ComsolApplication.doStart(Unknown Source)
    	at com.comsol.util.application.ComsolApplication.doRun(Unknown Source)
    	at com.comsol.bridge.Bridge$2.run(Unknown Source)
    	at java.lang.Thread.run(Unknown Source)
    
    ERROR: Could not start COMSOL Application. See log file: /home/.comsol/v44/logs/server2.log
    java.lang.IllegalStateException: Shutdown in progress
    	at java.lang.ApplicationShutdownHooks.add(Unknown Source)
    	at java.lang.Runtime.addShutdownHook(Unknown Source)
    	at org.apache.catalina.startup.Catalina.start(Catalina.java:699)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    	at java.lang.reflect.Method.invoke(Unknown Source)
    	at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:322)
    	at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:451)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    	at java.lang.reflect.Method.invoke(Unknown Source)
    	at com.comsol.util.application.ServerApplication.a(Unknown Source)
    	at com.comsol.util.application.ServerApplication.a(Unknown Source)
    	at com.comsol.util.application.ServerApplication.a(Unknown Source)
    	at com.comsol.util.application.ServerApplication.main(Unknown Source)

What might be the reason and how should I fix it?
                                

Sibbs Gambling (1746 rep)

Jun 8, 2014, 06:33 PM • Last activity: Jun 8, 2014, 09:58 PM

3 votes

1 answers

2112 views

stat meanings of computing nodes

linux gridengine

I submitted a job to a Linux cluster which uses SGE job scheduler. The job stat is qw for a long time, so I inspected the stats of computing nodes using "qstat -f". I found that many nodes were labelled with stats "d", "adu" and "E". I wonder what these stats mean. The [Grid Engine Man pages][1] lis...

                                  I submitted a job to a Linux cluster which uses SGE job scheduler. The job stat is qw for a long time, so I inspected the stats of computing nodes using "qstat -f". 

I found that many nodes were labelled with stats "d", "adu" and "E". I wonder what these stats mean. The Grid Engine Man pages  listed these stats for filtering queue instances ( -qs {a|c|d|o|s|u|A|C|D|E|S} ), but no further explanation on the meaning of these stats. 

What do the states mean?

Dejian (828 rep)

May 16, 2014, 01:53 PM • Last activity: May 16, 2014, 02:10 PM

2 votes

1 answers

8156 views

what is the difference between qsub and ./

linux cluster gridengine

Can anyone tell me the difference between the following ways of submitting a script: $ qsub script_name.sh and ./script_name.sh What are the differences between the above two ways of submitting a job to a cluster? Also how come sometimes I need to type: $ chmod +x script_name.sh ...before I can type...

                                  Can anyone tell me the difference between the following ways of submitting a script:

    $ qsub script_name.sh

and 

    ./script_name.sh

What are the differences between the above two ways of submitting a job to a cluster?

Also how come sometimes I need to type:

    $ chmod +x script_name.sh 

...before I can type ./script_name.sh to submit a job? How come sometimes I just need to type qsub script_name.sh?  

Sorry I'm not very familiar with Unix.

john_w (153 rep)

Feb 25, 2014, 11:00 PM • Last activity: Feb 25, 2014, 11:18 PM

Showing page 1 of 20 total questions