Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

40 votes
8 answers
132444 views
Parallelise rsync using GNU Parallel
I have been using a `rsync` script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB. In order to sync those files, I have been using `rsync` command as follows: rsync -avzm --stats --human-readable --include-from p...
I have been using a rsync script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB. In order to sync those files, I have been using rsync command as follows: rsync -avzm --stats --human-readable --include-from proj.lst /data/projects REMOTEHOST:/data/ The contents of proj.lst are as follows: + proj1 + proj1/* + proj1/*/* + proj1/*/*/*.tar + proj1/*/*/*.pdf + proj2 + proj2/* + proj2/*/* + proj2/*/*/*.tar + proj2/*/*/*.pdf ... ... ... - * As a test, I picked up two of those projects (8.5GB of data) and executed the command above. Being a sequential process, it took 14 minutes and 58 seconds to complete. So, for 1.2TB of data, it would take several hours. If I would could have multiple rsync processes in parallel (using &, xargs or parallel), it would save me time. I tried with below command with parallel (after cding to the source directory) and it took 12 minutes 37 seconds to execute: parallel --will-cite -j 5 rsync -avzm --stats --human-readable {} REMOTEHOST:/data/ ::: . This should have taken 5 times less time, but it didn't. I think, I'm going wrong somewhere. How can I run multiple rsync processes in order to reduce the execution time?
Mandar Shinde (3374 rep)
Mar 13, 2015, 06:51 AM • Last activity: Jul 26, 2025, 08:00 PM
5 votes
2 answers
2032 views
Running GNU Parallel on 2 or more nodes with Slurm scheduler
I am trying to distribute independent runs of a process using GNU Parallel on a HPC that uses Slurm workload manager. Briefly, here is the data analysis set up: Script#1: myCommands ./myscript --input infile.txt --setting 1 --output out1 ./myscript --input infile.txt --setting 2 --output out2 ./mysc...
I am trying to distribute independent runs of a process using GNU Parallel on a HPC that uses Slurm workload manager. Briefly, here is the data analysis set up: Script#1: myCommands ./myscript --input infile.txt --setting 1 --output out1 ./myscript --input infile.txt --setting 2 --output out2 ./myscript --input infile.txt --setting 3 --output out3 ./myscript --input infile.txt --setting 4 --output out4 Script#2: run.sh #SBATCH --time=00:02:00 #SBATCH --nodes=2 #SBATCH --cpus-per-task=2 cat myCommands | parallel -j 4 This works, however it only uses one node. The two cores on that nodes are split into 4 threads to make room for 4 jobs as requested by parallel. That is not desirable. My searching indicates I will need a nodefile and a sshloginfile to accomplish this, but I see no examples online that work with Slurm, only with PBS system. How can I make the script (1) use both nodes, and (2) not split cores into threads?
cryptic0 (191 rep)
Jan 30, 2017, 06:24 PM • Last activity: Jun 15, 2025, 05:03 PM
1 votes
1 answers
57 views
Expected behaviour of GNU parallel --memfree <size> and --memsuspend <size> when size much bigger than RAM
While experimenting with GNU parallel I found that the following cases all hang with decreasing CPU usage on a Fedora 41 VM with 8GB RAM. Is this expected behaviour? ``` parallel --halt now,fail=1 --timeout 2s --memfree 30G echo ::: a b c parallel --halt now,fail=1 --timeout 2s --memsuspend 30G echo...
While experimenting with GNU parallel I found that the following cases all hang with decreasing CPU usage on a Fedora 41 VM with 8GB RAM. Is this expected behaviour?
parallel --halt now,fail=1 --timeout 2s --memfree 30G echo ::: a b c
parallel --halt now,fail=1 --timeout 2s --memsuspend 30G echo ::: a b c
parallel --timeout 2s --memsuspend 30G echo ::: a b c
parallel --timeout 2s --memfree 30G echo ::: a b c
I'd have expected at least the first or second command to actually timeout and exit with errorcode 3. [strace log](https://paste.centos.org/view/5d24131f ) that shows it's basically spinning and continuously reading /proc/meminfo with an awk subprocess which is in line with expected behaviour (memfreescript) even though it seems pretty wasteful every 1 second. **Why does it allow --memfree and --memsuspend values much greater than physical RAM ?** Could someone also clarify this section in the manual for --memfree. Does it mean the youngest *running* job would be killed? > If the jobs take up very different amount of RAM, GNU parallel will only start as many as there is memory > for. If less than size bytes are free, no more jobs will be started. If less than 50% size bytes are free, > the youngest job will be killed (as per --term-seq), and put back on the queue to be run later. kill_youngster_if_not_enough_mem code is relevant but isn't something I quite grasp in relation to the full GNU parallel codebase.
parallel --version GNU parallel 20241222
 uname -a Linux host 6.11.4-301.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Oct 20 15:02:33 UTC 2024 x86_64 GNU/Linux
Somniar (113 rep)
May 31, 2025, 05:47 PM • Last activity: May 31, 2025, 09:36 PM
275 votes
10 answers
347288 views
Parallelize a Bash FOR Loop
I have been trying to parallelize the following script, specifically each of the three FOR loop instances, using GNU Parallel but haven't been able to. The 4 commands contained within the FOR loop run in series, each loop taking around 10 minutes. #!/bin/bash kar='KAR5' runList='run2 run3 run4' mkdi...
I have been trying to parallelize the following script, specifically each of the three FOR loop instances, using GNU Parallel but haven't been able to. The 4 commands contained within the FOR loop run in series, each loop taking around 10 minutes. #!/bin/bash kar='KAR5' runList='run2 run3 run4' mkdir normFunc for run in $runList do fsl5.0-flirt -in $kar"deformed.nii.gz" -ref normtemp.nii.gz -omat $run".norm1.mat" -bins 256 -cost corratio -searchrx -90 90 -searchry -90 90 -searchrz -90 90 -dof 12 fsl5.0-flirt -in $run".poststats.nii.gz" -ref $kar"deformed.nii.gz" -omat $run".norm2.mat" -bins 256 -cost corratio -searchrx -90 90 -searchry -90 90 -searchrz -90 90 -dof 12 fsl5.0-convert_xfm -concat $run".norm1.mat" -omat $run".norm.mat" $run".norm2.mat" fsl5.0-flirt -in $run".poststats.nii.gz" -ref normtemp.nii.gz -out $PWD/normFunc/$run".norm.nii.gz" -applyxfm -init $run".norm.mat" -interp trilinear rm -f *.mat done
Ravnoor S Gill (2853 rep)
Dec 5, 2013, 09:04 PM • Last activity: May 13, 2025, 05:51 AM
1 votes
2 answers
414 views
Uncompressed .lzo files in parallel in both the folders simultaneously and then delete the original .lzo files
So I have `.lzo` files in `/test01/primary` folder which I need to uncompress and then delete all the `.lzo` files. Same thing I need to do in `/test02/secondary` folder as well. I will have around 150 `.lzo` files in both folders so total around 300 `.lzo` files. From a command line I was running l...
So I have .lzo files in /test01/primary folder which I need to uncompress and then delete all the .lzo files. Same thing I need to do in /test02/secondary folder as well. I will have around 150 .lzo files in both folders so total around 300 .lzo files. From a command line I was running like this to uncomressed one file lzop -d file_name.lzo. What is the fastest way to uncompressed all .lzo files and then delete all .lzo files from both folders simultaneously. Below is the code I have: #!/bin/bash set -e export PRIMARY=/test01/primary export SECONDARY=/test02/secondary parallel lzop -dU -- ::: {"$PRIMARY","$SECONDARY"}/*.lzo I want to uncompress and delete .lzo files parallelly in both PRIMARY and SECONDARY folder simultaneously. With my above code, it does in PRIMARY first and then in SECONDARY folder. How can I achieve parallellism both in PRIMARY and SECONDARY simultaneously? Also does it uncompress all the files and then delete later on or uncompress one file and then delete that file and then move to next one? I tried with this but it doesn't work. It just works on first 40 files and after that it doesn't work at all. #!/bin/bash set -e export PRIMARY=/test01/primary export SECONDARY=/test02/secondary parallel -j 40 lzop -dU -- ::: "$PRIMARY"/*.lzo & parallel -j 40 lzop -dU -- ::: "$SECONDARY"/*.lzo & wait
user1950349 (841 rep)
Oct 9, 2015, 11:34 PM • Last activity: May 12, 2025, 08:12 AM
1 votes
1 answers
213 views
How to make GNU parallel report progress in a way suitable for use outside of a terminal?
I want to run a command in parallel on a bunch of files as part of a Github CI workflow (on an ubuntu runner) in order to speed up the CI job. I would also like the parallel command to report its progress. Currently my command looks something like this: ``` # ci/clang-tidy-parallel.sh find src \ ! -...
I want to run a command in parallel on a bunch of files as part of a Github CI workflow (on an ubuntu runner) in order to speed up the CI job. I would also like the parallel command to report its progress. Currently my command looks something like this:
# ci/clang-tidy-parallel.sh
find src \
  ! -path "path/to/exclude/*" \
  -type f \( -name "*.cpp" -o -name "*.h" \) \
  | parallel --progress "clang-tidy-19 {}"
This works great when run from a shell on my own machine: the jobs are executed in parallel and a single line of output is shown with how many jobs are in progress and how many have finished already. However, when run as part of the Github workflow the output is kind of nasty: 1. It prints the error sh: 1: cannot open /dev/tty: No such device or address a bunch of times. 2. It prints _way_ more progress output than necessary. Something like 1700 lines of progress reports, while there are only about 80 jobs to run. Most of these lines are duplicates. E.g., the first couple of lines are:
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s 
local:4/0/100%/0.0s
If I run the command locally and redirect stderr to a file, I observe similar behavior
ci/clang-tidy-parallel.sh 2>log
When the command has finished, the log file contains hundreds of lines of output. (Though no errors about missing /dev/tty.) On the other hand, without the --progress option, the job just sits there with no visible output until it has completed, which is also not desirable. Is there a way to configure GNU parallel so that it reports progress in a way that is friendly to non-terminal environments? In particular, I would like it to only print a line of output when the status of a parallel job has changed (which should mean getting one line per job if everything goes smoothly). --- Thanks to Ole Tange for pointing me in the right direction. Based on his solution and some AI-assisted coding I came up with this monstrosity:
file_list=$(find src \
  ! -path "path/to/exclude/*" \
  -type f \( -name "*.cpp" -o -name "*.h" \))

length=$(wc -w  >(
    perl -pe 'BEGIN{$/="\r";$|=1};s/\r/\n/g' |
    grep '%' |
    perl -pe 'BEGIN{$|=1}s/\e\[[0-9;]*[a-zA-Z]//g' |
    perl -pe "BEGIN{\$length=$length;$|=1} s|(\d+)% (\d+):\d+=\S+ (\S+).*|\$1% (\$2/\$length) -- \$3|" |
    perl -ne 'BEGIN{$|=1}$s{$_}++ or print')
The raw output from --bar looks something like this:
#   0 sec src/tuner/Utilities.h                                                                                                                                         
3.65853658536585
[7m3% 3:7[0m9=0s src/tuner/Utilities.h                                                                                                                                        [0m
(With escape sequences to print the progress bar.) The successive commands processing that output perform the following transformations: - Transform carriage returns into newlines. - Find lines containing percentage output. - Strip out escape sequences. - Perform a regex replacement to extract and format the number of files processed, the completion percentage, and the name of the file being processed. It also includes the total number of files to be processed via a shell variable. - Print unique lines. The BEGIN{$|=1} on the perl invocations is necessary to ensure output gets flushed immediately. The p option will run perl on each line of input and print the result. The n option runs on each line of input put does not automatically print. The e option provides the script as a CLI argument. It generates output similar to this:
Running clang-tidy on 82 files
1% (1/82) -- src/tuner/LoadPositions.h
2% (2/82) -- src/tuner/Main.cpp
2% (2/82) -- src/tuner/Utilities.h
3% (3/82) -- src/tuner/Utilities.h
...
I'm sure there's a better way to do those perl scripts (and not have 4 of them). But this works, and my perl-foo is very weak.
JSQuareD (113 rep)
Apr 26, 2025, 04:25 AM • Last activity: Apr 28, 2025, 01:02 AM
1 votes
1 answers
102 views
Parallel processing of single huge .bz2 or .gz file
I would like to use GNU Parallel to process a huge .gz or .bz2 file. I know I can do: bzcat huge.bz2 | parallel --pipe ... But it would be nice if there was a way similar to `--pipe-part` that can read multiple parts of the file in parallel. One option is to decompress the file: bzcat huge.bz2 > hug...
I would like to use GNU Parallel to process a huge .gz or .bz2 file. I know I can do: bzcat huge.bz2 | parallel --pipe ... But it would be nice if there was a way similar to --pipe-part that can read multiple parts of the file in parallel. One option is to decompress the file: bzcat huge.bz2 > huge parallel --pipe-part -a huge ... but huge.bz2 is huge, and I would much prefer decompressing it multiple times than storing it uncompressed.
Ole Tange (37348 rep)
Mar 28, 2025, 11:58 AM • Last activity: Mar 29, 2025, 10:33 AM
3 votes
2 answers
566 views
Is there a way to tell GNU parallel to hold off spawning new jobs until all jobs in a batch has finished?
I want to run four processes in parallel, but not spawn any new jobs until all of these four have finished. EDIT: My command looks like this: find . -name "*.log" | parallel -j 4 './process.sh {}'
I want to run four processes in parallel, but not spawn any new jobs until all of these four have finished. EDIT: My command looks like this: find . -name "*.log" | parallel -j 4 './process.sh {}'
user2724383 (33 rep)
Jun 27, 2019, 02:35 PM • Last activity: Mar 26, 2025, 10:50 AM
2 votes
1 answers
45 views
GNU parallel: substitute a string or nothing depending on argument value
I'm trying to use GNU parallel to run a command for all combinations of several behavior-changing flags: ```bash parallel 'cmd --foo {1} --bar {2} {3} out.foo={1}.bar={2}/{3/}' ::: 0 1 ::: 0 1 ::: in/* ``` This should yield a series of invocations like: ```bash cmd --foo 0 --bar 1 'in/fileXYZ' 'out....
I'm trying to use GNU parallel to run a command for all combinations of several behavior-changing flags:
parallel 'cmd --foo {1} --bar {2} {3} out.foo={1}.bar={2}/{3/}' ::: 0 1 ::: 0 1 ::: in/*
This should yield a series of invocations like:
cmd --foo 0 --bar 1 'in/fileXYZ' 'out.foo=0.bar=1/fileXYZ'
cmd --foo 1 --bar 0 'in/fileXYZ' 'out.foo=1.bar=0/fileXYZ'
However, the CLI of this tool is irregular: --foo accepts a boolean argument, whereas --bar does not accept an argument (and has no negative form either). Thus, the invocations must instead look like this:
cmd --foo 0 --bar 'in/fileXYZ' 'out.foo=0.bar=1/fileXYZ'
cmd --foo 1       'in/fileXYZ' 'out.foo=1.bar=0/fileXYZ'
--- What is the best (least verbose) way in GNU parallel to transform an argument of 1 or 0 into the presence or absence of --bar on the command line?
intelfx (5699 rep)
Mar 15, 2025, 05:21 PM • Last activity: Mar 17, 2025, 11:48 AM
1 votes
2 answers
61 views
recursive call to gnu parallel
I have a very slow samba share that i can access through wsl on my windows laptop. On windows accessing the share doesn't take much time but on linux it is in order of seconds. I basically replicate the file structure of the share so I made this poor-man rsync bash script. Given it takes several sec...
I have a very slow samba share that i can access through wsl on my windows laptop. On windows accessing the share doesn't take much time but on linux it is in order of seconds. I basically replicate the file structure of the share so I made this poor-man rsync bash script. Given it takes several second to get the content of one directory, I thought it would be a good idea to 'enqueue' them in parallel to accelerate the process. So I made the following bash script. The replicate function get the current directory then call itself using parallel on all the subdirs. Running this program, I noticed the most used resources were memory and cpu usage. Disk and Network I/O are very little used. I understand that memory needs to be allocated for all the process that are created, but I supposed that the process will spend most of their time idling, either waiting for network I/O or for their children to terminate. My question is: What is the CPU doing during all this time, all my cores are used at 100%. For reference there is at all time around 800 to a 1000 process running using around 4~6G of memory.
create_file(){
    if [ ! -f "$1" ] ; then
        touch "$1";
    fi
}
export -f create_file

better_ls(){
    find "$1" -maxdepth 1 -type f ! -name '~*' -exec bash -c 'echo "- $(basename "{}")"' \; -o -type d -exec bash -c 'echo "d $(basename "{}")"' \; | tail -n+2 | sort
}
export -f better_ls


replicate() {
    DIR="realpath "$1""
    OUTPUT="$2"
    if [ ! -e "$DIR" ] ; then 
        echo "File not found: $DIR";
        return
    fi
    OUTPUT="$OUTPUT$DIR"

    echo "$DIR" | sed -e "s!/mnt/!!" -e "s;/;  ;g";

    mkdir -p "$OUTPUT" ;

    content="better_ls "$DIR"";
    alreadycopied="better_ls "$OUTPUT" | cut -d' ' -f2-";

    newcontent="echo "$content" | cut -d' ' -f2-";

    echo "$content" |
        grep '^-' |
        cut -d\  -f2- |
        xargs -I {} bash -c "create_file  \"$OUTPUT/{}\""

    diff <(echo "$alreadycopied") <(echo "$newcontent") | grep '^<' | cut -c 3- | xargs -I{} rm -r "$OUTPUT/{}"


    echo "$content" |
        grep '^d' |
        cut -d\  -f2- |
        parallel --line-buffer -I {} replicate "\"$DIR\"/{}" "$2"

}
export -f replicate;
PS: this idea was not so great and I went back to using rsync, but I still wanted to understand what was happening here.
Rouge a (11 rep)
Mar 7, 2025, 10:02 AM • Last activity: Mar 10, 2025, 03:47 PM
2 votes
2 answers
113 views
Parallel for-loop in bash with simultaneous sequential execution of another task with dependencies on the parallelized loop
In my bash script, I need to execute two different functions, ```taskA``` and ```taskB```, which take an integer (```$i```) as an argument. Since ```taskB $i``` depends on the completion of ```taskA $i```, the following abbreviated piece of code does the job: ``` #!/bin/bash taskA(){ ... } taskB(){...
In my bash script, I need to execute two different functions,
and
, which take an integer (
$i
) as an argument. Since
$i
depends on the completion of
$i
, the following abbreviated piece of code does the job:
#!/bin/bash

taskA(){
  ...
}

taskB(){
  ...
}

for i in {1..100};
do
  taskA $i
  taskB $i
done
As
can be run at different
$i
independently, I can create a semaphore (taken from here Parallelize a Bash FOR Loop ) and execute it in parallel. However,
$i
requires the completion of
$i
and the previous
$(i-1)
. Therefore, I just run them sequentially afterwards:
#!/bin/bash

open_sem(){
  mkfifo pipe-$$
  exec 3pipe-$$
  rm pipe-$$
  local i=$1
  for((;i>0;i--)); do
    printf %s 000 >&3
  done
}

run_with_lock(){
  local x
  read -u 3 -n 3 x && ((0==x)) || exit $x
  (
   ( "$@"; )
  printf '%.3d' $? >&3
  )&
}

taskA(){
  ...
}

taskB(){
  ...
}

N=36
open_sem $N
for i in {1..100};
do
  run_with_lock taskA $i
done

wait

for i in {1..100};
do
  taskB $i
done
In order to further optimize the procedure, is it possible to keep the semaphore for the parallel execution of
and run
simultaneously in such a way that it does not "overtake"
and waits for the completion of the taskA it depends on?
Schnarco (121 rep)
Apr 20, 2024, 08:12 PM • Last activity: Mar 3, 2025, 10:36 PM
15 votes
2 answers
18955 views
GNU Parallel: immediately display job stderr/stdout one-at-a-time by jobs order
I know that GNU Parallel buffers std/stderr because it doesn't want jobs output to be mangled, but if I run my jobs with `parallel do_something ::: task_1 task_2 task_3`, is there anyway for task_1's output to be displayed immediately, then after task_1 finishes, task_2's up to its current output, e...
I know that GNU Parallel buffers std/stderr because it doesn't want jobs output to be mangled, but if I run my jobs with parallel do_something ::: task_1 task_2 task_3, is there anyway for task_1's output to be displayed immediately, then after task_1 finishes, task_2's up to its current output, etc. If Parallel cannot solve this problem, is there any other similar program that could?
Hai Luong Dong (706 rep)
Apr 21, 2016, 03:36 AM • Last activity: Oct 9, 2024, 01:51 PM
9 votes
3 answers
13882 views
How to get GNU parallel on Amazon Linux?
Preferably without having to compile it from source. I tried adding repositories I found on Google: [CentOS 6][1] and [CentOS 5][2], but both give me: [ec2-user@ip-10-0-1-202 yum.repos.d]$ sudo yum install parallel -y Loaded plugins: priorities, update-motd, upgrade-helper amzn-main/2016.03 | 2.1 kB...
Preferably without having to compile it from source. I tried adding repositories I found on Google: CentOS 6 and CentOS 5 , but both give me: [ec2-user@ip-10-0-1-202 yum.repos.d]$ sudo yum install parallel -y Loaded plugins: priorities, update-motd, upgrade-helper amzn-main/2016.03 | 2.1 kB 00:00 amzn-updates/2016.03 | 2.3 kB 00:00 952 packages excluded due to repository priority protections Resolving Dependencies --> Running transaction check ---> Package parallel.noarch 0:20160522-1.1 will be installed --> Processing Dependency: /usr/bin/fish for package: parallel-20160522-1.1.noarch --> Processing Dependency: /usr/bin/ksh for package: parallel-20160522-1.1.noarch --> Processing Dependency: /usr/bin/zsh for package: parallel-20160522-1.1.noarch --> Processing Dependency: /bin/pdksh for package: parallel-20160522-1.1.noarch --> Processing Dependency: /usr/bin/ksh for package: parallel-20160522-1.1.noarch --> Processing Dependency: /usr/bin/zsh for package: parallel-20160522-1.1.noarch --> Processing Dependency: /usr/bin/fish for package: parallel-20160522-1.1.noarch --> Processing Dependency: /bin/pdksh for package: parallel-20160522-1.1.noarch --> Finished Dependency Resolution Error: Package: parallel-20160522-1.1.noarch (home_tange) Requires: /bin/pdksh Error: Package: parallel-20160522-1.1.noarch (home_tange) Requires: /usr/bin/fish Error: Package: parallel-20160522-1.1.noarch (home_tange) Requires: /usr/bin/zsh Error: Package: parallel-20160522-1.1.noarch (home_tange) Requires: /usr/bin/ksh You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest
Matt Chambers (241 rep)
Jun 15, 2016, 10:36 PM • Last activity: Sep 5, 2024, 05:42 PM
1 votes
1 answers
53 views
How do I ask gnu parallel to preserve stdin?
I'm using GNU parallel like this (:::: is a form of --arg-file): parallel -0Xuj1 my-command -- :::: <(find … -print0) But it seems like the command's standard input is managed by GNU parallel, which doesn't work for an interactive command. The command complains when its stdin gets closed. I'm guessi...
I'm using GNU parallel like this (:::: is a form of --arg-file): parallel -0Xuj1 my-command -- :::: <(find … -print0) But it seems like the command's standard input is managed by GNU parallel, which doesn't work for an interactive command. The command complains when its stdin gets closed. I'm guessing GNU parallel wants to use stdin to append to the arg list. I've been able to use the command interactively with parallel's --tmux flag and reattaching, but that's needlessly complicated. So far my workaround is to use plain xargs. I'm not using find -exec my-command -- {} + because it builds an argument list that's too large for my-command (it's Python, it reexecs itself and breaks; parallel and xargs have a flag to leave some headroom). Does anyone know a flag to tell GNU parallel to leave stdin alone? xargs -0a <(find … -print0) my-command --
Tobu (6769 rep)
Jul 30, 2024, 08:55 AM • Last activity: Jul 30, 2024, 12:59 PM
1 votes
1 answers
90 views
GNU parallel: how to call exported bash function with empty string argument?
Scenario: ``` $ process(){ echo "[$1] [$2] [$3]" ; } ; export -f process $ process "x" "" "a.txt" [x] [] [a.txt] ``` Here we see that the 2nd argument is empty string (expected). ``` $ find -name "*.txt" -print | SHELL=$(type -p bash) parallel process "x" "" [x] [./a.txt] [] [x] [./b.txt] [] [x] [./...
Scenario:
$ process(){ echo "[$1] [$2] [$3]" ; } ; export -f process

$ process "x" "" "a.txt"
[x] [] [a.txt]
Here we see that the 2nd argument is empty string (expected).
$ find -name "*.txt" -print | SHELL=$(type -p bash) parallel process "x" ""
[x] [./a.txt] []
[x] [./b.txt] []
[x] [./c.txt] []
Here we see that the 2nd argument is the output of find (unexpected). Expected output:
[x] [] [./a.txt]
[x] [] [./b.txt]
[x] [] [./c.txt]
How to fix? --- Note: if the 2nd argument is changed from "" to "y", then the output of find is present as the 3rd argument (expected):
$ find -name "*.txt" -print | SHELL=$(type -p bash) parallel process "x" "y"
[x] [y] [./a.txt]
[x] [y] [./b.txt]
[x] [y] [./c.txt]
Why _isn't_ the output of find present as the 3rd argument with ""? --- UPD: It seems that the solution is \"\":
$ find -name "*.txt" -print | SHELL=$(type -p bash) parallel process "x" \"\"
[x] [] [./a.txt]
[x] [] [./b.txt]
[x] [] [./c.txt]
However, I'm not sure that this is the correct general solution. Here is the counterexample:
$ VAR="" ; find -name "*.txt" -print | SHELL=$(type -p bash) parallel process "x" "$VAR"
[x] [./a.txt] []
[x] [./b.txt] []
[x] [./c.txt] []
pmor (665 rep)
Feb 29, 2024, 10:10 AM • Last activity: Feb 29, 2024, 10:46 AM
2 votes
1 answers
226 views
Use GNU parallel with very long lines
I have a very large SQL dumpfile (30GB) that I need to edit (do some find/replace) before loading back into the database. Besides having a large size, the file also contains very long lines. Except for the first 40 and last 12 lines, all other lines have lenghts ~ 1MB. These lines are all INSERTO IN...
I have a very large SQL dumpfile (30GB) that I need to edit (do some find/replace) before loading back into the database. Besides having a large size, the file also contains very long lines. Except for the first 40 and last 12 lines, all other lines have lenghts ~ 1MB. These lines are all INSERTO INTO commands that all look alike:
cat bigdumpfile.sql | cut -c-100
INSERT INTO table1 VALUES (951068,1407592,0.0267,0.0509,0.121),(285
INSERT INTO table1 VALUES (238317,1407664,0.008,0.0063,0.1286),(241
INSERT INTO table1 VALUES (938922,1407739,0.0053,0.0024,0.031),(226
INSERT INTO table1 VALUES (44678,1407886,0.0028,0.0028,0.0333),(234
INSERT INTO table1 VALUES (910412,1407961,0.001,0.0014,0),(911017,1
INSERT INTO table1 VALUES (903890,1408050,0.0066,0.01,0.0287),(9095
INSERT INTO table1 VALUES (257090,1408136,0.0023,0.0037,0.0196),(56
INSERT INTO table1 VALUES (593367,1408237,0.0066,0.0117,0.0286),(95
INSERT INTO table1 VALUES (870488,1408339,0.0131,0.009,0.0135),(870
INSERT INTO table1 VALUES (282798,1408414,0.0015,0.014,0.014),(2830
...
Parallel ends with an error on long lines:
parallel -a bigdumpfile.sql -k sed -i.bak 's/table1/newtable/'
parallel: Error: Command line too long (1018952 >= 63543) at input 0: INSERT INTO `table1...
Because all lines are similar and I only need the find/replace to happen at the beginning of the line I've follwed the advice [in this similar question here](https://unix.stackexchange.com/questions/642939/use-gnu-parallel-when-file-has-a-single-long-line) with a nice suggestion to use `--recstart and --recend`. However these are not working:
parallel -a bigdumpfile.sql -k --recstart 'INSERT' --recend 'VALUES' sed -i.bak 's/table/newtable/'
parallel: Error: Command line too long (1018952 >= 63543) at input 0: INSERT INTO `table1...
Tried a number of variations using --block but could not get it working. I am a GNU parallel newbie, and doing something wrong or just missing something obvious. Any help appreciated. Thanks! This is using GNU parallel 20240122.
fernan (23 rep)
Feb 24, 2024, 03:25 PM • Last activity: Feb 27, 2024, 07:21 AM
0 votes
1 answers
65 views
gnu parallel: how to control output of program?
Fast and simple. This command works locate -i mymovieormysong|parallel mplayer the song (or movie) play, but i cannot control mplayer with keyboard. How to do (if possible) this? Actually when i use keyboard to go forward or backward I obtain this ^[[C^[[C^[[C^[[C^[[C^[[C^[[C^[[D^[[D^[[D Edit1: usin...
Fast and simple. This command works locate -i mymovieormysong|parallel mplayer the song (or movie) play, but i cannot control mplayer with keyboard. How to do (if possible) this? Actually when i use keyboard to go forward or backward I obtain this ^[[C^[[C^[[C^[[C^[[C^[[C^[[C^[[D^[[D^[[D Edit1: using -u (un-group) option, the output appear but when I press keyboard for control mplayer still appear [C and [D
elbarna (13690 rep)
Feb 18, 2024, 11:20 PM • Last activity: Feb 23, 2024, 10:42 AM
1 votes
1 answers
61 views
Why does /usr/bin/time coupled with GNU parallel output results before command's output rather than after command's output?
Scenario: ``` $ cat libs.txt lib.a lib1.a $ cat t1a.sh f1() { local lib=$1 stdbuf -o0 printf "job for $lib started\n" sleep 2 stdbuf -o0 printf "job for $lib done\n" } export -f f1 /usr/bin/time -f "elapsed time %e" cat libs.txt | SHELL=$(type -p bash) parallel --line-buffer --jobs 2 f1 $ bash t1a.s...
Scenario:
$ cat libs.txt
lib.a
lib1.a

$ cat t1a.sh
f1()
{
        local lib=$1
        stdbuf -o0 printf "job for $lib started\n"
        sleep 2
        stdbuf -o0 printf "job for $lib done\n"
}
export -f f1
/usr/bin/time -f "elapsed time %e" cat libs.txt | SHELL=$(type -p bash) parallel --line-buffer --jobs 2 f1

$ bash t1a.sh
elapsed time 0.00
job for lib.a started
job for lib1.a started
job for lib.a done
job for lib1.a done
Here we see that elapsed time 0.00 appears _before_ command's output. Why? How to make elapsed time 0.00 appear _after_ command's output?
pmor (665 rep)
Feb 2, 2024, 01:52 PM • Last activity: Feb 7, 2024, 01:25 PM
2 votes
1 answers
51 views
GNU parallel: why does diagnostic output look like sequential execution rather than parallel execution?
Scenario: ``` $ cat libs.txt lib.a lib1.a $ cat t1a.sh f1() { local lib=$1 stdbuf -o0 printf "job for $lib started\n" sleep 2 stdbuf -o0 printf "job for $lib done\n" } export -f f1 cat libs.txt | SHELL=$(type -p bash) parallel --jobs 2 f1 ``` Invocation and output: ``` $ time bash t1a.sh job for lib...
Scenario:
$ cat libs.txt
lib.a
lib1.a

$ cat t1a.sh
f1()
{
        local lib=$1
        stdbuf -o0 printf "job for $lib started\n"
        sleep 2
        stdbuf -o0 printf "job for $lib done\n"
}
export -f f1
cat libs.txt | SHELL=$(type -p bash) parallel --jobs 2 f1
Invocation and output:
$ time bash t1a.sh
job for lib.a started
job for lib.a done
job for lib1.a started
job for lib1.a done

real    0m2.129s
user    0m0.117s
sys     0m0.033s
Here we see that execution of f1 was indeed in parallel (real 0m2.129s). However, diagnostic output looks like execution was sequential. I expected the following diagnostic output:
job for lib.a started
job for lib1.a started
job for lib.a done
job for lib1.a done
Why does diagnostic output look like sequential execution rather than parallel execution? How to fix the diagnostic output so that it does look like parallel execution?
pmor (665 rep)
Feb 1, 2024, 12:45 PM • Last activity: Feb 1, 2024, 02:35 PM
5 votes
1 answers
3393 views
Parallel Copy from local folders to remote servers at the same time
I have a multiple folders, and each folder has about 1500 files. I have a kind of for loop going over each folder and then sending the files to either one or 4 remote hosts depending upon the environment. Currently I am using `rdist`. Almost every file I have is changing on a daily basis, sometimes...
I have a multiple folders, and each folder has about 1500 files. I have a kind of for loop going over each folder and then sending the files to either one or 4 remote hosts depending upon the environment. Currently I am using rdist. Almost every file I have is changing on a daily basis, sometimes it just changes the date and time inside the file. I came across few commands like pscp, prsync as well as GNU parallel. I experimented with pscp and rdist on multiple hosts, both are giving similar results. 1. What is the difference between rdist and prsync in terms of performance? My understanding is that prsync can migrate files on multiple hosts and same is with rdist. My understanding from my tests is that neither prsync nor rdist copy multiple files in parallel on single host; they can only copy file by file in parallel on multiple hosts in parallel. So is there any difference between the two from performance side? For rdist, my scripts create a distfile like HOSTS( user@server user@server2 user@server3 ) RUN:(/var/inputpath/folder) -> ${HOSTS} install (/var/outputpath/folder) then I run rdist like following rdist-f /dist-file-path -P /path/to/ssh 2. I tested GNU parallel for local copy using cp and zipping using zip. It is really very fast. This allows copying multiple files in parallel even on local computer. So my question is, is there a possibility to combine GNU parallel with say pscp or rdist or prsync?
user321507 (51 rep)
Nov 17, 2018, 10:09 PM • Last activity: Jan 25, 2024, 07:03 PM
Showing page 1 of 20 total questions