Make parameter substitution in newline-separated string more efficient

2 votes
1 answer
109 views
                          The following code should demonstrate and help with testing inefficient Pattern Matching expressions in a Parameter Substitution for a newline-separated strings var vs. array.

The goal is to achieve at least on-par performance, as compared to grep, when filtering for only git status -s results, that involve index changes (fully or partially staged). 

So basically, every change entry that starts with a Git short status flag char like [MTARDC] (signalling a staged/index change), including double-flags (signalling partially staged changes in index and worktree), except untracked or unstaged-only changes (beginning with [ ?]).

Note, that R (rename) change flags can be followed by a multiple digits, see also examples in the test data below (possibly even for both index and worktree, so i. e. R104R104). 
See: short format of Git status 

The test data also contains file names with potentially problematic special chars, like escape sequences, space or ampersand: [\ $*"].

Also note, that the substitution based on Pattern Matching requires a negation of patterns, as compared to a RegEx for grep with the same results. To print the results, simply comment out the &>/dev/null parts.

	#! /bin/bash

    # Set extended pattern matching active
	shopt -s extglob 

	clear
	unset -v tmpVar tmpArr

    # Populate tmpVar and tmpArr for testing
	for i in {1..3}; do
		tmpVar+=' A addedW1'[$i]$'\n'
		tmpVar+='A  addedI1'[$i]$'\n'
		tmpVar+='AM addedA1'[$i]$'\n'
		tmpVar+=' C copiedW1'[$i]$'\n'
		tmpVar+='C  copiedI1'[$i]$'\n'
		tmpVar+='CR copied A1'[$i]$'\n'
		tmpVar+=' D removedW1'[$i]$'\n'
		tmpVar+='D  removedI1'[$i]$'\n'
		tmpVar+='DM removedA1'[$i]$'\n'
		tmpVar+=' M modifiedW1'[$i]$'\n'
		tmpVar+='M  modifiedW1'[$i]$'\n'
		tmpVar+='MR modifiedA1'[$i]$'\n'
		tmpVar+=' R101 renamedW1'[$i]$'\n'
		tmpVar+='R102  renamedI2'[$i]$'\n'
		tmpVar+='R103D renamedA1'[$i]$'\n'
		tmpVar+=' T typeChangedW1'[$i]$'\n'
		tmpVar+='T  typeChangedI1'[$i]$'\n'
		tmpVar+='TM typeChangedA1'[$i]$'\n'
		tmpVar+='?? exec2.bin'[$i]$'\n'
		tmpVar+='?? file1.txt'[$i]$'\n'
		tmpVar+='?? test.launch2'[$i]$'\n'
		tmpVar+='A  file00 0.bin'[$i]$'\n'
		tmpVar+='A  file11*1.bin'[$i]$'\n'
		tmpVar+='A  file22\03457zwei.bin'[$i]$'\n'
		tmpVar+='A  file33\t3.bin'[$i]$'\n'
		tmpVar+='A  file44$4.bin'[$i]$'\n'
		tmpVar+='A  file55"$(echo EXE)"5.bin'[$i]$'\n'
		tmpVar+='M  exec1.bin'[$i]$'\n'
		tmpVar+=' M test.launch1'[$i]$'\n'
		tmpVar+=' M myproject/src/main/java/util/MyUtil.java'[$i]$'\n'
		tmpVar+='M  myproject/src/test/util/MyUtilTest.java'[$i]$'\n'
		tmpVar+='R104R104 myproject/src/test/util/MyUtil2Test.java'[$i]$'\n'
		tmpVar+=' A invalidAdd'[$i]$'\n'
		tmpVar+='R invalidRename'[$i]$'\n'
		tmpArr+=(" A addedW1[$i]")
		tmpArr+=("A  addedI1[$i]")
		tmpArr+=("AM addedA1[$i]")
		tmpArr+=(" C copiedW1[$i]")
		tmpArr+=("C  copiedI1[$i]")
		tmpArr+=("CR copied A1[$i]")
		tmpArr+=(" D removedW1[$i]")
		tmpArr+=("D  removedI1[$i]")
		tmpArr+=("DM removedA1[$i]")
		tmpArr+=(" M modifiedW1[$i]")
		tmpArr+=("M  modifiedW1[$i]")
		tmpArr+=("MR modifiedA1[$i]")
		tmpArr+=(" R101 renamedW1[$i]")
		tmpArr+=("R102  renamedI2[$i]")
		tmpArr+=("R103D renamedA1[$i]")
		tmpArr+=(" T typeChangedW1[$i]")
		tmpArr+=("T  typeChangedI1[$i]")
		tmpArr+=("TM typeChangedA1[$i]")
		tmpArr+=("?? exec2.bin[$i]")
		tmpArr+=("?? file1.txt[$i]")
		tmpArr+=("?? test.launch2[$i]")
		tmpArr+=("A  file00 0.bin[$i]")
		tmpArr+=("A  file11*1.bin[$i]")
		tmpArr+=("A  file22\03457zwei.bin[$i]")
		tmpArr+=("A  file33\t3.bin[$i]")
		tmpArr+=("A  file44$4.bin[$i]")
		tmpArr+=('A  file55"$(echo EXE)"5.bin['$i']')
		tmpArr+=("M  exec1.bin[$i]")
		tmpArr+=(" M test.launch1[$i]")
		tmpArr+=(" M myproject/src/main/java/util/MyUtil.java[$i]")
		tmpArr+=("M  myproject/src/test/util/MyUtilTest.java[$i]")
		tmpArr+=("R104R104 myproject/src/test/util/MyUtil2Test.java[$i]")
		tmpArr+=(" A invalidAdd[$i]")
		tmpArr+=("R invalidRename[$i]")
	done

    # Perf-test array or string var filtering via grep
	_IFS="$IFS"; IFS=$'\n'
	startTime="$EPOCHREALTIME"
	grep '^[MTARDC]' /dev/null
	stopTime="$EPOCHREALTIME"
	IFS="$_IFS"
	echo
	awk 'BEGIN { printf "ELAPSED TIME via grep filtering from ARRAY: "; print '"$stopTime"' - '"$startTime"' }'

    # Perf-test array filtering via Pattern Matching in Parameter Substitution 
	startTime="$EPOCHREALTIME"
	printf '%s\n' "${tmpArr[@]/#[? ][?MTARDC]*([0-9]) *}" &>/dev/null
	stopTime="$EPOCHREALTIME"
	echo
	awk 'BEGIN { printf "ELAPSED TIME via parameter substitution from ARRAY: "; print '"$stopTime"' - '"$startTime"' }'

    # Perf-test string var filtering via Pattern Matching in Parameter Substitution 
	startTime="$EPOCHREALTIME"
	printf '%s\n' "${tmpVar//[? ][?MTARDC]*([0-9]) *([^$'\n'])?($'\n')}" &>/dev/null
	stopTime="$EPOCHREALTIME"
	echo
	awk 'BEGIN { printf "ELAPSED TIME via parameter substitution from VAR: "; print '"$stopTime"' - '"$startTime"' }'

    # RESULT:
    #ELAPSED TIME via grep filtering from ARRAY: 0.054975
    #ELAPSED TIME via parameter substitution from ARRAY: 0.00031805
    #ELAPSED TIME via parameter substitution from VAR: 4.546

As can be seen, grep is good, but variant #2 (array filtering via Parameter substitution) is way faster, so for huge arrays, there's a good alternative to grep.

Var string filtering via Parameter Substitution, on the other hand is terribly slow.

Mostly due to the fact that the matching pattern cannot end with * (which would remove everything to the string's end from the first match), but because it needs *([^$'\n'])?($'\n') instead, in order to match (and remove) everything in a match, up to the next newline and, to some extent, due to the tmpVar// greedy matching.

Is there another way/pattern for the example, to process var strings with Pattern Matching, likewise to array - without using the problematic and slowing newline-negation char matching and to get near the speed of the array example?
                        
Asked by fozzybear (59 rep)
Jan 23, 2025, 06:19 PM
Last activity: Mar 8, 2025, 06:03 AM
Make parameter substitution in newline-separated string more efficient

Related Questions