Grep (BRE) on surrounding delimiters w/o consuming the delimiter? Counting delimiter-separated strings between filename and extension
2
votes
4
answers
136
views
I have a dataset of images labeled/classified by characteristics, where an image can have more than one label. I want to count how many of each identifier I have. A toy dataset is created below, with different colors being the labels.
The bare filenames (without dot and extension) can consist of any ASCII characters except: 1) non-printable/control characters; 2) spaces or tabs ; OR 3) any of
Edit: I had my no-no list of characters as is now crossed (struck) out above when @ilkkachu gave the accepted answer. One option of that answer makes excellent use of the '
bballdave025@MY-MACHINE /home/bballdave025/toy
$ touch elephant_grey.jpg && touch zebra_white_black.jpg && touch rubik-s_cube_-_1977-first-prod_by_ErnoRubik-_red_orange_yellow_white_blue_green.jpg && touch Radio_Hotel.Washington_Heights.NYC-USA_green_yellow_blue_red_orange_reddish-purple_teal_grey-brown.jpg && touch Big_Bird__yellow_orange_red.jpg
Let's make it more easily visible. The files in the initially labeled dataset are shown below. (The | awk -F'/' '{print $NF}'
is just meant to take off the ./
or path/to/where/the/jpegs/are/
that would otherwise be before the filename.)
$ find . -type f | awk -F'/' '{print $NF}'
Big_Bird__yellow_orange_red.jpg
elephant_grey.jpg
Radio_Hotel.Washington_Heights.NYC-USA_green_yellow_blue_red_orange_reddish-purple_teal_grey-brown.jpg
rubik-s_cube_-_1977-first-prod_by_ErnoRubik-_red_orange_yellow_white_blue_green.jpg
zebra_white_black.jpg
Those are the filenames for labeled versions of the images. The corresponding originals are below:
$ find ../toy_orig_bak/ -type f | awk -F'/' '{print $NF}'
Big_Bird_.jpg
elephant.jpg
Radio_Hotel.Washington_Heights.NYC-USA.jpg
rubik-s_cube_-_1977-first-prod_by_ErnoRubik-.jpg
zebra.jpg
This is to show that the color labels are inserted between the filename and the dot extension. They are separated from each other and from the original filename by a (delimiting) _
character. (There are rules for the label names and for the filenames1.) The only allowed color strings at this initial point are any of {black, white, grey, red, orange, yellow, green, blue, reddish-purple, teal, grey-brown}
.
I further want to show that other labels may be added, as long as they're part of my controlled vocabulary, something which can be changed only by me. Imagine a file named rainbox.jpg
gets put in with the original filenames ( touch ../toy_orig_bak/rainbow.jpg
, for those of you following along for reproducibility ). I decide that I want to add indigo
and violet
to my controlled vocabulary list, so I can create the labeled filename,
$ touch rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
Desired Output
Again, I want a count of each of the labels. For the dataset I've set up (including that last labeled picture of a rainbow), the correct output would be
1 black
3 blue
3 green
1 grey
1 grey-brown
1 indigo
4 orange
4 red
1 reddish-purple
1 teal
1 violet
2 white
4 yellow
(The counts were performed somewhat manually, due to my grep
confusion.)
Attempts and a note on the details of the solution I want
Research below
My first thought (although I did worry about delimiter consumption) was to look at the surrounding delimiters: '_
' before and '_
' or '.
' after.
Here's first my grep
attempt
find . -type f -iname "*.jpg" | \
\
grep -o "[_]\(black\|white\|grey\|red\|orange\|yellow\|green\|blue\|"\
"reddish-purple\|teal\|grey-brown\|indigo\|violet\)[_.]" | \
\
tr -d [_.] | sort | uniq -c
and its output
3 blue
1 green
1 grey
1 orange
3 red
1 teal
1 violet
1 white
3 yellow
Which is not the same as before. Here's the comparison.
Before | Now
-----------------------|---------------------
1 black |
3 blue | 3 blue
3 green | 1 green
1 grey | 1 grey
1 grey-brown |
1 indigo |
4 orange | 1 orange
4 red | 3 red
1 reddish-purple |
1 teal | 1 teal
1 violet | 1 violet
2 white | 1 white
4 yellow | 3 yellow
|
I know this is happening because the regex engine consumes the second delimiter2.
Here is the crux of my main question: (I do want to solve my count problem, and I'll talk about some solutions I've researched and considered myself, but) the detail I want to know is about truly regular expressions and consuming the delimiter.
I want to get a count of each identifier string, and I'm wondering if I can do it with approach and (POSIX) Basic Regular Expressions – BRENote 2 and [reddit thread](https://www.reddit.com/r/askscience/comments/5rttyo/do_extended_regular_expressions_still_denote_the/) ([archived as a gist](https://gist.github.com/bballdave025/b2f7a190907146151696eed394079a64)) , specifically with grep
.
Any of sed
, awk
, IFS
with read
, etc. are welcome, too. I'm sure someone has a way solve this problem with Perl
(dermis and feline can be divorced by manifold methods), and I'd be glad to get that one, too.
Basically, I am absolutely okay with other solutions to the task of getting a count of each identifier string. However, if it's true that there's no way of stepping back the engine with a Basic Regular Expression engine (that's truly regular), I want to know. I've thought of zero-width matches, lookaheads, and look-behinds, but I don't know how these play out in POSIX Basic Regular Expressions or in mathematically/grammatically regular language parsers.
One thing I realize I wasn't taking into account
The point of the rules (see note \[2\]) was to allow the regex to take advantage of the fact that we should be able to assure ourselves that we're only getting the part of the classified filename with labels, as we only allow one of a finite set of strings preceded by an underscore and followed by either an underscore or a dot, with the dot only happening before the file extension. (I guess we can't be absolutely certain, as the original, pre-labeled filename could have one of the labels immediately preceding the dot - something like a_sunburn_that_is_bright_red.jpg
, but that's something for which I check and correct by adding a specific non-label string before the dot and extension.)
My regex, imagining that it could get past the delimiter being consumed, would still allow the following example problems
the_new_red_car_-_1989_red_black_silver.jpg
- would return {red, red , silver } as is,
- {red, red, black , silver } if working without consuming the 2nd '_',
- whereas {red, black , silver } is desired
parrot_at_blue_gold_banquet_-_a_black_tie_affair_yellow_red_green.jpg
- would return {blue, black, yellow, green} as is,
- {blue, gold, black, yellow, green} if not consuming the 2nd '_',
- whereas {yellow, red, green} is desired
Extra points for answers and discussions that take that into account. ; )
Research and ideas
There are a few discussions on different StackExchange sites, like [this one](https://web.archive.org/web/20230925145242/https://stackoverflow.com/questions/63821591/how-to-split-a-string-by-underscore-and-extract-an-element-as-a-variable-in-bash) , [that one](https://web.archive.org/web/20250602171231/https://stackoverflow.com/questions/49784912/regex-of-underscore-delimited-string) , [another one](https://web.archive.org/web/20250602171046/https://unix.stackexchange.com/questions/267677/add-quotes-and-new-delimiter-around-space-delimited-words) , but I think the [Unix & Linux discussion here](https://unix.stackexchange.com/a/334551/291375) ([archived](https://web.archive.org/web/20230324152449/https://unix.stackexchange.com/questions/334549/how-do-i-extract-multiple-strings-which-are-comma-delimited-from-a-log-file)) is the best one. I think that one of the approaches in this answer from @terdon ♦ or in the answer with hashes – from @Sobrique – might be useful.
I keep thinking that some version of ^.*\([_][]\)\+[.]jpg$
might be key to the situation, but I haven't been able to put together that solution today. If you know how it can help, you're welcome to give an answer using it; I'm going to wait for a fresh brain tomorrow morning.
Edit: @ilkkachu successfully used this idea.
Why am I doing this? I'm training a CNN to recognize different occurrences (not colors) in pictures of old and often handwritten books. I want to make sure the classes are balanced as I want. Also, I'll compare this with another method that doesn't look at the delimiter to make sure I don't have any problems like a '_yllow
' (instead of '_yellow
'), or a '_whiteorange
' _instead of '_white_orange
'). Most of the labels are put on through a Java program I've put together, but I've given a little leeway for people to change the filenames themselves in case of multiple labels for one file. Having given that permission, I have the responsibility of verifying legal labeled filenames.
Notes
\[1\] The rules for the identifying/classifying labels are:
The identifiers can be any of a finite set of strings which can contain only characters in [A-Za-z0-9-]
but not underscores.
The bare filenames (without dot and extension) can consist of any ASCII characters except: 1) non-printable/control characters; 2) spaces or tabs ; OR 3) any of
[!"#$%&/)(\]\[}{*?]
See the next paragraph for the real 3). (Note that this means the bare filenames CAN have an underscore, '_
', or even several of them.)
Edit: I had my no-no list of characters as is now crossed (struck) out above when @ilkkachu gave the accepted answer. One option of that answer makes excellent use of the '
@
' which was then not in the excluded character group, but which I actually don't allow in my filenames. There are other omissions in the original character group. As I actually want it, the above paragraph should be amended with the following.
3) any of
'[] ~@#$%^&|/)(}{[*?>\
Edit: Now this compiles as a BRE. (This was the simplest and most-readable BRE I could come up with.)
that beautifully crazy character group means that any of
{ [
, ]
,
, ~
, @
, #
, $
, %
, ^
, &
, |
, \
, /
, )
, (
, }
, {
, [
, *
, ?
, >
, `}
is not allowed – and neither is any tab (\t
, ...), nor any non-printing/control characters. Some of these are already standard on the no-no list for filenames on different OSs, but I give _my_ complete set (when I'm in charge of creating the filenames).
\[2\] Here is what I mean by the delimiter being consumed. I'll do my best to illustrate an example with our (Basic) Reg(ular)Ex(pression),
"[_]\(black\|white\|grey\|red\|orange\|yellow\|green\|blue\|"\
"reddish-purple\|teal\|grey-brown\|indigo\|violet\)[_.]"
Here goes.
This missing of some of the color strings is happening because the regex engine consumes the second delimiter.
For example, using O
to denote part of a miss (non-match) and X
to denote part of a hit (match), with YYYYY
denoting a complete match for the whole regex pattern, we get the following behavior.
Engine goes along looking for '_'
engine is here
|
v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOO
Matches
[_]
with '_'
engine is at
|
v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOX
Matches
\(...\|red\|...\)
with 'red'
engine is at
|
v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOXXXX
Matches
[_.]
with '_'
engine is at
|
v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOXXXXX
We have a whole match!
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOYYYYY
Given the -o flag, the engine outputs
'_red_'
The 'tr -d [_.]' takes off the surrounding underscores,
and our output line becomes
'red'
The problem now is that the engine cannot go back to
find the '_' before 'orange', or at least it can't do
so using any process I know about from my admittedly
imperfect knowledge of Basic Regular Expressions. As far
as a REGULAR expression engine, using a REGULAR grammar and
a REGULAR language parser knows, the whole universe in which
it's searching now consists of
orange_yellow_green_blue_indigo_violet.jpg
(I don't know if this statement is correct from a mathematical/formal-language point of view, and I'd be interested to know.)
And the process continues as from the first, beginning with Engine goes along looking for '_'
orange_yellow_green_blue_indigo_violet.jpg
OOOOOOXXXXXXXX
Match!
orange_yellow_green_blue_indigo_violet.jpg
OOOOOOYYYYYYYY
Engine spits out '_yellow_' which is 'tr -d [_.]'-ed
Engine cannot go back, so its search universe is now
green_blue_indigo_violet.jpg
and we continue with
green_blue_indigo_violet.jpg
OOOOOXXXXXX
Match!
green_blue_indigo_violet.jpg
OOOOOYYYYYYOOOOOOYYYYYYYY
That last match being on the '.' from [_.]
More formally, I want to know if it can be done with a real regular expression, i.e. one which can define a regular language and whose language is a context-free language, cf. Wikipedia's Regex article (archived). I think this is the same as a POSIX regular expression, but I'm not sure.
Refs. [A] (archived), [B] (archived), [C] (archived),
Dang it, I know there's a missing ending parenthesis up there in the text, somewhere, because I noticed it and went up to fix it. When I got up into the text, I couldn't remember the context of the parenthesis, so it's still there, just mocking me. I found it, and I bolded it! I'll probably take the bold formatting and this note down, soon, but I'm sharing my happiness right now.
Asked by bballdave025
(418 rep)
Jun 3, 2025, 04:56 AM
Last activity: Jul 25, 2025, 03:50 AM
Last activity: Jul 25, 2025, 03:50 AM