ack : get the 10th (or bigger nth) matching/capturing group
4
votes
2
answers
343
views
I think I might have just searched wrong, but I didn't find any answer. If there's a duplicate, please just let me know, and I can take this down.
Problem Background
I'm usingack
([link](https://beyondgrep.com/)) , which has Perl 5 under the hood, to get n-grams - especially higher-order n-grams. I can get up to 9-grams using the syntax I know (basically up to $9
), but I haven't been able to get the 10-grams. Using $10
just gives me $1
with a 0
after it. Things like $(10)
and ${10}
did not solve the problem. I'm _NOT_ interested in a solution using a language-modelling toolkit, I want to use ack
.
One dataset I'm using is the complete works of Mark Twain
( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt
).
I've parsed things clean (see the _Parsing Note_ at the end of the post) and saved the parsed result as TWAIN_Mark_complete_parsed.txt
.
I've been fine getting from 2-grams, with the code and partial results for that being
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) +(?=(\S+) +)' \
--output '$1 $2' | \
sort | uniq -c | \
sort -rn > Twain_2grams.txt
## time
info not shown
$ head -n 2 Twain_2grams.txt
18176 of the
13288 in the
all the way up to 9-grams, with
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
--output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
17 to mrs jane clemens and mrs moffett in st
17 mrs jane clemens and mrs moffett in st louis
(N.B. I meta-program the ack
commands, rather than just typing every single one.)
The Problem / What I've Tried
My first try with 10-grams, as well as the result, wastime cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
--output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
sort | uniq -c | sort -rn > Twain_10grams.txt
$ head -n 2 Twain_10grams.txt
17 to mrs jane clemens and mrs moffett in st to0
17 mrs jane clemens and mrs moffett in st louis mrs0
To better see what's happening,
[![diff -u Expected/Desired Output
Note that there _is_ a statistical (_very_ non-zero and finite) possibility of the real output being different from the one shown here. The top two results for 9-grams were not distinct sequences of words. Other possible parts of a more-common 10-gram might be found by looking at the top 10 most frequent 9-grams - using head
instead of head -n 2
. Even so, I'm fairly certain that not even this would guarantee that we have the two most frequent 10-grams. I hope, however, that I'm making it clear enough what I'm wanting to accomplish.
17 to mrs jane clemens and mrs moffett in st louis
3 mrs jane clemens and mrs moffett in st louis honolulu
**Edit** I've already found another set that changes expected output to (possibly not the actual output, but one that changes it from the simple model I used before.)
17 to mrs jane clemens and mrs moffett in st louis
7 happiness in his home had been wounded and bruised almost
That would be for the head -n 2
that I've been using to show what kind of results I get.
I don't want to get it by the same process I'm going to use here.
$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
17 to mrs jane clemens and mrs moffett in st louis
$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
3 mrs jane clemens and mrs moffett in st louis honolulu
2 mrs jane clemens and mrs moffett in st louis san
2 mrs jane clemens and mrs moffett in st louis no
2 mrs jane clemens and mrs moffett in st louis 224
1 mrs jane clemens and mrs moffett in st louis wash
1 mrs jane clemens and mrs moffett in st louis wailuku
1 mrs jane clemens and mrs moffett in st louis virginia
1 mrs jane clemens and mrs moffett in st louis the
1 mrs jane clemens and mrs moffett in st louis sept
1 mrs jane clemens and mrs moffett in st louis on
1 mrs jane clemens and mrs moffett in st louis hartford
1 mrs jane clemens and mrs moffett in st louis carson
**Edit** The code used to find the newer second-place frequency was
$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
6 shelley's happiness in his home had been wounded and bruised
1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
1 home had been wounded and bruised almost to death thirdly
1 home had been wounded and bruised almost to death secondly
1 home had been wounded and bruised almost to death it
1 home had been wounded and bruised almost to death fourthly
1 home had been wounded and bruised almost to death first
1 home had been wounded and bruised almost to death fifthly
1 home had been wounded and bruised almost to death and
Edit from Comment
@Inian made a great [comment](https://unix.stackexchange.com/questions/593467/ack-get-the-10th-or-bigger-nth-matching-capturing-group#comment1107135_593467) : > This is documented in the release notes - [github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md](https://github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md) - You're now restricted to the following variables: $1 thru $9, $, $., $&, $` , $' and $+_ For [future people](https://xkcd.com/979/) , I'm putting a [version, archived today](https://web.archive.org/web/20200617164916/https://github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md#ack-3s---output-allows-fewer-special-variables) , of theRELEASE-NOTES
The man
page for ack
does have the lines
> $1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".
but I was hoping there was a way to get higher numbers. With the info from the RELEASE-NOTES
, that hope seems mostly gone.
*However*, I still wonder if anyone has a work-around or hack, whether using ack
or any of the more 'standard' *NIX-type terminal tools. My preference, in order, would be perl
, grep
, awk
, sed
. If there's something similar to ack
(i.e. just command-line parsing, _NOT_ an NLP-toolkit-based solution), I'm interested in that, too.
I think it might be better to pose this as a new question. If you answer here, great. If I end up posting a new question, I will put the link here: [for now, this is just a link to this same question](https://unix.stackexchange.com/q/593467/291375) .
Parsing Note
To get my corpus ready for n-gram analysis, here was my parsing.tr [:upper:] [:lower:] TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:
Yes, that could all be on one line (and without the trailing && :
), but this makes for easier reading as well as explanation of why I'm doing what I'm doing.
System Details
$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name: Microsoft Windows 10 Enterprise
Version: 10.0.17134 N/A Build 17134
Manufacturer: Microsoft Corporation
Configuration: Member Workstation
Build Type: Multiprocessor Free
Asked by bballdave025
(418 rep)
Jun 17, 2020, 03:30 PM
Last activity: Jun 18, 2020, 07:29 AM
Last activity: Jun 18, 2020, 07:29 AM