Sample Header Ad - 728x90

ack : get the 10th (or bigger nth) matching/capturing group

4 votes
2 answers
343 views
I think I might have just searched wrong, but I didn't find any answer. If there's a duplicate, please just let me know, and I can take this down.

Problem Background

I'm using ack ([link](https://beyondgrep.com/)) , which has Perl 5 under the hood, to get n-grams - especially higher-order n-grams. I can get up to 9-grams using the syntax I know (basically up to $9), but I haven't been able to get the 10-grams. Using $10 just gives me $1 with a 0 after it. Things like $(10) and ${10} did not solve the problem. I'm _NOT_ interested in a solution using a language-modelling toolkit, I want to use ack. One dataset I'm using is the complete works of Mark Twain ( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt ). I've parsed things clean (see the _Parsing Note_ at the end of the post) and saved the parsed result as TWAIN_Mark_complete_parsed.txt. I've been fine getting from 2-grams, with the code and partial results for that being
time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) +(?=(\S+) +)' \
    --output '$1 $2' | \
    sort | uniq -c | \
    sort -rn > Twain_2grams.txt
## time info not shown
$ head -n 2 Twain_2grams.txt
  18176 of the
  13288 in the
all the way up to 9-grams, with
time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
    sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
     17 to mrs jane clemens and mrs moffett in st
     17 mrs jane clemens and mrs moffett in st louis
(N.B. I meta-program the ack commands, rather than just typing every single one.)

The Problem / What I've Tried

My first try with 10-grams, as well as the result, was
time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
    sort | uniq -c | sort -rn > Twain_10grams.txt

$ head -n 2 Twain_10grams.txt
     17 to mrs jane clemens and mrs moffett in st to0
     17 mrs jane clemens and mrs moffett in st louis mrs0
To better see what's happening, [![diff -u Expected/Desired Output Note that there _is_ a statistical (_very_ non-zero and finite) possibility of the real output being different from the one shown here. The top two results for 9-grams were not distinct sequences of words. Other possible parts of a more-common 10-gram might be found by looking at the top 10 most frequent 9-grams - using head instead of head -n 2. Even so, I'm fairly certain that not even this would guarantee that we have the two most frequent 10-grams. I hope, however, that I'm making it clear enough what I'm wanting to accomplish. 17 to mrs jane clemens and mrs moffett in st louis
3 mrs jane clemens and mrs moffett in st louis honolulu **Edit** I've already found another set that changes expected output to (possibly not the actual output, but one that changes it from the simple model I used before.)
17 to mrs jane clemens and mrs moffett in st louis
      7 happiness in his home had been wounded and bruised almost
That would be for the head -n 2 that I've been using to show what kind of results I get. I don't want to get it by the same process I'm going to use here.
$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     17 to mrs jane clemens and mrs moffett in st louis

$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      3 mrs jane clemens and mrs moffett in st louis honolulu
      2 mrs jane clemens and mrs moffett in st louis san
      2 mrs jane clemens and mrs moffett in st louis no
      2 mrs jane clemens and mrs moffett in st louis 224
      1 mrs jane clemens and mrs moffett in st louis wash
      1 mrs jane clemens and mrs moffett in st louis wailuku
      1 mrs jane clemens and mrs moffett in st louis virginia
      1 mrs jane clemens and mrs moffett in st louis the
      1 mrs jane clemens and mrs moffett in st louis sept
      1 mrs jane clemens and mrs moffett in st louis on
      1 mrs jane clemens and mrs moffett in st louis hartford
      1 mrs jane clemens and mrs moffett in st louis carson
**Edit** The code used to find the newer second-place frequency was
$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
      1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      1 home had been wounded and bruised almost to death thirdly
      1 home had been wounded and bruised almost to death secondly
      1 home had been wounded and bruised almost to death it
      1 home had been wounded and bruised almost to death fourthly
      1 home had been wounded and bruised almost to death first
      1 home had been wounded and bruised almost to death fifthly
      1 home had been wounded and bruised almost to death and

Edit from Comment

@Inian made a great [comment](https://unix.stackexchange.com/questions/593467/ack-get-the-10th-or-bigger-nth-matching-capturing-group#comment1107135_593467) : > This is documented in the release notes - [github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md](https://github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md) - You're now restricted to the following variables: $1 thru $9, $, $., $&, $` , $' and $+_ For [future people](https://xkcd.com/979/) , I'm putting a [version, archived today](https://web.archive.org/web/20200617164916/https://github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md#ack-3s---output-allows-fewer-special-variables) , of the RELEASE-NOTES The man page for ack does have the lines > $1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that". but I was hoping there was a way to get higher numbers. With the info from the RELEASE-NOTES, that hope seems mostly gone. *However*, I still wonder if anyone has a work-around or hack, whether using ack or any of the more 'standard' *NIX-type terminal tools. My preference, in order, would be perl, grep, awk, sed. If there's something similar to ack (i.e. just command-line parsing, _NOT_ an NLP-toolkit-based solution), I'm interested in that, too. I think it might be better to pose this as a new question. If you answer here, great. If I end up posting a new question, I will put the link here: [for now, this is just a link to this same question](https://unix.stackexchange.com/q/593467/291375) .

Parsing Note

To get my corpus ready for n-gram analysis, here was my parsing.
tr [:upper:] [:lower:]  TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:
Yes, that could all be on one line (and without the trailing && :), but this makes for easier reading as well as explanation of why I'm doing what I'm doing.

System Details

$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name:                   Microsoft Windows 10 Enterprise
Version:                10.0.17134 N/A Build 17134
Manufacturer:           Microsoft Corporation
Configuration:          Member Workstation
Build Type:             Multiprocessor Free
Asked by bballdave025 (418 rep)
Jun 17, 2020, 03:30 PM
Last activity: Jun 18, 2020, 07:29 AM