Command line method to find repeat-word typos, with line numbers

6 votes
6 answers
5273 views
bash command-line text-processing awk aspell
                          **Updated**: Clarify line number requirement, some verbosity reductions

From the command line, is there a way to:

* check a file of English text
* to find repeat-word typos,
* along with line numbers where they are found,

in order to help correct them?

## Example 1 ##

Currently, to help finish an article or other piece of English writing, aspell -c text.txt is helpful for catching spelling errors. But, not helpful when the error is an unintentional consecutive repetition of a word.

highlander_typo.txt:

    There can be only one one.

Running aspell:

    $ aspell -c highlander_typo.txt

Probably since aspell is a spell-checker, not a grammar-checker, so repeat word typos are beyond its intended feature scope. Thus the result is this file passes aspell's check because nothing is "wrong" in terms of individual word spelling.

The correct sentence is [There can be only one.](https://www.youtube.com/watch?v=sqcLjcSloXs) , the second one is an unintended repeat-word typo.

## Example 2 ##

But a different situation is for example kylie_minogue.txt:

    La la la

Here the repetition is not a typo, as these are part of an artist's [song lyrics](http://www.azlyrics.com/lyrics/kylieminogue/cantgetyououtofmyhead.html) .

So the solution should not presume and "fix" anything by itself, otherwise it could overwrite intentional repeated words.

## Example 3: Multi-line ##

jefferson_typo.txt:

    He has has refused his Assent to Laws, the most wholesome and necessary
    for the public good.
    He has forbidden his Governors to pass Laws of immediate and
    and pressing importance, unless suspended in their operation till his
    Assent should be be obtained; and when so suspended, he has utterly
    neglected to attend to them.

Modified from [The Declaration of Independence](https://www.gutenberg.org/ebooks/16780) 

In the above six lines,

* 1: He has has refused should be He has refused, the second has is a repeat-word typo
* 5: should be be obtained should be should be obtained, the second be is a repeat-word typo

However, did you notice a third repeat-word typo?

* 3: ... immediate and
* 4: and pressing ...

This is also a repeat-word typo because though they are on separate lines they are still part of the same English sentence, the trailing end of the line above has a word that is accidentally added at the start of the next line. Rather tricky to detect by eye due to the repetition being on opposite sides of a passage of text.

## Intended output ##

* an interactive program with a process similar to aspell -c yet able to detect repeat-words, or,

* a script or combination of commands able to extract line numbers and the suspected repeat words. This info makes it easier to use an editor such as vim to jump to the repeat words and make fixes where appropriate.

Using above multi-line jefferson_typo.txt, the desired output would be something like:

    1: has has
    3: and
    4: and
    5: be be

or:

    1: He [has has] refused his Assent to Laws, the most wholesome and necessary
    3: He has forbidden his Governors to pass Laws of immediate [and]
    4: [and] pressing importance, unless suspended in their operation till his
    5: Assent should [be be] obtained; and when so suspended, he has utterly

I am actually not entirely sure how to display the difficult case of inter-line or cross-line repeat-word, such as the and repetition above, so don't worry if your solution doesn't resemble this exactly.

But I hope that, like the above, it shows:

* relevant original input's line number
* some way to draw attention to what repeated, especially helpful if the line of text is also quite long.
* if the full line is displayed to give context (credit: @Wildcard), then there needs to be a way to somehow render the repeated word or words distinctively. The example shown here marks the repetition by enclosing them within ASCII characters [ ]. Alternatively, perhaps mimic grep --colors=always to colorize the line's matches for display in a color terminal

## Other considerations ##

* text, should stay as plain text files
* no GUI solutions please, just textual. ssh -X X11 forwarding not reliably available and need to edit over ssh

## Unsuccessful attempts ##

To try to find duplicates, uniq came to mind, so the plan was to first determine how to get repeat-word recognition to work on a single line at first.

In order to use uniq we would need to first convert words on a line, to becoming one word per line.

    $ tr ' ' '\n' < highlander_typo.txt
    There
    can
    be
    only
    one
    one.

Unfortunately:

    $ tr ' ' '\n' < highlander_typo.txt | uniq -D

Nothing.

This is because for -D option, which normally reveals duplicates, input has to be exactly a duplicate line. Unfortunately the period . at the end of the repeated word one negates this. It just looks like a different line. Not sure how I would work around arbitrary punctuation marks such as this period, and somehow add it back after tr processing.

This was unsuccessful. But if it were successful, next there would need to be a way to include this line's line number, since the input file could have hundreds of lines and it would help to indicate which line of the input file, that the repeat-word was detected on.

This single-line code processing would perhaps be part of a parent loop in order to do some kind of line-by-line multi-line processing and thus be able to process all lines in a file, but unfortunately getting past even single-line repeat-word recognition has been problematic.
                        
Asked by clarity123 (3589 rep)
Jan 19, 2016, 10:08 PM
Last activity: Jul 28, 2016, 04:58 PM
Command line method to find repeat-word typos, with line numbers

Related Questions