Command line method to find repeat-word typos, with line numbers
6
votes
6
answers
5273
views
**Updated**: Clarify line number requirement, some verbosity reductions
From the command line, is there a way to:
* check a file of English text
* to find repeat-word typos,
* along with line numbers where they are found,
in order to help correct them?
## Example 1 ##
Currently, to help finish an article or other piece of English writing,
aspell -c text.txt
is helpful for catching spelling errors. But, not helpful when the error is an unintentional consecutive repetition of a word.
highlander_typo.txt
:
There can be only one one.
Running aspell
:
$ aspell -c highlander_typo.txt
Probably since aspell
is a spell-checker, not a grammar-checker, so repeat word typos are beyond its intended feature scope. Thus the result is this file passes aspell
's check because nothing is "wrong" in terms of individual word spelling.
The correct sentence is [There can be only one.
](https://www.youtube.com/watch?v=sqcLjcSloXs) , the second one
is an unintended repeat-word typo.
## Example 2 ##
But a different situation is for example kylie_minogue.txt
:
La la la
Here the repetition is not a typo, as these are part of an artist's [song lyrics](http://www.azlyrics.com/lyrics/kylieminogue/cantgetyououtofmyhead.html) .
So the solution should not presume and "fix" anything by itself, otherwise it could overwrite intentional repeated words.
## Example 3: Multi-line ##
jefferson_typo.txt
:
He has has refused his Assent to Laws, the most wholesome and necessary
for the public good.
He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
Assent should be be obtained; and when so suspended, he has utterly
neglected to attend to them.
Modified from [The Declaration of Independence](https://www.gutenberg.org/ebooks/16780)
In the above six lines,
* 1: He has has refused
should be He has refused
, the second has
is a repeat-word typo
* 5: should be be obtained
should be should be obtained
, the second be
is a repeat-word typo
However, did you notice a third repeat-word typo?
* 3: ... immediate and
* 4: and pressing ...
This is also a repeat-word typo because though they are on separate lines they are still part of the same English sentence, the trailing end of the line above has a word that is accidentally added at the start of the next line. Rather tricky to detect by eye due to the repetition being on opposite sides of a passage of text.
## Intended output ##
* an interactive program with a process similar to aspell -c
yet able to detect repeat-words, or,
* a script or combination of commands able to extract line numbers and the suspected repeat words. This info makes it easier to use an editor such as vim
to jump to the repeat words and make fixes where appropriate.
Using above multi-line jefferson_typo.txt
, the desired output would be something like:
1: has has
3: and
4: and
5: be be
or:
1: He [has has] refused his Assent to Laws, the most wholesome and necessary
3: He has forbidden his Governors to pass Laws of immediate [and]
4: [and] pressing importance, unless suspended in their operation till his
5: Assent should [be be] obtained; and when so suspended, he has utterly
I am actually not entirely sure how to display the difficult case of inter-line or cross-line repeat-word, such as the and
repetition above, so don't worry if your solution doesn't resemble this exactly.
But I hope that, like the above, it shows:
* relevant original input's line number
* some way to draw attention to what repeated, especially helpful if the line of text is also quite long.
* if the full line is displayed to give context (credit: @Wildcard), then there needs to be a way to somehow render the repeated word or words distinctively. The example shown here marks the repetition by enclosing them within ASCII characters [
]
. Alternatively, perhaps mimic grep --colors=always
to colorize the line's matches for display in a color terminal
## Other considerations ##
* text, should stay as plain text files
* no GUI solutions please, just textual. ssh -X
X11 forwarding not reliably available and need to edit over ssh
## Unsuccessful attempts ##
To try to find duplicates, uniq
came to mind, so the plan was to first determine how to get repeat-word recognition to work on a single line at first.
In order to use uniq
we would need to first convert words on a line, to becoming one word per line.
$ tr ' ' '\n' < highlander_typo.txt
There
can
be
only
one
one.
Unfortunately:
$ tr ' ' '\n' < highlander_typo.txt | uniq -D
Nothing.
This is because for -D
option, which normally reveals duplicates, input has to be exactly a duplicate line. Unfortunately the period .
at the end of the repeated word one
negates this. It just looks like a different line. Not sure how I would work around arbitrary punctuation marks such as this period, and somehow add it back after tr
processing.
This was unsuccessful. But if it were successful, next there would need to be a way to include this line's line number, since the input file could have hundreds of lines and it would help to indicate which line of the input file, that the repeat-word was detected on.
This single-line code processing would perhaps be part of a parent loop in order to do some kind of line-by-line multi-line processing and thus be able to process all lines in a file, but unfortunately getting past even single-line repeat-word recognition has been problematic.
Asked by clarity123
(3589 rep)
Jan 19, 2016, 10:08 PM
Last activity: Jul 28, 2016, 04:58 PM
Last activity: Jul 28, 2016, 04:58 PM