Frequency of words in non-English language text: how can I merge singular and plural forms etc.?

5 votes

3 answers

766 views

shell-script text-processing sed portability natural-language

                          I'm sorting *French* language words in some text files according to *frequency* with a focus on *insight* rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms  in front of vowels(l', d') in the context of shaping word tokens for sorting.

The topic of the most frequent  words in a file takes many shapes( 1  | 2  | 3  | 4 ). So I put together this function using *GNU* utilities:

    compt1 () {
    for i in *.txt; do
    	echo "File: $i"
    	sed -e 's/ /\
    /g' 1. I cannot provide source data but I can provide this  file as an example. The words *heure* and *enfant* in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(*enfant*/*enfants*) and would benefit from being merged here.

Asked by user44370

Jul 19, 2014, 01:59 PM
Last activity: Jan 22, 2019, 07:58 PM

Frequency of words in non-English language text: how can I merge singular and plural forms etc.?

Related Questions