Frequency of words in non-English language text: how can I merge singular and plural forms etc.?
5
votes
3
answers
766
views
I'm sorting *French* language words in some text files according to *frequency* with a focus on *insight* rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(
l'
, d'
) in the context of shaping word tokens for sorting.
The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4 ). So I put together this function using *GNU* utilities:
compt1 () {
for i in *.txt; do
echo "File: $i"
sed -e 's/ /\
/g' 1. I cannot provide source data but I can provide this file as an example. The words *heure* and *enfant* in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(*enfant*/*enfants*) and would benefit from being merged here.
Asked by user44370
Jul 19, 2014, 01:59 PM
Last activity: Jan 22, 2019, 07:58 PM
Last activity: Jan 22, 2019, 07:58 PM