Sample Header Ad - 728x90

Parsing msgcat merge conflicts into "nice looking" console errors using bash and assorted CLI tools

1 vote
1 answer
143 views
I was asked by @terdon to post this follow-up to a more specific issue I had and resolved [over here](https://superuser.com/a/1870371/563952) . Apparently my choice of tools has not been a particularly smart one, so I shall describe the use case in detail and ask: **What assortment of CLI tools provides the most concise-yet-maintainable solution to emit msgcat merge conflicts into the log of IDEs and CI tools, such that developers unfamiliar with gettext or, say, plain git have a chance to act on them?** ... Quite the mouthful, is it not? As so often the case in... "ripe"... software projects, this is not a simple problem, even though _it really should be_. The issue: I have to roll my own conflict detection and logging, because... - ["No one" seems to know about](https://stackoverflow.com/a/79352922/3434465) the GettextResourceManager, so in my current project we use the Mono.Unix.Catalog, which only supports loading one single catalog per application. This project also has many old glade-2 components, which themselves do not support multiple catalogs (no way to tell them which catalog to look in, they always go for the one currently bound ). So, in this project, all .po files of all dependencies are merged together with msgcat into a single large catalog. - msgcat, as of the Debian 12 gettext version 0.21 does not seem to provide any means to express merge conflicts via exit code or stdout/stderr output . The conflicts are written into the resulting .po file and look similar to git merge conflicts (or svn merge conflicts). Loading an .mo file generated from such a conflicted .po leads to either the conflict markers making their way into the GUI (with --use-fuzzy) or the translation to be missing completely, but silently. - msguniq alledgedly is the gettext-native tool to perform duplicate checks - but it works only on a single .po file. Invoking it on the output of msgcat (with conflicting lines present), however, produces no output, so I have no clue how it is supposed to be used: msguniq --repeated - msgcomm can find msgids used in multiple .po files (nice...), but does not at all care whether they conflict or not (... - but useless): msgcomm --more-than=1 file1.po file2.po [...] Previously I had worked around this issue by passing --use-first to msgcat, but this has the remarkable downside that it is utterly silent (by design): Having to deduplicate a msgid used in different sources is a perfectly likely occurrence (as natural languages are imprecise), but we can only notice that it happened in a manual test - or in production. Furthermore, as this occurence can happen for different messages _in different languages_ one single person is completely incapable of testing for this. So instead of --use-first I want to actually emit an error when a merge conflict happens, such that the developer has a realistic chance to check and correct (either by copying one text over the other or by deduplicating the msgids). To that end, several constraints appear: - The conflict marker msgcat emits has the form "#-#-#-#-# source file name (Project-Id-Version) if set in the source file #-#-#-#-#\n" e.g. "#-#-#-#-# pt.po (My Library Catalogue) #-#-#-#-#\n" or "#-#-#-#-# pt.po #-#-#-#-#\n" for a source file 'pt.po' without or with empty Project-Id-Version . - The metadata entry msgid "" _always_ conflicts, if any source file has a meaningful Project-Id-Version set. (More generally, if any metadata field differs; the X-Poedit-* family of extensions provides other likely candidates.) So we want to skip the metadata entry and only fail if there are _other_ conflicts: Reporting it is not useful. - .po files must be named after the translation they provide, e.g. de.po, pt.po. (A rather annoying convention in this project, not the fault of gettext et al.) - gettext is not exactly mainstream (case in point: the Mono and glade implementations, which have caused this mess), so the emitted error message should be approachable for developers never having heard of it, e.g. by looking similar to a compiler error and otherwise as clean as possible. (This is a bit subjective, but #-#-#-#-# does not look clean - it is very "visually busy", especially in the middle of a long logfile containing many other tool outputs.) - bash and many of the _nominally common_ CLI tools are _also_ not mainstream _in practice_, so we have to somehow keep the script at once concise and uncomplex... I will gladly admit that I myself am not Linux-savvy enough to achieve both at the same time. (E.g. "everyone knows" grep, but sed is already quite arcane for many people and awk - _any_ awk - risks reflexive techno-paralysis. ... Note: This may not be the case _for you_ but it is certainly my experience with _any and all_ people I have cooperated with, including decades-long Linux natives. _Yes, this really is the case!!!_) Here are two minimal test files which, when merged, will produce a conflict: - library/pt.po
msgid ""
msgstr "Project-Id-Version: My Library Catalogue\n"
"Language: pt_BR\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"

msgid "Done"
msgstr "Feito"

msgid "Does not conflict."
msgstr ""
- application/pt.po
msgid ""
msgstr "Project-Id-Version: My Application Catalogue\n"
"Language: pt_BR\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"

msgid "Done"
msgstr "Pronto"

msgid "Does not conflict."
msgstr ""
For these files my current solution emits:
$ bash postbuild.sh
Found localisation: application/pt.po
Found localisation: library/pt.po
Creating ./pt/LC_MESSAGES/Domain.mo
Merge conflicts found in './pt/LC_MESSAGES/Domain.po':
22 from library 'My Application Catalogue'
24 from library 'My Library Catalogue'
The last three lines being the conflict report. It certainly has shortcomings: - no source file - no visual delimiter between conflicts but I deemed my script already too bloated to try for more. ---------- My current solution: Per application exists one file postbuild.sh that collects the pertaining .po files into a SOURCES array and then calls the merge script:
#!/bin/bash
SOURCES=(application/pt.po library/pt.po)
source compile_mo_files.sh; compile_mo_files SOURCES . true
I call those in MSBuilds PostBuildEvents, so all applications generate their localisation when built. The merge script compile_mo_files.sh:
#!/bin/bash

# recursively seeks through the passed SOURCE_DIRECTORIES for .po files, echoes each file found,
# msgcats them all together and msgfmts a single 'Domain.mo' file from them
compile_mo_files()
{
  # parameters https://mywiki.wooledge.org/BashFAQ/048#The_problem_with_bash.27s_name_references 
  if [[ $1 != SOURCE_DIRECTORIES ]]; then
    local -n SOURCE_DIRECTORIES=$1 # array of directories containing .po files
  fi

  if [[ $2 != OUTPUT_DIRECTORY ]]; then
    local OUTPUT_DIRECTORY=$2 # path wherein the individual language folders be created
  fi
  if [ -z "${OUTPUT_DIRECTORY}" ]; then
    OUTPUT_DIRECTORY='.'
  fi

  if [[ $3 != FAIL_ON_GETTEXT_ISSUES ]]; then
    # if set to any truthy value,
    # - call msgcat such that duplicate msgids create merge conflicts,
    # - check msgcat warnings
    # and abort, if any are found afterwards
    local FAIL_ON_GETTEXT_ISSUES=$3
  fi

  # collect .po files
  shopt -s lastpipe
  declare -A POS
  for SOURCE in "${SOURCE_DIRECTORIES[@]}"; do
    PREFIX=$(dirname "$(dirname "$SOURCE")")/
    find "$SOURCE" -name "*.po" |
    {
      while read -r PO; do
        LOCALISATION=$OUTPUT_DIRECTORY/$(basename "$PO" .po)/LC_MESSAGES
        MO=$LOCALISATION/Domain.mo
        echo Found localisation: "${PO#"$PREFIX"}"
        mkdir -p "$LOCALISATION"
        POS["$MO"]="${POS[$MO]}"" $PO"
      done
    }
  done
  # merge .po files, generate .mo files
  if [ ! "$FAIL_ON_GETTEXT_ISSUES" ]; then
    # Workaround: Proper solution would be to create one .mo per .po and load them under different domains,
    #             but Mono.Unix.Catalog does not support querying multiple domains.
    #             But when we merge these, we get false-positive "fuzzy"s, e.g. for msgid "".
    # FIXME:
    #   Implement or find own/proper wrapper for intl, switch to one .mo per language per project!
    #   https://www.gnu.org/software/gettext/manual/html_node/C_0023.html#C_0023-1  would likely be best.
    #   Took the liberty of adding it to 'builddeps/gettext/gettext-runtime/intl-csharp' but glade does not like it,
    #   so we are probably stuck with this workaround for pre-Avalonia GUI projects.
    #   Putting my tinkering into 'proper-intl' branches for reference.
    # -Zsar 2024-10-17
    MSGCAT_WORKAROUND='--use-first'
  fi
  EXIT_STATUS=0
  for MO in "${!POS[@]}"; do
    echo Creating "$MO"
    PO=${MO/.mo/.po}
    # FIXME: msgcat stdout is lost!
    # Sadly msgcat pretty much never emits a status code that is non-zero, so we have to do this by hand. -Zsar 2025-01-06
    WARNINGS=$({ msgcat ${MSGCAT_WORKAROUND:+"$MSGCAT_WORKAROUND"} --no-wrap -o "$PO" ${POS[$MO]} 1>/dev/null; } 2>&1)
    msgfmt --use-fuzzy -o "$MO" "$PO" # --use-fuzzy so we can pre-translate pt et al. with DeepL and still mark it as "please verify" for external translators
    # check for msgcat warnings
    if [ "$WARNINGS" ]; then
      printf 'msgcat warnings should be fixed to avoid surprising behaviour:\n%s\n' "$WARNINGS"
      EXIT_STATUS=2 # continue to emit _all_ issues into the log; 3 is worse than 2 so no harm overwriting it later -Zsar 2025-01-06
    fi
    # check for merge conflicts
    if [ "$FAIL_ON_GETTEXT_ISSUES" ]; then
      # Note: awk (or more generally POSIX Extended Regular Expressions) _does not_ support non-capturing groups
      #       and will do nonsense if you use them! see https://stackoverflow.com/a/57059535 
      #       That is why we are printing capture group 2 instead of making the first group non-capturing.
      # -Zsar 2025-01-06
      CONFLICT_MARKER='#-#-#-#-#\s+\S+\s+(\(([^()]+)\)\s+)?#-#-#-#-#'
      METADATA_LINE_NUMBER=$(sed -n '/^$/{=;q}' "$PO")
      CONFLICTS=$(gawk "NR > $METADATA_LINE_NUMBER && /$CONFLICT_MARKER/ && match(\$0, /$CONFLICT_MARKER/, library_name) { print NR, \"from library \047\"library_name[2] \"\047\" }" "$PO")
      if [ "$CONFLICTS" ]; then
        printf "Merge conflicts found in '%s':\n%s\n" "$PO" "$CONFLICTS"
        EXIT_STATUS=3
      fi
    fi
  done
  shopt -u lastpipe
  exit $EXIT_STATUS
}
FWIW: The project may, one hopes, rid itself of glade-2 and thereby of the necessity to merge these .po files to begin with - but _I_ will not be around to see it. Yet, that is the reason why I am loathe to e.g. challenge the .po naming convention - it would, eventually, simply stop to matter. (E.g. since posting the original question, we at least moved from Debian 10 to 12, but there is _so much more_ to do that it will likely take years. Moving off Mono-compatible .NET 4.7.2 would be next.)
Asked by Zsar (111 rep)
Jan 17, 2025, 06:28 PM
Last activity: Feb 3, 2025, 07:16 PM