Convenient-sized strings for translation with minimal markup #29

aitap · 2024-09-05T16:15:15Z

The gettext project has a number of recommendations for what the translatable strings should look like to make it most convenient to translate them. In particular, it is recommended to split text at paragraphs and minimise "unusual markup".

While I don't think we are going to entirely avoid Rd markup inside translatable strings, I think it's possible to achieve. I would like to suggest the following approach:

walk the Rd tree in search of lists containing text (objects with attr(., 'Rd_tag') == 'TEXT')
remember the place of the list in the tree
deparse the plain text elements
deparse all their neighbours that are markup Rd directives and also only contain plain text (for example, objects with attr(., 'Rd_tag') == '\\emph') and whose contents are all TEXT
replace the remaining neighbours with placeholders and remember their place in the tree
combine the block of text together, canonicalise whitespace, split by paragraph
store in a .pot file for translation

When rendering a help file for translation, perform the same process but backwards:

split the translations by Rd child placeholders
parse the resulting Rd fragments and graft the child elements between them
graft the resulting lists into the original Rd object
return it for rendering

The resulting translatable strings look very manageable:

tools::Rd_db('Rd2gettext')$'Rd_extract_strings.Rd' |>
 Rd2gettext::Rd_extract_strings() |>
 head(3)
# [[1]]
# [1] "Functions for Rd translation"
# attr(,"pos")
# [1] 1
# 
# [[2]]
# [1] "Extract translatable strings from a parsed <3> tree or translate it by giving the
# extracted strings to <6>."
# attr(,"pos")
# [1] 6
# 
# [[3]]
# [1] "A pre-parsed Rd tree loaded from <3> or otherwise produced by <6>."
# attr(,"pos")
# [1] 8 3 2

If you're interested in an approach like this, I can try to integrate it into rhelpi18n. My original use case is help(whittaker2, albatross), which is very unwieldy to translate as a flattened representation.

The text was updated successfully, but these errors were encountered:

eliocamp · 2024-11-05T02:59:47Z

Trying to remove/minimise formatting would be great, but how would this work in practice?

For example, I wouldn't really know how to translate "Extract translatable strings from a parsed <3> tree or translate it by giving the extracted strings to <6>." From context <3> and <6> are probably nouns, but I don't have any information on their number or grammatical gender.

What happens if the formatted strings need to be translated? Is it possible that <3> is a noun or phrase that needs translation?

BTW, for some reason if I run your example I don't get the same output:

tools::Rd_db('Rd2gettext')$'Rd_extract_strings.Rd' |> 
  Rd2gettext::Rd_extract_strings() |>
  head(3)
#> [[1]]
#> [1] "Functions for Rd translation"
#> attr(,"pos")
#> [1] 1
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)
#> 
#> [[2]]
#> [1] "Extract translatable strings from a parsed \\code{Rd} tree or translate it by giving the extracted strings to \\code{\\link[base]{gettext}}."
#> attr(,"pos")
#> [1] 6
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)
#> 
#> [[3]]
#> [1] "A pre-parsed Rd tree loaded from \\code{tools::\\link[tools]{Rd_db}} or otherwise produced by \\code{tools::\\link[tools]{parse_Rd}}."
#> attr(,"pos")
#> [1] 8 3 2
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)

Although I do get the placeholders if I pass the mean() documentation:

tools::Rd_db('base')$'mean' |>
  Rd2gettext::Rd_extract_strings() |>
  _[[3]]
#> [1] "an <2> object. Currently there are methods for numeric/logical vectors and \\link[=Dates]{date}, \\link{date-time} and \\link{time interval} objects. Complex vectors are allowed for \\code{trim = 0}, only."
#> attr(,"pos")
#> [1] 8 3 2
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)

aitap · 2024-11-05T18:18:20Z

Thank you for giving this a try! You're absolutely right, I've translated slightly more than a half of the data.table help and put some of the markup back in because it contained the required context. A better rule for including Rd markup instead of a placeholder might be "let markup in if the immediate child of the tag is a plain TEXT/RCODE/VERB value". This includes most \code{}, \sQuote{}/\dQuote, \strong{}/\emph{} blocks, but misses \dots and \R (which have no translation or internal content, but are still important for context).

What happens if the formatted strings need to be translated? Is it possible that <3> is a noun or phrase that needs translation?

The core of the suggested approach is to walk the parse tree recursively. If <3> is a block with its own potentially translatable plain text inside, the algorithm visits it and extracts its contents into separate translatable strings. With no splitting, some of the strings extracted from data.table get absolutely enormous and stop fitting into my Poeditor window.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convenient-sized strings for translation with minimal markup #29

Convenient-sized strings for translation with minimal markup #29

aitap commented Sep 5, 2024

eliocamp commented Nov 5, 2024

aitap commented Nov 5, 2024

Convenient-sized strings for translation with minimal markup #29

Convenient-sized strings for translation with minimal markup #29

Comments

aitap commented Sep 5, 2024

eliocamp commented Nov 5, 2024

aitap commented Nov 5, 2024