Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenient-sized strings for translation with minimal markup #29

Open
aitap opened this issue Sep 5, 2024 · 2 comments
Open

Convenient-sized strings for translation with minimal markup #29

aitap opened this issue Sep 5, 2024 · 2 comments

Comments

@aitap
Copy link
Contributor

aitap commented Sep 5, 2024

The gettext project has a number of recommendations for what the translatable strings should look like to make it most convenient to translate them. In particular, it is recommended to split text at paragraphs and minimise "unusual markup".

While I don't think we are going to entirely avoid Rd markup inside translatable strings, I think it's possible to achieve. I would like to suggest the following approach:

  • walk the Rd tree in search of lists containing text (objects with attr(., 'Rd_tag') == 'TEXT')
  • remember the place of the list in the tree
  • deparse the plain text elements
  • deparse all their neighbours that are markup Rd directives and also only contain plain text (for example, objects with attr(., 'Rd_tag') == '\\emph') and whose contents are all TEXT
  • replace the remaining neighbours with placeholders and remember their place in the tree
  • combine the block of text together, canonicalise whitespace, split by paragraph
  • store in a .pot file for translation

When rendering a help file for translation, perform the same process but backwards:

  • split the translations by Rd child placeholders
  • parse the resulting Rd fragments and graft the child elements between them
  • graft the resulting lists into the original Rd object
  • return it for rendering

The resulting translatable strings look very manageable:

tools::Rd_db('Rd2gettext')$'Rd_extract_strings.Rd' |>
 Rd2gettext::Rd_extract_strings() |>
 head(3)
# [[1]]
# [1] "Functions for Rd translation"
# attr(,"pos")
# [1] 1
# 
# [[2]]
# [1] "Extract translatable strings from a parsed <3> tree or translate it by giving the
# extracted strings to <6>."
# attr(,"pos")
# [1] 6
# 
# [[3]]
# [1] "A pre-parsed Rd tree loaded from <3> or otherwise produced by <6>."
# attr(,"pos")
# [1] 8 3 2

If you're interested in an approach like this, I can try to integrate it into rhelpi18n. My original use case is help(whittaker2, albatross), which is very unwieldy to translate as a flattened representation.

@eliocamp
Copy link
Owner

eliocamp commented Nov 5, 2024

Trying to remove/minimise formatting would be great, but how would this work in practice?

For example, I wouldn't really know how to translate "Extract translatable strings from a parsed <3> tree or translate it by giving the extracted strings to <6>." From context <3> and <6> are probably nouns, but I don't have any information on their number or grammatical gender.

What happens if the formatted strings need to be translated? Is it possible that <3> is a noun or phrase that needs translation?

BTW, for some reason if I run your example I don't get the same output:

tools::Rd_db('Rd2gettext')$'Rd_extract_strings.Rd' |> 
  Rd2gettext::Rd_extract_strings() |>
  head(3)
#> [[1]]
#> [1] "Functions for Rd translation"
#> attr(,"pos")
#> [1] 1
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)
#> 
#> [[2]]
#> [1] "Extract translatable strings from a parsed \\code{Rd} tree or translate it by giving the extracted strings to \\code{\\link[base]{gettext}}."
#> attr(,"pos")
#> [1] 6
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)
#> 
#> [[3]]
#> [1] "A pre-parsed Rd tree loaded from \\code{tools::\\link[tools]{Rd_db}} or otherwise produced by \\code{tools::\\link[tools]{parse_Rd}}."
#> attr(,"pos")
#> [1] 8 3 2
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)

Although I do get the placeholders if I pass the mean() documentation:

tools::Rd_db('base')$'mean' |>
  Rd2gettext::Rd_extract_strings() |>
  _[[3]]
#> [1] "an <2> object. Currently there are methods for numeric/logical vectors and \\link[=Dates]{date}, \\link{date-time} and \\link{time interval} objects. Complex vectors are allowed for \\code{trim = 0}, only."
#> attr(,"pos")
#> [1] 8 3 2
#> attr(,"subpos")
#> [1] 1
#> attr(,"extra")
#> named character(0)

@aitap
Copy link
Contributor Author

aitap commented Nov 5, 2024

Thank you for giving this a try! You're absolutely right, I've translated slightly more than a half of the data.table help and put some of the markup back in because it contained the required context. A better rule for including Rd markup instead of a placeholder might be "let markup in if the immediate child of the tag is a plain TEXT/RCODE/VERB value". This includes most \code{}, \sQuote{}/\dQuote, \strong{}/\emph{} blocks, but misses \dots and \R (which have no translation or internal content, but are still important for context).

What happens if the formatted strings need to be translated? Is it possible that <3> is a noun or phrase that needs translation?

The core of the suggested approach is to walk the parse tree recursively. If <3> is a block with its own potentially translatable plain text inside, the algorithm visits it and extracts its contents into separate translatable strings. With no splitting, some of the strings extracted from data.table get absolutely enormous and stop fitting into my Poeditor window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants