[go: nahoru, domu]

Page MenuHomePhabricator

Make Translate extension compatible with Parsoid
Closed, ResolvedPublic

Description

Current status

Initial requirements gathering, research, and further discussion led to four proposals, which are listed below. Proposal 4 has been implemented in Parsoid and deployed to production end-December 2021. We are currently fixing bugs and working through issues in Parsoid's support.

Page translation

Docs: https://www.mediawiki.org/wiki/Help:Extension:Translate/Page_translation_administration

Requirements

Support translation of wiki pages

  • Pages are living documents, so changes must be tracked
  • Each document has one source language (the source) and any number of translations
  • Each translated version is its own page with a separate history
  • The source is annotated to:
    • translatable and non-translatable parts
    • the translatable parts are further divided into units which are individually translatable
    • units may have non-translatable holes (variables)

Translators want:

  • Small but meaningful units with minimal amount of mark-up

Authors wants:

  • Control of what can be translated
  • Minimal degradation of editing experience, e.g. minimal amount of new mark-up to understand

Current architecture

Translate uses ParserBeforeInternalParse hook to mangle wikitext on translatable source page. For example:

<languages/>
<translate>
== Heading ==

You have <tvar|1>999</> bugs.
</translate>

Is mangled to as follows (note whitespace handling) for the parser to parse:

<languages/>
== Heading ==

You have 999 bugs.
<languages/> is a normal extension tag that just outputs some HTML and will be easy to convert.

Translation pages have no such markup. They are generated by Translate.

Proposal 1: concept of preprocessing

Parsoid would have a new kind of hook for “preprocess” (name up for discussion) that would be run for a whole page of a wikitext. Translate would register such a hook and mangle the wikitext before Parsoid starts processing it.

Anticipated issues

  • May break some assumption of Parsoid and probably would break html2wt.

Pros and cons:

  • Have absolute control over parsing.
  • Requires minimal changes to our code.
  • No need to worry about balanced DOM.
  • Does not harm wikitext editing
  • Visual editing would become even worse or impossible
  • Does not address the underlying architecture issues going forward

Proposal 2: <translatablepage> wrapper

For translatable pages, either implicitly or explicitly wrap the whole page under another tag, such as <translatablepage>. This tag would do the current mangling: basically removing <translate> tags and converting variable syntax to the actual value.

<translatablepage> would be type 3 without postprocessing.

Pros and cons:

  • Have absolute control over parsing.
  • Requires minimal changes to our code.
  • Does not harm wikitext editing, except a bit for the wrapper tag.
  • No need to worry about balanced DOM
  • Visual editing could get even a bit worse (but maybe with some effort it could get better, just allow editing the whole contents as wikitext instead of the weird mix it currently has)
  • Does not address the underlying architecture issues going forward

Proposal 3: <translate> is an extension tag

We would register type 3 <translate> tags with Parsoid and have them do parseWT( mangle( $input ) ).

Anticipated issues:

  • Block-vs-inline rendering based on content may be difficult to implement.
  • There may be a lot of "unbalanced DOM" type of mark-up usage on existing pages.

Pros and cons:

  • Enables better VE support in the future
  • Likely requires most effort to implement
  • Like requires a lot of effort to support migration
  • May require introducing new mark-up to better support cases where unbalanced mark-up is used currently

Proposal 4: <translate> is an annotation

See T261181#6476451. But TLDR is that Parsoid treats translation tags as transparent annotations and handles them as such in Parsoid HTML. The new Parsoid markup spec is documented @ https://www.mediawiki.org/wiki/Specs/HTML/2.4.0#Annotation_tags and enables editing of translatable content in VE.

Plural parsing

Validation of MediaWiki plural syntax is part of message validation framework used in translatewiki.net

Requirements

Validate plural syntax in messages

  • Should match the MediaWiki core parsing behavior.

Developers want

  • Simple way to parse (expand) plural syntax to validate the number of forms.

Current architecture

Translate installs a custom parser function that overrides the normal plural function to gather the parser function arguments.

Proposals

Not investigated yet.

Open questions

1. Block vs. inline rendering

Block vs. inline rendering is currently proposed to be a setting per extension tag. How would <translate> keep working while supporting e.g. the following cases:

Some stuff <translate>inline text</translate>

<translate>
A paragraph goes here.

Another here.
</translate>

The current logic is: IF tag contents contains a newline THEN block context ELSE inline context

Having explicitly to specify whether the context is inline or block would be, imho, too much overhead for translation admins.

sourceToDom allows to specify a wrapper tag. Can this tag have attributes? Can it be different for different calls of the tag?

2. Balanced DOM

How to find out what would break currently?

Can a linter be written? In a way that doesn't affect current workflow?

What exactly are the rules of a balanced DOM? For example, could <translate> tags span over multiple sections? Can they stop or start in the middle of a section when spanning multiple sections?

What would the migration process look like?

3. Is <translate> actually a new type of an extension tag, or an extension tag at all?

The tech talk mostly focuses on transforming content. The main point of <translate> is to just annotate parts of a page in a machine readable way, and any effects on parsing are to be considered unwanted implementation details.

4. Parsing plural syntax

Does Parsoid provide a way to do this:

$input = 'Some translation here with {{PLURAL:$1|a house|$1 houses}}';
$output = $parsoid->doSomething( $input );

$output being something like:

[
  [ 'a house', '$1 houses' ]
]

Must support multiple plurals in one string. Nice to support nested plurals. Must handle {} inside plural options.

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolved Esanders
OpenFeatureNone
Resolvedihurbain
Resolvedihurbain
Resolved Nikerabbit
Resolved Nikerabbit
Resolvedihurbain
Resolvedihurbain
Openihurbain
Resolvedihurbain
Resolvedihurbain
ResolvedBUG REPORTihurbain
Resolvedssastry
Resolvedihurbain
ResolvedBUG REPORTihurbain
DeclinedBUG REPORTihurbain
ResolvedBUG REPORTihurbain
OpenBUG REPORTNone
ResolvedBUG REPORTihurbain
ResolvedBUG REPORTArlolra
Resolvedihurbain
ResolvedBUG REPORTihurbain
ResolvedFeatureihurbain
ResolvedBUG REPORTmatmarex
OpenBUG REPORTNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Maybe it is too late, but I have an idea:

Current process:

  1. Page is created in source language (SL).
  2. Translation admins add <translate> tags.
  3. Translate extension creates Page/SL removing translate tags.

What about slightly reversing the process?

  1. Page is created in source language.
  2. Translate extension duplicates Page into Page/Translate.
  3. Translation admins add <translate> tags in Page/Translate.

This way, source language remains unchanged and editable using VisualEditor.

But this would imply an efficient kind of diff tool which would be able to maintain Page/Translate to minimize translation admins’ workload each time source page is changed.

@ssastry T274881: Change translation variable (tvar) syntax is now done pending QA. Does that unblock your work? Is there anything else we should do on our side?

While migrating to the new XML-ish translation variable syntax, I realized that even this one may screw up things. Consider the following markup:

<translate><!--T:1--> Here is a {{<tvar name="1">ll|Page title</tvar>|template-powered auto-localized link}}.</translate>

This is perfectly valid from Translate’s POV, and often seen in the wild. However, I’m not sure if it can be properly tokenized—the tvar contains the template title, a pipe, and the first parameter, while the curly brackets and the second parameter are outside of it.

Or even such things occur:

<translate><!--T:1--> This translation unit contains some <tvar name="begin"><span style="color:red"></tvar>styled text<tvar name="end"></span></tvar>.</translate>

I’m pretty sure it’s theoretically impossible to build a parse tree from this, since the tvar should be parent and child of the span at the same time.

Of course we can ignore these (rather common, at least for the first one) edge cases and say Translate became more compatible with Parsoid thanks to the new syntax, but it won’t be fully compatible ever without seriously limiting how complex situations can be handled in translation syntax.


@Nikerabbit I also noticed while migrating that MediaWiki-extensions-CodeMirror highlights <translate> tags in green and (for example) the {{PAGELANGUAGE}} magic word in brown, but it doesn’t highlight <tvar> tags, and the {{TRANSLATIONLANGUAGE}} magic word is highlighted in purple (as if it was a template, not a magic word-like thing). This is a bit annoying (especially as typos don’t stand out), but I can live with it, however, I wonder if this difference has any implications on how Parsoid can handle them.

While migrating to the new XML-ish translation variable syntax, I realized that even this one may screw up things

Good points. I am not sure if I can prevent that inside the Translate extension, but Parsoid may be able to do it in the future?

Does CodeMirror do intelligent introspecting of the Parser to identify parser tags and magic words, or is it a manual list somewhere?

<tvar> tags are not registered as parser functions. With the old syntax that was not even possible. With the new syntax it is possible and should be no-op, as Translate seens the text before the parser sees these tags.

Similarly for {{TRANSLATIONLANGUAGE}}, it is not a real magic word, but we could register it as one and have it just produce "This magic word does not work outside translation units" if the parser sees it (or if that causes troubles, just return page content language instead).

@ssastry T274881: Change translation variable (tvar) syntax is now done pending QA. Does that unblock your work? Is there anything else we should do on our side?

Thanks! We'll pick this up again in a bit. We are discussing solution strategies for T275082 right now. We'll pick something for translate that doesn't foreclose future improvements on the Parsoid side.

Cleaning up my assignments. Currently nothing for me to do here.

Change 702996 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] WIP - Support translate extension in parsoid

https://gerrit.wikimedia.org/r/702996

Change 724709 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Add annotation tags to SiteConfig.php

https://gerrit.wikimedia.org/r/724709

Change 724709 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add annotation tags to SiteConfig.php

https://gerrit.wikimedia.org/r/724709

Change 725297 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/VisualEditor@master] Handle annotation tags as nodes rather than metaitems.

https://gerrit.wikimedia.org/r/725297

Change 726027 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/vendor@master] Bump parsoid to 0.15.0-a2

https://gerrit.wikimedia.org/r/726027

Change 726027 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.15.0-a2

https://gerrit.wikimedia.org/r/726027

Change 726574 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Refactor WTSUtils::origSrcValidInEditedContent to pass SerializerState

https://gerrit.wikimedia.org/r/726574

Change 725297 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Handle annotation tags as nodes rather than metaitems.

https://gerrit.wikimedia.org/r/725297

Test wiki created on Patch Demo by IHurbainPalatin (WMF) using patch(es) linked to this task:

https://patchdemo.wmflabs.org/wikis/a5af6673ba/w/

On https://patchdemo.wmflabs.org/wikis/a5af6673ba/wiki/Test_2?veaction=edit translation unit markers (<!--T:n-->) are visible as comments. These are handled specially by Translate (it identifies translation units based on these comments), so maybe they should be presented specially.

@Tacsipacsi The main goal was to have the TU markers being visible *somehow* so that editors would be aware of them during VE edition as well (and so that they would be handled carefully as required). The current handling has the benefit of not adding new markup/interface to what people are used to - and to use the existing code base as is.
I agree that having them presented differently would be nice (although I'd need someone more versed in UI/UX to chime in on how 😄 ), but I'm afraid it may add significant complexity to the current handling of comments in VE. In particular, I don't know of any other extension using comments in a special way handled by VE, so "allowing extensions to handle comments differently" would be a prerequisite, and I believe this starts stretching the scope of the current Phab.

Test wiki created on Patch Demo by IHurbainPalatin (WMF) using patch(es) linked to this task:

https://patchdemo.wmflabs.org/wikis/a030c77438/w/

Change 735339 had a related patch set uploaded (by RhinosF1; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/VisualEditor@REL1_37] Handle annotation tags as nodes rather than metaitems.

https://gerrit.wikimedia.org/r/735339

Change 735339 abandoned by Bartosz Dziewoński:

[mediawiki/extensions/VisualEditor@REL1_37] Handle annotation tags as nodes rather than metaitems.

Reason:

Looks like we're going to disable the broken feature in Translate on 1.37 (see https://gerrit.wikimedia.org/r/c/735958) rather than backporting it.

https://gerrit.wikimedia.org/r/735339

Change 702996 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add support for annotation tags in Parsoid

https://gerrit.wikimedia.org/r/702996

Test wiki on Patch Demo by IHurbainPalatin (WMF) using patch(es) linked to this task was deleted:

https://patchdemo.wmflabs.org/wikis/a5af6673ba/w/

Test wiki on Patch Demo by IHurbainPalatin (WMF) using patch(es) linked to this task was deleted:

https://patchdemo.wmflabs.org/wikis/a030c77438/w/

T55974 should probably be marked as a parent task or duplicates of this task.

Change 757937 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Translate@master] WIP: started poking at <languages/> tag for Parsoid

https://gerrit.wikimedia.org/r/757937

Change 779863 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Translate@master] Update the warning message for VisualEditor

https://gerrit.wikimedia.org/r/779863

Change 779863 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Update the warning message for VisualEditor

https://gerrit.wikimedia.org/r/779863

This has been deployed for a long while by now, and it doesn't seem to be breaking the world. There are still a few bugs, but they can/will be investigated in separate tickets.