How do you handle paste from word? Could you share some code? :)

wiktor_walc · on Nov 16, 2018

(CKEditor team member speaking) Proper handling pasting from Word really takes a lot of time and it's hardly possible to provide a high quality solution alone. We will deliver some basic (in our understanding) support for pasting from office in our most recent editor (https://ckeditor.com/ckeditor-5/) in 2-3 weeks and it already took long weeks of development (see https://github.com/ckeditor/ckeditor5-paste-from-office/issu...), even with our years of experience in rich text editing and with this particular feature.

If pasting from Word is absolutely critical feature for you you may want to check older editors on the market, like CKEditor 4 or even... TinyMCE, one of our competitors. These editors are on the market for 6+ years and had enough time and people to deal with the crap that MS Word produces - correctly preserving as much formatting as possible, without wasting a lot of your end users time on recreating the same content again in an online rich text editor.

stronglikedan · on Nov 15, 2018

It seems to somewhat work in the Paste HTML example [0], for which the source is available [1]. I just did a cursory test with some simple formatting, so YMMV, but it may be a good starting point.

[0] https://www.slatejs.org/#/paste-html [1] https://github.com/ianstormtaylor/slate/tree/master/examples...

paperpunk · on Nov 15, 2018

My issue is mainly that Word doesn't really maintain the same structured hierarchy in the XML that HTML would – it's more like a sequential format. The users wanted a way to indicate certain types of content or annotations and they do so via coloring text in certain ways – but I found in practice this is very hard to reconcile since there is a lot of invisible formatting in word, the element may terminate and start again, with a new invisible element in between. Invisible to the user - but very visible to the parser.

Essentially it's a balance of attempting to remove all the spurious elements (`<o:p>`, or invisible empty formatting, etc.) and then reason about what remains. Much of that involves a lot of walking the tree to inspect neighbouring nodes because them being co-located can indicate something.

Look you may be recoiling in horror by now – it sounds horrific. Actually what we have is a remarkably stable system all things considered but it was built up over time. I think the only approach you can take is write a large amount of unit tests for the schema normaliser, with real MS-word samples and expected outputs, and then really put the system through its paces. Every time you find an example that breaks your model, add a unit test for that snippet, and evolve.

God forbid a Word update ever introduces a new format.

tomspeak · on Nov 15, 2018

Also very interested in this, we've had issues with dealing with reconciling external copy and pasting.

sbr464 · on Nov 15, 2018

Curious also about any gotchas, etc.

c-smile · on Nov 15, 2018

Here is "DOM canibalizer" module from my HTML-NOTEPAD ( https://html-notepad.com ) :

https://github.com/c-smile/sciter-sdk/blob/master/notepad/re...

While it looks pretty simple it cleans MS Word and browser artifacts in pasted markup pretty well.

But I shall admit that such simplicity is possible only with sciter (that html-notepad is based on). E.g. that canonicalizeDOM gets called before the content appears in target document. So all this does not affect undo/redo stack, etc.