Migrate from MS Word to Asciidoctor

Pandoc

Pandoc is a Swiss army knife for converting one markup format to another. It does an admirable job converting simple docx files to AsciiDoc.

Generally, we don’t like to recommend pandoc because it doesn’t create AsciiDoc the way we prefer. However, in this case, it’s a good choice.

To perform the conversion from MS Word docx to AsciiDoc, you need to perform the following command:

pandoc < 2.11.2

$ pandoc -f docx -t asciidoctor --wrap=none --atx-headers \
  --extract-media=extracted-media -o output.adoc

pandoc (2.11.2 or newer)

$ pandoc input.docx -f docx -t asciidoc --wrap=none --markdown-headings=atx \
  --extract-media=extracted-media -o output.adoc

pandoc (2.11.2 or newer) and docker

If you use docker you can also use the latest version of pandoc with docker without installing it:

$ docker run --rm --volume "$PWD:/data" --user `id -u`:`id -g` pandoc/core input.docx -f docx \
  -t asciidoc --wrap=none --markdown-headings=atx --extract-media=extracted-media -o output.adoc

Then, edit the output file to tidy it up.

The conversions you can expect are shown in the following table (tested using MS Word 2010 and 2013 + Pandoc 2.0 and 2.5 (Windows)):

Pandoc MS Word to AsciiDoc conversions
MS Word Feature	Conversion
Headings (using MS Word Heading styles 1-5)	Yes
Tables	Yes. Merged cells are unmerged. Column widths are ignored.
Unordered and ordered lists	Yes
Footnotes	Yes
Figure and table captions	Normal paragraph
Other MS Word styled paragraphs	Normal paragraph
Embedded images	Yes
Character formatting (bold, underline and italic)	Yes
URL links	No. See how to remove hyperlinks to fix it.
Email links	Yes
Document automation (fields, auto-generated figure and table numbers)	Ignored
Internal references (e.g., “See Figure 3”)	Plain text
Drawing canvas	Ignored
Text boxes	Ignored
Linked (not embedded) images	Ignored
Vector graphics (MS Word “insert shape”)	Ignored

Optimizing for Pandoc

The basic usage documented above is fine for one-off imports. If you have a lot to do, it’s worthwhile cleaning the input document first, and automating the post-conversion tidy up.

Clean up the MS Word document:
- Remove non-essential material (title pages, headers and footers, table of contents etc.); it is usually easiest to copy just the body text into a new blank document
- Switch off tracking and accept all changes
- Ensure you have used Heading styles for headings
- Ensure table titles are immediately above their table
- Ensure figure titles are immediately above their figure
- Ensure images are inserted as embedded files, not as links
- Remove canvases and put any images they contained into the main flow as paragraph images (this limitation may be removed in the next release of pandoc)
- Remove hyperlinks and replace all internal references and auto-generated sequence numbers with their literal values (Ctrl+A, Ctrl+Shift+F9)
- Remove text boxes and put their text into the main flow
- Replace special characters: smart quotes with simple quotes, non-breaking hyphens with normal hyphens.
- Remove all character formatting (Ctrl+A, Ctrl+B, Ctrl+B, Ctrl+I, Ctrl+I, Ctrl+U, Ctrl+U)
- Optional: insert ids and cross references using AsciiDoc notation (You might find it easier doing it now rather than in the AsciiDoc document later.)
- Save as “Strict Open XML document (docx)`”
Convert using pandoc as shown above.

Check that the output document looks OK, and that all images have been extracted.

If for some reason pandoc is not extracting images, you can always extract them by using unzip tool. Docx is just a zip file with a docx file extension. Embedded images are located in word/media directory.

$ unzip input.docx -d input-docx
  ls input-docx/word/media/

Fix up the output, preferably with an editor that can do regular expressions:
- Delete automatic ids (those beginning with underscores)
- Replace long table delimiters with short ones.
- Insert line-breaks to get to 1 sentence per paragraph.
- Re-insert images and turn caption paragraphs back into Asciidoctor captions.
- Replace the hard cross references with AsciiDoc references.
- Fix tables: merged cells will have unmerged, column widths need putting back.
Try to convert it, and fix any errors that come up.

The following are POSIX shell one-liners to automate some of these steps (adjust the regexps to match your particular document):

Delete automatically inserted ids

$ perl -W -pe  's!\[\[_.*]]!!g' -i output.adoc

Shorten table delimiters

$ perl -W -pe  's!\|==*!|====!g' -i output.adoc

1 sentence per line. Be careful not to match lists. It will get confused by abbreviations, but there is no way around that.
```
$ perl -W -pe 's!(\w\w+)\.\s+(\w)!$1.\n$2!g' -i output.adoc
```

Replace figure captions with id and title

$ perl -W -pe 's!^Figure (\d+)\s?(.*)![[fig-$1]]\n.$2\n!g' -i output.adoc

Replace references to figures with AsciiDoc xref

$ perl -W -pe 's!Figure (\d+)!<<fig-$1>>!g' -i output.adoc

Google Docs

Google Docs can already upload and edit MS Word docx files. Using the AsciiDoc Processor add-on by Guillaume Grossetie, you can copy and paste part or all of the document from Google Docs as AsciiDoc text. The features that it can handle seem to be substantially fewer than pandoc but expect further development. The source for the add-on can be found in its repository.

Plain text

This method is only useful for very small files or if the other methods are not available.

It keeps the text, and fixes fields like auto-numbered lists and cross references.
It loses tables (converted to plain paragraphs), images, symbols, form fields, and text boxes.

In MS Word, use Save as Plain text, then when the File Conversion dialog appears, set:

Other encoding: UTF-8
Do not insert line breaks
Allow character substitution

Save the file then apply AsciiDoc markup manually.

Experiment with the encoding. Try UTF-8 first, but if you get problems you can always revert to US-ASCII.