Migrate from Confluence XHTML to Asciidoctor

You can convert Atlassian Confluence XHTML pages to Asciidoctor using this Groovy script.

The script calls Pandoc to convert single or multiple HTML files exported from Confluence to AsciiDoc files. You’ll need Pandoc installed before running this script. If you have trouble running this script, you can use the Pandoc command referenced inside the script to convert XHTML files to AsciiDoc manually.

Example 1. convert.groovy
// This script is provided by melix.
// The source can be found at https://gist.github.com/melix/6020336

@Grab('net.sourceforge.htmlcleaner:htmlcleaner:2.4')
import org.htmlcleaner.*

def src = new File('html').toPath()
def dst = new File('asciidoc').toPath()

def cleaner = new HtmlCleaner()
def props = cleaner.properties
props.translateSpecialEntities = false
def serializer = new SimpleHtmlSerializer(props)

src.toFile().eachFileRecurse { f ->
    def relative = src.relativize(f.toPath())
    def target = dst.resolve(relative)
    if (f.isDirectory()) {
        target.toFile().mkdir()
    } else if (f.name.endsWith('.html')) {
        def tmpHtml = File.createTempFile('clean', 'html')
        println "Converting $relative"
        def result = cleaner.clean(f)
        result.traverse({ tagNode, htmlNode ->
                tagNode?.attributes?.remove 'class'
                if ('td' == tagNode?.name || 'th'==tagNode?.name) {
                    tagNode.name='td'
                    String txt = tagNode.text
                    tagNode.removeAllChildren()
                    tagNode.insertChild(0, new ContentNode(txt))
                }

            true
        } as TagNodeVisitor)
        serializer.writeToFile(
                result, tmpHtml.absolutePath, "utf-8"
        )
        "pandoc -f html-native_divs -t asciidoctor $tmpHtml --wrap=none -o ${target}.adoc".execute().waitFor()
        tmpHtml.delete()
    }/* else {
        "cp html/$relative $target".execute()
    }*/
}

This script was created by Cédric Champeau (melix). You can find the source of this script hosted at this gist.

The script is designed to be run locally on HTML files or directories containing HTML files exported from Confluence.

Usage

  1. Save the script contents to a convert.groovy file in a working directory.

  2. Make the file executable according to your specific OS requirements.

  3. Create an html directory for input files and an asciidoc directory for output files, both inside the working directory.

  4. Place individual files, or a directory containing files, into the aforementioned html directory.

  5. Run groovy convert to convert the files contained inside the html directory.

  6. Look for the generated output file in the asciidoc directory and confirm it meets your requirements.