Loading TOC...

MarkLogic Server 11.0 Product Documentation
xdmp.tidy

xdmp.tidy(
   doc as String,
   [options as Object?]
) as Sequence

Summary

Run tidy on the specified html document to convert the document to well-formed and clean XHTML. Returns two nodes: the first is a status node indicating any errors or warning from tidy, and the second is an html node containing the cleaned xhtml.

Parameters
doc A string representing the html document you want to tidy.
options The options are based on the open source HTML Tidy configuration options, available at http://tidy.sourceforge.net/docs/quickref.html. Most of the tidy options are available this function, with the following exceptions:
  • The character encoding for the output is always UTF-8.
  • The filesystem options which allow you to specify where to save output are not supported (although there are many ways to achieve this through functions such as xdmp.save).
  • The output is always XHTML.
  • Entities except for the built-in HTML entities will be always be output in numeric form.

This function supports the following options:

HTML, XHTML, and XML Options

addXmlDecl

Default Value: no

Description: This option specifies if Tidy should add the XML declaration when outputting XML or XHTML. Note that if the input already includes an <?xml ... ?> declaration then this option will be ignored.

addXmlSpace

Default Value: no

Description: This option specifies if Tidy should add xml:space="preserve" to elements such as <PRE>, <STYLE> and <SCRIPT> when generating XML. This is needed if the whitespace in such elements is to be parsed appropriately without having access to the DTD.

altText

Default Value: n/a


Description: This option specifies the default "alt=" text Tidy uses for <IMG> attributes. This feature is dangerous as it suppresses further accessibility warnings. You are responsible for making your documents accessible to people who can not see the images!

assumeXmlProcins

Default Value: no

Description: This option specifies if Tidy should change the parsing of processing instructions to require ?> as the terminator rather than >. This option is automatically set if the input is in XML.

bare

Default Value: no

Description: This option specifies if Tidy should strip Microsoft specific HTML from Word 2000 documents, and output spaces rather than non-breaking spaces where they exist in the input.

clean

Default Value: no

Description: This option specifies if Tidy should strip out surplus presentational tags and attributes replacing them by style rules and structural markup as appropriate. It works well on the HTML saved by Microsoft Office products.

cssPrefix

Default Value: n/a

Description: This option specifies the prefix that Tidy uses for styles rules. By default, "c" will be used.

doctype

Default Value: auto

Possible Values: auto, omit, strict, loose, transitional, or user-specified fpi string

Description: This option specifies the DOCTYPE declaration generated by Tidy. If set to omit the output won't contain a DOCTYPE declaration. If set to auto (the default) Tidy will use an educated guess based upon the contents of the document. If set to strict, Tidy will set the DOCTYPE to the strict DTD. If set to loose, the DOCTYPE is set to the loose (transitional) DTD. Alternatively, you can supply a string for the formal public identifier (FPI). For example:

doctype: "-//ACME//DTD HTML 3.14159//EN"

If you specify the FPI for an XHTML document, Tidy will set the system identifier to the empty string. Tidy leaves the DOCTYPE for generic XML documents unchanged. Specifying a doctype of omit implies that the numeric-entities option is set to yes.

dropEmptyParas

Default Value: yes

Description: This option specifies if Tidy should discard empty paragraphs. If set to no, empty paragraphs are replaced by a pair of <BR> elements as HTML4 precludes empty paragraphs.

dropFontTags

Default Value: no

Description: This option specifies if Tidy should discard <FONT> and <CENTER> tags without creating the corresponding style rules. This option can be set independently of the clean option.

dropProprietaryAttributes

Default Value: no

Description: This option specifies if Tidy should strip out proprietary attributes, such as MS data binding attributes.

encloseBlockText

Default Value: no

Description: This option specifies if Tidy should insert a <P> element to enclose any text it finds in any element that allows mixed content for HTML transitional but not HTML strict.

encloseText

Default Value: no

Description: This option specifies if Tidy should enclose any text it finds in the body element within a <P> element. This is useful when you want to take existing HTML and use it with a style sheet.

escapeCdata

Default Value: no

Description: This option specifies if Tidy should convert <![CDATA[]]> sections to normal text.

fixBackslash

Default Value: yes

Description: This option specifies if Tidy should replace backslash characters "\" in URLs by forward slashes "/".

fixBadComments

Default Value: yes

Description: This option specifies if Tidy should replace unexpected hyphens with "=" characters when it comes across adjacent hyphens. The default is yes. This option is provided for users of Cold Fusion which uses the comment syntax: <!--- --->

fixUri

Default Value: yes

Description: This option specifies if Tidy should check attribute values that carry URIs for illegal characters and if such are found, escape them as HTML 4 recommends.

hideComments

Default Value: no

Description: This option specifies if Tidy should print out comments.

hideEndtags

Default Value: no

Description: This option specifies if Tidy should omit optional end-tags when generating the pretty printed markup. This option is ignored if you are outputting to XML.

indentCdata

Default Value: no

Description: This option specifies if Tidy should indent <![CDATA[]]> sections.

inputXml

Default Value: no

Description: This option specifies if Tidy should use the XML parser rather than the error correcting HTML parser.

joinClasses

Default Value: no

Description: This option specifies if Tidy should combine class names to generate a single new class name, if multiple class assignments are detected on an element.

joinStyles

Default Value: yes

Description: This option specifies if Tidy should combine styles to generate a single new style, if multiple style values are detected on an element.

literalAttributes

Default Value: no

Description: This option specifies if Tidy should ensure that whitespace characters within attribute values are passed through unchanged.

logicalEmphasis

Default Value: no

Description: This option specifies if Tidy should replace any occurrence of <I> by <EM> and any occurrence of <B> by <STRONG>. In both cases, the attributes are preserved unchanged. This option can be set independently of the clean and drop-font-tags options.

lowerLiterals

Default Value: yes

Description: This option specifies if Tidy should convert the value of an attribute that takes a list of predefined values to lower case. This is required for XHTML documents.

mergeDivs

Default Value: yes

Description: Can be used to modify behavior of setting the clean option to yes. This option specifies if Tidy should merge nested <div> such as <div><div>...</div></div>.

ncr

Default Value: yes

Description: This option specifies if Tidy should allow numeric character references.

newBlocklevelTags

Default Value: none

Description: This option specifies new block-level tags. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags. Note you can't change the content model for elements such as <TABLE>, <UL>, <OL> and <DL>.

newEmptyTags

Default Value: none

Description: This option specifies new empty inline tags. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags. Remember to also declare empty tags as either inline or blocklevel.

newInlineTags

Default Value: none

Description: This option specifies new non-empty inline tags. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags.

newPreTags

Default Value: none

Description: This option specifies new tags that are to be processed in exactly the same way as HTML's <PRE> element. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags. Note you can not as yet add new CDATA elements (similar to <SCRIPT>).

numericEntities

Default Value: no

Description: This option specifies if Tidy should output entities other than the built-in HTML entities (&, <, > and ") in the numeric rather than the named entity form.

outputHtml

Default Value: no

Description: This option specifies if Tidy should generate pretty printed output, writing it as HTML.

outputXhtml

Default Value: yes

Description: This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML. If a DOCTYPE or namespace is given they will checked for consistency with the content of the document. In the case of an inconsistency, the corrected values will appear in the output. For XHTML, entities can be written as named or numeric entities according to the setting of the numeric-entities option. The original case of tags and attributes will be preserved, regardless of other options.

outputXml

Default Value: yes

Description: This option specifies if Tidy should pretty print output, writing it as well-formed XML. Any entities not defined in XML 1.0 will be written as numeric entities to allow them to be parsed by a XML parser. The original case of tags and attributes will be preserved, regardless of other options.

quoteAmpersand

Default Value: yes

Description: This option specifies if Tidy should output unadorned & characters as &#38;.

quoteMarks

Default Value: no

Description: This option specifies if Tidy should output " characters as " as is preferred by some editing environments. The apostrophe character ' is written out as &#39; since many web browsers don't yet support &#39;.

quoteNbsp

Default Value: yes

Description: This option specifies if Tidy should output non-breaking space characters as entities, rather than as the Unicode character value 160 (decimal).

repeatedAttributes

Default Value: keep-last

Possible Values:keep-first, keep-last

Description: This option specifies if Tidy should keep the first or last attribute, if an attribute is repeated (for example, if a tag has has two align attributes.

replaceColor

Default Value: no

Description: This option specifies if Tidy should replace numeric values in color attributes by HTML/XHTML color names where defined, e.g. replace "#ffffff" with "white".

showBodyOnly

Default Value: no

Description: This option specifies if Tidy should print only the contents of the body tag as an HTML fragment. Useful for incorporating existing whole pages as a portion of another page.

uppercaseAttributes

Default Value: no

Description: This option specifies if Tidy should output attribute names in upper case. The default is no, which results in lower case attribute names, except for XML input, where the original case is preserved.

uppercaseTags

Default Value: no

Description: This option specifies if Tidy should output tag names in upper case. The default is no, which results in lower case tag names, except for XML input, where the original case is preserved.

word2000

Default Value: no

Description: This option specifies if Tidy should go to great pains to strip out all the surplus stuff Microsoft Word 2000 inserts when you save Word documents as "Web pages". Doesn't handle embedded images or VML.

Diagnostic Options

accessibilityCheck

Default Value: 0

Possible Values: 0, 1, 2, or 3

Description: This option specifies what level of accessibility checking, if any, that Tidy should do. Level 0 is equivalent to Tidy Classic's accessibility checking. For more information on Tidy's accessibility checking, see the web site for the Adaptive Technology Resource Centre at the University of Toronto.

showErrors

Default Value: 6

Possible Values: Any integer.

Description: This option specifies the number Tidy uses to determine if further errors should be shown. If set to 0, then no errors are shown.

showWarnings

Default Value: yes

Description: This option specifies if Tidy should suppress warnings. This is useful when a few errors are hidden between many warning messages.

Pretty Print Options

breakBeforeBr

Default Value: no

Description: This option specifies if Tidy should output a line break before each <BR> element.

indent

Default Value: no

Possible Values: no, yes, auto

Description: This option specifies if Tidy should indent block-level tags. If set to auto, this option causes Tidy to decide whether or not to indent the content of tags such as TITLE, H1-H6, LI, TD, TD, or P depending on whether or not the content includes a block-level element. You are advised to avoid setting indent to yes as this can expose layout bugs in some browsers.

indentAttributes

Default Value: no

Description: This option specifies if Tidy should begin each attribute on a new line.

indentSpaces

Default Value: 2

Possible Values: Any integer.

Description: This option specifies the number of spaces Tidy uses to indent content, when indentation is enabled.

markup

Default Value: yes

Description: This option specifies if Tidy should generate a pretty printed version of the markup. Note that Tidy won't generate a pretty printed version if it finds significant errors (see force-output).

punctuationWrap

Default Value: no

Description: This option specifies if Tidy should line wrap after some Unicode or Chinese punctuation characters.

split

Default Value: no

Description: This option specifies if Tidy should create a sequence of slides from the input, splitting the markup prior to each successive <H2>. The slides are written to "slide001.html", "slide002.html" etc.

tabSize

Default Value: 8

Possible Values: Any integer.

Description: This option specifies the number of columns that Tidy uses between successive tab stops. It is used to map tabs to spaces when reading the input. Tidy never outputs tabs.

verticalSpace

Default Value: no

Description: This option specifies if Tidy should add some empty lines for readability.

wrap

Default Value: 68

Possible Values: Any integer.

Description: This option specifies the right margin Tidy uses for line wrapping. Tidy tries to wrap lines so that they do not exceed this length. Set wrap to zero if you want to disable line wrapping.

wrapAsp

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within ASP pseudo elements, which look as follows:
<% ... %>.

wrapAttributes

Default Value: no

Description: This option specifies if Tidy should line wrap attribute values, for easier editing. This option can be set independently of wrap-script-literals.

wrapJste

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within JSTE pseudo elements, which look as follows:
<# ... #>.

wrapPhp

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within PHP pseudo elements, which look as follows:
<?php ... ?>.

wrapScriptLiterals

Default Value: no

Description: This option specifies if Tidy should line wrap string literals that appear in script attributes. Tidy wraps long script string literals by inserting a backslash character before the line break.

wrapSections

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within <![ ... ]> section tags.

Miscellaneous Options

forceOutput

Default Value: no

Description: This option specifies if Tidy should produce output even if errors are encountered. Use this option with care - if Tidy reports an error, this means Tidy was not able to, or is not sure how to, fix the error, so the resulting output may not be what you expect.

keepTime

Default Value: no

Description: This option specifies if Tidy should keep the original modification time of files that Tidy modifies in place. The default is no. Setting the option to yes allows you to tidy files without causing these files to be uploaded to a web server when using a tool such as SiteCopy. Note this feature is not supported on some platforms.

quiet

Default Value: no

Description: This option specifies if Tidy should output the summary of the numbers of errors and warnings, or the welcome or informational messages.

tidyMark

Default Value: yes

Description: This option specifies if Tidy should add a meta element to the document head to indicate that the document has been tidied. Tidy won't add a meta element if one is already present.

Example

var html = "
<htm>
 <h1>This is a heading 1
 <p>This is paragraph tag
";

xdmp.tidy(html, {});

// Returns a tidy-status node with any errors and warnings and
// an html node containing the clean and well-formed XHTML.

Example

xdmp.tidy(xdmp.documentGet("/space/test.html",
               {'format':'text'}),
     {'outputXhtml':'yes'}).toArray()[1]

// Returns the html document from the filesystem converted to xhtml

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.