Skip to main content

Using MarkLogic Content Pump (mlcp)

How mlcp Determines Document Type

The document type determines what kind of database document mlcp inserts from input content: Text, XML, JSON, or binary. Document type is determined in the following ways:

  • Document type can be inherent in the input file type. For example, aggregates and rdf input files always insert XML documents. For details, see Supported Input Format Summary.

  • You can specify a document type explicitly with -document_type. For example, to load documents as XML, use -input_file_type documents -document_type xml. You cannot set an explicit type for all input file types.

  • mlcp can determine document type dynamically from the output document URI and the MarkLogic Server MIME type mappings when you use -input_file_type documents -document_type mixed.

If you set -document_type to an explicit type such as -document_type json, then mlcp inserts all documents as that type.

If you use -document_type mixed, then mlcp determines the document type from the output URI suffix and the MIME type mapping configured into MarkLogic Server. Mixed is the default behavior for -input_file_type documents.

Note

  • You can only use -document_type mixed when the input file type is documents.

  • If an unrecognized or unmapped file extension is encountered when loading mixed documents, mlcp creates a binary document.

The following table contains examples of applying the default MIME type mappings to output URIs with various file extensions, an unknown extension, and no extension. The default mapping includes many additional suffixes. You can examine and create MIME type mappings under the Mimetypes section of the Admin Interface. For more information, see Implicitly Setting the Format Based on the MIME Type in the Loading Content Into MarkLogic Server Guide.

URI

Document Type

/path/doc.xml

XML

/path/doc.json

JSON

/path/doc.jpg

binary

/path/doc.txt

text

/path/doc.unknown

binary

/path/doc-nosuffix

binary

The MIME type mapping is applied to the final output URI. That is, the URI that results from applying the URI transformation options described in Controlling Database URIs During Ingestion. The following table contains examples of how URI transformations can affect the output document type in mixed mode, assuming the default MIME type mappings.

Input Filename

URI Options

Output URI

Doc Type

/path/doc.1

None

/path/file.1

binary

/path/doc.1

Add a .xml suffix:

-output_uri_suffix ".xml"

/path/file.xml

XML

/path/doc.1

Replace the unmapped suffix with .txt:

-output_uri_replace "\.\d+,'.txt'"

/path/file.txt

text

Document type determination is completed prior to invoking server-side transformations. If you change the document type in a transformation function, you are responsible for changing the output document to match. For details, see Transforming Content During Ingestion.