Skip to main content

Using MarkLogic Content Pump (mlcp)

Default Document URI Construction

The default database URI assigned to ingested documents depends on the input source. Loading content from the local filesystem can create different URIs than loading the same content from a ZIP file or archive. Command line options are available for you to modify this behavior. You can use options to generate different URIs; for details, see Transforming the Default URI.

The following table summarizes the default behavior with several input sources:

Input Source

Default URI

Example

documents in a native directory

/path/filename

Note that on Windows, the device (“c:”) becomes a path step, so c:\path\file becomes /c:/path/file.

/space/data/bill/dream.xml

/c:/data/bill/dream.xml

documents in a ZIP or GZIP file

/compressed-file-path/path/inside/zip/filename

If the input file is /space/data/big.zip and it contains a directory entry bill/, then the document URI for dream.xml in that directory is: /space/data/big.zip/bill/dream.xml

a GZIP compressed document

/path/filename-without-gzip-suffix

If the input is /space/data/big.xml.gz, the result is /space/data/big.xml.

delimited text file

The value in the column used as the id. (The first column, by default).

For a record of the form “first,second,third” where Column 1 is the id: first

archive or forest

The document URI from the source database.

sequence file

The key in a key-value pair

aggregate XML

line delimited JSON

/path/filename-split_start-seqnum

Where /path/filename is the full path to the input file, split_start is the byte position from the beginning of the split, and seqnum begins with 1 and increments for each document created.

For input file /space/data/big.xml:/space/data/big.xml-0-1/space/data/big.xml-0-2

For input file /space/data/big.json:/space/data/big.json-0-1 /space/data/big.json-0-2

RDF

A generated unique name

c7f92bccb4e2bfdc-0-100.xml

For example, the following command loads all files from the filesystem directory /space/bill/data into the database attached to the App Server on port 8000. The documents inserted into the database have URIs of form /space/bill/data/filename.

# Windows users, see Modifying the Example Commands for Windows
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -input_file_path /space/bill/data -mode local

If the /space/bill/data directory is zipped up into bill.zip, such that bill/ is the root directory in zip file, then the following command inserts documents with URIs of the form bill/data/filename:

# Windows users, see Modifying the Example Commands for Windows
$ cd /space; zip -r bill.zip bill
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -input_file_path /space/bill.zip \
    -mode local -input_compressed true

When you use the -generate_uri option to have mlcp generate URIs for you, the generated URIs follow the same pattern as for aggregate XML and line delimited JSON:

/path/filename-split_start-seqnum

The generated URIs are unique across a single import operation, but they are not globally unique. For example, if you repeatedly import data from some file /tmp/data.csv, the generated URIs will be the same each time (modulo differences in the number of documents inserted by the job).