Skip to main content

Using MarkLogic Content Pump (mlcp)

Splitting Large XML Files Into Multiple Documents

Very large XML files often contain aggregate data that can be disaggregated by splitting it into multiple smaller documents rooted at a recurring element. Disaggregating large XML files consumes fewer resources during loading and improves performance when searching and retrieving content. For aggregate JSON handling, see Creating Documents from Line-Delimited JSON Files.

The following mlcp options support creating multiple documents from aggregate data:

  • -aggregate_record_element

  • -uri_id

  • -aggregate_record_namespace

You can disaggregate XML when loading from either flat or compressed files. For more information about working with compressed files, see Loading Documents From Compressed Files.

Follow this procedure to create documents from aggregate XML input:

  1. Set -input_file_path:

    • To load from a single file, set -input_file_path to the path to the aggregate XML file.

    • To load from multiple files, set -input_file_path to a directory containing the aggregate files. The directory must not contain other kinds of files.

  2. If you are loading from a compressed file, set -input_compressed.

  3. Set -input_file_type to aggregates.

  4. Set -aggregate_record_element to the element QName of the node to use as the root for all inserted documents. See the example below. The default is the first child element under the root element.

    Note

    The element QName should appear at only one level. You cannot specify the element name using a path, so disaggregation occurs everywhere that name is found.

  5. Optionally, override the default document URI by setting -uri_id to the name of the element from which to derive the document URI.

  6. If the aggregate record element is in a namespace, set -aggregate_record_namespace to the input namespace.

The default URI is hashcode-seqnum in local mode. If there are multiple matching elements, the first match is used.

If your aggregate URI IDs are not unique, you can overwrite one document in your input set with another. Importing documents with non-unique URI IDs from multiple threads can also cause deadlocks.

The example below uses the following input data:

$ cat > example.xml
<?xml version="1.0" encoding="UTF-8"?>
<people>
  <person>
    <first>George</first>
    <last>Washington</last>
  </person>
  <person>
    <first>Betsy</first>
    <last>Ross</last>
  </person>
</people>

The following command breaks the input data into a document for each <person> element. The -uri_id and other URI options give the inserted documents meaningful names. The command creates URIs of the form /people/lastname.xml by using the <last/> element as the aggregate URI id, along with an output prefix and suffix:

# Windows users, see Modifying the Example Commands for Windows
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -input_file_path example.xml \
    -input_file_type aggregates -aggregate_record_element person \
    -uri_id last -output_uri_prefix /people/ \
    -output_uri_suffix .xml

The command creates two documents: /people/Washington.xml and /people/Ross.xml. For example, /people/Washington.xml contains:

<?xml version="1.0" encoding="UTF-8"?>
<person>
    <first>George</first>
    <last>Washington</last>
</person>

If the input data is in a namespace, set -aggregate_record_namespace to that namespace. For example, if the input data is modified to include a namespace:

$ cat > example.xml
<?xml version="1.0" encoding="UTF-8"?>
<people xmlns="http://marklogic.com/examples">...</people>

Then mlcp ingests no documents unless you set -aggregate_record_namespace. Setting the namespace creates two documents in the namespace http://marklogic.com/examples. For example, after running the following command:

# Windows users, see Modifying the Example Commands for Windows
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -input_file_path example.xml \
    -input_file_type aggregates -aggregate_record_element person \
    -uri_id last -output_uri_prefix /people/ \
    -output_uri_suffix .xml \
    -aggregate_record_namespace "http://marklogic.com/examples"

The document with URI /people/Washington.xml contains:

<?xml version="1.0" encoding="UTF-8"?>
<person xmlns="http://marklogic.com/examples">
    <first>George</first>
    <last>Washington</last>
</person>