Splitting Large XML Files into Multiple Documents
Very large XML files often contain aggregate data that can be disaggregated by splitting it into multiple smaller documents rooted at a recurring element. Disaggregating large XML files consumes fewer resources during loading and improves performance when searching and retrieving content. For aggregate JSON handling, see Creating Documents from Line-Delimited JSON Files.
The following mlcp options support creating multiple documents from aggregate data:
-aggregate_record_element
-uri_id
-aggregate_record_namespace
You can disaggregate XML when loading from either flat or compressed files. For more information about working with compressed files, see Loading Documents from Compressed Files.
Follow this procedure to create documents from aggregate XML input:
Set
-input_file_path
:To load from a single file, set
-input_file_path
to the path to the aggregate XML file.To load from multiple files, set
-input_file_path
to a directory containing the aggregate files. The directory must not contain other kinds of files.
If you are loading from a compressed file, set
-input_compressed
.Set
-input_file_type
toaggregates
.Set
-aggregate_record_element
to the element QName of the node to use as the root for all inserted documents. See the example below. The default is the first child element under the root element.Note
The element QName should appear at only one level. You cannot specify the element name using a path, so disaggregation occurs everywhere that name is found.
Optionally, override the default document URI by setting
-uri_id
to the name of the element from which to derive the document URI.If the aggregate record element is in a namespace, set
-aggregate_record_namespace
to the input namespace.
The default URI is hashcode-seqnum
in local mode. If there are multiple matching elements, the first match is used.
If your aggregate URI IDs are not unique, you can overwrite one document in your input set with another. Importing documents with non-unique URI IDs from multiple threads can also cause deadlocks.
The example below uses the following input data:
$ cat > example.xml <?xml version="1.0" encoding="UTF-8"?> <people> <person> <first>George</first> <last>Washington</last> </person> <person> <first>Betsy</first> <last>Ross</last> </person> </people>
The following command breaks the input data into a document for each <person>
element. The -uri_id
and other URI options give the inserted documents meaningful names. The command creates URIs of the form /people/lastname.xml
by using the <last/>
element as the aggregate URI id, along with an output prefix and suffix:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh import -host localhost -port 8000 -username user \ -password password -mode local -input_file_path example.xml \ -input_file_type aggregates -aggregate_record_element person \ -uri_id last -output_uri_prefix /people/ \ -output_uri_suffix .xml
The command creates two documents: /people/Washington.xml
and /people/Ross.xml
. For example, /people/Washington.xml
contains:
<?xml version="1.0" encoding="UTF-8"?> <person> <first>George</first> <last>Washington</last> </person>
If the input data is in a namespace, set -aggregate_record_namespace
to that namespace. For example, if the input data is modified to include a namespace:
$ cat > example.xml <?xml version="1.0" encoding="UTF-8"?> <people xmlns="http://marklogic.com/examples">...</people>
Then mlcp ingests no documents unless you set -aggregate_record_namespace
. Setting the namespace creates two documents in the namespace http://marklogic.com/examples
. For example, after running the following command:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh import -host localhost -port 8000 -username user \ -password password -mode local -input_file_path example.xml \ -input_file_type aggregates -aggregate_record_element person \ -uri_id last -output_uri_prefix /people/ \ -output_uri_suffix .xml \ -aggregate_record_namespace "http://marklogic.com/examples"
The document with URI /people/Washington.xml
contains:
<?xml version="1.0" encoding="UTF-8"?>
<person xmlns="http://marklogic.com/examples">
<first>George</first>
<last>Washington</last>
</person>