Loading TOC...
Information Studio Developer's Guide (PDF)

Information Studio Developer's Guide — Chapter 4

Creating and Configuring Flows

This chapter describes how to use Information Studio to create flows that load content into a database.

The main topics are:

Accessing Information Studio

When you start Application Services, the Application Services page opens. The Information Studio section is near the bottom of the page and looks similar to the following:

The Information Studio Flows section lists all of the flows in the App-Services database and enables you to create a new flow, configure an existing flow, or delete a flow.

The Information Studio Flows section provides the following actions:

ActionDescription
New FlowCreates a new flow. See Creating a New Flow.
ConfigureClick the flow name to configure an existing flow.
DeletePermanently delete a flow from the App-Services database. See Deleting a Flow.

Creating a New Flow

To create a new flow, use the following steps:

  1. In the Information Studio Flows section of the Application Services page, click New Flow.

    A flow named "Untitled" is created and the Flow Editor appears.

  2. In the Flow Editor, click Edit next to the flow name:

  3. Type the name of the flow in the text field and click Done:

  4. Configure your flow by selecting a collector and ingestion options, adding transforms, and configuring the destination database. For details on these tasks, see the following sections:

Selecting a Collector

A collector is a plugin that accumulates content to be loaded into a MarkLogic Server database. Collectors enable you to specify how the files are to be loaded into the database. Different collectors gather content in different ways. For example, one collector scans and loads files from a filesystem directory in a single pass. Another collector monitors and mirrors a directory.

The following collectors are shipped with MarkLogic Server:

  • The Filesystem Directory collector loads files from a specified directory.
  • The Browser Drop-Box collector loads files dropped into a browser window.
  • The External Binary Filesystem Directory collector loads files from a specified directory as external binary documents. See Working With Binary Documents in the Application Developer's Guide.

You can also create custom collectors as described in Creating Custom Collectors.

The topics in this section are as follows:

Changing the Default Collector

  1. The Filesystem Directory collector is the default collector for a flow. To change the collector, click Change Collector in the Collect section of the Flow Editor:

    The Select a Collector dialog appears.

  2. Select the type of collector to use.

Using the Filesystem Directory Collector

The Filesystem Directory collector loads all of the files from a specified directory into the database. Collector options specify where the ingested documents are found on the filesystem. Ingestion options specify how the documents are ingested into the database.

To configure the Filesystem Directory Collector options, use the following steps:

  1. In the Information Studio Flows section of the Application Services page, click the name of the flow you want to configure.

    The Flow Editor appears.

  2. In the Collect section of the Flow Editor, confirm the collector is Filesystem Directory and select Configure.

  3. In the Directory Path text box, enter the filesystem location of the documents to be loaded into the database.

    A fully qualified directory path is required and must be readable by MarkLogic Server. All the files in the specified directory and its subdirectories are ingested.

  4. Click Done to save the directory path.
  5. To specify ingestion options, see Configuring Ingestion Options.

Using the Browser Drop-Box Collector

The Browser Drop-Box collector loads all the files dropped into a browser window into the database. The ingestion options specify how the documents are ingested into the database.

To use the Browser Drop-Box collector, use the following steps:

  1. In the Information Studio Flows section of the Application Services page, click the name of the flow you want to configure.

    The Flow Editor appears.

  2. To set the ingestion options, see Configuring Ingestion Options.
  3. In the Collect section of the Flow Editor, click Start Loading:

  4. Initially, you are prompted to log into MarkLogic Server and accept a certificate: Enter the same login credentials that you used to log into Information Studio.

    If you run MarkLogic Server on Mac OS, do not check the box that instructs Java to remember your login information if you want the option to use the Drop-Box Collector with different credentials.

  5. When you accept the certificate from MarkLogic Server, click 'Always trust content from this publisher' if you don't want to be prompted again to accept the certificate when doing future uploads. Click Run:

  6. When you are ready to upload your files, select the files from your filesystem (for example, from an Explorer window) and drag them into the Drag Files Here area of the Collector:

    Navigating away from the Information Studio page during uploading cancels the upload operation.

  7. When you finish uploading your files, click Stop Loading.

  8. To review the documents you loaded, go to the Query Console. Select the Content Source and click Explore. The documents in the database are listed in the results pane at the bottom of the page.

Using the External Binary Filesystem Collector

The External Binary Filesystem Directory collector loads external binary files, as defined in Working With Binary Documents in the Application Developer's Guide. Ingestion options specify how documents are ingested into the database.

The source directory and its contents must remain accessible to any MarkLogic Server instances that make queries against the external binary documents. See Selecting a Location For Binary Content in the Application Developer's Guide.

To configure the External Binary Filesystem Collector options, use the following steps:

  1. In the Information Studio Flows section of the Application Services page, click the name of the flow you want to configure.

    The Flow Editor appears.

  2. In the Collect section of the Flow Editor, select Configure.

  3. In the Directory Path text box, enter the filesystem location of the documents to be loaded into the database.

    A fully qualified directory path is required and must be readable by MarkLogic Server. All of the files in the specified directory and its subdirectories are ingested.

  4. Click Done to save the directory path.
  5. To set the ingestion options, see Configuring Ingestion Options.

Using the Oscars Example Data Loader Collector

The Oscars Example Data Loader collector loads the complete set of Oscar data files from the MarkLogic Developer Network into the database used by the Example Oscars application created by Application Builder. For details on how to use the Oscars Example Data Loader collector, see Loading the Complete Set of Oscars Data in the Application Builder Developer's Guide.

Configuring Ingestion Options

Ingestion options specify how documents are ingested into the database and under what URI they are stored. Using these options, you can fine tune which documents are ingested by the collector. The available options are described in the following table.

FieldDescription
Documents per transactionThe maximum number of documents to be ingested in a single transaction. If ingesting more than the maximum, the ingest operation schedules more than one transaction.
FilteringThe filter used to select the documents in the filesystem. This can be any XQuery regular expression. The default regular expression specifies all documents in the directory and subdirectories.
Repair XML documentsCheck this option to attempt to repair malformed XML content on each document during ingestion. If the box is left unchecked, malformed XML content is rejected and an error is generated.
FormatIngest documents as a particular format, such as XML, Text, or Binary. The default setting ingests documents as any format. Documents that are not originally of the specified format are converted to that format.
EncodingSpecify the encoding type of your source documents, such as UTF-8, ASCII, and so on. See the Search Developer's Guide for a list of character set encodings by language. Source document encodings are translated into UTF-8. The string specified for the encoding option is matched to an encoding name according to the Unicode Charset Alias Matching rules, see http://www.unicode.org/reports/tr22/#Charset_Alias_Matching. The Auto setting uses an automatic encoding detector. If no encoding can be detected, the encoding defaults to UTF-8.
LanguageAdd an xml:lang attribute to the root element node on all ingested documents to indicate they are written in a particular language, such as English or French. Default indicates to not tag ingested documents with an xml:lang attribute.
Default namespaceThe namespace to use if there is no namespace at the root node of the document. The default value is "".

If you specify a default namespace in a collector, you must also specify the same namespace in the transforms you use in the flow. Otherwise the transform does not work.

Most of the ingestion options are the same as the options passed to the xdmp:document-load function.

To configure the ingestion options for a collector, use the following steps:

  1. In the Collect section of the Flow Editor, click Ingestion.

    The Ingestion Settings dialog appears.

  2. In the Ingestion Settings dialog, configure the settings. Each option is described in the table above.

Transforming Content During Ingestion

Transforms are plugins that modify your documents as they are loaded into the database. Several transforms are shipped with MarkLogic Server. You can also create custom transforms, as described in Creating Custom Transforms.

MarkLogic Server provides transforms for the following actions:

Adding a Transform To A Flow

To add a transform, use the following steps:

  1. In the Information Studio Flows section of the Application Services page, click New Flow or click the name of the existing flow to access the Flow Editor.

  2. In the Transform section of the Flow Editor, click Add Transformation Step.

    The Select A Transform dialog appears.

  3. Select the transform that you want to add to your flow. You can add multiple transforms to your flow.

Refer to the following sections for information on using each type of transform.

Deleting Elements or Attributes

The Delete transform enables you to remove an element or attribute. You can also isolate the element or attribute in a specific namespace to be deleted. Replace <match-expression> with the name of the element to delete, or with an expression using an XSLT match pattern. If deleting an attribute, replace <match-expression> with the match pattern that matches an attribute (for example, @attribute-name). You can test your <match-expression> by clicking Test. You can specify namespace bindings to use in your <match-expression> using the Add aNamespace Binding button. Click Done to save the transform.

If you specified a default namespace in your collector, click 'Add a namespace binding' and specify the same default namespace:

Normalizing Dates

The Normalize Dates transform enables you to specify a text transformation into an xs:date or xs:dateTime value on certain elements in the documents.

Specify the date format using the drop-down list:

You can add the date as a new element, as an attribute to an existing element, or overwrite the value of an existing element or attribute with the date.

If you have specified a default namespace in your collector, click 'Add a namespace binding' and specify the same default namespace for your element and attribute (if applicable):

Validating Documents Against a Schema

The Schema Validation transform validates the XML of loaded documents against the schema according to the validation mode as follows:

  • Select strict to cause validation errors to stop processing with a fatal error
  • Select lax to cause validation errors to produce a warning and continue processing

The default setting is strict. For details on the validation levels, see Validate Expression in the XQuery and XSLT Reference Guide.

Applying a Custom XSLT Stylesheet

The XSLT transform enables you to create a custom XSLT stylesheet to apply to the loaded documents. The XSLT stylesheet enables you to modify document content, properties, permissions, and collections.

XSLT cannot be used on binary documents, so they are ignored by this transform and are passed through unchanged. Additionally, you cannot use the <xsl:result-document> XSLT instruction in an Information Studio transformation; if you use it, any result documents are not propagated to your content database.

If you are importing an XSLT stylesheet, the stylesheet must have read permission for the infostudio-user role and should be stored under the /actions root in the App-Services database. Information Studio expects XSLT stylesheets to be located in the /actions directory, so it is not necessary to specify the root directory in your import statement.

For example, to import the stylesheet, /actions/a.xsl, the import should look like the following:

<xsl:import href="a.xslt"/>

Extracting Metadata from Binary Content With the Filter Documents Transform

The Filter Documents transform extracts text and metadata from binary content during ingestion, with options to store the text and metadata either as document properties or in a separate XHTML document. For example, for Microsoft Word or PDF input, the transform extracts metadata such as author, creation date, heading, and body text as properties or into an XHTML document.

Text extracted from binary documents contains little formatting. This text is usually used to support search, classification, and other text processing. For an example of the types of metadata extracted and the supported document formats, see xdmp:document-filter in the XQuery and XSLT Reference Guide.

To choose whether to store the metadata as properties or as separate XHTML documents, select your choice form the drop down list.

Renaming Elements or Attributes

The Rename transform enables you to change the name of an element or attribute in one or more namespaces.

When you select the Rename transform, a configure settings dialog appears.

Replace <match-expression> with the name of the existing element or attribute. Replace <new-qname> with the new name. Click Test to see the results of your entries.

If you specified a default namespace in your collector, click 'Add a namespace binding' and specify the same default namespace.

Adding a Custom XQuery Transform

The XQuery transform enables you to add a Content Processing Framework (CPF) action module to modify the loaded documents. The settings window provides a basic template for the CPF action module and directions on where to add your code. Alternatively, you can paste in your completed action module.

The module you provide must be a valid CPF action module, or the pipeline does not function properly.

Selecting Database Load Settings

The Load section of the Flow Editor enables you to select the destination database and to configure a number of document settings, such as the following:

  • URI structure under which the documents are to be loaded
  • Document access permissions
  • Collections under which the documents are to be grouped

The topics in this section are:

Selecting the Destination Database

To select the database for the flow, do the following:

  1. In the Information Studio Flows section of the Application Services page, click the name of the flow you want to configure.

    The Flow Editor appears.

  2. In the Load section of the Flow Editor, select the database into which you want this flow to load the content:

    The total number of documents currently in the database is displayed on the right:

Document Settings

Click Document Settings to assign attributes to the documents that are loaded into the database.

The Document Settings options are the following:

Configuring the URI Structure

The URI Structure Configuration dialog enables you to specify the URI structure of incoming documents and how to handle conflicts between incoming documents and documents that already exist in the database. The URI setting defines the structure of the URI under which files are loaded into the database. A URI can be made up of some or all the following elements and they can be organized in any order:

ElementWhat it isExample
{$guid}Globally Unique ID. This specifies to generate a globally unique ID for each file loaded into the database.15908936213503297716

{$path}

{$path strip-prefix=' '}

The directory path to the file. By default, this is the same path under which the file is located in the filesystem.

You can add a strip-prefix inside {$path} to remove a prefix from the upload directory. (See example below.)

C:/latest/docs
{$filename}The name of the file.myfile
{$ext}The file extension. Note that the dot (.) must be specified in the URI structure as a literal.xml

To configure the URI structure, do the following:

  1. Select the destination database and click Document Settings.
  2. In the Document Settings dialog, select the URI tab.

    The URI Structure Configuration dialog appears.

  3. To change the URI structure, modify the structure of the URI in the URI field.

    For example, the default URI is:

    /content{$path}/{$filename}.{$ext}

    This means a file named, C:/mydir/mydocument.xml is loaded with the following URI:

    /contentC:/mydir/mydocument.xml

    Changing the URI structure to:

    /content{$path strip-prefix="C:/mydir"}/{$filename}.{$ext}

    results in the following URI:

    /content/mydocument.xml

    Changing the URI structure to:

    /http://mydir/{$filename}.{$ext}

    results in the URI:

    /http://mydir/mydocument.xml

    Changing the URI structure to:

    /mydir/{$filename}

    results in the URI:

    /mydir/mydocument
  4. To handle incoming documents that have the same URI as an existing document in the database, select the 'If a document already exists at this URI' radio button from the following options :
    OptionDescription
    Ignore Incoming documentDo not update an existing document in the database with the incoming document.
    ReplaceReplace the existing document in the database with the incoming document.
    ErrorGenerate an error if an existing document in the database has the same URI as the incoming document.
Configuring Document Access Permissions

As described in Document Permissions in the Understanding and Using Security Guide Guide, you can specify permissions on the ingested documents to control which users can access them and in what manner.

To set the permissions for the documents ingested by this flow, do the following:

  1. In the Document Settings dialog, select the Permissions tab.
  2. Enter the name of the Role, then select the permission to assign to the role from drop-down list.

  3. To add a new permission, click New Permission and repeat the procedure described in the previous step. You can add as many permissions for as many roles as you like.

Configuring Collections

Collections are described in detail in Protected Collections in the Administrator's Guide. This section describes how to specify the collections to associate with the documents ingested by a flow.

To set the collections for the documents ingested by a flow, use the following steps:

  1. In the Document Settings dialog, select the Collections tab.
  2. Add or remove collections in the Destination Collections dialog.

Configuring Quality Boost

Quality Boost associates all ingested documents with the specified quality value. A positive value increases the relevance score of the document in text search functions. The converse is true for a negative value. Leaving this field blank specifies the default document quality, which is 0.

To set the quality boost for the documents ingested by a flow, select the Quality Boost tab in the Document Settings dialog.

Launching Ingestion and Tracking Status

The Status portion of the Flow Editor displays the status of the ingest operations and any resulting errors when ingestion has completed.

  1. To begin ingesting documents into the database, click Start Loading.

    • The Status section displays the ticket status and progress of the documents at the Collecting, Processing, and Loaded stages of the flow.

    • When the ingestion is complete, the Status indicates the ticket status as 'completed' and displays the number of documents that were successfully ingested into the database. If there are errors, you can click on the errors link. The errors window includes both collection errors and processing errors.

  2. (Optional) Click on an error for more detail.

  3. (Optional) To remove the loaded documents from the database, Click Unload:

  4. (Optional) To review the documents you loaded, go to the Query Console. Select the Content Source and click Explore. The documents in the database are listed in the results pane at the bottom of the page.

Deleting a Flow

To delete a flow, click the Delete link associated with the flow in the Information Studio Flows section of the Application Services page. Click OK in the "Are you sure you'd like to delete this flow?" popup window.

« Previous chapter
Next chapter »
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy