Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 11

Marking Up Documents With Entity Enrichment

This chapter describes how to use entity enrichment in MarkLogic Server to add XML markup for entities such as people and places around text. It contains the following sections:

Overview of Entity Enrichment

With XML, you can add element tags around text. If you add element tags around text that has a particular meaning, then you can then search for those tags to find occurrences of that meaningful thing. Common things to mark up with element tags are people and places, but there are many more things that are useful to mark up. Many industries have domain-specific things that are meaningful to mark up. For example, medical researchers might find it useful to mark up perscription drugs with a tag such as RX. The class of things to mark up are sometimes called entities, and the process of making XML more useful by finding these entities and then marking it up with meaningful tags (and searchable) is called entity enrichment.

MarkLogic Server includes built-in entity capabilities (licensed separately), and is also capable of integrating with third-party entity enrichment services. The Content Processing Framework makes it easy to call out to third-party tools, and there are some samples included to demonstrate this process, as described in Sample Pipelines Using Third-Party Technologies.

Built-In Entity Enrichment

The built-in entity enrichment in MarkLogic Server requires an entity enrichment licence key and requires a separate set of libraries to install. This section describes the installation procedure and provides some details on the core entity enrichment capabilites. This section includes the following parts:

Installing the Entity Enrichment Libraries

Before you can use entity enrichment, you must first install the MarkLogicBasis package for your platform as follows:

  1. Ensure that you have a license key that includes entity enrichment. If you navigate to the Host Status page in the Admin Interface (Hosts > hostname > Status), it should list entity enrichment under Options. If you do not have the proper license, contact MarkLogic Support or your account manager.
  2. Obtain the entity enrichment download from MarkLogic Support and save the file on the machine in which MarkLogic runs.
  3. Install entity enrichment as follows:
    Platform Perform the following.
    Windows (64-bit)
    1. Locate the MarkLogicBasis-6.0-1-x86_64.msi file you downloaded and double-click it to run the installer.
    2. Accept the installer defaults.
    3. Restart MarkLogic Server (for example, from the Admin Interface).
    Linux (64-bit)
    1. Navigate to the directory in which you saved the MarkLogicBasis installation file. For example:
      % cd /space/tmp
    2. As the root user, run the folowing:
      % rpm -i MarkLogicBasis-6.0-1-x86_64.rpm
    3. Restart MarkLogic Server (for example, from the Admin Interface or by running /etc/init.d/Marklogic restart as the root user).
    Solaris (64-bit SPARC and x64)
    1. Download the package to /var/spool/pkg.
    2. Unpack the compressed tar file in /var/spool/pkg with the following shell commands:
      % cd /var/spool/pkg
      % uncompress MARKlogicBasis-6.0-1-sparc.tar.Z
      % tar xf MARKlogicBasis-6.0-1-sparc.tar
      % rm MARKlogicBasis-6.0-1-sparc.tar

      The package name will end in amd64.tar.Z for x64 (64-bit AMD Opteron and Intel 64-bit EM64T). If you are installing a release other than 6.0-1, the package name will match the latest service release.

    3. As the root user, install the package with the following command:
      # pkgadd MARKlogicBasis
    Mac OS X Not available.

The entity enrichment package is installed in the <marklogic-dir>/MarkLogicBasis directory.

When MarkLogic Server restarts, you will see messages in the log file indicating that the entity enrichment libraries are loaded. The log messages are similar to the following:

2011-09-22 19:29:43.147 Info: Checking for Basis at /opt/MarkLogicBasis/rlp/lib/amd64-glibc25-gcc41/
2011-09-22 19:29:43.147 Info: Checking for Basis plugin at
lib/BasisEntityExtractor.so
2011-09-22 19:29:43.160 Info: Loaded Basis entity extraction

Details of MarkLogic Entity Enrichment

You can enrich documents with entity markup in MarkLogic Server using the following APIs:

The cts:entity-highlight API calls out to the built-in entity enrichment capabilities in MarkLogic Server (entity enrichment license key required for each language). The cts:entity-highlight API is similar to cts:highlight (see Highlighting Search Term Matches), except it finds entities and allows you to replace their content with other content. The entity:enrich API is implemented in an XQuery library module, and it uses cts:entity-highlight to mark up entities according to the <marklogic-dir>/Config/entity.xsd schema.

A valid entity enrichment license key is required to use cts:entity-highlight; without a valid license key, it throws an exception. If you have a valid license for entity enrichment, you can entity enrich text in English and in any other languages for which you have a valid license key. For languages in which you do not have a valid license key, cts:entity-highlight finds no entities for text in that language. Additionally, you must have the entity enrichment package installed as described in Installing the Entity Enrichment Libraries.

The entity enrichment finds entities for a wide variety of things, including people and places. For a description of all of the entity types, see the cts:entity-highlight documentation in MarkLogic XQuery and XSLT Function Reference. The following shows a simple example of using the built-in entity enrichment:

xquery version "1.0-ml";

import module namespace entity="http://marklogic.com/entity" 
    at "/MarkLogic/entity.xqy";

let $myxml := <node>George Washington never visited Norway.</node>
return
entity:enrich($myxml)

This returns the following:

<node>
  <entity:person xmlns:entity="http://marklogic.com/entity">George 
     Washington</entity:person> never visited 
  <entity:gpe xmlns:entity="http://marklogic.com/entity">Norway.
  </entity:gpe>
</node>

Entity Enrichment Pipelines

MarkLogic Server includes Content Processing Framework (CPF) applications to perform entity enrichment on your XML. You can use the built-in capabilities, you can use the CPF applications for third-party entity extraction technologies, or you can create custom applications with your own technology or some other third-party technology. This section includes the following parts:

These CPF applications require you to install content processing on your database. For details on CPF, including information about domains and pipelines, see the Content Processing Framework Guide guide.

MarkLogic Server Entity Enrichment Pipeline

The MarkLogic Server entity enrichment pipeline uses the built-in entity enrichment capabilites in MarkLogic Server to add XML tags to text. When you set up the entity enrichment pipeline, all documents added to the domain and all documents modified in the domain will be enriched using the built-in entity enrichment capabilities (entity enrichment license key required).

To configure the MarkLogic Server entity enrichment CPF application on your system, perform the following steps:

  1. Install content processing on the database in which you want to enrich your content.
  2. Set up your domain appropriately for your content.
  3. Attach the Status Change Handling and Entity Enrichment pipelines to your domain.
Any documents inserted or updated under the domain to which these pipelines are attached will now use the MarkLogic Server built-in entity enrichment, and the appropriate markup will be added to the documents.

Sample Pipelines Using Third-Party Technologies

There are sample pipelines and CPF applications which connect to third-party entity enrichment tools. The sample pipelines are installed in the <marklogic-dir>/Installer/samples directory. There are sample pipelines for the following entity enrichment tools:

  • TEMIS Luxid®
  • Calais OpenCalais
  • SRA NetOwl
  • Janya
  • Data Harmony

MarkLogic Server connects to these tools via a web service. Sample code is provided on an as-is basis; the sample code is not intended for production applications and is not supported. For details, including setup instructions, see the README.txt file and the samples-license.txt file in the <marklogic-dir>/Installer/samples directory.

Custom Entity Enrichment Pipelines

You can create custom CPF applications to enrich your documents using other third-party enrichment applications. To create a custom CPF application you will need the third party application, a way to connect to it (via a web service, for example), and you will need to write XQuery code and a pipeline file similar to the ones used for the sample applications described in the previous section.

« Previous chapter
Next chapter »