AWS Database Blog

Building a knowledge graph in Amazon Neptune using Amazon Comprehend Events

On 28-Oct-22, the AWS CloudFormation template and Jupyter notebook linked in this post were updated to 1/ add openCypher queries along with the existing Gremlin and SPARQL queries, 2/ updated to use Sagemaker newer Amazon Linux 2 instances, 3/ fixed a bug in the RDF generation code that improperly labeled a property as an RDF type, and 4/ improved a SPARQL query visualization in the notebook.

Organizations that need to keep track of financial events, such as mergers and acquisitions or bankruptcy or leadership change announcements, do so by analyzing multiple documents, news articles, SEC filings, or press releases. This data is often unstructured or semi-structured text, which is hard to analyze without a predefined data model. You can use Amazon Comprehend to extract entities from unstructured text. At AWS re:Invent 2020, Amazon Comprehend launched Amazon Comprehend Events, a new API for extracting financial events from natural language text documents. With this launch, you can use Amazon Comprehend Events to extract granular details about real-world financial events and associated entities expressed in unstructured text.

After you extract the data, you need to organize it to find patterns, navigate relationships across different entities, and build knowledge graph applications for trading or detecting bad actors. Using a knowledge graph is an easy way to organize the information. For example, a financial analysis team can build a knowledge graph to explore the mergers and acquisitions of companies and connect people and organizations to the event. You can use these same techniques to improve your “know your customer” (KYC) and customer 360 or identity graph projects by leveraging unstructured text.

In the post Announcing the launch of Amazon Comprehend Events, we demonstrated how to use the Amazon Comprehend Events API to analyze Amazon press releases and visualize the extracted events as a graph in an Amazon SageMaker notebook. In this post, we show you how to build a knowledge graph of the extracted events and entities in Amazon Neptune by integrating the output from the Amazon Comprehend Events API on those press releases.

Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run business-critical graph applications. Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying your graph with milliseconds latency. The Amazon Comprehend Events API can perform coreference resolution between events within a single document, which is the grouping of expressions in text that refer to the same thing or person. Neptune allows you to bring those documents together and analyze the contents as a connected graph using graph query languages such as Gremlin and SPARQL at scale. Neptune Workbench allows you to visualize the documents to intuitively observe the events and relationships.

The following diagram is a Neptune Workbench visualization of the relationship between a document, a corporate acquisition event, and the organizations (with their roles) involved in that event.

The following diagram is a Neptune Workbench visualization of the relationship between a document, a corporate acquisition event, and the organizations (with their roles) involved in that event.

In the following visual representation of our data model, we model every source document, extracted entity, and extracted event as a node in the graph. For example, a “press release” is a document, “Whole Foods Market” is an entity, and “acquires” is an event. These nodes are linked with edges from the source document to the events found within that document, as well as edges among the events and the key entities associated (referred in Amazon Comprehend Events as arguments) with them. Our nodes and edges are labeled with the type of entity and relationship detected, respectively.

Our nodes and edges are labeled with the type of entity and relationship detected, respectively.

Solution overview

Integrating Amazon Comprehend Events with Neptune has three steps:

  1. Retrieve the results from the Amazon Comprehend Events API.
  2. Transform the JSON-formatted output into the graph format that Neptune supports.
  3. Load the graph data into Neptune.

For this post, we already completed the first step by sampling 118 press releases from Amazon’s Press Center dated with years between 2017–2020, running them through Amazon Comprehend Events, and including the output in a publicly available bucket via Amazon Simple Storage Service (Amazon S3). Later in this document, we share an AWS CloudFormation template that you can use to set this up in your AWS account to try for yourself, and to utilize as a template for your own knowledge graph project.

Transforming the Amazon Comprehend Events output into a graph

Each document we sent to Amazon Comprehend Events appears as a JSON-formatted object on a single line in the output document (JSON Lines format). If we “pretty print” one of the lines in that output, it has a general structure seen in the following diagram (ellipses used to truncate fields we don’t use in this post).

Each document we sent to Amazon Comprehend Events appears as a JSON-formatted object on a single line in the output document (JSON Lines format).

The preceding figure demonstrates a number of facts about the event structure in the document. It shows that it contains two organizations, Whole Foods Market and Amazon, both having roles in a corporate acquisition event (as the investee and investor, respectively), along with the phrases and locations within the text signifying each. The API provides confidence scores for both the assignment of a class label (a score for mentions, arguments, and triggers) and membership in a given coreference group (mentions and triggers). For more information about scores, see Detect Events.

Using the graph model we explained earlier, we need to transform the JSON format into the Neptune-supported format. We have three different types of nodes from the output:

  • Documents – The document Amazon Comprehend Events analyzes
  • Events – The financial events detected in the document
  • Entities – The people, organizations, places, and other entities referenced in the document

The JSON structure provided by the API indicates that entities and events have a relationship to a specific document. Events also have relationships to one or more entities in specific roles. We also want to look at the relevant properties associated with each node. The model is as follows:

  • Documents (nodes labeled DOCUMENT) contain a unique reference to the input record.
  • Events are labeled with the type of event detected, as noted in the JSON object field Type (ACQUISITION in the example) and have two properties:
    • primaryName – The Text property of the first entry in the list of triggers (for example, buyout).
    • names – A set of all the Text property values found in the list of triggers.
  • Entities are labeled with the type of entity identified, as noted in the first object in the Mentions list for that object (ORGANIZATION in the example). Like events, it has two properties:
    • primaryName – The Text property of the first entry in the list of mentions (Whole Foods Market, Inc.).
    • names – A set of all the Text property values found in the list of triggers (Whole Foods Market, Inc., Whole Foods Market).
  • Documents have an edge to the events found within that document. That edge is labeled as EVENT and has no associated properties.
  • Events have an edge to the referenced in the Arguments collection of that event. The edge is labeled with the value in the Role (INVESTOR or INVESTEE in the example).
  • We want mentions representing the same real-world entities to have a common node identifier pointing to the same node within our graph (also known as entity resolution). We use the Amazon Comprehend Events provided resolution of entities within documents and extend this across documents by naming our entities using the pattern {label}_{primaryName}. When this data is loaded into Neptune, any document containing an organization with the first mention Amazon is associated together in a common node by the identifier organization_Amazon.
  • Non-entity nodes may have common names that don’t represent a common real-world event. For example, two events of type ACQUISITION with a trigger word buyout likely represent two different acquisitions and not the same acquisition. Therefore, we ensure the identifiers of these nodes are named uniquely.
  • The Amazon Comprehend Events API returns a confidence score for all mentions, triggers, and arguments of a given event, as well as a group score for entity and trigger group membership. Depending on the use case, you may want to filter out the lower-confidence results to keep a higher-precision graph at the expense of potentially missing some harder-to-detect results.

The following diagram shows the entity resolution between documents using a common naming pattern for the entity node.

The following diagram shows the entity resolution between documents using a common naming pattern for the entity node.

Neptune provides the Neptune Workbench, an in-console experience to query your graph. The workbench lets you quickly and easily query your Neptune databases with Jupyter notebooks—a fully managed, interactive development environment with live code and narrative text. For this post, we include the notebook Neptune_Knowledge_Graph_Unstructured, which contains the Python code to transform the Amazon Comprehend output into the two CSV format files for nodes and edges required by the Neptune bulk loader for Gremlin graphs, and an NTriple format file for loading in RDF. We copy those files into an S3 bucket that is used by the Neptune bulk loader.

Provisioning your resources

The easiest way to follow along with this post is to use the provided CloudFormation template. The template sets up a new Neptune cluster and adds a Neptune Workbench notebook with code for parsing the Amazon Comprehend Events JSON output, instructions for how to load the data into Neptune in both Gremlin and RDF formats using the notebook, and Gremlin and SPARQL queries to analyze and visualize the data. The following diagram illustrates the architecture for the stack.

The following diagram illustrates the architecture for the stack.

To deploy the CloudFormation template and try running the solution yourself, complete the following steps:

  1. On the AWS CloudFormation console, choose Create stack.
  2. Choose With new resources (standard).
  3. For the Amazon S3 URL, enter https://aws-neptune-customer-samples.s3.amazonaws.com/knowledge-graph-unstructured/neptune-kg-unstructured-blog-stack.yml.
  4. Choose Next.
  5. Enter a stack name of your choosing.
  6. Choose Next.
  7. Continue through the remaining sections.
  8. Read and select the check boxes in the Capabilities
  9. Choose Create stack.
  10. When the stack creation is complete (approximately 15 minutes), on the Outputs tab for the stack, find the value for NeptuneSageMakerNotebook.
  11. Choose the notebook to navigate to the Neptune Workbench notebook instance.

You should now have a Jupyter Notebook instance as shown in the following screenshot.

You should now have a Jupyter Notebook instance as shown in the following screenshot.

  1. Choose ipynb to open the notebook containing the code for this post.

The first two cells of the notebook contain Python code that streams the Amazon Comprehend Events output from a public S3 bucket, converts it to graphs format in the model described previously, and saves it to files on the Neptune Workbench instance.

  1. Run the first two cells by choosing Play for each cell.

While each cell is running, an asterisk appears between the brackets after In. When it’s finished, the asterisk changes to a sequential number signifying the order of the steps. The word Complete also appears below the step.

When the first two cells are complete, you copy the generated files into your S3 bucket so you can load them into Neptune. The CloudFormation script injected the S3 bucket name created into the notebook as an environment variable named S3_WORKING_BUCKET.

  1. Run the third cell to copy these files to the S3 bucket.

The output of this cell prints two different S3 paths: one for property graph (Gremlin) data and one for RDF data. These paths are needed for the bulk loading step.

Now continue with the following sections to follow along with the notebook.

Loading the data into Neptune

In this section, we use the Neptune Workbench magics to load data into Neptune.

The CloudFormation script creates an AWS Identity and Access Management (IAM) role that permits Neptune to access the S3 bucket we copied the data into, and attaches that IAM role to our cluster. It also creates an Amazon S3 VPC endpoint for our Neptune cluster’s VPC to facilitate accessing Amazon S3. To create a similar pattern without using our CloudFormation script, see Prerequisites: IAM Role and Amazon S3 Access.

We use the %load magic in the Workbench. Complete the following steps:

  1. Enter %load and run the cell.
  2. In the form, for Source, enter the S3 path to the folder containing the property graph files: s3://{bucketname}/pg/.

You can find this path on the stack Outputs tab.

  1. For Format, choose CSV.
  2. The Load ARN should be automatically populated with the IAM role you created.
  3. Keep the remaining fields at their default.
  4. Choose Submit.

Choose Submit.

  1. When you see the message Load_Completed, run the same cell again, but this time load the RDF file.
  2. For Source, enter the path s3://{bucketname}/rdf/.
  3. For Format, choose ntriples.
  4. Choose Submit.

Now that our graph has been loaded into Neptune, we continue to the next steps of analyzing the results in Neptune.

Analyzing the results in Neptune

Now that we have populated our knowledge graph with the output from Amazon Comprehend Events, we can run queries to analyze our graph.

The first query we run shows our top six organizations in order of decreasing number of incoming edges (number of financial events they were associated with). See the following code in Gremlin:

%%gremlin
g.V().hasLabel('ORGANIZATION').
  order().
  by(inE().count(),decr).
  limit(6).
  project('primaryName','edgeCount','nodeId').
    by('primaryName').
    by(inE().count()).
    by(T.id)

1    {'edgeCount': 270, 'nodeId': 'node__organization_amazon', 'primaryName': 'Amazon'}
2    {'edgeCount': 40, 'nodeId': 'node__organization_businesses', 'primaryName': 'businesses'}
3    {'edgeCount': 36, 'nodeId': 'node__organization_amazon.com,_inc.', 'primaryName': 'Amazon.com, Inc.'}
4    {'edgeCount': 30, 'nodeId': 'node__organization_company', 'primaryName': 'company'}
5    {'edgeCount': 18, 'nodeId': 'node__organization_whole_foods_market', 'primaryName': 'Whole Foods Market'}
6    {'edgeCount': 13, 'nodeId': 'node__organization_amazon.com', 'primaryName': 'Amazon.com'}

In SPARQL, you can use the following query to get the edge count by node:

%%sparql

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX entities: <http://example.org/entities/>
PREFIX rels: <http://example.org/relations/>

SELECT DISTINCT ?org (COUNT(*) as ?cnt) WHERE {
    ?org rdf:type entities:organization .
    ?event ?role ?org .
} GROUP BY ?org ?name
ORDER BY DESC(?cnt)
LIMIT 6

The list contains three different vertices representing Amazon: node__organization_amazon, node__organization_amazon.com_inc., and node__organization_amazon.com. We also see Whole Foods Market (node__organization_whole_foods_market), so we use these organizations for our analysis.

Now let’s construct a query that shows all the extracted Amazon Comprehend Events connecting Amazon and Whole Foods Market. We also add in some parameters to the %%gremlin cell magic that tells the graph visualization engine in Neptune Workbench how to best render the graph. For more information about visualization hints in Neptune Workbench, see Graph visualization in the Neptune workbench. We know from our data model that organization nodes have incoming edges from event nodes labeled with the role the organization played in the event. Therefore, the following query illustrates this path and returns the values we want to display on the visualization:

%%gremlin -p v,ine,outv,oute,inv

g.V(['node__organization_amazon','node__organization_amazon.com,_inc.','node__organization_aws']).as('amazon').
    inE().as('roleEdge').
    outV().as('eventNode').
    outE().as('otherRoleEdge').
    inV().hasId('node__organization_whole_foods_market').as('otherOrg').
    path().by('primaryName').by().by().by().by('primaryName')

The following visualization shows the various paths between Amazon and Whole Foods Market.

The following visualization shows the various paths between Amazon and Whole Foods Market.

This graph illustrates that six events are detected in our corpus linking Amazon and Whole Foods Market. If we mouse over the event nodes, we can see the full labels that these events are CORPORATE_MERGER, CORPORATE_ACQUISITION, and INVESTMENT_GENERAL. The edge labels show the roles of each company in those events.

Although we can visually follow the various paths between the two organizations, we may want to instead aggregate them as a list of paths with a count of the number of times each one occurs. This is the same data shown in the graph visualization, so you can confirm the counts are correct. See the following code:

%%gremlin
g.V(['node__organization_amazon','node__organization_amazon.com,_inc.','node__organization_amazon.com']).as('amazon').
    inE().as('roleEdge').
    outV().as('eventNode').
    outE().as('otherRoleEdge').
    inV().hasId('node__organization_whole_foods_market').as('otherOrg').
    groupCount().by(path().from('roleEdge').to('otherRoleEdge').by(label))

The results of this query are as follows:

{path[PARTICIPANT, CORPORATE_MERGER, PARTICIPANT]: 1, 
 path[INVESTOR, CORPORATE_ACQUISITION, INVESTEE]: 3, 
 path[INVESTOR, INVESTMENT_GENERAL, INVESTEE]: 2}

You can see the same results with SPARQL using the following query:

%%sparql

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX entities: <http://example.org/entities/>
PREFIX rels: <http://example.org/relations/>
PREFIX org: <http://example.org/entities/organization/>

SELECT ?role_1 ?event_type ?role_2 (COUNT(*) as ?cnt) WHERE {
    VALUES ?start {<http://example.org/entities/organization/amazon> <http://example.org/entities/organization/amazon.com%2C%20inc.> <http://example.org/entities/organization/amazon.com>}
    ?event ?role_1 ?start .
    ?event rdf:type ?event_type .
    ?event ?role_2 <http://example.org/entities/organization/whole%20foods%20market> .
} GROUP BY ?role_1 ?event_type ?role_2
ORDER BY DESC(?cnt)

The output shows that, of the six events, three are CORPORATE_ACQUISITION events where Amazon is in the INVESTOR role and Whole Foods Market in the INVESTEE role, one is CORPORATE_MERGER where Amazon and Whole Foods Market are both in the PARTICIPANT role, and two are INVESTMENT_GENERAL events with Amazon in the INVESTOR role and Whole Foods Market in the INVESTEE role.

You can use the graph model described earlier to build your own set of queries to incorporate the source document objects to view the source of each document, or examine all the events in a subset of the documents.

Cleaning up

When you’re done experimenting with the graph, you still have a Neptune cluster and a Neptune Workbench instance running. Make sure to run the last cell in the notebook to delete the graph files that we copied to the bucket. Then return to the AWS CloudFormation console and delete the root stack that we created if you don’t want to incur recurring costs associated with these services in the future. This removes all the infrastructure we created for this post. If you get an error, it’s likely that you didn’t run the last cell in the notebook and therefore the S3 bucket wasn’t deleted because it wasn’t empty.

Conclusion

In this post, we demonstrated how you can use the Amazon Comprehend Events API and Neptune to create a knowledge graph, extracting granular details about real-world events and associated entities from your unstructured text with minimal expertise or analysis. We saw how to work backwards from our goals to create a data model for our use case and how to leverage that model to link events from multiple disparate documents into a single knowledge graph. We utilized the Neptune Workbench to transform the Amazon Comprehend Events output into Neptune bulk loader files for both the property graph and RDF, loaded that data, and ran both Gremlin and SPARQL queries on that data to discover findings across our knowledge graph.

You can use the solution in this post as a foundation to build your own knowledge-based effort or improve your existing projects in areas like “know your customer,” knowledge graphs, or customer 360 and identity graphs by linking the graph with your existing corporate knowledge. You can use the procedure highlighted in this post to analyze your organization’s financial data and create a knowledge graph that you can query for useful insights.


About the Authors

Brian O'KeefeBrian O’Keefe is a Senior Specialist Solutions Architect at Amazon Web Services (AWS) focused on Neptune. He works with customers and partners to solve business problems using Amazon graph technologies. He has over two decades of experience in various software architecture and research roles, many of which involved graph-based applications.

Navtanay Sinha is a Senior Product Manager at AWS. He works with graph technologies to help Amazon Neptune customers fully realize the potential of their graph database.

Graham HorwoodGraham Horwood is a data scientist at Amazon AI. His work focuses on natural language processing technologies for customers in the public and commercial sectors.