In today’s rapidly evolving world, an increasing number of resources are becoming available on the Internet, leading to exponentially growing data in various forms. As a result, the concept of the Knowledge Graph has become a crucial topic from an industry perspective. These graphs are often constructed from semi-structured knowledge sources, such as Wikipedia, or harvested from the web using a combination of statistical and linguistic methods. The outcome is large-scale Knowledge Graphs that represent a trade-off between correctness and completeness.
1. Problem Statement
The main objective of this paper is to understand and find relations among various knowledge graphs which are available in the market today such as Google Knowledge Graph, Dbpedia, Wolfram Knowledge graph etc. and explain and compare based on the following metrics:
- How are these knowledge graphs are implemented?
- What is the underlying technology used
- How they are similar to each other
- How they are different from one another
The paper is mainly arranged in following section.
- Section 2 includes a detail introduction of knowledge graph.
- Section 3 includes a detail explanation about the common underlying architecture of knowledge graph.
- Section 4 includes detailed explanation about different knowledge graphs.
- Section 5 includes the overall conclusion of the whole paper.
2. Introduction
From Wikipedia, The Knowledge Graph is a knowledge base used by Google to enhance its search engine’s search results with semantic-search information gathered from a wide variety of sources.[6]
In summary, Knowledge Graphs on the web serve as the backbone and supporting structure for many information systems that require access to structured knowledge, whether domain-specific or domain-independent. The concept of providing intelligent systems and agents with general, formalized knowledge of the world dates back to classic Artificial Intelligence research in the 1980s and has been a significant asset to the computer science industry ever since.
In recent years, the representation of general knowledge as a graph has gained considerable attention, largely due to the advancements in Linked Open Data sources like DBpedia, as well as Google’s announcement of the Google Knowledge Graph in 2012. There are various methods for constructing these Knowledge Graphs: they can be curated, as in the cases of Cyc, Freebase, and Wikidata, or extracted from large-scale, semi-structured web knowledge bases such as Wikipedia, DBpedia, and YAGO. Regardless of the approach used to create a Knowledge Graph, the results often fall short of complete accuracy.
As models of either the entire real world or specific parts of it, Knowledge Graphs cannot reasonably achieve full coverage. It is inefficient to include information about every entity in the universe. Moreover, it is unlikely, particularly when heuristic methods are applied, that a Knowledge Graph will be entirely correct, resulting in a trade-off between coverage and correctness, with each Knowledge Graph addressing this balance in its own way.
From the early days, the Semantic Web has promoted a graph-based representation of knowledge, e.g., by pushing the RDF standards. In such graph-based knowledge representation, entities, which are the nodes of the graph, are connected by relations, which are the edges of the graph to quote a lay man term e.g., Shakespeare has written Hamlet and entities can have types, denoted by is a relation which for our example will be that Shakespeare is a writer, Hamlet is a play.[1]
In most of cases, the set of possible types and relations are organized in a schema or ontology, and this schema or ontology further defines their interrelation and restriction related to their usage. With the concept of linked data coming into picture, it was further proposed to interlink different datasets in the semantic web. Now with the use of interlinking, the collection we create could be understood as a large global knowledge graph. To this date, about 1,000 datasets are interlinked in the Linked Open Data cloud where the majority of links connecting identical entities in two datasets.
Now in the beginning the term Knowledge Graph was coined by Google in 2012, making reference to their use of semantic knowledge in Web Search i.e. Things, not strings but more recently this term has been used to refer to Semantic Web knowledge bases such as DBpedia or YAGO. From an overall perspective, any graph-based representation of some knowledge could be considered a knowledge graph which could include any kind of RDF dataset, as well as description logic ontologies. However, there is still no common definition about what a knowledge graph is and what it is not.
Hence, instead of attempting a formal definition of what a knowledge graph is, most of the people working int this field restrict their selves to a minimum set of characteristics of knowledge graphs, which they use to tell a knowledge graphs from other collections of knowledge which they do not consider as knowledge graphs. So, to summarize, a knowledge graph
- mainly describes real world entities and their interrelations, organized in a graph.
- it defines possible classes and relations of entities in a schema.
- it consists of interrelated entities.
- it is not constrained to a single domain.
3. Architecture
Due to their schema less nature, Knowledge graph have been widely adopted and used, as these characteristics enable a knowledge graph to grow seamlessly while allowing new relationship and entities to be added. The graph-based nature of knowledge graph makes it possible to allow its linkage to other graphs hence leading to an easy and reliable integration of multiple kinds of information which further enhance the integrity of information.[3]
The exploration of these graphs leads to the discovery of new connections and findings of links and commonalities between items and users, hence it has become a powerful tool to represent knowledge in the form of a labeled directed graph, which provides semantic to textual information. A knowledge graph like many other graphs is constructed by representing each of its item, entity and user as nodes, and then linking those nodes that interact with each other via set of edges.
A knowledge graph is a multi-relational graph by nature which is composed of entities as nodes and relations as different types of edges. To describe it further, an instance of edge is a triplet of fact i.e. head entity, relation and tail entity which are usually denoted as
h, r, t. For example, the fact Fountainhead is written by Ayn Rand can be stored in knowledge graph by constructing two nodes first one for Fountainhead and second one for Ayn rand and have one directed edge Written By which depicts the relationship between these two nodes. In general, while we are working on construction of a knowledge graph, there are mainly three steps we work on:
- Coming up with entities required to describe each aspect of sentence
- Have a specific Data Acquisition strategy
- Have a specific Data Processing Strategy
Now while creating it in a real-world environment, the first steps mainly focus on answering couple of questions such as
- What are the nodes?
In the context of knowledge graph, nodes will be related to a semantic concept such as persons, entities, events etc.
- What are the edges?
For knowledge graph, they will be defined by relation- ships between nodes based on semantics.
Now, second steps mainly focus on acquiring data to build the knowledge graph. The data which is required for this process can be collected from various source like Reputable open semantic database like Wikipedia, General web content of higher quality such as newspaper and other article archives. On most of these sources, content has already been cleaned up from low quality stuff such as Spam, porn and low-quality ads. The other sources for data acquisition to include open crawl database such as Common Crawl, comprising of unprocessed dumps of crawls of high-quality links.
The Third step in this process of building a knowledge graph mainly includes processing, during which one can come up with algorithms and heuristics to identify and extract the knowledge which is required. In general, there are two main ways of conducting knowledge processing:
First, simple heuristic approach which includes text processing based on regular expressions, or simple parsing using NLP techniques. For images type data, this approach includes basic processing of meta data. The lack of depth in this approach can be compensated and mitigated by sheer numbers i.e. making use of lots of data.
Second, deep learning techniques where one can employ novel methods to experiment with your own kinds of knowledge. This approach could be really time involving and will possibly keep you on the leading edge.
The Process:
1. Information Extraction: First, we need to ensure that we have access to the correct template. This phase basically includes working on the collection of data which we can use as a test set to refine and further use for our linking phase.
2. Linking: This phase basically includes the process of mapping the relationships between entities i.e. connecting an actor to films he has starred in and to other actors he has worked with.
3. Analysis: This phase includes discovering of and categorizing the information about an entity from the content such as the type of food a restaurant serves or from sentiment data such as whether the restaurant has positive reviews.
A. Knowledge Base Construction
Below is an example of how a knowledge base is constructed with an example.[4] Let say input consists of sentences like “U.S. President Barack Obama’s wife Michelle Obama.”, and the output consists of tuples in an” has spouse” table representing the fact such as” Barrack Obama is married to Michelle Obama”.
As knowledge base construction uses specific terms to refer to objects it manipulates. In above example,
- “Barrack Obama” is an entity
- “Barack” in the sentence” Barack and Michelle are married” is a mention.
- Barack”,” Obama” or” the president” may refer to the same entity” Barack Obama”, and entity linking is the process to figure this out.
- In the sentence” Barack and Michelle are married”, the two mentions” Barack” and” Michelle” has a mention- level relation of has spouse.
- The entity” Barack Obama” and” Michelle Obama” has an entity-level relation of has spouse.
4. Analysis of existing knowledge graphs
Below is the detailed study of most used knowledge graphs which are available to us.
A. Cyc and OpenCyc
The Cyc knowledge graph dating back to 1980s is the one of the oldest knowledge graphs in the market. It’s an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common-sense knowledge, with the goal of enabling AI applications to perform human-like reasoning.[8] It is a curated knowledge graph, which was developed by and now maintained by CyCorp Inc.
Until early 2017, parts of the Cyc were released as OpenCyc, which provided an API, RDF endpoint, and data dump under an open-source license. OpenCyc is a publicly available reduced version of Cyc. The first version of OpenCyc was released in spring 2002 and it only contained 6,000 concepts and 60,000 facts. A Semantic Web endpoint to OpenCyc also exists which contains links to DBpedia and other LOD datasets.
The overall Cyc knowledge base is divided into many contexts, each one of which is a collection of assertions which all together shares a common set of assumptions. Among these contexts, some micro theories are usually focused either on a specific domain knowledge or some particular interval of time or some level of details.
This micro theory mechanism allows Cyc to independently maintain assertions and enhance the performance of the complete Cyc system by focusing on the inferencing process.
Features:
- OpenCyc contains roughly about 630,000 concepts, which forms an ontology in the domain of human consensus reality.
- It has over 7,000,000 assertions which includes facts and rules and about 38,000 relations which interrelate constrain and define the concepts.
- A compiled version of the Cyc Inference Engine and the Cyc Knowledge Base Browser is available to use.
- Plus, a natural language parsers and CycL-to-English generation functions is available for common use.
- It also contains a natural language query tool, which allows a user to specify powerful and flexible queries without any need of formal understanding of logic or complex knowledge representation.
- It also contains an ontology exporter which makes the task of exporting specified portion of knowledge base to OWL files much easier.
- Documentation and self-paced learning materials are also available to help a users to achieve a basic- to intermediate-level understanding of the issues of knowledge representation and application development using Cyc.
- It also contains specification of CycL, the core language in which Cyc is written, plus do includes the CycL-to- Lisp, CycL-to-C, etc. translators.
- Also available is a specification of the Cyc API, by calling which a programmer can build an ResearchCyc application.
B. Free base
Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members.[7] It was an online collection of structured data harvested from many sources, including individual, user- submitted wiki contributions. It aimed to create a centralized global resource which allowed people to have access to common information over web more effectively. It was developed by the American software company Metaweb and ran publicly since March 2007. It’s acquisition by Google was announced on 16 July 2010, and eventually Google’s Knowledge Graph was powered in part by Freebase. Freebase data was available for commercial and non-commercial use under a Creative Commons Attribution License, and an open API, RDF endpoint, and a database dump was provided for programmers.
Although have about 900 person years invested in the creation of Cyc, it was still not fully equipped, hence to rectify this gap the efforts were made to distribute load to as many points as possible through crowd sourcing, which was the way taken by Freebase, a knowledge graph which was publicly editable with schema templates available for most kinds of possible entities i.e. person, cities, movies, etc. Freebase defined its data structure as a set of nodes and a set of links that established relationships between the nodes, instead of using tables and key to define data structures. Freebase had the freedom to model much more set of complex relationships between individuals elements than a conventional database, as its core data structure was non-hierarchical in nature. Freebase was available for users to enter in new objects and relationships into its graph.
Queries to the database were made in Meta web Query Language (MQL) and served by a triple store called graphs. The last version of freebase which was available in market contained about 50 million entities and 3 billion facts. Its schema consists of about 27,000 entity and 38,000 relation types. In overall approach, freebase differed from wiki model in many ways. User created types were adopted in public commons only post its promotion by Met web employees, and still a user didn’t had capabilities to modify each other’s types, reason being that as many external applications relied on freebase it was a major concern to open up its schema permission, as changing a type’s schema for example changing a property or deleting a property could potentially broke API users queries and even within itself.
C. Wiki data
Wikidata is a Wikimedia project to create an open and collaborative database [6]. It is a collaboratively edited knowledge graph, which is currently operated by the Wikimedia foundation and it also hosts the various language editions of Wikipedia. The main intention to provide is a common source of data which can be used by Wikimedia projects such as Wikipedia, or by anyone else, under a public domain license.
Wikidata also stores relational statements about an entity as well as the interwiki links associated with the pages on the Wikimedia projects that describe that entity. The English Wikipedia uses these interlanguage links stored at Wikidata, and has some limited applications for the statements made in Wikidata.
Each Wikipedia page with an entry in Wikidata uses the language links stored there to populate the language links that show in the left column. Traditional interwiki links in a page’s wiki-text are still recognized, and simply override the information for that language (if any) from Wikidata.
After Freebase was shutdown, the data which was available in freebase was subsequently moved to wikidata and then the repository at wikidata was mainly consist of items, which had a label, a description about the item and any number of aliases. These items are uniquely identified by a number. It has statement which describes a detailed characteristics of an item plus also includes a property and value. These properties in wikidata are also followed by a number. Plus these properties can also be linked to external database.
A property which links an item to an external database, such as an authority control database used by libraries and archives, is called an identifier. Special Site links which are present connect an item to corresponding content on client wikis, for example Wikipedia, Wikibooks or Wiki quote.
All of this information which is generated by wikidata graph can be displayed in any language, irrespective of the language the data was originated in, and while accessing these values, client wikis will show the most up-to-date data which is available. A main feature of Wikidata is that for each axiom, provenance meta data can be included such as the source and date for the population figure of a city. As if now it consist of 37,864,327 data items.
D. DBpedia
DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project.[9] This structured information is made available on the World Wide Web. The core feature of DBpedia is that it allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.
Tim Berners-Lee described DBpedia as one of the most famous parts of the decentralized Linked Data effort.
As per the Official website of DBpedia, it is a tool for automatically annotating mentions of DBpedia resources in text, while providing a solution for linking unstructured information sources available to the Linked Open Data cloud through DBpedia. It’s one of the knowledge graphs which is extracted from structured data in Wikipedia and the main source for this extraction are the key-value pairs in the Wikipedia info boxes. In a whole crowd-sourced process, types of info boxes are mapped to the DBpedia ontology, and keys used in those info boxes are mapped to properties in that ontology.
Data is accessed using an SQL-like query language for RDF called SPARQL. The English version of the DBpedia knowledge base over all describes about 4.58 million things, out of which 4.22 million are classified in a consistent ontology, including
- 1,445,000 persons
- 735,000 places which includes 478,000 populated places
- 411,000 creative works which includes 123,000 music albums, 87,000 films and 19,000 video games
- 241,000 organizations which includes 58,000 companies and 49,000 educational institutions
- 251,000 species
- 6,000 diseases
One of the challenges in extracting information from Wikipedia is that the same concepts can be expressed using different parameters in infobox and other templates, such as —birthplace= and —placeofbirth=. Because of this, queries about where people were born would have to search for both of these properties in order to get more complete results. As a result, the DBpedia Mapping Language has been developed to help in mapping these properties to an ontology while reducing the number of synonyms. Due to the large diversity of info boxes and properties in use on Wikipedia, the process of developing and improving these mappings has been opened to public contributions.
In addition to it, DBpedia is also available in localized versions in about 125 languages and all these different versions together describe about 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia.
The full DBpedia data set features
- 38 million labels
- Abstracts in 125 different languages
- 25.2 million links to images
- 29.8 million links to external web pages
- 80.9 million links to Wikipedia categories
- 41.2 million links to YAGO categories
DBpedia is connected with other Linked Datasets by around 50 million RDF links. Altogether the DBpedia 2014 release consists of 3 billion pieces of information (RDF triples) out of which 580 million were extracted from the English edition of Wikipedia, 2.46 billion were extracted from other language editions. Detailed statistics about the DBpedia datasets in 24 popular languages are provided at Dataset Statistics.
The DBpedia knowledge base has several advantages over existing knowledge bases:
- It covers many domains.
- It represents real community agreement.
- It automatically evolves itself as Wikipedia changes.
- It is truly multilingual in nature.
The DBpedia knowledge base allows individuals to ask quite surprising queries against Wikipedia, for example Give me all cities in California with more than 20,000 inhabitants or Give me all Indian restaurants with in 10 mile radius. Altogether, the use cases of the DBpedia knowledge base are widespread and range from enterprise knowledge management, over Web search to revolutionizing Wikipedia search.
E. YAGO
YAGO (Yet Another Great Ontology) is an open-source knowledge base developed at the Max Planck Institute for Computer Science in Saarbrcken.[10] YAGO is a huge se- mantic knowledge base which automatically extracted and derived from Wikipedia WordNet and GeoNames.
Currently, YAGO has knowledge of more than 10 million entities which includes persons, organizations, cities, etc. and can also further contains more than 120 million facts about these entities. YAGO has been used in the Watson artificial intelligence system.
Features:
- The YAGO has a confirmed and proved accuracy rate of about 95%.
- Every relation is annotated with its confidence value.
- YAGO assign the entities to more than 350,000 classes, by combining the clean taxonomy of WordNet with the richness of the Wikipedia category system.
- It is an ontology which is anchored in time and space.
- To integrate YAGO to the linked data cloud, YAGO has been linked to the DBpedia ontology and to the SUMO ontology.
- YAGO attaches a temporal dimension and a spatial dimension to many of its facts and entities.
- In addition to a taxonomy, YAGO has thematic domains such as “music” or “science” from WordNet Domains.
- YAGO can extracts and combines entities and facts from 10 Wikipedia’s in different languages.
F. Google knowledge graph
The Google Knowledge Graph is a system that Google launched in May 2012 which was also when the term knowledge graph was coined. It is a system that understands facts about people, places and things and how these entities are all connected. Google Launches Knowledge Graph to Provide Answers, Not Just Links is our introductory article about the system from when it launched.[5]
The Google Knowledge Graph is the fruit of years of
compiling search results and user actions. It works on the concept of leveraging information from the already structured information sources such as Wikipedia which is Googles object of admiration, Freebase which is a offshoot acquired by Google in 2010, IMDb, the main Internet Movie Database and many others. Added to this all the ideas, notions and concepts that Google was able to collect via different pages marked by micro formats, RDF, or more recently, Schema. Overall, Googles Knowledge Graph contains 18 billion statements about 570 million entities, with a schema of 1,500 entity types and 35,000 relation types. Googles Knowledge
Graph is used both behind-the-scenes to help Google im- prove its search relevancy and also to present Knowledge Graph boxes, at times, within its search results that provide direct answers.
G. Yahoo knowledge graph
Yahoo Knowledge graph is a knowledge base used by Yahoo to enhance its search engine’s result with semantic search information gathered from a wide variety of sources. The knowledge graph builds on both public data similar to Wikipedia and Freebase as well as closed commercial sources for various domains. It uses wrappers for different sources and monitors evolving sources, such as Wikipedia, for constant updates.[1]
Information about entities is acquired and extracted from multiple sources on daily basis using simple information extraction techniques available. The open data sources such as Wikipedia as well as closed data sources from paid providers are leveraged. The information is uniformly stored in a central knowledge repository where entities and their respective attributes are categorized, normalized and then validated them against a common ontology for say 300 classes and 950 properties, using a scalable and more generalized framework.
The machine learning techniques are used to disambiguate and blend together varies entities that actually co-refer to a common same real-world object. This is eventually meant to turn siloed, incomplete, inconsistence and possibly inaccurate information into a rich unified and disambiguated knowledge graph.
Yahoo knowledge graph team uses a plugin system to enrich the graph with inferred information useful for the applications that their team support and simultaneously leverage the editorial curation for hot fixes which ever are available. The access to knowledge graph is provided via APIs. Plus, a large set of data export are also generated on regular basis for supporting large scale offline data processing. Yahoos’ knowledge graph contains roughly 3.5 million entities and 1.4 billion relations. Its schema, which is aligned with schema.org, comprises 250 types of entities and 800 types of relations.
H. Microsoft satori
Satori is Microsoft’s equivalent to Google’s Knowledge Graph. Microsoft’s Satori (named after a Zen Buddhist term for enlightenment) is a graph-based repository that comes out of Microsoft Research’s Trinity graph database and computing platform.
Although almost no public information on the construction, the schema, or the data volume of Satori is available but it has been said to consist of 300 million entities and 800 million relations in 2012, and its data representation format is RDF.
Microsoft’s Satori extract data from the unstructured information on web pages to create a structured database of the nouns of the Internet: people, places, things, and the relationships between them all. It uses the Resource Description Framework and the SPARQL query language, and it was designed to handle billions of RDF triples (or entities). For a sense of scale, the 2010 US Census in RDF form has about one billion triples.[3]
Satori needs to be able to handle queries from Bing’s front- end even ones that would require traversing potentially billions of nodes in milliseconds. To make sure that it doesn’t suffer from latency while waiting for calls to storage, Microsoft built the Trinity engine Satori is based on totally within memory, atop of a distributed memory-based storage layer. The result is a memory cloud based on a key-value store similar to Microsoft Azures distributed disk storage filesystem.
Conclusions
To conclude, knowledge graphs has been widely in use these days, although the expression of using a knowledge graph is fairly limited to triples. Over the course of writing this paper we worked on understanding the semantic function of a knowledge graph and the elaborated relationship which exist between data, information and knowledge with the aim of clarifying the expression of knowledge graph from the different levels of data.