YAGO 4.5

YAGO 4.5 is the latest version of the YAGO knowledge base. It is based on Wikidata — the largest public general-purpose knowledge base. YAGO refines the data as follows:

  1. All entity identifiers and property identifiers are human-readable.
  2. The top-level classes come from schema.org — a standard repertoire of classes and properties maintained by Google and others. The lower level classes are a careful selection of the Wikidata taxonomy.
  3. The properties come from schema.org.
  4. YAGO 4.5 contains semantic constraints in the form of SHACL. These constraints keep the data clean, and allow for logical reasoning on YAGO.

YAGO is thus a simplified, cleaned, and “reasonable” version of Wikidata. It contains 49 million entities and 109 million facts.

If you use YAGO 4.5 for scientific purposes, please cite our paper:

Fabian M. Suchanek, Mehwish Alam, Thomas Bonald, Lihu Chen, Pierre-Henri Paris, Jules Soria:
YAGO 4.5: A Large and Clean Knowledge Base with a Rich Taxonomy
Resource paper at the Conference on Research and Development in Information Retrieval (SIGIR), 2024

How to use YAGO

YAGO is an RDFS knowledge base. It is a collection of facts, each of which consists of a subject, a predicate, and an object — as in yago:Elvis_Presley rdf:type schema:Person.

YAGO puts each entity into at least one class. The classes form a taxonomy, where the higher classes are taken from schema.org, and the lower classes from Wikidata. The highest class is schema:Thing.

The facts come from Wikidata, and the predicates have been mapped manually to the predicates of schema.org. Facts whose predicates could not be mapped were omitted. All predicates, all classes, and most entities have human-readable names. YAGO entities are mapped with owl:sameAs to Wikidata and with schema:sameAs to WordNet and other sources.

YAGO comes with SHACL constraints that specify the disjointness of certain classes, as well as the domains, ranges, and cardinalities of relations. Please find a detailed description of the upper taxonomy as well as our design document here.

Data

YAGO 4.5 licensed under a Creative Commons Attribution-ShareAlike by the YAGO team of Télécom Paris. Some facts are imported from schema.org that releases its data under the same license.

The YAGO 4.5 knowledge base consists of the following set of Turtle files:

  • Schema: The upper taxonomy, constraints, and property definitions in SHACL.
  • Taxonomy: The full taxonomy of classes.
  • Facts: All facts about entities that have an English Wikipedia page.
  • Facts beyond Wikipedia: All facts about entities that do not have an English Wikipedia page.
  • Meta: The fact annotations (“facts about facts”) in RDF*.

YAGO 4.5 can be downloaded here

The .ntx files are using RDF* (a.k.a. RDF star) N-Triples syntax. It can be parsed using Jena, RDF4J, N3.js RDF.rb, Blazegraph, AnzoGraph, Stardog, or GraphDB.

Code

If you are just interested in the data of YAGO, there is no need to use the present code repository. You can download data of YAGO above.

The source code of YAGO is a Python project that ingests facts from Wikidata, and transforms them into YAGO. If you run the code yourself, you can add other sources or modify the generation of the knowledge base. The YAGO 4.5 source code is available at Github. It is licensed under the Creative Commons Attribution License.

Acknowledgements

YAGO can only be so large because it is based on other sources. We would like to thank

  • the creators and contributors of Wikidata.
    Thank you for having and implementing such an ambitious vision of building a “Wikipedia for machines”, and thank you for keeping it open!
  • the team of Schema.org, who created the taxonomy and the properties that we use in YAGO 4.5.