YAGO 3
YAGO 3 combines the information from the Wikipedias in multiple languages with WordNet, GeoNames, and other data sources. YAGO 3 taps into multilingual resources of Wikipedia, getting to know more local entities and facts. This version has been extracted from 10 different Wikipedia versions (English, German, French, Dutch, Italian, Spanish, Polish, Romanian, Persian, and Arabic). YAGO 3 is special in several ways:
- YAGO 3 combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.
- YAGO 3 is anchored in time and space. YAGO attaches a temporal dimension and a spatial dimension to many of its facts and entities.
- In addition to taxonomy, YAGO has thematic domains such as “music” or “science” from WordNet Domains.
- YAGO 3 extracts and combines entities and facts from 10 Wikipedias in different languages.
- YAGO 3 contains canonical representations of entities appearing in different Wikipedia language editions.
- YAGO 3 integrates all non-English entities into the rich type taxonomy of YAGO.
- YAGO 3 provides a mapping between non-English infobox attributes and YAGO relations.
YAGO 3 knows more than 17 million entities (like persons, organizations, cities, etc.) and contains more than 150 million facts about these entities. As with all major releases, the accuracy of YAGO 3 has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value.
If you use YAGO3 for scientific purposes, please cite our paper:
Farzaneh Mahdisoltani, Joanna “Asia” Biega, Fabian M. Suchanek:
YAGO3: A Knowledge Base from Multilingual Wikipedias
Full paper at the Conference on Innovative Data Systems Research (CIDR), 2015
How to use Yago 3
YAGO classifies each entity into a taxonomy of classes. Every entity is an instance of one or multiple classes. Every class (except the root class) is a subclass of one or multiple classes. This yields a hierarchy of classes — the taxonomy. The YAGO taxonomy is the backbone of the ontology, and is designed with much care and attention to correctness. For those interested in the details of that taxonomy, we provide here a more in-depth explanation of the classes. The taxonomy consists of 4 layers:
-
The root node of the taxonomy is
rdfs:Resource
. It includes entities, but also properties, literals, etc.rdfs:Resource
has a subclassowl:Thing
, which is the class of things (entities). -
Under
owl:Thing
, there is the class taxonomy from WordNet. Each class name is of the form<wordnet_XXX_YYY>
, where XXX is the name of the concept (e.g., singer), and YYY is the WordNet 3.0 synset id of the concept (e.g., 110599806). For example, the class of singers is<wordnet_singer_110599806>
. Each class is connected to its more general class by therdfs:subclassOf
relationship. -
The middle layer of the taxonomy consists of classes that have been derived from Wikipedia categories. For
example, one class is
<wikicategory_American_rock_singers>
, derived from the Wikipedia category American rock singers. Each of these classes is connected to one class of the WordNet layer by a rdfs:subclassOf relationship. In the example,<wikicategory_American_rock_singers>
rdfs:subclassOf
<wordnet_singer_110599806>
. Not all Wikipedia categories become classes in YAGO. -
The lowest layer of the taxonomy is the layer of instances. Instances comprise individual entities such as
rivers, people, or movies. For example, this layer contains
<Elvis_Presley>
. Each instance is connected to one or multiple classes of the higher layers by the relationshiprdf:type
. In the example:<Elvis_Presley>
rdf:type
<wikicategory_American_rock_singers>
.
This way, you can walk from the instance up to its class by rdf:type, and then further up by rdfs:subclassOf
.
In YAGO (as in RDF), each fact consists of a subject, a predicate, and an object. Every fact can have a fact id.
The fact id is simply computed as a hash from the subject, predicate, and object of the fact. For example, the
fact <Elvis_Presley> rdf:type <person>
could have the fact identifier <id_42>
.
YAGO contains facts about these fact identifiers. For example, YAGO contains
<id_42> <occursSince> "1935-01-08"
<id_42> <occursUntil> "1977-08-16"
<id_42> <extractionSource> <http://en.wikipedia.org/Elvis_Presley>
These facts mean that Elvis was a person from the year 1935 to the year 1977, and that this fact was found in Wikipedia.
Data
YAGO 3 is licensed under a Creative Commons Attribution 3.0 License by the YAGO team of the Max-Planck Institute for Informatics. The exact version number of this data is 3.0.2. The data was extracted from the following versions of Wikipedia: en:20140626, de:20130422, fr:20140315, nl:UNKNOWN, it:20140317, es:20140320, ro:20140314, pl:20140312, ar:20140323, fa:20140319.
The YAGO3 knowledge base is a set of independent modular full-text files, which together constitute the knowledge base. This data is available in two formats:
- The TSV format contains 5 columns: fact identifier, subject, predicate, object, numerical value of the object (if applicable). This file contains the entire YAGO3, split into “Themes”, i.e., into files that group facts of a certain topic (taxonomy, labels, dates, etc.).
- The Turtle format is an export of the TSV format in the W3C standard Turtle. It replicates the themes of the TSV format. It does not contain the textual facts from Wikipedia (anchor texts, Infobox templates, etc.; Theme “Other”) and the facts about extraction provenance (which fact was extracted from where and how; themes labeled “*Sources*”). It does, however, contain all other facts, as well as the meta-facts (the “facts about facts”) concerning the location and the duration of facts. For this purpose, the file stores the fact identifiers in a comment line before the actual fact. This means that they are not visible (and not an obstacle) to systems that do not support this type of comments.
2022 Revival Data
YAGO 3.0.3 is a non-official updated version of YAGO from 2022. It is licensed under a Creative Commons Attribution 3.0 License by the YAGO team. We are very grateful to Jovan Stojanovic from INRIA Saclay for this work! The data was extracted from the English and French Wikipedia as of 2022, and has not been evaluated manually.
The data is structured as for the standard release, and is available in two formats:
- The TSV format contains 5 columns: fact identifier, subject, predicate, object, numerical value of the object (if applicable). This file contains the entire YAGO3, split into “Themes”, i.e., into files that group facts of a certain topic (taxonomy, labels, dates, etc.).
- The Parquet format is an export of the TSV format into the Apache Parquet file format, which is a column-oriented data compression file format.
Code
If you are just interested in the data of YAGO, there is no need to use the present code repository. You can download data of YAGO above.
The source code of YAGO is a Java project that extracts facts from Wikipedia and the other data sources, and stores these facts in files. These files make up the YAGO knowledge base. If you run the code yourself, you can define (a) what Wikipedia languages to cover, and (b) which specific Wikipedia, Wikidata, and Wikimedia Commons snapshots should be used during the build. The YAGO 3 source code is available at Github. It is licensed under the GNU General Public License, version 3 or later.
Acknowledgements
YAGO can only be so large because it is based on other sources. We would like to thank
- the numerous voluntary editors of Wikipedia. Thank you for giving mankind such a wonderful huge encyclopedia!
- the team of Geonames. Thank you for creating this marvellous collection of geographical data, and thank you for providing this work for free!
- the creators of WordNet. Thank you for organizing and analyzing the English language in such a diligent way and thank you for making your work available for free!
- the WordNet Domains project for categorizing WordNet synsets into thematic domains.
- the Universal WordNet, which provided YAGO with multilingual labels for classes.