Skip to content

Concured/wikidata-taxonomy

 
 

Repository files navigation

Wikidata-Taxonomy

Command-line tool to extract taxonomies from wikidata.

Installation

wikidata-taxonomy requires at least NodeJs version 4.

Install globally to make command wdtaxonomy accessible from your shell $PATH:

$ npm install -g wikidata-taxonomy

Usage

This module provides the command wdtaxonomy. By default, a usage help is printed:

$ wdtaxonomy

The first arguments needs to be a Wikidata identifier to be used as root. For instance extract a taxonomy of planets (Q634):

$ wdtaxonomy Q634

The extracted taxonomy is based on statements using the property "subclass of" (P279) or "subproperty of" (P1647) and additional statistics. Option --sparql prints the SPARQL queries that are used.

Taxonomy extraction and output can be controlled by several options. For instance this command lists a biological taxonomy of mammals:

$ wdtaxonomy.js Q7377 --property P171 --brief

Output formats

Tree format

By default, the taxonomy is printed in "tree" format with colored Unicode characters:

$ wdtaxonomy Q17362350
planet of the Solar System (Q17362350) •2 ↑
├──outer planet (Q30014) •23 ×4 ↑
└──inner planets (Q3504248) •8 ×4 ↑

The output contains item labels, Wikidata identifiers, the number of Wikimedia sites connected to each item (indicated by bullet character ""), the number of instances (property P31, indicated by a multiplication sign "×"), and an upwards arrow ("") as indicator for additional superclass not included in the tree.

Option "--instances" (or "-i") explicitly includes instances:

$ wdtaxonomy -i Q17362350
planet of the Solar System (Q17362350) •2 ↑
├──outer planet (Q30014) •23 ↑
|   -Saturn (Q193)
|   -Jupiter (Q319)
|   -Uranus (Q324)
|   -Neptune (Q332)
└──inner planets (Q3504248) •8 ↑
    -Earth (Q2)
    -Mars (Q111)
    -Mercury (Q308)
    -Venus (Q313)

Classes that occur at multiple places in the taxonomy (multihierarchy) are marked like in the following example:

$ wdtaxonomy Q634
planet (Q634) •196 ×7 ↑
├──extrasolar planet (Q44559) •81 ×833 ↑
|  ├──circumbinary planet (Q205901) •14 ×10
|  ├──super-Earth (Q327757) •32 ×46
...
├──terrestrial planet (Q128207) •67 ×7
|  ╞══super-Earth (Q327757) •32 ×46  …
...

CSV format

The CSV format ("--format csv") is optimized for comparing differences in time. Each output row consists of five fields:

  • level in the hierarchy indicated by zero or more "-" (default) or "=" characters (multihierarchy).

  • id of the item. Items on the same level are sorted by their id.

  • label of the item. Language can be selected with option --language. The character , in labels is replaces by a whitespace.

  • sites: number of connected sites (Wikipedia and related project editions). Larger numbers may indicate more established concepts.

  • parents outside of the hierarchy, indicated by zero or more "^" characters.

For instance the CSV output for Q634 would be like this:

$ wdtaxonomy -f csv Q634
level,id,label,sites,instances,parents
,Q634,planet,196,7,^
-,Q44559,extrasolar planet,81,833,^
--,Q205901,circumbinary planet,14,10,
--,Q327757,super-Earth,32,46,
...
-,Q128207,terrestrial planet,67,7,
==,Q327757,super-Earth,32,46,
...

In this example there are 196 Wikipedia editions or other sites with an article about planets and seven Wikidata items are direct instance of a planet. At the end of the line "^" indicates that "planet" has one superclass. In the next rows "extrasolar planet" (Q44559) is a subclass of planet with another superclass indicated by "^". Both "circumbinary planet" and "super-Earth" are subclasses of "extrasolar planet". The latter also occurs as subclass of "terrestrial planet" where it is marked by "==" instead of "--".

JSON format

Option --format json serializes the taxonomy as JSON object with the following fields:

  • root: Wikidata identifier of the root item/property
  • items: object with Wikidata items/properties, indexed by their identifier
  • narrower
  • broader
  • instances (if option instances is enabled)

Specialized taxonomies

The hierarchy properties P279 ("subclass of") and P31 ("instance of") to build taxonomies from can be changed with option property (-P).

Members of (P463) the European Union (Q458):

$ wdtaxonomy Q458 -P P463

Members of (P463) the European Union (Q458) and number of its citizens in Wikidata (P27):

$ wdtaxonomy Q458 -P 463/27

As Wikidata is no strict ontology, subproperties are not factored in. For instance this query does not include members of the European Union although P463 is a subproperty of P361.

Parts of (P361) the European Union (Q458):

$ wdtaxonomy Q458 -P P361

A taxonomy of subproperties can be queried like taxonomies of items. The hierarchy property is set to P1647 ("subproperty of") by default:

$ wdtaxonomy P361
$ wdtaxonomy P361 -P P1647  # equivalent

Subproperties of "part of" (P361) and which of them have an inverse property (P1696):

$ wdtaxonomy P361 -P P1647/P1696

Inverse properties are neither factored in so queries like these do not necesarrily return the same results:

What hand (Q33767) is part of (P361):

$ wdtaxonomy Q33767 -P 361 -r

What parts the hand (Q33767) has (P527):

$ wdtaxonomy Q33767 -P 527

Release notes

Release notes are listed in file CHANGES.md in the source code repository.

See Also

build status npm version Documentation Status

This document

Related tools

About

command-line tool to extract taxonomies from Wikidata

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 92.1%
  • Python 7.5%
  • Makefile 0.4%