Tag Archives: “How to Build a Better Wikipedia: Ubiquitous Infrastructure for Deep Accountability”

Wikidata: Open Linked Data

Taking a break from the neverending saga of the Cryptocurrency Rush, I noticed an article by  Vrandecis and Krotzsch (apologies for the missing diacritical marks–my software choked on them) in October Communications of the ACM, “Wikidata: a free collaborative knowlegebase“.

If there is one thing truly and deeply missing from Wikipedia, it is deep indexing throughout the whole collection, especially in machine usable forms.  In other words, linked data (AKA a semantic web).  There are lots of cross links and lots of metainformation, but it is human generated, inconsistent, and mostly opaque to computers.

I am not surprised to find that the Wikipedia community sees the same lack, and seeks to address it, with what is called Wikidata.

First, let me say that this is basically the right idea.  Linked Data is designed for exactly this situation:  maintaining an independent collection of assertions about relations among data items.  We don’t want to-and couldn’t-centralize Wikipedia or try to impose an impractical process on everyone.  We need to ‘find’ the data as it is, yet be able to make connections in a logical and machine readable way.

There are automatic tools coming on line that extract relations from Wikipedia pages as they are edited, and generate assertions about relations.  The initial target is to document related items across languages.  This concept is such a good idea I even did a little demo of something similar a while back.

This is what Wikidata does, and it is a pretty standard linked data system in this.

There are a couple of interesting points about Wikidata that are not common or standard for linked data, both of which come out of the “Wikipedia” culture.

First, the data is designed to be editable by anyone, just like Wikipedia.  And like Wikipedia, you wonder how this will work.  The answer is the same as for the the texts:  it works better than it  reasonably should.  However, if people are going to be using the data via APIs, instability and variability in quality are going to be a challenge.  How can my analytics know what is stable, carefully vetted data, and what is iffy stuff that no human has ever looked at?

Second, and more interesting, is the design for “plurality”: “It would be naive to expect global agreement on the “true” data, since many facts are disputed or simply uncertain. Wikidata allows conflicting data to coexist and provides mechanisms to organize this plurality.” (V&K, p. 79)

As far as I can tell, these mechanisms are built on the fact that liked data can have many assertions about the same items, which need not agree.  So long as the source of these assertions is kept clear, they can coexist, and even enrich the information.

This is actually standard operating procedure for linked data, though many systems that use linked data will try to “clean up” the data, to impose an orderly consistency on their view of the data.

I gather that there are also conventions for marking assertions as “preferred” and “deprecated”.  No doubt there may be more nuances to these editorial remarks in the future.  This will obviously be problematic in cases with sharply contested arguments (e.g., about political borders).

It will be interesting to see if this gains as much traction as Wikipedia has. If so, it will be pretty cool.


 

Denny,Vrandecis and Markus Krotzsch, Wikidata: a free collaborative knowledgebase. Commun. ACM %@ 0001-0782, 57 (10):78-85,  2014.

Robert E. McGrath, “How to Build a Better Wikipedia: Ubiquitous Infrastructure for Deep Accountability”, Microsoft E-Science Workshop, Indianapolis, December 7-9, 2008.