Wednesday, November 20, 2013

#Wikidata & #Freebase - an #interview with Denny Vrandečić

When Denny signalled the availability of data that links Wikidata to Freebase, there was a lot of follow up and several questions were left unanswered.  So I asked my questions and, I am happy with the answers I received. So enjoy this interview about the Wikidata and Freebase connection. 
Thanks,
       GerardM

By moving to California you embrace yet another culture, language.What does it do to you, your family
My wife is from Uzbekistan, I am from Croatia. We met in Israel. She has lived the last few years in Estonia, I in Germany. We have moved to California, and we are very excited about this move. Finally to live in a country where we both speak the language! And also, we have high hopes with regards to the weather.
The US in general, and San Francisco in particular, is a real melting pot - or rather fruit salad, as they say these days - of people from all over the world. This seems to be a good match to our own background, so we are looking forward to see what this will mean for us.
What is your job description and title at Google (and what is it you actually do)
My job title is "Ontologist". I am working on Google's Knowledge Graph. The Knowledge Graph for Google is basically what Wikidata is for the Wikimedia projects: a repository of structured knowledge about the world, that is used in many different ways in various different applications. I am working, together with a great team, on the schema and ontology of the Knowledge Graph: its data model, schema, and the way it captures and represents knowledge.
What is Freebase and how does it compare with Wikidata
Freebase is the publicly available and editable part of the Knowledge Graph. Freebase and Wikidata are very similar. At the same time, there are quite a few differences: the user communities, the incentive architecture, the license, the way sources and knowledge diversity are handled, to name the big differences. There are also minor differences in the data model, the way classes, properties, types interact with each other, the UI, the workflows, the prominence of internationalization, currently also the size and scope of the knowledge bases, etc. Wikidata currently has 22 Million statements, Freebase has 2 Billion facts. Now one should always be careful with such simple metrics, and especially these numbers are not comparable, but it hints at some differences in the knowledge base.
Do you work on FB or is this only an outward facing project
Freebase is a part of the Knowledge Graph, and I work on the Knowledge Graph, so yeah, I also work on Freebase.
With a link between FB and WD it will be possible data can flow from WD because of its license. How is it for a flow from the other direction
I am not a lawyer, but unfortunately it seems that the licenses used by Freebase and Wikidata would not allow for that direction. We are aware that this is not good, and we are working on ways to fix this.
But even without actually letting data flow from Freebase to Wikidata, such an alignment can be valuable for the Wikidata editors: bots could compare the data, and flag inconsistencies. Bots could do simple comparisons, like, go through all the countries, compare the capitals, and report on the differences. And this can be done not only with Freebase, but with any other structured or semi-structured base that Wikidata connects to. One example is the work that Maximilian Klein did, comparing the sex/gender of authors connected through the VIAF ID.
The data for the FB<->WD connection is based on a WD dump, How will it be maintained in the future 
This is not decided yet. In principle, we could create this dump regularly, but I am unsure if this will be needed. We will have to see how things develop before deciding on what to do next.
When info exist in both WD and FB and it differs, what would you like to see done
The difference to be fixed, obviously! Both knowledge bases may and do contain errors, and the ability to compare them can lead to an increased quality level in both. It is obvious that the respective communities should take care of such errors in their knowledge bases. By providing links between the knowledge bases, this should get easier to automate.
Would that be a model for collaboration with other sources like VIAF
Yes, in many cases that should work. It is clear that some sources and their communities are more open to corrections than others. Wikidata can become a central hub for identity on the Web, collecting and reconciling IDs from many different sources. What makes Wikidata so interesting in itself is its commitment to let everyone share in the sum of all knowledge - both in participating in creating it as well as in accessing it. The barriers are so much lower than almost anywhere else. In which other knowledge base can you easily fix an error?
Does FB have sources for its statements
Yes, but this is quite a different notion from what Wikidata does with sources. When a dataset is being uploaded to Freebase, the source for this upload is usually recorded. It is closer to what is usually called "provenance", whereas Wikidata's notion of a source is closer to what is called a "reference", an external authority that makes the given claim. A strength of Wikidata is the diversity of the sources, and that they can be refined later by the community. A statement in Wikidata can have several sources (which makes sense if you think of them as references supporting the statement), but in Freebase every statement only has one source (which again makes sense if you think about it being the provenance of that statement, from where this statement came from).
spot the differences
What do you prefer, no diffs or sourced diffs
(I don't understand the question, and rephrase it to: do you prefer an unsourced statement over no statement?)
In most cases, yes, I'd rather have an unsourced statement than no statement at all. In a perfect world, all statements in Wikidata would have great, authoritative sources. But for now, I really think that Wikidata should be lenient with regards to unsourced statements in most cases. There are obvious cases where this is not true: data about living persons has to be more carefully sourced, especially when it has the opportunity to hurt the person. Also, we have to remember, that a Wikipedia community can decide to use a statement from Wikidata only if it is sourced, and to drop it otherwise. Once Wikidata has matured a bit, I expect it to move towards a more stricter policy, and to see tools develop that help with getting there. That will be a quite exciting time.
What FB feature would you LOVE to have in Wikidata
The expressive query features of Freebase. The Wikidata team is hard on working towards enabling this, and I am very much looking forward to see it happen: this will open a whole new world of possibilities for everyone using Wikidata's data, and it will lead to much more visibility of the data and help to refine the Wikidata properties.

1 comment:

anonymous said...

Software Testing is the process that verifies that software meets specified requirements and does not contain bugs or errors. Software testing is an activity whose purpose is to identify the qualities of computer software, such as reliability, security, efficiency, and usability. The modern view of software QA/testing is different from its 60's origins. It has evolved into a systematic process for evaluating the quality of all types of computer systems and now they can visit https://mobilunity-bpo.com/hr-services site for hr services. There are many methods of testing that can be used to evaluate how well a particular piece of software performs under specific conditions. Here are some ways to test your software effectively.