Sunday, August 31, 2014

#Wikidata - my #workflow enriching Wikidata using tools

As I have other commitments, I do not have the same amount of time to do what I used to do. The workflow I use is now quite stable and dependable so I am happy to publish it. It is fairly easy and obvious. You can do this too.

Important are objectives; mine are:
  • make Wikidata more informative by adding relevant statements
  • Provide the basis for further usage of data
My workflow is based on the people who died in 2014. This is reported in categories. ToolScript informs me about all items that do not have a date of death. Every line represents an item; typically they are human but there are also horses and other critters included. I click the Reasonator icon and, the links to articles provide me with the first lines of that article. Typically the date of birth and death are included. I copy this text when it is not English and use Google translate. From the translated text I copy the dob dod. I click on the Qnumber in the Reasonator and add these dates in Wikidata.

The ToolScript can easily point to 2013 or any other year. Obviously you can make your own script to do whatever.

Once somebody is a registered dead, I look at the article for interesting categories. They can be anything from "Alma mater university x" to "player of Whatever FC". Most interesting are the implied facts NOT reported from the dearly departed. Any category may contain hundreds of other items for whom we are not aware about said fact. The first thing to do is to document said category, this category can be on any wiki. Documenting is done by including a statement with "is a list of" "human" and have a qualifier like "alma mater" "University X". Reasonator will show at most the first 500 entries of the resulting query.

When many entries are still missing, Autolist2 is the tool to use. From the Reasonator page of the category, copy the name of the category, the P and the Q value to the appropriate spot. Do not forget to make sure that the right Wiki has been selected (en in the example). Consider the depth; depth 0 is safest. Make sure that the WDQ mode is on "AND" and press "Run". This will generate the list that is selected for processing. Check the list and copy the P and Q values to the control box. Click "Process commands" when you feel comfortable with the results. Once the process starts, you will find the changes in the Reasonator page for the item you add statements for, in the example of the illustration it is the New Zealand Order of Merit

For best results most entries are often in the "local language" like this example for people who work(ed) at the university of Innsbruck.

With a workflow like this you are more effective. The work is documented and slowly but surely Wikidata becomes truly informative.

Friday, August 29, 2014

#Wikidata - Adolf Butenandt, Nobel laureate, professor and student

For many professors we know in Wikidata that they are or have been employed by what university. Data about this has been added categories at a time. Often this has been repeated for categories about the same university from different Wikipedias.

At the same time information has been added for the universities where people studied. However, there is an increasing number of professors for whom it is not known where they studied.

Professor Butenandt is a case in point; he studied at the university of Marburg and the university of Göttingen. It is known on one Wikipedia and not on others. Given that categories are linked as well, it is fairly easy to signal missed opportunities.

Thanks to this query by Magnus, we know about 23,351 professors without an alma mater. For Mr Butenandt information has been or will be added and, obviously there is much more work left to do.

Wednesday, August 27, 2014

#Wikipedia - Professor Hermann Buhl "Leichtathlet"

Mr Buhl died in Tirol wandering through the Alps.  He used to be an athlete of repute and became a professor at the Julius Maximilians-Universiteit.

It is obvious that Mr Buhl was a professor because of his presence in a category. It is not obvious in the same way where he studied and what he taught. When you read the text, it expects a lot of knowledge about the DDR for the text to make such things obvious.

Every Wikipedia has its notability criteria and, the German Wikipedia is not different. Mr Buhl is certainly notable as an athlete but his career did not end. He is probable notable as well for the "latter" part of his career. Some would argue that he started to contribute in a meaningful way when he taught in university.

Monday, August 25, 2014

#Wikimedia - "Share in the sum of all available knowledge"

When we are to focus on the available knowledge we have to share, statistics are key. They cut the crap and focus on numbers. Given that information can be made out of data, knowing how much additional information is available that is easily understood by people who can read English is relevant. Two reports are relevant; one shows the number of links in English and, the other shows the number of labels in English [1]

At this time there are 757,967 items with an English label and without an article. This is 4,7% of the total number of items Wikidata holds. At the same time 58% of the number of items do not have a label in English.

Not having a label does not mean that we cannot provide meaningful information. The name of a Dutch or Spanish person is for instance perfectly understood; it is typically written exactly the same in English. Reasonator understands this and always presents a label anyway.

It is fairly easy to start sharing this "missing" information. It is already done in many Wikipedias. The suggestion to share more information has been put asked on all Wikipedias and  several "communities" do not think it is a good idea. In effect they prefer an inferior product providing a subset of the information that should be available to all our readers.

[1] it shows the numbers for other languages as well and, the statistics are near real time. It takes a minute for them to be presented to you.

Sunday, August 24, 2014

#Reasonator - A new metric for #Wikimedia

Denny wrote a really good article in the SignPost. It includes a "TL:DR" that I am happy to quote.
TL;DR: We should focus on measuring how much knowledge we allow every human to share in, instead of number of articles or active editors. A project to measure Wikimedia's success has been started. We can already start using this metric to evaluate new proposals with a common measure.
The point Denny makes is great; we aim to enable every human being to share in the sum of all knowledge and we should measure the extend to which we are achieving this goal. When you read the article carefully it does not say Wikipedia, it says Wikimetrics. The point Denny makes is very much that we need to focus on what it takes to bring information to people.

Presenting data that is available to us as information is what Reasonator does. It relies on what is known in Wikidata about articles that exist in any Wikipedia. To make this understood to a person, the number of available statements and the number of available labels for an item are key.

When Wikimetrics is to appreciate the potential of Wikidata and the approach Reasonator takes, it should include three bits of information;
  • the number of statements per item
  • the number of labels per language
  • how items are covered with labels in a language
With such an approach the graph will be substantially different. Not one language covers 50% of all the topics known to Wikidata and consequently the graph will show that there is much more work for us to do. It will also indicate that the amount of information that is available for a public that can read English is much larger and the amount available to people who can only read Gujarati is much less.

#Wikidata - Ameyo Adadevoh a physician from #Nigeria

When a Mr Sawyer arrived in Lagos and showed symptoms of ebola, Mrs Adedevoh took control of the situation and thanks to her efforts ebola was largely contained. In the end it did not save her; as a physician in the frontline of the fight against ebola she became a victim herself.

Mrs Adadevoh is another hero of our times. When you google about ebola and Nigeria, there are two things that are of interest; sadly there are the opinion pieces that see a conspiracy in the coming of Mr Sawyer to Nigeria but more positive is the information about the efforts to contain ebola in Nigeria and what it is that you can do to become infected; personal hygiene is key.

There is a call to ensure that hospital staff are immunised. It is quite obvious that no country can really afford to lose key people like Mrs Adadevoh. It is equally obvious that all doctors and nurses who have to deal with ebola patients need to be protected. Without them containing and treating ebola is impossible.

Saturday, August 23, 2014

#Wikidata - the #beta label lister

At the hackathon of #Wikimania2014, work was done on a new version of the label lister. It is a gadget that allows you to edit labels and aliases in other languages. It proved to be an indispensable tool to me. Today I learned that the new label lister is now available.

The most wonderful thing is that it became much more compact, you do not need to click as much anymore and it "just works". In the screenshot you see Mrs Bundschuh, she is a former member of the Landtag of Bavaria, and as you can see it is trivially easy to add a label in your language.

I hope that functionality like the label lister will make it into a core feature of Wikidata.

Monday, August 18, 2014

#Twitter - #WikiParliaments.. but what about #Wikidata and #Austria?

Twitter advertised several things that I might like. WikiParliaments could be one of them. Today I learned that Othmar Tödling died. He was a member of the "Nationalrat" of Austria. As such he might be very much of interest to WikiParliaments.

Politicians are human too; they die. When they do, it is often noted in a category what function they held. Today I started adding statements for those humans who hold or held the function of parliamentarian in Austria.

My hope is that people who care about parliaments will make it even prettier and embellish them with even more statements and qualifiers.

Sunday, August 17, 2014

#MediaWiki - #MediaViewer rehashed

Some things are plain stupid, sometimes I am and sometimes someone else is. I filed a bug about my experience of the MediaViewer. For me it is a show stopper; it prevents me from using it easily.

The problem is that Chrome shows a really awful URL for an image with funny characters in its title. When I look at it using the MediaViewer it is bad but it looks fine when I look at it from the Commons page.
  • File:%C3%89cole_normale_sup%C3%A9rieure_de_Paris,_26_January_2013.jpg
  • File:École normale supérieure de Paris, 26 January 2013.jpg
According to the Bugzilla triage I must be stupid because it works; it complies with specifications and, indeed technically it works. It just stopped working for me.

Several reactions are possible. My choice was to shrug, mutter "it is the user experience stupid" and I got on with my life. Others find it a precursor to the invasion of an evil overlord who does not understand the world and prepare for war.

By filing a bug, by posting this blog I have rid myself of my frustrations. I know several developers; I met many of them at Wikimania and I know they are really dedicated and mean well. I also know that such things pass. I am sure someone will see the light or Google will fix Chrome (if that is where the bug lives). In the end I do not look at images that often as a result.

#Wikidata - sources or confidence

At this time Wikidata has more than 36,396,372 statements these statements are associated with some 15,335,451 items. The majority of these items have less than five statements and even worse for many items it is not known what they are about.

When you consider the quality of this data, there are two schools of thought. There are those who insist on sources with every statement and, there are those who have confidence in the validity of the data because they know where it came from.

Either way, when you want to assert that a specific approach is superior, it becomes a numbers game and, understanding the relative merits is what it is all about. When something is sourced, you can be confident that it is highly probable at the time of the sourcing. There is however no certainty that the data remains stable. Confidence can be maintained by regularly comparing the data with what the source has to say.

When the data is regularly compared, it does not matter that much if Wikidata has source information itself. The source is typically one of the Wikipedias and they are said to have sources, this may provide us with enough reasons for confidence. The comparison of data increases this confidence particularly when multiple sources prove to be in agreement.

Practically, the basic building blocks to start comparing exist. It has been done before by Amir and he produced long lists of differences. Three things are needed to establish new best practices:
  • a well defined place needs to found where such reports may be found
  • communities need to understand that it raises confidence in their project

#Wikidata - giving a #category an application

Many #Wikimedia categories have interlanguage links. Obviously the content of all these linked categories do not have the same content. Someone has to add the articles, sometimes it gets done and sometimes it doesn't. Often articles just do not exist.

When the facts that are implicit in what a category is about make it to all the items in all the categories, typically you have a superset in Wikidata. It does not stop there; items in Wikidata may be included that are not in any of those linked categories.

This is all theoretical unless ... unless you can query Wikidata and use the results. Much data has been added to Wikidata based on the content of categories and queries have been used to identify missing items this is done using AutoList2. This is one application; it is used by some of the "advanced" users of Wikidata.

What is even more interesting is showing what Wikidata things should be in a category. This is done using Reasonator. At this time for over 690 categories statements are included that define a query. This query is already complex enough that the Wikidata functionality will not be able to express the results..

These queries could be of use to "advanced" Wikipedians because it is a basis for identifying articles that have not been categorised or articles that still need to be written in their Wikipedia. For everyone else it is just interesting; this information exists and it is readily available. It is one way of learning that Wikidata knows for instance about 121,922 politicians.

Saturday, August 16, 2014

#Wikidata - application for its long tail

When Lauren Bacall died this week, it was all over in the news. When Marjorie Stapp died on June 2, 2014 it was noted in the English Wikipedia only yesterday. Today it is known to Wikidata and, several bits of information where added to the item about Mrs Stapp as well.

Among those statements is her identifier in the IMDB. The IMDB does not know yet about the demise of Mrs Stapp and it is not unlikely that there are more actors and actresses we know about that have died. Providing external sources like the IMDB with an RSS feed of the changes that are made in Wikidata is not hard.

When we share our information in this way, we gain friends. With these new friends we may do friendly things like noting differences between the data that we hold. Equally important, we add a reason why people might maintain the data that is in Wikidata. As our data gains in application, we will grow and diversify our community.

Thursday, August 14, 2014

#Wikimedia - the quality of access to the sum of all human knowledge

Again, a big flare up of "we the community" demand this and that. Again what Wikipedia, the Wikimedia Foundation is about is conveniently forgotten. At Wikimania there was a really interesting presentation by Raph Koster author of a "Theory of Fun for Game Design". Well recommended once it is available for viewing..

An abstraction of the current huha is in there and this community is described as the monsters who rule it all (my words, his pictures). These people who impose their world on others have forgotten what the game is about. It is about providing access to the sum of all knowledge. From that perspective their issues with the multimedia viewer are hardly significant compared with the increased ease for people who just access the parts of human knowledge we do give access to.

My pet example of "the community" not caring about providing access to our available knowledge is in the decision that easy and obvious access to fonts adds clutter to the user interface and is therefore not acceptable... About seven percent of a population is dyslexic and it is extremely hard to find and enable the OpenDyslexic font. It took a MediaWiki developer over two minutes and he enabled it in a way I did not know existed... He knew it existed, he knew the name of the font. This demonstrates how relevant seven percent of our reader population to our community is.

Should we primarily care about access or is it a playground for monsters?

#Wikidata - It ain't got a thing

A rose, a rose, is a rose by any other name as beautiful.. Eh actually people are quite smart and know a rose when they see one. Machines need to be told what is a rose.

Wikidata has this requirement of being usable by machines. So we need to know what thing a thing is and for all humans it needs to be stated that all of them are considered human.

Several high powered people at Wikimania expressed the opinion that for Wikidata to get in full swing, we have to identify every thing.

I have identified a few hundred "list articles". Items that start with "List of " or "Member of " for instance. I have identified a lot of "group of people" who were supposed to be born in the XXth century.

At Wikidata a thing is bad. We cannot safely select it, we cannot auto describe it. We should get rid of every thing.

Saturday, August 09, 2014

#Wikidata - Dear Lila, it is all about the application

Three months in the job, Lila did an analysis of where we are with our projects. The way she brought it was very much traditional; Wikipedia and English Wikipedia at that. The challenges were not that traditional; much of the public will be mobile and they will not be where they are today.

Another requirement is that all the new people need to be able to contribute. Removing the existing road blocks is absolutely necessary..

When people are to contribute, they have to have a reason to contribute. They will need to benefit from the effort. This year Commons will be wikidatafied and it will become possible to search in multiple languages. The Amnesty International community may add the people on their watch list to Wikidata. In this way what we do in Wikidata gets more of an application.

When we start thinking in terms of how people will be able to use the data we have in store for them, we will find more contributors. Their data will become better connected. The value of our data will increase and we will realise the aspiration of more people in more countries being involved in what we do. We will not only share in the sum of our available information they will put it to use for us.

Friday, August 08, 2014

Dear #Wikipedia, they are not what we call a "human"

At #Wikidata it hurts when you are cheating. For us a human is singular; he or she has a date of birth, maybe a date of death and that is what we expect to find in "20th-century births" and all its subcategories.

We could argue that a horse, a cat or a dog has a date of birth as well but really, Scott Alexander and Larry Karaszewski for instance is not one human or singular. Together they do not have a date of birth, they have two.

Because of the problems articles like the one about Scott and Larry generate, we put them on a "black list". We make them a "group of people". In this way we will not consider them for all kinds of subsequent statements. We will not make them an alumni or give them an occupation. That is reserved for humans.

Thursday, August 07, 2014

#Wikimania - Mr Salil Shetty, #Amnesty International

Mr Shetty spoke at Wikimania 2014. He explained how much Amnesty International and the Wikimedia Foundation have in common. He did a good job and many of the people at the conference proved to have been involved with campaigns of Amnesty in the past. One of the things it does is ask for international attention for people who are in trouble, they are often in jail, they have been tortured and, as the record shows attention helps.

What we could do at Wikidata is decide that all the people who need attention for reasons valid to Amnesty International are notable enough to have an item. We would have all the notable information about these people, this include their profession and the fact that AI considers them at risk and links to more information for instance at the website of AI. This provides the basic information when people decide that there are too many reasons to write a Wikipedia article.

All too often the tipping point for writing an article is the death of a person on a list like this. My hope is that there will be few of these occasions when an article gets actually written.

Monday, August 04, 2014

#Reasonator - Malungsgrynnan är en/ett Ö

Malungsgrynnan is an island in the Kalix archipelago in Sweden. According to this Dutch category, there are some 259 islands.

It is quite interesting that the Dutch Wikipedia knows about these islands while the Swedish Wikipedia does not. Information has been harvested and information about all these islands can be seen in Swedish and any other language thanks to Reasonator.

The best part is that this information, any information available in Wikidata may be fully available in your language as well.. Just add the missing labels.

Saturday, August 02, 2014

#Wikipedia - Municipio de #Prescott (#Dakota del Norte)

Prescott is a township in Renville county, North Dakota. It has 28 inhabitants and such is its notability that only the Spanish and the Vietnamese Wikipedia have an article. Pictures can be found thanks to Google maps..

What is really funny is that all the localities of Renville county can be found in the Spanish category. It is not hard at all to add all of them to Wikidata. It is funny to see all those Spanish names when Reasonator presents information about Renville county.

A message to the researchers of Wikipedia; Rambot did not add all the places of the Unites States. Wikidata is actually more inclusive than the English Wikipedia. Who knew ..

#Wikidata - is the 60th Infantry Division (Wehrmacht) a "division"

It seems obvious; the 60th Infantry Division (Wehrmacht) is an "infantry division" and, it is part of the Wehrmacht. There are a few problems though
  • the logo has it that it is the "60th motorised division"
  • Wikidata does not know "infantry division"
  • The German Wikipedia has a category of all infantry divisions of the Wehrmacht
  • The Waffen SS is considered to be a division of the Wehrmacht in Wikidata
It is fairly easy to include all the divisions that are in the German category as divisions and as being part of the Wehrmacht. It is fairly easy to create an item for "infantry division" as well and use that in stead of "division". Or have a "motorised division"..

It is all too easy to have other people become angry when you decide one way or the other. They are the splitters and the lumpers. The worst thing is that all communication is only happening in writing and often quite angrily. Communication on IRC is dysfunctional and this is considered a failure of the community because they could use one channel that is dominated by discussions about Wikidata development. 

Never mind, you cannot win maybe play even.

Friday, August 01, 2014

#Wikidata - #priorities

Wikidata is relevant because it is used. The primary use case that makes Wikidata relevant are the interwiki links. Wikidata gained relevance as more Wikimedia projects used Wikidata for its interwiki links.

There are ambitions to add more relevance to Wikidata. This is done by giving an application to the data that is included in Wikidata.

  • external sources are shown based on Wikidata information
  • automatically generated text for "humans" in English
  • automated descriptions
  • presentation of information in Reasonator
  • query in WDQ, Autolist and Autolist2.
  • automated descriptions are used in "Wikidata search" on many Wikipedias
Particularly for humans, this works rather well. This house of cards relies very much on one premise; that we know what a Wikidata item is about; is it a human, a house, a settlement or a record label?

For most of our items Given that we do not know what an item is about. For me it is a priority that as many items get identified for what they are as soon as possible. This makes it possible to make a statement that a person is a journalist, an engineer or whatever. 

There are issues with this approach. There is this potential for false positives and at the same time it provides us with possibilities as well. Once false positives have been recognised as such, they can be used to identify other items for what they are. Once the right identifiers have been set as well, it becomes easier to remove the wrong identifiers.

The biggest issue however is with our community. Not everyone shares the same priorities and insights and this is made worse because many people Wikidata only know the standard functionality. For them "Widar" is playing a game. They are not aware that the bulk of the edits are done with Widar or by bot and that the interface they use is inferior. These people have good intentions but they have no clue.

Add to this that communication channels do not really function and what we have is quite messy. 

It will not really improve until we share more of our priorities and listen to each others arguments and work together towards solutions. As it is there are too many perspectives and, conflicting priorities aplenty. Consequently there is plenty frustration shared by all.