Tuesday, January 31, 2006

Rule number one: You are wrong !

There are those days when you must be wrong .. Today was one such day: because the vision that is behind WiktionaryZ "appears to be based on a simplistic and Eurocentric view of language". It must be true because that is what is being said .. If not, rule number one must be applicable ..

Today a nice standard called OLIF was pointed out to me. OLIF stands for Open Lexicon Interchange Format, it is an open standard for lexical/terminological data encoding. It is essentially a Free and Open standard, it has many illustrious and industrious backers. But it is a tad Eurocentric. It now has East Asian language support, but it is best at English, German, French, Spanish, Danish and Portuguese.

I would not want to dismiss a standard like OLIF when people are actively involved in a standard .. To be useful would be necessary to define how a standard is lacking, it might be that it is just a matter of some refining. It might be that the standard does consider things that I have not considered yet (and I do know that I do not consider everything all the time).

On Meta we are voting for a new Wikimedia project, this project is about standards .. Discussing standards but more importantly it is a conduit for working on those standards that do not fit the requirements that we have for standards in the Wikimedia Foundation's projects. Now there are two options: I am wrong or the standards will fit our needs like a glove.

Thanks,
GerardM

Saturday, January 28, 2006

About new words

More and more I find how my appreciation of things to do with words is changing. By chance I found this book Taal van het jaar vijf It is a publication by van Dale, the Dutch producer of quality dictionaries, it is a collection of 52 articles describing new words that make their appearance in the Dutch language. With my interest in words and more I find it a superb collection of observations how a language changes by the inventiveness of the people who use it..

If you know Dutch, you will love it. If there is more of this, also on other languages, on the Internet .. I would appreciate to learn about it :)

Thanks,
GerardM

Monday, January 23, 2006

Standing on the shoulders of giants

At a conference in December I went to, I received promotion CD for the "Referentiebestand Nederlands" of the TST-Centrale. Yesterday I spend a considerable amount of time digesting the information on the CD. I learned a lot, particularly the power of having content in a database that can be used in many different ways. The "referentiebestand" is a corpus based dataset with 45.000 Expressions with morfological, syntactic collocational, semantic and pragmatic descriptions. (much of the definitions are considered stubs in the English Wikipedia, please help in describing these subjects better).

I have learned a few things. I am right when I understand that the location of words in a sentence is relevant. The TST-Centrale uses a fixed notation for it; I wonder how universal it is.. On the CD it is explained that people who use this content, can select the information they want to use and that this is part of the secret of its success. We hope that having the WiktionaryZ available as a database will serve in a similar way.

Reading back the first paragraph, I feel like a Tom Thumb. Every second noun is something to look up and it was hard work writing it. The great thing is, that when you have it in front of you, when you see it demonstrated it does make sense. When we collaborate on content and stucture, when we make WiktionaryZ something that is usefull because it has an application, we will have giants that help us little people make progress when we are allowed to stand on their shoulders and move with us forwards.

Thanks,
GerardM

Saturday, January 21, 2006

A discussion on trusted computing

Today I had a big discussion about trusted computing. The question was is it bad is it good. Should we be against it and why.

From my perspective the biggest problem with trusted computing that I have is that it may be an open standard, but it is not a free standard. The specifications of the standard that the trusted computing group is available for organisations, it costs at least $1000,- and thereby excludes all these people that create wonderfull programs that people share. Because of this lack of openness it is has a fundamental problem. It makes me trust organisations that I not necessarily trust; why should I trust Yahoo, MSN and AOL as they demonstrate what I perceive as a lack of protection for the privacy of their clients? This trusted computing architecture does not allow me to trust my own software of the software created by a friend. It does not because I do not see how I can create software that will be trusted by my system.

Personally I do not think trusted computing is the equivalent of digital rights management. I am of the opinion that DRM leads to giving away rights that are mine.

Trusted computing does one other thing. I expect that it will take away much of the anonimity that is still with us on the Internet. This aspect is probably something that few people considered. My first clue came after I realised that it is a perfect tool to do vandal fighting on Wikipedia. It gives us a tool to more precisely know where these people come from and it can give us much better protection against this scurge. The other side of the coin is that when it provides us with more control, it will also give more control to those that I do not trust to use it wisely.

Thanks,
GerardM

Sweet nuggets of relevancy

The Kamusi project is one of those deserving projects, it is a project dedicated to the Swahili language. It intends to create a dictionary. A dictionary that has been worked on for many years. A resource to be proud of.

It is only lacking in this one resource.. money.. If it does not find money it will pack it in. They are now finishing off software so that the project will be in the best shape it can be. I truly hope that the Kamusi project will find its way into a bright future. If you can help, please do :)

Thanks,
GerardM

Friday, January 20, 2006

etymonline.com

Giving good content a home is expensive. When you have good quality content, the people will come. Two truisms that are obvious. Dvortygirl mentioned the etymonline website today. In this website you see both truisms in action. It is a great resource, people love it and there is a problem hosting the great content because of the cost associated with a great user experience.

As it says on the information on the mainpage, they only have some 20.000 visitors, the size of the database is "only" 51,08 MB. My reaction to these problems are predictable; I would like to host this information in WiktionaryZ when we are ready for etymological content. There are however a few issues. These issues have to do with the perception of Wikipedia.. Let me quote:
  • "Approach Wikipedia about a partnership, or actually merge the site into Wikipedia. This is a painful option. In a sense, this site is the anti-Wikipedia. It is deliberately not open source. You'd understand why if you saw the regular stream of e-mails I get from people insisting that their own crackpot folk-etymology idea is absolutely correct. Such things can be based on some fervent politico-racial agenda, simple insanity, or "my French teacher in 10th grade told me."

    A major reason this site exists is to serve as a template against which to measure people's best guesses and wacko theories. The whole Internet is a big Wikipedia; this site is a compilation of the most rigorous academic information."

I would have several answers but the most relevant thing is Wikipedia=Crackpots. Yes, our project is not Wikipedia. Yes, we want well researched information. No, it is not Open Source it is Open Content. All this does not address the issue of the quality of content.

In WiktionaryZ we have a need to address quality. In the current Wiktionary we have our crackpots. There is this loon who insists of there being a word called "exicornt". He even threatened admins in order to have this word exist in the Wiktionary.. :( We have to deal with these persons.

We could do something for etymonline. We can approach them. We can offer to host its content by importing their content (obviously with proper attribution), but we sure have to address the issues. Quality is important and we have to protect quality content for our own sake. When we do, and when are successful opportunities like this will be less problematic.

Thanks,
GerardM

Thursday, January 19, 2006

NPOV and sources (continued)

In the blog entry titled NPOV and sources, I mentioned the need for one resource where the sources are given for the Wikipedia articles on one subject. The consequence is that you may end up with a resource that is truly big. Today I learned about a resource of the UNESCO that has 1.600.000 references in the Index Translationum. The Index Translationum is a resource of books and its translations.

This is marvelous resource; there are some caveats. The on-line data is only from 1979 onwards while the resource started in 1932. The printed edition is available in the UNESCO library in Paris...

There are many possibilities with a resource like this.. It is indeed one of the resources that would benefit on a massive scale from a GOOGLE Book action. Obviously I would not care who does the scanning and who spends effort on digitizing this conten. It is however an extremely important resource and everyone would benefit if the data from 1932 to 1979 becomes publicly available as is the intention of this project.

There is more to learn about this project. When considering using it for presenting our sources in Wikimedia projects, we would like an API to refer to it.. I have not looked into it.

Thanks,
GerardM

Wednesday, January 18, 2006

What do you do when your computer is broken

Well, you ask if you can use someone else´s computer. I did. So then I used my mother´s computer.

What do you do when the WIFI router does not work anymore .. You use the ethernet option this router provides...

What do you do when the keyboard mapping is broken .. you do not use question marks ..

Life is a bitch.

Thanks,
GerardM

Monday, January 16, 2006

What to write about ..

With the new WiktionaryZ blog, topics about WiktionaryZ will go there. There are many topics that I have been writing about that are not directly related to WiktionaryZ. These are topics like IEEE-LOM, fundraising, Wikidata, how I think we can improve Wikimedia projects..

Advertising

Yesterday Amgine wrote an “Advertising proposal. The idea is that we can advertise the services that the Wikimedia Foundation provides. Essential for good advertising and marketing is that you target your audience. With such an attitude comes the focus that would improve our content. The advertising would be done using an advertising server; the many people that have their own wiki, their own blog, could include advertisements from this server.

Wikidata

The Functionality that was originally conceived for Wikidata will end up in Mediawiki it self; this is both a blessing and a curse. The great thing is that it does signal that much of the functionality that we conceived is indeed relevant and it will bring functionality to all the wikis that use Mediawiki. The drawback is that it has implications for the design. It does complicate things at this stage, but on the other hand when we have a great instead of a good design from the start, the extra time and effort needed now may pay itself back in the long run.

Thanks,
GerardM

Sunday, January 15, 2006

Our blog

Words and what not is my blog. Now that we have a "commission", my blog will become different. It must become different because it will be important that it is understood that all these people have their own opinions, ideas and that this commission is also a process. A process of getting this thing on the road, a process of developing a consensus of what we are doing.

All these things are not new. It happened all the time.. Now we will make this process more visible; it is done to read this other blog, the WiktionaryZ.blogger.com blog of the commission.

The thing I am not sure about is to what extend I will continue blogging here and to what extend I will blog on the new blog.

Thanks,
GerardM

Saturday, January 14, 2006

Presenting: the Commission

WiktionaryZ is a big project. It is becoming bigger. It is an innovation on the Wiktionaries, as such it is a continuation of these projects. It requires innovation in the MediaWiki code, the first software is about to be merged into and become available in all MediaWiki wikis. There is the first alpha version of the WiktionaryZ software, and it hosts the GEMET data. We are talking with several relevant organizations that share the dream that still is WiktionaryZ.

There is one issue; people think that the problem with this project is that it is run by one person; me. It was once put to me was that that I represented a "truck factor" for the project. From my perspective, WiktionaryZ is certainly my dream, but it has never been my dream alone. I have worked hard to make WiktionaryZ happen, but I am not the only one that worked hard to make it happen this far. I came up with many ideas that made WiktionaryZ what it is, but not all ideas have been mine, all the ideas became WiktionaryZ because of the many conversations, e-mails and IRC chats about them.

From my perspective, there is this issue that I do not scale. There are things that amount to policy and it needs to be expressed because policy dictates technological choices and we are building the technology for WiktionaryZ. This implies that some policies have been set but it also implies that more questions will raise their head that do require an answer and require an answer quickly because we are building the software, the network, the connections now.

I did discuss this issue with Jimmy Wales, and he came up with the great idea to have a commission for projects like Wiktionary. This commission could have several funtions; it can act in a similar way as the chapters do for countries for projects. One role of the commission would be that the policies of a project would be consistent with the aims, the policies of the Wikimedia Foundation. Another equally important part would be to represent the community that will make WiktionaryZ its own. To do all this, members of such a commission have to be part of the discussions about the developing Wikidata and WiktionaryZ.

I like the idea and, I have asked several people to become part of an initial commission. They are all Wiktionarians (with one exception), they represent many Wiktionary projects and they are and have to be communicative; they do use Skype/VOIP and often can be found on IRC.

The commission:
Including Sabine Cretella is obvious. Sabine developed WiktionaryZ with me from the start. WiktionaryZ is a dream she has fostered for a long long time. Sabine is active on the Italian Wiktionary and is one of the initiators of the Neapolitan Wikipedia. Sabine is a professional translator and is known in this world as an evangelist of Open Source and Open Content.

Of all the Wiktionaries, English is the most relevant. Dvortygirl has been active there for a long time. Like me, she is an admin and she is well liked and respected for her work.

Gangleri is active on many wiktionaries. The thing I really appreciate is his involvement with right to left languages like Yiddish. On the Internet, these languages have their own issues, Gangleri is active in the Mozilla organisations to address several of these.

Yann, is active on the French, Hindi and Gujarati Wiktionary. He is also the treasurer of the French chapter. One of Yann's challenges is to help us get more people interested in the languages from India.

Erik
is the one who is not into Wiktionary. The reason why he is invited is because he is the architect and realiser of Wikidata. This is the enabling technology for WiktionaryZ. Erik is also the realiser for WiktionaryZ. Wikidata has in WiktionaryZ its first application. As this is a truly big and complex project, many of the things that would hit Wikidata eventually, need to be addressed from the start. Erik is also important as a linking pin to the Mediawiki developers.

GerardM, if I need an introduction I invite you to read this blog.

Some policies or, a glossary of our policies:
Availability: "our data is to be made available through open standards and in a non-discriminatory manner"
Data design: "WiktionaryZ is implemented as a relational database. When information that is relevant in the context of WiktionaryZ cannot be added, we will try to ammend the design."
Full functionality: "WiktionaryZ needs to be able to include the information that is available in the Wiktionaries"
Partner: "a partner is an organisation that collaborates with us in the realisation of what we intend with WiktionaryZ"
Sponsor: "a person or organisation that donates money or content to the project or to the Foundation"
Success: "success is when people find an application for the WiktionaryZ data that we did not think off."
User Interface: "the interface should be in any language. We want this both for the Mediawiki and the WiktionaryZ user interface"

Thursday, January 12, 2006

Dialects within a language

There are Wikipedias in many languages. So far there are some 212. Most of these languages have an ISO-639 code. There are two versions of this code that are "official", there is one version that is workable; the ISO/DIS-639-3 is currently maintained by SIL international. However, workable does not mean that it is perfect. Today there were to moments where the current practices relating to ISO-639 were the issue.

JAVA uses ISO-639 for its language codes. The codes used is the ISO-639-1. Consequently the Neopolitan language is not known. OmegaT is an open source CAT tool, it uses the languages known to JAVA as the languages that it can translate.. So in order to translate to Neapolitan you have to pretend that it is a different language.. Not nice.. So the nice people of SUN were asked this and we have great expectations.

Today there was a request on Meta, the website about the Wikimedia Foundation's project, for a new Wikipedia. The request is for tarantino, it is considered a dialect, a dialect of Neapolitan. This request is problematic because there is not even an ISO-639 code. Consequently there is little chance of there being a wikipedia for created. Now, with the new namespace manager, it is possible to create a seperate namespace within the nap.wikipedia.org for the tarantino dialect. This is also a solution for the problematic request for a Lower Saxon wikipedia that will be in an orthography that is not German..

It is sobering to see that standards can enable and prevent things to happen. Good standards are vital and ISO/DIS 639-3 is a big move forward.

Thanks,
GerardM

Wednesday, January 11, 2006

Machine translation

Machine translation is a difficult thing. It is hard to get it right, the English language Wikipedia has a problem with writing a good article about it. Some people think that with WiktionaryZ we are in the business of machine translation, as a result the subject gets on our radar screen every now and then. It would be a good characterization to say that the Machine Translation we could use is the one that is able to translate the definitions of a DefinedMeaning in all languages. This resulting translation should be good enough so that people have an idea what is meant .. in Bambara, or Papiamento... Translations like these are what we need and there are as far as I am aware no Machine Translation engines.

To create a Machine Translation engine for these language, you would need all kinds of rules about the languages that you want to translate to. Now I would like to know these rules. Not so much to build Machine Translation engines but because they have this other application, one that is much closer to my heart, it is needed for software that teaches people languages. The Universität Bamberg is working on exactly this. They want to use the data of WiktionaryZ for this purpose. So if you have a nice sets of rules for them I would be obliged.

Thanks,
GerardM

The importance of statistics

When you have statistics about your website, you know about your public. At this moment in time, the Wikimedia Foundation does not invest in resources for statistics. Yesterday, some people announced "AOL-day"; it reflects the astonishing growth of the Wikipedia project. We are able to notice these things through the services of Alexa.com. When you look at the data, you find the distribution of the traffic to the different wikipedias.

As astonishing as the growth of Wikipedia is the apparent popularity of Wiktionary. According to Alexa Wiktionary is more popular than dictionary projects that have much more content. This is probably the effect that the association with Wikipedia brings us.

When you look at the traffic details for Wiktionary, the thing that strikes me most is the popularity of the Russian wiktionary. Such details points to the apparant strength of this project or to the need for Russian content. For me, this is relevant because it could be one reason why we concentrate resources on a given language.

As WiktionaryZ is a true Wiki project, there is no need to concentrate to much about the user interface. People WILL find what works and what does not work and to a large extend the user interface will evolve. However, at this moment we are thinking about the infrastructure of the project.

In the current infrastructure a resource is indicated by preceding the project name with the ISO-639 code; the Dutch wiktionary is therefore http://nl.wiktionary.org. For WiktionaryZ we do not have a separate database for each project. When we maintain this link into the mainpage for a language, we benefit in two ways; there is a main page for the whole project, there is a localised entry level per language and the statistics of Alexa remain relevant for some basic analysis of the demand of our project.

T

Tuesday, January 10, 2006

Transliterations

In my previous post I wrote about a mail that I received from someone from LISA. It was a great mail and what made it even more super was the thoughtful following correspondence.

The current implementation of our database is the first itteration of what is going to be WiktionaryZ. Its intention started off as at least including all the information that is included in the Wiktionaries. So far I was against the inclusion of transliterations as we would want the translations anyway and, often a tranlation is in effect a transliteration. The case was made that this is too simple.

Several point were made:
  • people find transliterations usefull
  • many phrasebooks include them (their transliterations are often quite bad)
  • having a "standard" transliteration helps because for names like معمر القذافي in excess of 20 different transliterations exist.
  • transliteration should exist in addition to translations AND they are specific to a language
  • there are standards for transliteration (eg for transliterating Russian into English)
The best argument was left to last; transliterations were very usefull in his work.

Given my stance on what should be included in the database, I would say I want to have this. However, this is complicated by the fact that transliterations are language specific AND they are on the same level as pronunciations are.

Yes, this gentlemen knows how to find Njoe Jork, he studied some Afrikaans at one point in his life.. :)

Thanks,
GerardM

Just another day

Today is just another day. I try to write every day something that is of interest. This is becoming more difficult. This is not because of a lack of things to write about. If anything it is the opposite. I do not want to write more than once a day and therefore selection is what is problematic.

  • We have discussed that we are in danger of getting too much content. This may mean that we have to slowly absorb content and not make a big mess.
  • There was someone new to me who came to Sabine and wanted to help us connect with someone we are already working with. The great thing is that this is a great person that can help the Wiki for Standards project along. He is also someone who could become relevant for WiktionaryZ
  • Today I found this proposal called Wikimaps This is a proposal that was missing localisation, so I added it as a "must have" feature. When this is going to happen, many translators will be happy for a resource that helps with geographic data.
  • I have been thinking more about partners. There is so much work and there are so many problems that will come our way and there are so many organisations that can help us manage. It would be folly not to collaborate and it would be ungrateful not to acknowledge these organisations.
  • Quistnix told me that conversions are happening because messages about "namespaces". He is very active with interwiki links so he would notice these things.
  • I received a mail about Maltese verbal morphology. This came with a request on how we could collaborate.
Truly most of these items deserve their own blog entry and I am certain that I have not mentioned everything that was relevant today.. For instance I received this nice mail from someone in the LISA organisation..

Thanks,
GerardM

Monday, January 09, 2006

What words do we want in WiktionaryZ

WiktionaryZ wants to have all words in all languages. So what words should we concentrate on? Brian0918 created a big list with some 306,390 English expressions. This list is a compilation of the entries in the "OED and AHD". To me there are several issues that should be addressed.
  • The way copyright infringements are signalled in relation to lexicological content
  • Do we want to emulate what others have done, or are we masters of our own destiny
  • Where are the French, the Swahili, the Kannada and the words of all the other languages that are equally deserving attention
People who compile lexicons, include information that normally would not make it into a dictionary. The reason why they do is because it functions as a "watermark". When this watermark is found this is considered proof of copyright infringement. When you combine the list of both the "OED and AHD", the prevailing theory would be that infringement is proven for both bodies of work.

With WiktionaryZ we have the opportunity to use Wikipedia as our corpus. In Wikipedia we find the words as they are used today. When we concentrate our effort on these words, we provide added value to the information contained in Wikipedia and at the same time Wikipedia adds value to WiktionaryZ because it allows us to show words in context. In Wikipedia we have people from many countries that contribute, they do use words that are normal in their locale. With the "OED and AHD" we do not necessarily get these words.

It is not a bad thing that people interested in English concentrate on English. As such I welcome this effort. However, in my opinion the emphasis is too much on main languages like English. In these languages it is hard for WiktionaryZ to become relevant. To become relevant we have to do things that others do not. Relevancy can be gained in many ways; translation to minor languages is a way for some, counting the characters in a word makes is a way for others.

From my perspective, we become relevant by harbouring communities, special interest groups and allow them to make WiktionaryZ their project. When we maintain our core values of Freedom, of inclusiveness, of non-discriminatory access, of open standards we will be relevant to some if not to all.

Thanks,
GerardM

Sunday, January 08, 2006

Welcome Sabine :)

Today Sabine Cretella became a new user on the English Wikipedia. She was welcomed by TheRingess. Many people are really happy when they are welcomed. It is a reason many people give why they stayed with our project.

When we are going to have single login, people will know that Sabine is not a newby. So what will happen when you are new to a project, would that mean that when you are new to the project you will not be welcommed ?? Sabine and I found it funny that she was welcomed then again it is the charm of our project that you are welcomed.. What is funny is this..

Thanks,
GerardM

Mediawiki is very much a collaborative effort

Yesterday there was this meeting in Rotterdam. It proved to me again how powerfull such meetings are.. I was very happy to meet Erik Zachte. Erik has earned himself a great reputation for the statistics that we provide with our projects, I learned that the Wikipedia and the Wiktionary statistics are based on similar data but they are slightly different in their presentation.

The biggest improvement for the production of statistics has been the policy that the dumps of the Wikimedia Foundation are in a XML format. This provides a much more stable basis for the production of statistics. With the advent of Wikidata and single login, I worry about how this will affect these popular statistics.

It was great to have Erik reassure me that we will cross that bridge when we meet it, "because we have always done that". Issues that may arise are: currently we do not have Wikimedia wide statistics, with single login implemented we can. Wikidata projects will be .. where and will probably not count for the project that the data has been entered for. With the introduction of the namespace manager in Mediawiki 1.6, some of the assumption in the statistics may have to be revisited..

Thanks,
GerardM

NPOV and sources

The NPOV or Neutral Point of View is one of the fundamental policies of the Wikimedia Foundation. It serves us well, and it is a great instrument to prevent POV pushers from getting the upper hand. At this moment there is a clarion call for more quality and providing sources is seen by many as a "silver bullet".

Let me be clear about my position; I am all in favour of providing sources with articles. However, for every controversial subject you can find literature that "proves" all the crackpot ideas that are floating around. To complicate things even more, literature available in one country or language is not all the literature that exists on a subject. It is therefore my position that you do not prove anything by providing sources, you only prove that there are sources that helped us come up with a particular article and that it raises the standard of quality.

Yesterday, there was a New Year meeting of Wikipedians in Rotterdam, and as is usual at such meetings all kinds of everything were discussed. Including the problems with sources. During the discussion we came up with the following:

We need a central database to do away with the current interwiki links. When we have such a database, we should have all sources for the same subject, never mind what Wikipedia it is, in there as well. It helps people find sources but also when there is a difference in the view taken between the different Wikipedias, the sources can be compared and it will prove to be a usefull instrument to deal with cultural bias.

The thing that triggered this idea was that Oscar had bought a book about the Amazingh. On the Dutch Wikipedia there was a guy who quoted sources that noone of us could read as it was in the Berber language. I was really pleased that Oscar took the effort to buy this because it shows very much his good faith, our good faith. By having all sources for the same subject in one place, we would show similar good faith and, it would be a really powerfull tool to remove much cultural bias from the Wikipedias.

Thanks,
GerardM

Friday, January 06, 2006

Portal pages in WiktionaryZ

I had a great conversation with Dvortygirl today. We discussed many of the things that have to do with WiktionaryZ. We discussed locales and that the locales you use in computing do not match linguistic locales. Dvortygirl knew this resource where there used to be a dialect survey where Americans were asked some 50 questions to determine how the usage differs from place to place.

When people want to do this kind of research, it makes sense to have a place for it in WiktionaryZ as well. It dawned on me that I have not really considered many of the User Interface questions that will come up. Then again some things are obvious. People will not only want to select the language that they see. There will be a need for a portal page for each language. This could be the kernel for a portal page for the English language (from the Dutch Wiktionary).

With starter portals like this, it can be expanded in many ways. Many internal (to the Wikimedia Foundation) and external resources links can be added and give users a rich experience. The Main page of WiktionaryZ would therefore be similar to the http://wiktionary.org website particularly to point people in the right direction.

Thanks,
GerardM

Thursday, January 05, 2006

Angela's report

I read a report by Angela about her visit to South Africa. In it Wiktionary was mentioned. Apparantly there is a need for a translation glossary for legal terms in South Africa. South Africa has 11 official languages and the quality of translations often are not as good as they should be.

The fun will start when English, South African, Australian, New Zealand, Canadian and US-American legal terms end up together. The only way this can work is by having distinct terms clearly associated with a specific legal glossary for a legislation.

The same report mentions a biblical Hebrew dictionary, it seems that people working on such a resource want to have it in a collaborative environment.

Both are examples of the pent up demand that exists for a resource like WiktionaryZ.

Thanks,
GerardM

Wednesday, January 04, 2006

Organizing the Ultimate Wiktionary project

WiktionaryZ is going to be big. It is going to be seriously big. It is going to attract serious attention. It is likely to attract serious attention and it will because it already does. At first it was primarily a technological challenge; can we combine the data that is in the different language Wiktionaries. The open nature of these projects made them a combination of lexicological, terminological and thesaurus data.

We ended up with a design that will allow for a lot of refinement. However many people who looked at it think it has great potential. Ultimate Wiktionary, the project, is our dream. As we worked on it for the last year and a half we understood more and more the pent up demand that exists for information that can be stored in our project.

Our dreams are big. We want to realise them. We are fortunate as we are associated with the best organisation to make this happen, the Wikimedia Foundation. It has a great reputation, it has a great community and what they do with Wikipedia is astounding.

From an organisational point of view, the Wikipedia project will be different from the WiktionaryZ project. Wikipedia is community driven; they create the data they finance the project. WiktionaryZ will be different because much of the data that will be included already exists. Many organisations struggle while maintaining their resources. For WiktionaryZ the opportunity exists to focus all this energy in one place.

The development of WiktionaryZ was made possible by organisations supporting our effort. Kennisnet, a WMF partner, provided the initial investment. The Universität Bamberg was the second organisation to help. More work needs to be done and there are more organisations that are willing to collaborate technically and who are willing to share their resources to make WiktionaryZ happen.

WiktionaryZ is going to be big. I made the bet that we will need in the first year of full-featured production two hundred thousand EURO (not US$) in servers. There are people that have told me that my “guestimation” is on the low end.. :)

People that work on content are and will be attributed in the normal way; it can and will be found in the history of the content. Organisations that prove to be partners of our project could be credited on the left hand side and may end up under the toolbox. A link will refer to a page about our partner.

Our project is as much about collaboration as any of the other Wikimedia projects. There is however no other project where organisations will play such an important role. This calls for a different way of organisation their effort. I propose therefore to combine these organisations in a consortium that will be the focal point for the contributions of organisations.

The WiktionaryZ consortium will have two functions; managing the collaboration of organisations and finding the resources to make WiktionaryZ possible.

Thanks,
GerardM

Tuesday, January 03, 2006

Interproject and Interlanguage links in Wikimedia

On the Wikimedia projects many people including myself run "bots" to link articles to articles in other languages. The object is that information on the same subject can be found in another language. The amount of edits generated are massive; compare for instance the number of edits of my bot with the number of edits of me on the English Wiktionary where I am not very active but where I am an "admin". Maintaining interlanguage links on Wiktionary is simple

This situation is more complicated on the Wikipedias. Because of disambiguation articles are necessarily not named in an obvious way. With currently 212 languages it is impossible to maintain the links timely and reliably.

There is also the problem that some people consider linking from Wikipedia to Wiktionary "spamming" and are so extreme as to ban people for this. The situation is a mess.

With the advent of more centralisation for users and the integration of Wikidata in Mediawiki, my solution would be technical. Each Wikipedia can link one article once to projects and languages. Other projects can link in the same way. The behaviour of these projects is not necessarily the same. WiktionaryZ could because of its thesaurus structures provide a universal browsing capability through the articles and the terminology of many languages. Commons could provide galleries of pictures that are associated with a word as its "category".

I would have this mechanism also be integrated with our search engine; when for instance the word Chihuahua does not exist at all in a Wikipedia, it might exist in the Wiktionary and thereby provide the disambiguation that you can find similar to the one in the English Wikipedia. When the Russian Wikipedia has an article on the dog it can be suggested to read that one in stead due to the relation in the thesaurus.

The benefits in short:
  • Better and more timely information.
  • Less edits by bots.
  • Improved user experience
  • Better integration of the smaller Wikipedias.
Thanks,
GerardM

Monday, January 02, 2006

WiktionaryZ will not be a dictionary done digitally

WiktionaryZ will be able to host the content that can be found in the Wiktionaries. It is therefore of an extreme importance that the data in those wiktionaries conforms to certain specifications. These specifications are simple;
  • All words must belong to a language
  • All words must be spelled as they are used in that language
  • Only data that is more or less structured can be parsed and thereby become a candidate for inclusion in the database
When there are dictionary conventions for a particular language that does not conform to these specifications we have a problem. I have a problem with the Latin wiktionary. The word "os" for instance is not Os and is not ŏs. The word "avia" is not Avia nor ăvĭa. The reason given why it is written like that is to conform to standards of some dead wood dictionaries. When there is a need to give additional information to the way a word appears normally in a text, we can provide that information in addition to the normal appearance.

If this means that some data cannot be converted I find this disturbing. I do however not see how it can be done in a different way.

Thanks,
GerardM

Sunday, January 01, 2006

Happy New Year

I wish everyone joy and happiness. I hope that we will find the cooperation that will make WiktionaryZ the success that we hope for.

The next important deliverable is a "write up" by Erik where he explains what Wikidata is about. How it can be used and how certain essentials are delivered. As much as an explanation it will also be the mental excercise needed to get more programmers involved in Wikidata. Combine this with the amount of inline documentation that PHP provides and it will open up many possibilities to many programmers, inside the Wikimedia Foundation an out.

Let me be really clear about one thing; Wikidata is powerfull stuff in its own right. In one way it is really great that its first implementation is so ambitious; when you can model this. You can model almost everything. In another it means that as WiktionaryZ is dependent on Wikidata, it will move forward technically at the same speed. Given that the namespace manager will be in Mediawiki 1.6 and given that other infrastructure issues are addressed things could not look more prommissing that the way they do.

For the language guys reading this; we are seriously considering to "steal" a page out of the LMF book; particularly for lexicological information we are going to have "Attributes" that are defined in a language specific way and that are going to be defined particularly in three places; the Expression, the SynTrans and the DefinedMeaning level. This means that we are about to ditch the LexicalItem. This will do us two things; it will make the core of WiktionaryZ more efficient and it will allow us to be more language specific from the start. As we will have Attributes that are conditional on other Attributes and as this will be reflected in the User Interface I think this may reflect the core idea of LMF. Then again I do not really know as I do not really understand enough of LMF yet.

Thanks,
GerardM