Monday, January 02, 2006

WiktionaryZ will not be a dictionary done digitally

WiktionaryZ will be able to host the content that can be found in the Wiktionaries. It is therefore of an extreme importance that the data in those wiktionaries conforms to certain specifications. These specifications are simple;
  • All words must belong to a language
  • All words must be spelled as they are used in that language
  • Only data that is more or less structured can be parsed and thereby become a candidate for inclusion in the database
When there are dictionary conventions for a particular language that does not conform to these specifications we have a problem. I have a problem with the Latin wiktionary. The word "os" for instance is not Os and is not ŏs. The word "avia" is not Avia nor ăvĭa. The reason given why it is written like that is to conform to standards of some dead wood dictionaries. When there is a need to give additional information to the way a word appears normally in a text, we can provide that information in addition to the normal appearance.

If this means that some data cannot be converted I find this disturbing. I do however not see how it can be done in a different way.

Thanks,
GerardM

3 comments:

Anonymous said...

Um, this is a basic text processing exercise, hardly even a hurdle, not a vast sundering dogmatic gulf.

* Ā, Ă → A
* ā, ă → a
* Ē, Ĕ → E
* ē, ĕ → e
* Ī, Ĭ, J → I
* ī, ĭ, j → i
* Ō, Ŏ → O
* ō, ŏ → o
* Ū, Ŭ → U
* ū, ŭ → u
* Ȳ, Y̆ → Y
* ȳ, y̆ → y
* Æ → Ae
* Œ → Oe
* æ → ae
* œ → oe

A human can do it easily; a bot could do it automatically on import without breaking a sweat. I am certain it is not the only massaging of text you will have to do.

You are acting as if the information you want does not exist simply because it has other information layered on top of it. This idea you have, that the underlying information is somehow irretrievable, is ridiculous.

GerardM said...

The Latin Wiktionary is in many ways one of the best. I would therefore like to convert it as one of the first. This is however dependant on the extra work that is needed to convert the data. More work is a later conversion. Changing things now prevents us from working on it when it becomes more hectic.

Thanks,
GerardM

Anonymous said...

Somewhat Off Topic:

How will links from Wikipedia to Wiktionary be handled with the new database? Will links be created automatically? And what will the links look like?

I'm wondering about this because there have been some problems with that today. The German Wikpedia today blocked the bot which created linka to the German Wiktionary and accused the Wiktionarians of "spamming" and providing false and incomplete information in their articles. This is all quite sad because the user who maintained the bot now left Wikimedia altogether.