The page provides some brief documentation for the Tanaka Corpus of parallel Japanese-English sentences, and in particular the modification and editing that has been carried out to enable use of the corpus as a source of examples in the WWWJDIC dictionary server and other systems.
The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his Pacling2001 paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)
Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected
From inspection, it appears that many of the sentence pairs have been derived from textbooks, e.g. books used by Japanese students of English. Some are lines of songs, others are from popular books and Biblical passages.
The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.
The original file can still be downloaded (see below.)
As described below, the Tanaka Corpus has been edited and adapted to be used within the WWWJDIC dictionary server as a set of example sentences associated with words in the dictionary. In order to adapt the corpus for this role, it has been edited as follows:
The process described above is ongoing, and at present the edited corpus has just over 160,000 sentence pairs.
In addition a small number of additional sentence pairs have been added to provided examples of the usage of Japanese words and phrases not present in the original corpus.
(The incorporation of the Tanaka Corpus in the WWWJDIC server is described in a paper presented to the 2003 Papillon workshop.)
In order to facilitate the linking of sentences in the Corpus to words in the online dictionary, a list of Japanese words and phrases was extracted from each sentence. This was carried out using the Chasen morphological analysis program. Compound words which had dictionary entries were recombined as necessary. At present about 27,000 unique Japanese words and phrases are indexed.
The list of words associated with each sentence is used by the WWWJDIC server to select examples of the usage of the words. In addition, users of the WWWJDIC server can search the Corpus using text strings in Japanese and/or English, and using regular expressions. Users can also submit corrections to sentences via a WWW feedback form. Several thousand corrections have been submitted this way.
More information on the WWWJDIC use of the corpus in in the documentation.
To see an example of the sentence linking, here is the sentence display for 大学生. There is also a function for browsing the sentences.
The file is in text format, with the Japanese in either EUC-JP or UTF8 encoding. If you wish to have it in any other format or coding, you will have to convert it yourself.
The format is as follows:
The following example pair illustrates the format:
A: その家はかなりぼろ屋になっている。[TAB]The house is quite run down.#ID=25507
B: 其の{その} 家(いえ)[1] は 可也{かなり} ぼろ屋[1]~ になる[1]{になっている}
An automatically-generated subset of the edited corpus is also available. The generation selects sentences at random, while ensuring that all the indexed words continue to be represented. The subset is about 30% the size of the full file.
The Corpus is a useful and interesting collection of matched Japanese and English sentence pairs, however it cannot be regarded as containing natural or representative examples of text in either language. This is because of the way it was originally compiled and the artificial nature of the sources. Also it still contains a large number of errors and repetitions. It certainly should not be used for any statistical analyses of the text. While the Corpus appears to be adequate and useful as a source of examples of word usage, the user is advised to be cautious and critical. The following points should be considered:
While the corpus may be freely downloaded and used in servers, etc. two special requests are made:
The file is in the Public Domain. Professor Tanaka made the original file available on this basis, and although many hours have been spent editing it and adding the indices, I don't think its status should be other than freely available. However, if you are using the file in a system, it would be polite to mention where you got it, and provide a link back to this page.
The original file is available from here (in UTF8 coding) or here (in EUC-JP coding). (Please do not use these versions in projects. See the warning above.)
The edited version used in the WWWJDIC server can be downloaded from:
A number of projects in addition to WWWJDIC use the Tanaka corpus as a source of example sentences for Japanese words.
The Tatoeba Project is expanding the corpus to include translations of the sentences in other European languages.
Many people have played a part in editing the examples file, and extending and correcting the indices. I particularly wish to acknowledge the contribution of Paul Blay, whose work significantly improved the indices, including the creation of all the {} extensions. Paul is currently maintaining the file.
Jim Breen
May 2008