rsync server

Download files and convert files encoded as EUC to UTF-8:

for f in JMdict JMdict_e edict edict2u edict_sub kanjd212 kanjd213u kanjidic kanjidic2.xml kanjidic_comb_utf8 JMnedict.xml enamdict examples.utf kradfile-u.gz;do rsync ftp.monash.edu.au::nihongo/$f $f;done;for f in edict edict2 edict_sub enamdict kanjd212;do iconv -f euc-jp -t utf-8 $f>$f.utf;done

List all files:

rsync ftp.monash.edu.au::nihongo


The old KANJIDIC format looks like this:

哀 3025 U54c0 B8 C30 G8 S9 F1715 J1 N304 V791 H2068 DK1310 L401 K1670 DO1249 MN3580 MP2.0997 E998 IN1675 DF1131 DT1239 DJ1804 DG327 DM408 P2-2-7 I2j7.4 Q0073.2 DR465 Yai1 Wae アイ あわ.れ あわ.れむ かな.しい {pathetic} {grief} {sorrow} {sympathize}

The grade numbers are based on the 2010 jōyō kanji list and the 2004 jinmeiyō kanji list:

All 2136 jōyō kanji have been assigned a grade:

$ grep ' G[1-8] ' kanjidic|wc -l

There are two special tags used in the reading field: T1 for nanori (such as 藤: トウ ドウ ふじ T1 ぞう と ふじゅ) and T2 for radical names (such as 气: ケ いき T2 きがまえ).


KANJIDIC2 is the XML version of KANJIDIC/KANJIDIC2. In addition to the information included in KANJIDIC, KANJIDIC2 includes French, Portuguese, and Spanish readings, hangul versions of Korean readings, and accepted miscounts of stroke counts.

$ grep ^哀 kanjidic.txt
哀 3025 U54c0 B8 C30 G8 S9 F1715 J1 N304 V791 H2068 DK1310 L401 K1670 DO1249 MN3580 MP2.0997 E998 IN1675 DF1131 DT1239 DJ1804 DG327 DM408 P2-2-7 I2j7.4 Q0073.2 DR465 Yai1 Wae アイ あわ.れ あわ.れむ かな.しい {pathetic} {grief} {sorrow} {pathos} {pity} {sympathize}
$ xml sel -t -c '/kanjidic2/character[literal="哀"]' kanjidic2.xml|xml fo
<cp_value cp_type="ucs">54c0</cp_value>
<cp_value cp_type="jis208">16-5</cp_value>
<rad_value rad_type="classical">30</rad_value>
<rad_value rad_type="nelson_c">8</rad_value>
<dic_ref dr_type="nelson_c">304</dic_ref>
<dic_ref dr_type="nelson_n">791</dic_ref>
<dic_ref dr_type="halpern_njecd">2068</dic_ref>
<dic_ref dr_type="halpern_kkld">1310</dic_ref>
<dic_ref dr_type="heisig">401</dic_ref>
<dic_ref dr_type="gakken">1670</dic_ref>
<dic_ref dr_type="oneill_kk">1249</dic_ref>
<dic_ref dr_type="moro" m_vol="2" m_page="0997">3580</dic_ref>
<dic_ref dr_type="henshall">998</dic_ref>
<dic_ref dr_type="sh_kk">1675</dic_ref>
<dic_ref dr_type="jf_cards">1131</dic_ref>
<dic_ref dr_type="tutt_cards">1239</dic_ref>
<dic_ref dr_type="kanji_in_context">1804</dic_ref>
<dic_ref dr_type="kodansha_compact">327</dic_ref>
<dic_ref dr_type="maniette">408</dic_ref>
<q_code qc_type="skip">2-2-7</q_code>
<q_code qc_type="sh_desc">2j7.4</q_code>
<q_code qc_type="four_corner">0073.2</q_code>
<q_code qc_type="deroo">465</q_code>
<reading r_type="pinyin">ai1</reading>
<reading r_type="korean_r">ae</reading>
<reading r_type="korean_h">애</reading>
<reading r_type="ja_on">アイ</reading>
<reading r_type="ja_kun">あわ.れ</reading>
<reading r_type="ja_kun">あわ.れむ</reading>
<reading r_type="ja_kun">かな.しい</reading>
<meaning m_lang="fr">pitoyable</meaning>
<meaning m_lang="fr">peine</meaning>
<meaning m_lang="fr">chagrin</meaning>
<meaning m_lang="fr">pitié</meaning>
<meaning m_lang="fr">pathétique</meaning>
<meaning m_lang="fr">compatir</meaning>
<meaning m_lang="es">compasión</meaning>
<meaning m_lang="es">lástima</meaning>
<meaning m_lang="es">miseria</meaning>
<meaning m_lang="es">piedad</meaning>
<meaning m_lang="es">pena</meaning>
<meaning m_lang="es">compadecerse de</meaning>
<meaning m_lang="pt">patético</meaning>
<meaning m_lang="pt">pesar</meaning>
<meaning m_lang="pt">pena</meaning>
<meaning m_lang="pt">emoção</meaning>
<meaning m_lang="pt">compaixão</meaning>
<meaning m_lang="pt">solidariesar</meaning>


The edict file has only one headword and reading on each line but the edict2 file contains multiple headwords or readings on some lines. edict_sub is a subset of the edict file that only includes the priorized entries marked with (P).

$ grep shin/shank edict
脛 [すね] /(n) (uk) shin/shank/lower leg/(P)/
脛 [はぎ] /(ok) (n) (uk) shin/shank/lower leg/
臑 [すね] /(n) (uk) shin/shank/lower leg/
$ grep shin/shank edict2
脛(P);臑 [すね(P);はぎ(脛)(ok)] /(n) (uk) shin/shank/lower leg/(P)/EntL1570850X/
$ grep shin/shank edict_sub
脛 [すね] /(n) (uk) shin/shank/lower leg/(P)/
$ wc -l edict{,2,sub}
  227336 edict
  170430 edict2
   22636 edict_sub

Here are examples of entries in the edict2 file:

脛(P);臑 [すね(P);はぎ(脛)(ok)] /(n) (uk) shin/shank/lower leg/(P)/EntL1570850X/
成り;為り [なり] /(n) (See 成る・7) being promoted (shogi)/EntL2611370/
生足;なま足;生脚 [なまあし] /(n) (sl) (See 生・なま・2) bare legs/bare feet/stockingless legs/EntL2113910/
引用句 [いんようく] /(n) {ling} quotation/EntL1169670X/

Finding the first reading of the entry for a headword with the lowest ID in JMdict or EDICT2

The entries in EDICT2 and JMdict typically have more common readings on the list of readings first, but the lines in EDICT are not sorted so that the lines for more common readings would be first:

$ grep '^舅 ' edict2
舅 [しゅうと(P);しうと;しいと(ok)] /(n) (See 姑) father-in-law/(P)/EntL1571280X/
$ grep '^舅 ' edict
舅 [しいと] /(ok) (n) father-in-law/
舅 [しうと] /(n) father-in-law/
舅 [しゅうと] /(n) father-in-law/(P)/

Even though EDICT2 and JMdict have fewer headwords with two or more entries than EDICT, they still have some headwords with two or more entries:

$ grep 雨雪 edict2
雨雪 [あめゆき;あまゆき] /(n) (col) sleet (mixture of snow and rain)/EntL2768340/
雨雪 [うせつ] /(n) (1) snow and rain/(2) (arch) snowfall/EntL2768330/
$ grep 然然 edict2
然然;然々 [ささ] /(adv) (arch) such and such/EntL2173610/
然然;然々 [しかじか] /(adv,n,adj-no) such and such/EntL1831970X/

In versions of JMdict and EDICT2 from 2013, there were 817 headwords with 2 entries, 59 headwords with 3 entries, 13 headwords with 4 entries, 1 headword with 5 entries, and 2 headwords with 6 entries.

Here are different ways to find the first reading of the entry for a headword with the lowest ID:

$ xml sel -t -v '/JMdict/entry[k_ele/keb="雨雪"][1]/r_ele[1]/reb' JMdict
$ ruby -rnokogiri -e'puts Nokogiri.XML(IO.read("JMdict_e")).xpath("/JMdict/entry[k_ele/keb=\"雨雪\"][1]/r_ele[1]/reb").text'
$ awk -v v=雨雪 '{split($1,headwords,";");for(i in headwords){headword=headwords[i];sub(/\(.*/,"",headword);if(headword==v){split($2,readings,/[ -~]/);print readings[2];exit}}}' <(awk -F/ '{print$(NF-1),$0}' edict2|LC_ALL=C sort -nk1,1|cut -d\  -f2-)

The entries in JMdict are sorted by the IDs but the lines in EDICT2 are not. An entry with a lower ID is more likely to be for a more common sense or reading of a headword.


JMdict is the XML version of JMdict/EDICT/EDICT2. In addition to the information included in EDICT2, JMdict includes glosses in other languages than English and the dates when the entry was changed. JMdict_e only includes English glosses.

$ grep '^名詞 ' edict
名詞 [なことば] /(ok) (n) (ling) noun/
名詞 [めいし] /(n) (ling) noun/(P)/
$ grep 1531570 edict2
名詞 [めいし(P);なことば(ok)] /(n) {ling} noun/(P)/EntL1531570X/
$ xml sel -t -c '/JMdict/entry[ent_seq=1531570]' JMdict_e
<re_inf>out-dated or obsolete kana usage</re_inf>
<upd_detl>Entry created</upd_detl>
<upd_detl>Entry amended</upd_detl>
<pos>noun (common) (futsuumeishi)</pos>
<field>linguistics terminology</field>

WWWJDIC audio files

WWWJDIC has audio samples for over 100,000 of the words (or readings of words) in JMdict/EDICT. I used this script to download the audio files:

mkdir /tmp/pod
sed -n 's,^\([^[]*\) /.*,\1,p' edict|while read x;do curl "$php?kana=$x" -o /tmp/pod/$x.mp3;done
sed -n 's,^\(.*\) \[(.*)\] .*,\1 \2,p' edict|while read x y;do curl "$php?kanji=$x&kana=$y" -o /tmp/pod/$x\ $y.mp3;done
find /tmp/pod \( -size 52288c -o -size 53303c \) -delete

An audio file with a size of 52288 or 53303 bytes is served for a word that is not found.

When I ran the script, it took about 10 hours, and it downloaded audio files for 127,677 out of 213,800 words (or pairs of a word and a reading) in the edict file. I uploaded the files here: http://jptxt.net/wwwjdic-audio.tar (the file is hidden from the directory listing because it's almost 2 GB).