EDICT and KANJIDIC

Files

JMdict/EDICT:

KANJIDIC2/KANJIDIC:

Other:

rsync server

This download files and converts files encoded as EUC-JP to UTF-8:

for f in JMdict JMdict_e edict edict2u edict_sub kanjd212 kanjd213u kanjidic kanjidic2.xml kanjidic_comb_utf8 JMnedict.xml enamdict examples.utf kradfile-u.gz;do rsync ftp.monash.edu.au::nihongo/$f $f;done;for f in edict edict2 edict_sub enamdict kanjd212;do iconv -f euc-jp -t utf-8 $f>$f.utf;done

This lists all files:

rsync ftp.monash.edu.au::nihongo

KANJIDIC

The old KANJIDIC format looks like this:

哀 3025 U54c0 B8 C30 G8 S9 F1715 J1 N304 V791 H2068 DK1310 L401 K1670 DO1249 MN3580 MP2.0997 E998 IN1675 DF1131 DT1239 DJ1804 DG327 DM408 P2-2-7 I2j7.4 Q0073.2 DR465 Yai1 Wae アイ あわ.れ あわ.れむ かな.しい {pathetic} {grief} {sorrow} {sympathize}

The grade numbers are based on the 2010 jōyō kanji list and the 2004 jinmeiyō kanji list:

All 2136 jōyō kanji have been assigned a grade:

$ grep ' G[1-8] ' kanjidic|wc -l
    2136

There are two special tags used in the reading field: T1 for nanori (such as 藤: トウ ドウ ふじ T1 ぞう と ふじゅ) and T2 for radical names (such as 气: ケ いき T2 きがまえ).

KANJIDIC2

KANJIDIC2 is the XML version of KANJIDIC/KANJIDIC2. In addition to the information included in KANJIDIC, KANJIDIC2 includes French, Portuguese, and Spanish readings, hangul versions of Korean readings, and accepted miscounts of stroke counts.

$ grep ^哀 kanjidic.txt
哀 3025 U54c0 B8 C30 G8 S9 F1715 J1 N304 V791 H2068 DK1310 L401 K1670 DO1249 MN3580 MP2.0997 E998 IN1675 DF1131 DT1239 DJ1804 DG327 DM408 P2-2-7 I2j7.4 Q0073.2 DR465 Yai1 Wae アイ あわ.れ あわ.れむ かな.しい {pathetic} {grief} {sorrow} {pathos} {pity} {sympathize}
$ xml sel -t -c '/kanjidic2/character[literal="哀"]' kanjidic2.xml
<character>
<literal>哀</literal>
<codepoint>
<cp_value cp_type="ucs">54c0</cp_value>
<cp_value cp_type="jis208">16-5</cp_value>
</codepoint>
<radical>
<rad_value rad_type="classical">30</rad_value>
<rad_value rad_type="nelson_c">8</rad_value>
</radical>
<misc>
<grade>8</grade>
<stroke_count>9</stroke_count>
<freq>1715</freq>
<jlpt>1</jlpt>
</misc>
<dic_number>
<dic_ref dr_type="nelson_c">304</dic_ref>
<dic_ref dr_type="nelson_n">791</dic_ref>
<dic_ref dr_type="halpern_njecd">2068</dic_ref>
<dic_ref dr_type="halpern_kkld">1310</dic_ref>
<dic_ref dr_type="heisig">401</dic_ref>
<dic_ref dr_type="gakken">1670</dic_ref>
<dic_ref dr_type="oneill_kk">1249</dic_ref>
<dic_ref dr_type="moro" m_vol="2" m_page="0997">3580</dic_ref>
<dic_ref dr_type="henshall">998</dic_ref>
<dic_ref dr_type="sh_kk">1675</dic_ref>
<dic_ref dr_type="jf_cards">1131</dic_ref>
<dic_ref dr_type="tutt_cards">1239</dic_ref>
<dic_ref dr_type="kanji_in_context">1804</dic_ref>
<dic_ref dr_type="kodansha_compact">327</dic_ref>
<dic_ref dr_type="maniette">408</dic_ref>
</dic_number>
<query_code>
<q_code qc_type="skip">2-2-7</q_code>
<q_code qc_type="sh_desc">2j7.4</q_code>
<q_code qc_type="four_corner">0073.2</q_code>
<q_code qc_type="deroo">465</q_code>
</query_code>
<reading_meaning>
<rmgroup>
<reading r_type="pinyin">ai1</reading>
<reading r_type="korean_r">ae</reading>
<reading r_type="korean_h">애</reading>
<reading r_type="ja_on">アイ</reading>
<reading r_type="ja_kun">あわ.れ</reading>
<reading r_type="ja_kun">あわ.れむ</reading>
<reading r_type="ja_kun">かな.しい</reading>
<meaning>pathetic</meaning>
<meaning>grief</meaning>
<meaning>sorrow</meaning>
<meaning>pathos</meaning>
<meaning>pity</meaning>
<meaning>sympathize</meaning>
<meaning m_lang="fr">pitoyable</meaning>
<meaning m_lang="fr">peine</meaning>
<meaning m_lang="fr">chagrin</meaning>
<meaning m_lang="fr">pitié</meaning>
<meaning m_lang="fr">pathétique</meaning>
<meaning m_lang="fr">compatir</meaning>
<meaning m_lang="es">compasión</meaning>
<meaning m_lang="es">lástima</meaning>
<meaning m_lang="es">miseria</meaning>
<meaning m_lang="es">piedad</meaning>
<meaning m_lang="es">pena</meaning>
<meaning m_lang="es">compadecerse de</meaning>
<meaning m_lang="pt">patético</meaning>
<meaning m_lang="pt">pesar</meaning>
<meaning m_lang="pt">pena</meaning>
<meaning m_lang="pt">emoção</meaning>
<meaning m_lang="pt">compaixão</meaning>
<meaning m_lang="pt">solidariesar</meaning>
</rmgroup>
</reading_meaning>
</character>

EDICT and EDICT2

The edict file has one headword and reading per line but the edict2 file has multiple headwords and readings on some lines. edict_sub is a subset of the edict file for the priorized entries marked with (P).

$ grep shin/shank edict
脛 [すね] /(n) (uk) shin/shank/lower leg/(P)/
脛 [はぎ] /(ok) (n) (uk) shin/shank/lower leg/
臑 [すね] /(n) (uk) shin/shank/lower leg/
$ grep shin/shank edict2
脛(P);臑 [すね(P);はぎ(脛)(ok)] /(n) (uk) shin/shank/lower leg/(P)/EntL1570850X/
$ grep shin/shank edict_sub
脛 [すね] /(n) (uk) shin/shank/lower leg/(P)/
$ wc -l edict edict2 edict_sub
  227336 edict
  170430 edict2
   22636 edict_sub

Here are examples of entries in EDICT2:

脛(P);臑 [すね(P);はぎ(脛)(ok)] /(n) (uk) shin/shank/lower leg/(P)/EntL1570850X/
成り;為り [なり] /(n) (See 成る・7) being promoted (shogi)/EntL2611370/
生足;なま足;生脚 [なまあし] /(n) (sl) (See 生・なま・2) bare legs/bare feet/stockingless legs/EntL2113910/
引用句 [いんようく] /(n) {ling} quotation/EntL1169670X/

Finding the first reading of the entry for a headword with the lowest ID in JMdict or EDICT2

EDICT2 and JMdict typically have more common readings first but EDICT does not:

$ grep '^舅 ' edict2
舅 [しゅうと(P);しうと;しいと(ok)] /(n) (See 姑) father-in-law/(P)/EntL1571280X/
$ grep '^舅 ' edict
舅 [しいと] /(ok) (n) father-in-law/
舅 [しうと] /(n) father-in-law/
舅 [しゅうと] /(n) father-in-law/(P)/

Even though EDICT2 and JMdict have fewer headwords with two or more entries than EDICT, they still have some headwords with two or more entries:

$ grep 雨雪 edict2
雨雪 [あめゆき;あまゆき] /(n) (col) sleet (mixture of snow and rain)/EntL2768340/
雨雪 [うせつ] /(n) (1) snow and rain/(2) (arch) snowfall/EntL2768330/
$ grep 然然 edict2
然然;然々 [ささ] /(adv) (arch) such and such/EntL2173610/
然然;然々 [しかじか] /(adv,n,adj-no) such and such/EntL1831970X/

In versions of JMdict and EDICT2 from 2013, there were 817 headwords with 2 entries, 59 headwords with 3 entries, 13 headwords with 4 entries, 1 headword with 5 entries, and 2 headwords with 6 entries.

Here are different ways to find the first reading of the entry for a headword with the lowest ID:

$ xml sel -t -v '/JMdict/entry[k_ele/keb="雨雪"][1]/r_ele[1]/reb' JMdict
うせつ
$ ruby -rnokogiri -e'puts Nokogiri.XML(IO.read("JMdict_e")).xpath("/JMdict/entry[k_ele/keb=\"雨雪\"][1]/r_ele[1]/reb").text'
うせつ
$ awk -v v=雨雪 '{split($1,headwords,";");for(i in headwords){headword=headwords[i];sub(/\(.*/,"",headword);if(headword==v){split($2,readings,/[ -~]/);print readings[2];exit}}}' <(awk -F/ '{print$(NF-1),$0}' edict2|LC_ALL=C sort -nk1,1|cut -d\  -f2-)
うせつ

The entries in JMdict are sorted by the IDs but the lines in EDICT2 are not. An entry with a lower ID is more likely to include a more common sense or reading of a headword.

JMdict

JMdict is the XML version of JMdict/EDICT/EDICT2. In addition to the information included in EDICT2, JMdict also includes glosses in other languages than English and the dates when the entry was changed. JMdict_e only includes English glosses.

$ grep '^名詞 ' edict
名詞 [なことば] /(ok) (n) (ling) noun/
名詞 [めいし] /(n) (ling) noun/(P)/
$ grep 1531570 edict2
名詞 [めいし(P);なことば(ok)] /(n) {ling} noun/(P)/EntL1531570X/
$ xml sel -t -c '/JMdict/entry[ent_seq=1531570]' JMdict_e
<entry>
<ent_seq>1531570</ent_seq>
<k_ele>
<keb>名詞</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news2</ke_pri>
<ke_pri>nf38</ke_pri>
</k_ele>
<r_ele>
<reb>めいし</reb>
<re_pri>ichi1</re_pri>
<re_pri>news2</re_pri>
<re_pri>nf38</re_pri>
</r_ele>
<r_ele>
<reb>なことば</reb>
<re_inf>out-dated or obsolete kana usage</re_inf>
</r_ele>
<info>
<audit>
<upd_date>2010-07-27</upd_date>
<upd_detl>Entry created</upd_detl>
</audit>
<audit>
<upd_date>2010-07-28</upd_date>
<upd_detl>Entry amended</upd_detl>
</audit>
</info>
<sense>
<pos>noun (common) (futsuumeishi)</pos>
<field>linguistics terminology</field>
<gloss>noun</gloss>
</sense>
</entry>

WWWJDIC audio files

WWWJDIC has audio samples for over 100,000 of the words (or readings of words) in JMdict/EDICT. I used this script to download the audio files:

php=http://assets.languagepod101.com/dictionary/japanese/audiomp3.php
mkdir /tmp/pod
sed -n 's,^\([^[]*\) /.*,\1,p' ~/japanese/data/edict|while read x;do curl "$php?kana=$x" -o /tmp/pod/$x.mp3;done
sed -n 's,^\(.*\) \[(.*)\] .*,\1 \2,p' ~/japanese/data/edict|while read x y;do curl "$php?kanji=$x&kana=$y" -o /tmp/pod/$x\ $y.mp3;done
find /tmp/pod \( -size 52288c -o -size 53303c \) -delete

An audio file with a size of 52288 or 53303 bytes is served for missing words.

When I ran the script, it took about 10 hours, and it downloaded audio files for 127,677 out of 213,800 words. I uploaded the files to http://jptxt.net/wwwjdic-audio.tar (1.8 GB).