JMdict/EDICT:
(P).KANJIDIC2/KANJIDIC:
Other:
This download files and converts files encoded as EUC-JP to UTF-8:
for f in JMdict JMdict_e edict edict2u edict_sub kanjd212 kanjd213u kanjidic kanjidic2.xml kanjidic_comb_utf8 JMnedict.xml enamdict examples.utf kradfile-u.gz;do rsync ftp.monash.edu.au::nihongo/$f $f;done;for f in edict edict2 edict_sub enamdict kanjd212;do iconv -f euc-jp -t utf-8 $f>$f.utf;done
This lists all files:
rsync ftp.monash.edu.au::nihongo
The old KANJIDIC format looks like this:
哀 3025 U54c0 B8 C30 G8 S9 F1715 J1 N304 V791 H2068 DK1310 L401 K1670 DO1249 MN3580 MP2.0997 E998 IN1675 DF1131 DT1239 DJ1804 DG327 DM408 P2-2-7 I2j7.4 Q0073.2 DR465 Yai1 Wae アイ あわ.れ あわ.れむ かな.しい {pathetic} {grief} {sorrow} {sympathize}
L401 means that the kanji has an RTK frame number 401 in the fifth and earlier editions of RTK, even though it changed to 395 in the sixth edition of RTK1.S9 means that the kanji has 9 strokes.F1715 means that the kanji has a frequency rank of 1715. The frequency ranks are only included for the first 2501 characters.B8 means that the kanji has a radical (bushu) number 8 (亠).G8 means that the kanji is taught in secondary school.The grade numbers are based on the 2010 jōyō kanji list and the 2004 jinmeiyō kanji list:
All 2136 jōyō kanji have been assigned a grade:
$ grep ' G[1-8] ' kanjidic|wc -l
2136
There are two special tags used in the reading field: T1 for nanori (such as 藤: トウ ドウ ふじ T1 ぞう と ふじゅ) and T2 for radical names (such as 气: ケ いき T2 きがまえ).
KANJIDIC2 is the XML version of KANJIDIC/KANJIDIC2. In addition to the information included in KANJIDIC, KANJIDIC2 includes French, Portuguese, and Spanish readings, hangul versions of Korean readings, and accepted miscounts of stroke counts.
$ grep ^哀 kanjidic.txt
哀 3025 U54c0 B8 C30 G8 S9 F1715 J1 N304 V791 H2068 DK1310 L401 K1670 DO1249 MN3580 MP2.0997 E998 IN1675 DF1131 DT1239 DJ1804 DG327 DM408 P2-2-7 I2j7.4 Q0073.2 DR465 Yai1 Wae アイ あわ.れ あわ.れむ かな.しい {pathetic} {grief} {sorrow} {pathos} {pity} {sympathize}
$ xml sel -t -c '/kanjidic2/character[literal="哀"]' kanjidic2.xml
<character>
<literal>哀</literal>
<codepoint>
<cp_value cp_type="ucs">54c0</cp_value>
<cp_value cp_type="jis208">16-5</cp_value>
</codepoint>
<radical>
<rad_value rad_type="classical">30</rad_value>
<rad_value rad_type="nelson_c">8</rad_value>
</radical>
<misc>
<grade>8</grade>
<stroke_count>9</stroke_count>
<freq>1715</freq>
<jlpt>1</jlpt>
</misc>
<dic_number>
<dic_ref dr_type="nelson_c">304</dic_ref>
<dic_ref dr_type="nelson_n">791</dic_ref>
<dic_ref dr_type="halpern_njecd">2068</dic_ref>
<dic_ref dr_type="halpern_kkld">1310</dic_ref>
<dic_ref dr_type="heisig">401</dic_ref>
<dic_ref dr_type="gakken">1670</dic_ref>
<dic_ref dr_type="oneill_kk">1249</dic_ref>
<dic_ref dr_type="moro" m_vol="2" m_page="0997">3580</dic_ref>
<dic_ref dr_type="henshall">998</dic_ref>
<dic_ref dr_type="sh_kk">1675</dic_ref>
<dic_ref dr_type="jf_cards">1131</dic_ref>
<dic_ref dr_type="tutt_cards">1239</dic_ref>
<dic_ref dr_type="kanji_in_context">1804</dic_ref>
<dic_ref dr_type="kodansha_compact">327</dic_ref>
<dic_ref dr_type="maniette">408</dic_ref>
</dic_number>
<query_code>
<q_code qc_type="skip">2-2-7</q_code>
<q_code qc_type="sh_desc">2j7.4</q_code>
<q_code qc_type="four_corner">0073.2</q_code>
<q_code qc_type="deroo">465</q_code>
</query_code>
<reading_meaning>
<rmgroup>
<reading r_type="pinyin">ai1</reading>
<reading r_type="korean_r">ae</reading>
<reading r_type="korean_h">애</reading>
<reading r_type="ja_on">アイ</reading>
<reading r_type="ja_kun">あわ.れ</reading>
<reading r_type="ja_kun">あわ.れむ</reading>
<reading r_type="ja_kun">かな.しい</reading>
<meaning>pathetic</meaning>
<meaning>grief</meaning>
<meaning>sorrow</meaning>
<meaning>pathos</meaning>
<meaning>pity</meaning>
<meaning>sympathize</meaning>
<meaning m_lang="fr">pitoyable</meaning>
<meaning m_lang="fr">peine</meaning>
<meaning m_lang="fr">chagrin</meaning>
<meaning m_lang="fr">pitié</meaning>
<meaning m_lang="fr">pathétique</meaning>
<meaning m_lang="fr">compatir</meaning>
<meaning m_lang="es">compasión</meaning>
<meaning m_lang="es">lástima</meaning>
<meaning m_lang="es">miseria</meaning>
<meaning m_lang="es">piedad</meaning>
<meaning m_lang="es">pena</meaning>
<meaning m_lang="es">compadecerse de</meaning>
<meaning m_lang="pt">patético</meaning>
<meaning m_lang="pt">pesar</meaning>
<meaning m_lang="pt">pena</meaning>
<meaning m_lang="pt">emoção</meaning>
<meaning m_lang="pt">compaixão</meaning>
<meaning m_lang="pt">solidariesar</meaning>
</rmgroup>
</reading_meaning>
</character>
The edict file has one headword and reading per line but the edict2 file has multiple headwords and readings on some lines. edict_sub is a subset of the edict file for the priorized entries marked with (P).
$ grep shin/shank edict
脛 [すね] /(n) (uk) shin/shank/lower leg/(P)/
脛 [はぎ] /(ok) (n) (uk) shin/shank/lower leg/
臑 [すね] /(n) (uk) shin/shank/lower leg/
$ grep shin/shank edict2
脛(P);臑 [すね(P);はぎ(脛)(ok)] /(n) (uk) shin/shank/lower leg/(P)/EntL1570850X/
$ grep shin/shank edict_sub
脛 [すね] /(n) (uk) shin/shank/lower leg/(P)/
$ wc -l edict edict2 edict_sub
227336 edict
170430 edict2
22636 edict_sub
Here are examples of entries in EDICT2:
脛(P);臑 [すね(P);はぎ(脛)(ok)] /(n) (uk) shin/shank/lower leg/(P)/EntL1570850X/
成り;為り [なり] /(n) (See 成る・7) being promoted (shogi)/EntL2611370/
生足;なま足;生脚 [なまあし] /(n) (sl) (See 生・なま・2) bare legs/bare feet/stockingless legs/EntL2113910/
引用句 [いんようく] /(n) {ling} quotation/EntL1169670X/
EntL1570850X, 1570850 is the ID used in JMdict and JMdictDB and X means that there is an audio file for the entry in WWWJDIC.脛(P);臑 [すね(P);はぎ(脛)(ok)], すね is marked as a reading for both 脛 and 臑 but はぎ is only marked as a reading for 脛.(P) is used for the priorized headwords and readings that have an element like <ke_pri>ichi1</ke_pri> or <re_pri>ichi1</re_pri> in JMdict.(See 成る・7) refers to the seventh sense of the entry that has 成る as a headword.(See 生・なま・2) refers to the second sense of the entry that has 生 as a headword and なま as a reading, since there are multiple entries that have 生 as a headword.{ling} is a field of application tag for linguistic terminology.EDICT2 and JMdict typically have more common readings first but EDICT does not:
$ grep '^舅 ' edict2
舅 [しゅうと(P);しうと;しいと(ok)] /(n) (See 姑) father-in-law/(P)/EntL1571280X/
$ grep '^舅 ' edict
舅 [しいと] /(ok) (n) father-in-law/
舅 [しうと] /(n) father-in-law/
舅 [しゅうと] /(n) father-in-law/(P)/
Even though EDICT2 and JMdict have fewer headwords with two or more entries than EDICT, they still have some headwords with two or more entries:
$ grep 雨雪 edict2
雨雪 [あめゆき;あまゆき] /(n) (col) sleet (mixture of snow and rain)/EntL2768340/
雨雪 [うせつ] /(n) (1) snow and rain/(2) (arch) snowfall/EntL2768330/
$ grep 然然 edict2
然然;然々 [ささ] /(adv) (arch) such and such/EntL2173610/
然然;然々 [しかじか] /(adv,n,adj-no) such and such/EntL1831970X/
In versions of JMdict and EDICT2 from 2013, there were 817 headwords with 2 entries, 59 headwords with 3 entries, 13 headwords with 4 entries, 1 headword with 5 entries, and 2 headwords with 6 entries.
Here are different ways to find the first reading of the entry for a headword with the lowest ID:
$ xml sel -t -v '/JMdict/entry[k_ele/keb="雨雪"][1]/r_ele[1]/reb' JMdict
うせつ
$ ruby -rnokogiri -e'puts Nokogiri.XML(IO.read("JMdict_e")).xpath("/JMdict/entry[k_ele/keb=\"雨雪\"][1]/r_ele[1]/reb").text'
うせつ
$ awk -v v=雨雪 '{split($1,headwords,";");for(i in headwords){headword=headwords[i];sub(/\(.*/,"",headword);if(headword==v){split($2,readings,/[ -~]/);print readings[2];exit}}}' <(awk -F/ '{print$(NF-1),$0}' edict2|LC_ALL=C sort -nk1,1|cut -d\ -f2-)
うせつ
The entries in JMdict are sorted by the IDs but the lines in EDICT2 are not. An entry with a lower ID is more likely to include a more common sense or reading of a headword.
JMdict is the XML version of JMdict/EDICT/EDICT2. In addition to the information included in EDICT2, JMdict also includes glosses in other languages than English and the dates when the entry was changed. JMdict_e only includes English glosses.
$ grep '^名詞 ' edict
名詞 [なことば] /(ok) (n) (ling) noun/
名詞 [めいし] /(n) (ling) noun/(P)/
$ grep 1531570 edict2
名詞 [めいし(P);なことば(ok)] /(n) {ling} noun/(P)/EntL1531570X/
$ xml sel -t -c '/JMdict/entry[ent_seq=1531570]' JMdict_e
<entry>
<ent_seq>1531570</ent_seq>
<k_ele>
<keb>名詞</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news2</ke_pri>
<ke_pri>nf38</ke_pri>
</k_ele>
<r_ele>
<reb>めいし</reb>
<re_pri>ichi1</re_pri>
<re_pri>news2</re_pri>
<re_pri>nf38</re_pri>
</r_ele>
<r_ele>
<reb>なことば</reb>
<re_inf>out-dated or obsolete kana usage</re_inf>
</r_ele>
<info>
<audit>
<upd_date>2010-07-27</upd_date>
<upd_detl>Entry created</upd_detl>
</audit>
<audit>
<upd_date>2010-07-28</upd_date>
<upd_detl>Entry amended</upd_detl>
</audit>
</info>
<sense>
<pos>noun (common) (futsuumeishi)</pos>
<field>linguistics terminology</field>
<gloss>noun</gloss>
</sense>
</entry>
WWWJDIC has audio samples for over 100,000 of the words (or readings of words) in JMdict/EDICT. I used this script to download the audio files:
php=http://assets.languagepod101.com/dictionary/japanese/audiomp3.php
mkdir /tmp/pod
sed -n 's,^\([^[]*\) /.*,\1,p' ~/japanese/data/edict|while read x;do curl "$php?kana=$x" -o /tmp/pod/$x.mp3;done
sed -n 's,^\(.*\) \[(.*)\] .*,\1 \2,p' ~/japanese/data/edict|while read x y;do curl "$php?kanji=$x&kana=$y" -o /tmp/pod/$x\ $y.mp3;done
find /tmp/pod \( -size 52288c -o -size 53303c \) -delete
An audio file with a size of 52288 or 53303 bytes is served for missing words.
When I ran the script, it took about 10 hours, and it downloaded audio files for 127,677 out of 213,800 words. I uploaded the files to http://jptxt.net/wwwjdic-audio.tar (1.8 GB).