Scripts

Estimate the difficulty of Japanese example sentences by summing the frequency ranks of words

The script uses MeCab to print the lemmas (base forms) of words that are recognized by MeCab, selects words of type 2 (one or more kanji) or type 6 (hiragana), and sums the positions of the words on a word frequency list.

mecab -F '%f[6]\t%t\t' -E '\n'|awk -F\\t 'NR==FNR{a[$0]=NR;next}{sum=0;for(i=1;i<NF-2;i+=2){if($(i+1)~/2|6/)sum+=a[$i]};print sum}' <(curl -s jptxt.net/word-frequency.txt|grep -v '^#'|cut -d\; -f1) -

Print an EDICT entry for each word from STDIN

awk 'NR==FNR{a[$1]=a[$1]$0"\n";next}{printf"%s",a[$0]}' <(curl -s ringtail.its.monash.edu.au/pub/nihongo/edict.gz|gzip -d|iconv -f euc-jp -t utf-8) -

The EDICT file is the old plain text version of EDICT/EDICT2/JMDict which contains one pair of headword and reading per line. EDICT2 is a file in a format like EDICT which contains multiple headwords or readings on some lines.

Generate an audio file for reviewing vocabulary items

This script creates an audio file for reviewing vocabulary items by concatenating the Japanese audio files used by WWWJDIC and English audio files generated by the say command that comes with OS X. wordlist.txt contains lines like 単語 vocabulary, and the audio files have filenames like 単語 たんご.mp3.

i=10000
cat wordlist.txt|awk -F/ 'NR==FNR{x=$NF;gsub(/ .*/,"",x);a[x]=$0;next}$1 in a{print$1"\t"$2"\t"a[$1]}' <(printf %s\\n ~/japanese/files/pod/*.mp3) -|while IFS=$'\t' read jp en mp3;do
  ffmpeg -v 0 -i "$mp3" -ar 22050 /tmp/$((i++)).aif</dev/null
  say "[[volm 0.6]]$en[[slnc 800]]" -v alex -o /tmp/$((i++)).aif
done
sox /tmp/*.aif /tmp/0.aif
ffmpeg -i /tmp/0.aif -c:a libfaac -q 150 -y output.m4a
rm /tmp/*.aif

Convert RTK keywords to kanji

printf %s\\n "${@-$(cat)}"|awk -F\; 'NR==FNR{a[$3]=$1;next}{print a[$0]}' <(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#) -|paste -sd\\0 -

When for example the script is ran with the arguments Sino- character, it prints 漢字.

Convert kanji to RTK keywords

printf %s\\n "${@-$(cat)}"|grep -o .|awk -F\; 'NR==FNR{a[$1]=$3;next}$0 in a{print$0,a[$0]}' <(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#) -

Replace Latin letters with katakana

This script replaces Latin letters with katakana. You can use its output to practice reading katakana while reading text that is written in the Latin alphabet.

awk 'NR==FNR{a[$1]=$2;next}{for(i in a)gsub(a[i],i)}1' <(printf %s\\n カ ka ガ ga キ ki ギ gi ク ku グ gu ケ ke ゲ ge コ ko ゴ go サ sa ザ za シ shi ジ ji ス su ズ zu セ se ゼ ze ソ so ゾ zo タ ta ダ da チ chi ツ tsu テ te デ de ト to ド do ナ na ニ ni ヌ nu ネ ne ノ no ハ ha バ ba パ pa ヒ hi ビ bi ピ pi フ fu ブ bu プ pu ヘ he ベ be ペ pe ホ ho ボ bo ポ po マ ma ミ mi ム mu メ me モ mo ヤ ya ユ yu ヨ yo ラ ra リ ri ル ru レ re ロ ro ワ wa ヰ wi ヱ we ヲ wo|paste - -) -

Replace Latin letters with hiragana

awk 'NR==FNR{a[$1]=$2;next}{for(i in a)gsub(a[i],i)}1' <(printf %s\\n か ka が ga き ki ぎ gi く ku ぐ gu け ke げ ge こ ko ご go さ sa ざ za し shi じ ji す su ず zu せ se ぜ ze そ so ぞ zo た ta だ da ち chi つ tsu づ du て te で de と to ど do な na に ni ぬ nu ね ne の no は ha ば ba ぱ pa ひ hi び bi ぴ pi ふ fu ぶ bu ぷ pu へ he べ be ぺ pe ほ ho ぼ bo ぽ po ま ma み mi む mu め me も mo や ya ゆ yu よ yo ら ra り ri る ru れ re ろ ro わ wa ゐ wi ゑ we を wo|paste - -) -

Convert hiragana to katakana

tr $'[\u3040-\u309f]' $'[\u30a0-\u30ff]'

This does not work with GNU tr which does not support multibyte characters. Another option is to use this command: ruby -pe'$_.tr!"\u{3040}-\u{309f}","\u{30a0}-\u{30ff}"'.

Convert ASCII printable characters to full-width characters

tr ' -~' $'\u3000\uff01-\uff5e'

The Unicode block for full-width characters has versions of all printable ASCII characters except space in ASCII order. U+3000 is ideographic space.

This does not work with GNU tr which does not support multibyte characters. Another option is to use this command: ruby -pe'$_.tr!" -~","\u{3000}\u{ff01}-\u{ff5e}"'.

Search for entries in the edict file

[ $# -ne 0 ]&&grep -Eie "$*" ~/f/edict

You can download the edict file and convert it to UTF-8 by running this command: curl -s ringtail.its.monash.edu.au/pub/nihongo/edict.gz|gzip -d|iconv -f euc-jp -t utf-8>~/edict.

Replace RTK keywords with kanji in English text

sed "$(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#|head -2200|awk -F\; '$3{print$1$3}'|sed 's,[][/.*],\\&,g;s,\(.\)\(.*\),s/\\b\2\\b/\1/g,')"

The output looks like this:

"An 醜 bit 之 古 metal," says the 聖 男 to the shopkeeper; "but it
will 為 井 enough to 煮 my humble drop 之 水 之 an 夕. 吾'll
呉 you 三 厘 for it." This 彼 did and took the kettle 宅,
rejoicing; for it was 之 bronze, fine 働, the very 物 for the
Cha-no-yu.

Practice RTK keywords by typing the keywords of kanji in a shell

This script displays one kanji at a time, prompts you to type the RTK keyword of the kanji, and displays the correct answer for two seconds if the answer is wrong. After showing 50 kanji, it exits and shows the correct keywords for all incorrect answers.

#!/usr/bin/env bash

cd "${0%/*}"
LC_ALL=en_US.UTF-8
trap onexit EXIT
clear

onexit(){
  echo
  clear
  echo "$log"|awk '$2==0{print$3,$4}'|paste -sd' ' -
  awk 'BEGIN{printf"%s%.1f","Average time per kanji: ",(systime()-'"$d)/$(wc -l<<<"$log")}"
}

IFS=$';\n' read -d '' -a keywords< <(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#|head -n2200|cut -d\; -f1,3)

n=${1-50}
d=$(date +%s)

for ((i=1;i<=n;i++));do
  framenumber=$(($RANDOM$RANDOM%(${#keywords[@]}/2)))
  kanji=${keywords[framenumber*2]}
  keyword=${keywords[framenumber*2+1]}
  pad=$(printf %$(($(tput cols)/2-7))s)
  read -ep"$pad$kanji " -n${#keyword} answer
  if [[ $answer = $keyword ]];then
    status=1
  else
    clear
    echo "$pad$kanji $keyword"
    sleep 2
    clear
    status=0
    read -d '' -t0.001 -n99999 # clear the typeahead buffer
    printf '\e[2K\r'
  fi
  logline="$(date +%s) $status $kanji $keyword"
  echo "$logline">>rtktypelog
  log+=$logline$'\n'
done

Add furigana to pieces of Japanese text that are paired with the same text written in kana

def kata2hira(x)
  x.gsub(/[\u{30a1}-\u{30fa}]/) { [$&.ord - 96].pack("U") }
end

def furigana(word, reading)
  hira = kata2hira(reading)
  if word == reading or reading == "" or word =~ /^[\u{3040}-\u{30ff}\u{ff00}-\u{ffef}]+$/
    word
  elsif word == hira
    word
  elsif word =~ /^[\u{4e00}-\u{9fff}]+$/
    "<ruby><rb>#{word}</rb><rt>#{hira}</rt></ruby>"
  else
    groups = word.scan(/(?:[^\u{3040}-\u{30ff}]+|[\u{3040}-\u{30ff}]+)/)
    regex = "^" + groups.map { |g|
      if g =~ /^[\u{3040}-\u{30ff}]+$/
        "(#{Regexp.escape(kata2hira(g))}|#{Regexp.escape(g)})"
      else
        "(.+?)"
      end
    }.join + "$"
    kanagroups = hira.scan(Regexp.new(regex))[0]
    return "<ruby><rb>#{word}</rb><rt>#{hira}</rt></ruby>" unless kanagroups
    0.upto(groups.length - 1) { |i|
      unless groups[i] =~ /[\u{3040}-\u{30ff}]/
        groups[i] = "<ruby><rb>#{groups[i]}</rb><rt>#{kanagroups[i]}</rt></ruby>"
      end
    }
    groups.join
  end
end

if __FILE__ == $0
  "次々 つぎつぎ
ユニークな ユニークな
痛い いたい
困難な こんなんな
言い訳 いいわけ
ごろごろ
カット かっと
くっ付ける くっつける
ジェット機 じぇっとき
湿っぽい しめっぽい
東京ドーム とうきょうドーム
3月 さんげつ
一ヶ月 いっかげつ
X線 エックスせん
八ッ橋 やつはし
4ヵ年 よんかねん
ィ形容詞 イけいようし
黄色い きいろい
物の怪 もののけ
鬼に金棒 おににかなぼう
千円貸してください せんえんかしてください".split("\n").each { |line|
    puts furigana(*line.split(" ", 2))
  }
end

Use MeCab to add furigana to Japanese text

Anki's Japanese support plugin also uses MeCab to generate furigana (https://github.com/dae/ankiplugins/blob/master/japanese/reading.py).

require "./furigana" # this is the script in the section above

def mecab_furigana(line)
  IO.popen(["mecab", "-F%m\\t%f[7]\\n", "-U%m\\t\\n", "-E", ""], "r+") { |io|
    io.puts line
    io.close_write
    io.read
  }.lines.map { |l| furigana(*l.chomp.split("\t", 2)) }.join
end

"「IT」は何の略か知っていますか。
この綱は直径20cmあるそうです。
妹は来年、二十歳になります。
今日の新聞、どこに置いた?
3月は仕事が忙しい。
彼は数学の博士だそうです。
彼女はOLです。
工事は3月まで続きます。
定価から2000円割り引きますよ。
私の国について少しお話しましょう。
東京ドーム
10ヶ国
12ヶ月
どうしよ~。
X線
No.2
命の親
〆切".split.each { |line|
  puts mecab_furigana(line)
}

Find kanji compounds whose first translation in EDICT is the same as the RTK keywords of the kanji joined by spaces

rtk = Hash[IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l|
  l.split(";").values_at(0, 2)
}]
kanji = rtk.keys.join
IO.foreach("#{Dir.home}/japanese/data/edict") { |l|
  c = l.scan(/^([#{kanji}]{2}) \[(.*?)\] \/(.*?)\//)
  next if c == []
  c[2] = c[2].sub(/^(\([^)]*\) )*/, "").sub(/ \([^)]*\)$/, "").sub(/^to /, "")
  next unless rtk[c[0][0]] + " " + rtk[c[0][1]] == c[2]
  puts c[1].rjust(5, "\u{3000}") + " " + c[0] + " " + c[2]
}

The output consists of lines like this:

  とくぎ 特技 special skill
  のうは 脳波 brain waves
 さんそう 山荘 mountain villa

Generate an HTML file for reviewing uncommon words in subtitle files or other Japanese text

Dir.chdir(__dir__)

require "./furigana"

edict = {}
IO.read("../data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each { |entry|
  keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
  next if edict.key?(keb)
  reb = entry[/(?<=<reb>).*(?=<\/reb>)/]
  gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/]
  gloss = gloss.sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
  edict[keb] = [reb, gloss]
}

freq = Hash[IO.read("#{Dir.home}/Sites/jp/word-frequency.txt").split[20000..200000].map { |w| [w, nil] }]
rtk = IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l| l[0] }.join
words = `for f in ~/desktop/*.srt;do mecab -F '%t %f[6]\\n' "$f";done|awk '$1=="2"{print$2}'`
output = ""

words.split.uniq.shuffle.each { |word|
  next unless freq.key?(word)
  next unless word =~ /^[\u{3040}-\u{309f}#{rtk}]{2,}$/
  reb, gloss = edict[word] || next
  next unless gloss =~ /^[a-z -]{1,18}$/
  output << "<div onclick=\"highlight(this)\"><div>#{furigana(word, reb)}</div><div>#{gloss}</div></div>\n"
}

exit if output == ""
f = "../review/episodes.html"
IO.write(f, IO.read(f).sub(/<div.*<\/div>/m, output))
system("open", f)

Modify Japanese SRT subtitles to add translations after uncommon words

freq = {}
IO.readlines("#{Dir.home}/Sites/jp/word-frequency.txt").grep(/^[^#]/).each { |x|
  freq[x.split(";")[0]] = nil
}

edict = {}
IO.read("../data/JMdict_e.xml").scan(/<entry>.*?<\/entry>/m).each { |entry|
  keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
  next if edict[keb]
  next unless freq.key?(keb)
  gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/]
  gloss = gloss.sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
  next if gloss.length > 20
  edict[keb] = gloss
}

Dir["#{Dir.home}/Desktop/*.srt"].each { |f|
  out = ""
  IO.read(f).gsub("\r", "").split("\n\n").each { |s|
    id, time, subs = s.split("\n", 3)
    out << id + "\n" + time + "\n"
    IO.popen(["mecab", "-F%M\t%f[6]\n", "-U%M\n", "-E", "EOS\n"], "r+") { |io|
      io.puts subs
      io.close_write
      io.read
    }.split("\n").each { |morpheme|
      if morpheme == "EOS"
        out << "\n"
      elsif morpheme =~ /(.+)\t(.+)/
        if english = edict[$2]
          out << " " + $1 + " " + english + " "
        else
          out << $1
        end
      else
        out << morpheme
      end
    }
    out << "\n"
  }
  IO.write(f, out.gsub!(/^ | $/, ""))
}

The output looks like this:

778
01:03:09,196 --> 01:03:13,200
上巻 first volume 下巻 last volume じゃなくて
上中下だって。

Convert parts of kanjidic2.xml to TSV

require"nokogiri"

xml=IO.read("#{Dir.home}/japanese/data/kanjidic2.xml")

Nokogiri.XML(xml).css("character").each{|e|
  puts[
    e.css("literal").text,
    e.css("reading[r_type='ja_on']").map(&:text)*" ",
    e.css("reading[r_type='ja_kun']").map(&:text)*" "),
    e.css("nanori").map(&:text)*" ",
    e.css("meaning:not([m_lang])").map(&:text)*", ",
    e.css("grade").text,
    e.css("stroke_count").text,
    e.css("rad_value[rad_type='classical']").text
  ]*"\t"
}

Generate an HTML file for reviewing kanji compounds where homophones are grouped together

edict = {}
IO.read("#{Dir.home}/japanese/data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each { |entry|
  keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
  next if edict[keb]
  reb = entry[/(?<=<reb>).*(?=<\/reb>)/]
  gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/].sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
  edict[keb] = [reb, gloss]
}

freq = {}
IO.foreach("#{Dir.home}/sites/jp/word-frequency.txt").grep(/^[^#]/)[10000..50000].each { |l|
  freq[l.split(";")[0]] = nil
}

rtk = Hash[IO.foreach("#{Dir.home}/sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l|
  l.split(";").values_at(0, 2)
}]

words = []
IO.read("#{Dir.home}/japanese/data/edict.txt").scan(/^([#{rtk.keys.join}]{2}) \[(.*?)\] \/(.*?)\//).each { |w|
  next unless freq.key?(w[0])
  w[2] = w[2].sub(/^(\(.*?\) )*/, "").sub(/ \([^)]*?\)$/, "").sub(/^to /, "")
  next unless w[2] =~ /^[a-z -]{1,20}$/
  words << w
}

kana = words.transpose[1].uniq
output = ""

kana.shuffle.each { |k|
  found = words.select { |w| w[1] == k }
  found = found.select { |w| found.transpose[2].count(w[2]) == 1 }.sample(4)
  next unless found.size >= 2
  output << "<div><span>#{k}</span>\n"
  found.each { |f|
    output << "<div><div>#{f[0]}</div><div>#{f[2]}</div></div>\n"
  }
  output << "</div>\n"
}

exit if output == ""

f = "#{Dir.home}/Sites/jp/printable-homophones.html"
IO.write(f, IO.read(f).sub(/<div.*\/div>\n/m, output))

Generate an HTML file for reviewing words that consist of two kanji

edict={}
IO.read("#{Dir.home}/japanese/data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each{|entry|
  keb=entry[/(?<=<keb>).*(?=<\/keb>)/]||next
  next if edict.key?(keb)
  reb=entry[/(?<=<reb>).*(?=<\/reb>)/]
  gloss=entry[/(?<=<gloss>).*(?=<\/gloss>)/].sub(/^\(.*?\) */,"").sub(/ \(.*?\)$/,"").sub(/^to /,"")
  edict[keb]=[reb,gloss]
}

rtk=IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map{|x|x[0]}.join
freq=IO.read("#{Dir.home}/Sites/jp/word-frequency.txt").scan(/^[^#;]+/)[15000..40000]
core=IO.read("#{Dir.home}/Sites/jp/core-6000.txt").scan(/^[^#;]+/)
output=""

(freq-core).shuffle.each{|word|
  next unless word=~/^[#{rtk}]{2}$/
  next unless edictword=edict[word]
  next unless edictword[1]=~/^[a-z -]{1,20}$/
  output<<"<div><div>#{edictword[0]}</div><div>#{word}</div><div>#{edictword[1]}</div></div>\n"
}

f="#{Dir.home}/Sites/jp/printable-two-kanji.html"
IO.write(f,IO.read(f).sub(/<div.*<\/div>\n/m,output))