Scripts

sentence-difficulty

The script uses MeCab to find the lemmas of words, selects words of type 2 (one or more kanji) or type 6 (hiragana), and sums the positions of the words on a word frequency list.

mecab -F '%f[6]\t%t\t' -E '\n'|awk -F\\t 'NR==FNR{a[$0]=NR;next}{sum=0;for(i=1;i<NF-2;i+=2){if($(i+1)~/2|6/)sum+=a[$i]};print sum}' <(curl -s jptxt.net/word-frequency.txt|grep -v '^#'|cut -d\; -f1) -

edictwordlist

This script prints EDICT entries for each word from STDIN.

awk 'NR==FNR{a[$1]=a[$1]$0"\n";next}{printf"%s",a[$0]}' <(curl ringtail.its.monash.edu.au/pub/nihongo/edict.gz|gzip -d|iconv -f euc-jp -t utf-8) -

wwwjdic-tts-audio

This script creates an audio file for reviewing vocabulary items by concatenating the Japanese audio files used by WWWJDIC and English audio files generated by the say command that comes with OS X. wordlist.txt contains lines like 単語 vocabulary, and the audio files are named like 単語 たんご.mp3.

i=10000
cat wordlist.txt|awk -F/ 'NR==FNR{x=$NF;gsub(/ .*/,"",x);a[x]=$0;next}$1 in a{print$1"\t"$2"\t"a[$1]}' <(printf %s\\n ~/japanese/files/pod/*.mp3) -|while IFS=$'\t' read jp en mp3;do
  ffmpeg -v 0 -i "$mp3" -ar 22050 /tmp/$((i++)).aif</dev/null
  say "[[volm 0.6]]$en[[slnc 800]]" -v alex -o /tmp/$((i++)).aif
done
sox /tmp/*.aif /tmp/0.aif
ffmpeg -i /tmp/0.aif -c:a libfaac -q 150 -y output.m4a
rm /tmp/*.aif

rtk

This script converts RTK keywords to kanji.

printf %s\\n "${@-$(cat)}"|awk -F\; 'NR==FNR{a[$3]=$1;next}{print a[$0]}' <(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#) -|paste -sd\\0 -

For example rtk Sino- character prints 漢字.

katalatin

This script replaces Latin letters with katakana. You can use its output to practice reading katakana while reading text that is written in the Latin alphabet.

awk 'NR==FNR{a[$1]=$2;next}{for(i in a)gsub(a[i],i)}1' <(printf %s\\n カ ka ガ ga キ ki ギ gi ク ku グ gu ケ ke ゲ ge コ ko ゴ go サ sa ザ za シ shi ジ ji ス su ズ zu セ se ゼ ze ソ so ゾ zo タ ta ダ da チ chi ツ tsu テ te デ de ト to ド do ナ na ニ ni ヌ nu ネ ne ノ no ハ ha バ ba パ pa ヒ hi ビ bi ピ pi フ fu ブ bu プ pu ヘ he ベ be ペ pe ホ ho ボ bo ポ po マ ma ミ mi ム mu メ me モ mo ヤ ya ユ yu ヨ yo ラ ra リ ri ル ru レ re ロ ro ワ wa ヰ wi ヱ we ヲ wo|paste - -) -

hiralatin

This script replaces Latin letters with hiragana.

awk 'NR==FNR{a[$1]=$2;next}{for(i in a)gsub(a[i],i)}1' <(printf %s\\n か ka が ga き ki ぎ gi く ku ぐ gu け ke げ ge こ ko ご go さ sa ざ za し shi じ ji す su ず zu せ se ぜ ze そ so ぞ zo た ta だ da ち chi つ tsu づ du て te で de と to ど do な na に ni ぬ nu ね ne の no は ha ば ba ぱ pa ひ hi び bi ぴ pi ふ fu ぶ bu ぷ pu へ he べ be ぺ pe ほ ho ぼ bo ぽ po ま ma み mi む mu め me も mo や ya ゆ yu よ yo ら ra り ri る ru れ re ろ ro わ wa ゐ wi ゑ we を wo|paste - -) -

hira2kata

This script converts hiragana to katakana. It does not work with GNU tr which does not support multibyte characters. Another option is to use ruby -pe'$_.tr!"\u{3040}-\u{309f}","\u{30a0}-\u{30ff}"'.

tr $'[\u3040-\u309f]' $'[\u30a0-\u30ff]'

half2full

This script converts ASCII printable characters to full-width characters. It does not work with GNU tr which does not support multibyte characters. Another option is to use ruby -pe'$_.tr!" -~","\u{3000}\u{ff01}-\u{ff5e}"'.

tr ' -~' $'\u3000\uff01-\uff5e'

edict

This script searches for entries in the edict file. You can download the edict file and convert it to UTF-8 by running curl -s ringtail.its.monash.edu.au/pub/nihongo/edict.gz|gzip -d|iconv -f euc-jp -t utf-8>~/edict.

[ $# -ne 0 ]&&grep -Eie "$*" ~/f/edict

rtk-english

This script replaces RTK keywords with kanji in English text.

sed "$(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#|head -2200|awk -F\; '$3{print$1$3}'|sed 's,[][/.*],\\&,g;s,\(.\)\(.*\),s/\\b\2\\b/\1/g,')"

The output looks like this:

"An 醜 bit 之 古 metal," says the 聖 男 to the shopkeeper; "but it
will 為 井 enough to 煮 my humble drop 之 水 之 an 夕. 吾'll
呉 you 三 厘 for it." This 彼 did and took the kettle 宅,
rejoicing; for it was 之 bronze, fine 働, the very 物 for the
Cha-no-yu.

rtktype

This script displays one kanji at a time, prompts you to type the RTK keyword of the kanji, and displays the correct answer for two seconds if the answer is wrong. After showing 50 kanji, it exits and shows the correct keywords for all incorrect answers.

#!/usr/bin/env bash

cd "${0%/*}"
LC_ALL=en_US.UTF-8
trap onexit EXIT
clear

onexit(){
  echo
  clear
  echo "$log"|awk '$2==0{print$3,$4}'|paste -sd' ' -
  awk 'BEGIN{printf"%s%.1f","Average time per kanji: ",(systime()-'"$d)/$(wc -l<<<"$log")}"
}

IFS=$';\n' read -d '' -a keywords< <(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#|head -n2200|cut -d\; -f1,3)

n=${1-50}
d=$(date +%s)

for ((i=1;i<=n;i++));do
  framenumber=$(($RANDOM$RANDOM%(${#keywords[@]}/2)))
  kanji=${keywords[framenumber*2]}
  keyword=${keywords[framenumber*2+1]}
  pad=$(printf %$(($(tput cols)/2-7))s)
  read -ep"$pad$kanji " -n${#keyword} answer
  if [[ $answer = $keyword ]];then
    status=1
  else
    clear
    echo "$pad$kanji $keyword"
    sleep 2
    clear
    status=0
    read -d '' -t0.001 -n99999 # clear the typeahead buffer
    printf '\e[2K\r'
  fi
  logline="$(date +%s) $status $kanji $keyword"
  echo "$logline">>rtktypelog
  log+=$logline$'\n'
done

furigana.rb

This script adds furigana to Japanese text written with kanji that is paired with the same text written with hiragana or katakana.

def kata2hira(x)
  x.gsub(/[\u{30a1}-\u{30fa}]/) { [$&.ord - 96].pack("U") }
end

def furigana(word, reading)
  hira = kata2hira(reading)
  if word == reading or reading == "" or word =~ /^[\u{3040}-\u{30ff}\u{ff00}-\u{ffef}]+$/
    word
  elsif word == hira
    word
  elsif word =~ /^[\u{4e00}-\u{9fff}]+$/
    "<ruby><rb>#{word}</rb><rt>#{hira}</rt></ruby>"
  else
    groups = word.scan(/(?:[^\u{3040}-\u{30ff}]+|[\u{3040}-\u{30ff}]+)/)
    regex = "^" + groups.map { |g|
      if g =~ /^[\u{3040}-\u{30ff}]+$/
        "(#{Regexp.escape(kata2hira(g))}|#{Regexp.escape(g)})"
      else
        "(.+?)"
      end
    }.join + "$"
    kanagroups = hira.scan(Regexp.new(regex))[0]
    return "<ruby><rb>#{word}</rb><rt>#{hira}</rt></ruby>" unless kanagroups
    0.upto(groups.length - 1) { |i|
      unless groups[i] =~ /[\u{3040}-\u{30ff}]/
        groups[i] = "<ruby><rb>#{groups[i]}</rb><rt>#{kanagroups[i]}</rt></ruby>"
      end
    }
    groups.join
  end
end

if __FILE__ == $0
  "次々 つぎつぎ
ユニークな ユニークな
痛い いたい
困難な こんなんな
言い訳 いいわけ
ごろごろ
カット かっと
くっ付ける くっつける
ジェット機 じぇっとき
湿っぽい しめっぽい
東京ドーム とうきょうドーム
3月 さんげつ
一ヶ月 いっかげつ
X線 エックスせん
八ッ橋 やつはし
4ヵ年 よんかねん
ィ形容詞 イけいようし
黄色い きいろい
物の怪 もののけ
鬼に金棒 おににかなぼう
千円貸してください せんえんかしてください".split("\n").each { |line|
    puts furigana(*line.split(" ", 2))
  }
end

mecab-furigana.rb

This script uses MeCab to add furigana to Japanese text. Anki's Japanese support plugin also uses MeCab to generate furigana (see https://github.com/dae/ankiplugins/blob/master/japanese/reading.py).

require "./furigana" # this is the furigana.rb script above

def mecab_furigana(line)
  IO.popen(["mecab", "-F%m\\t%f[7]\\n", "-U%m\\t\\n", "-E", ""], "r+") { |io|
    io.puts line
    io.close_write
    io.read
  }.lines.map { |l| furigana(*l.chomp.split("\t", 2)) }.join
end

"「IT」は何の略か知っていますか。
この綱は直径20cmあるそうです。
妹は来年、二十歳になります。
今日の新聞、どこに置いた?
3月は仕事が忙しい。
彼は数学の博士だそうです。
彼女はOLです。
工事は3月まで続きます。
定価から2000円割り引きますよ。
私の国について少しお話しましょう。
東京ドーム
10ヶ国
12ヶ月
どうしよ~。
X線
No.2
命の親
〆切".split.each { |line|
  puts mecab_furigana(line)
}

rtk-compounds.rb

This script finds kanji compounds whose first translation in EDICT is the same as the RTK keywords of the kanji joined by spaces.

rtk = Hash[IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l|
  l.split(";").values_at(0, 2)
}]
kanji = rtk.keys.join
IO.foreach("#{Dir.home}/japanese/data/edict") { |l|
  c = l.scan(/^([#{kanji}]{2}) \[(.*?)\] \/(.*?)\//)
  next if c == []
  c[2] = c[2].sub(/^(\([^)]*\) )*/, "").sub(/ \([^)]*\)$/, "").sub(/^to /, "")
  next unless rtk[c[0][0]] + " " + rtk[c[0][1]] == c[2]
  puts c[1].rjust(5, "\u{3000}") + " " + c[0] + " " + c[2]
}

The output consists of lines like this:

  とくぎ 特技 special skill
  のうは 脳波 brain waves
 さんそう 山荘 mountain villa

episodes.rb

This script generates an HTML file for reviewing uncommon words in subtitle files or other Japanese text.

Dir.chdir(__dir__)

require "./furigana"

edict = {}
IO.read("../data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each { |entry|
  keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
  next if edict.key?(keb)
  reb = entry[/(?<=<reb>).*(?=<\/reb>)/]
  gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/]
  gloss = gloss.sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
  edict[keb] = [reb, gloss]
}

freq = Hash[IO.read("#{Dir.home}/Sites/jp/word-frequency.txt").split[20000..200000].map { |w| [w, nil] }]
rtk = IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l| l[0] }.join
words = `for f in ~/desktop/*.srt;do mecab -F '%t %f[6]\\n' "$f";done|awk '$1=="2"{print$2}'`
output = ""

words.split.uniq.shuffle.each { |word|
  next unless freq.key?(word)
  next unless word =~ /^[\u{3040}-\u{309f}#{rtk}]{2,}$/
  reb, gloss = edict[word] || next
  next unless gloss =~ /^[a-z -]{1,18}$/
  output << "<div onclick=\"highlight(this)\"><div>#{furigana(word, reb)}</div><div>#{gloss}</div></div>\n"
}

exit if output == ""
f = "../review/episodes.html"
IO.write(f, IO.read(f).sub(/<div.*<\/div>/m, output))
system("open", f)

edict-subs.rb

This script modifies Japanese SRT subtitles to add translations after uncommon words.

freq = {}
IO.readlines("#{Dir.home}/Sites/jp/word-frequency.txt").grep(/^[^#]/).each { |x|
  freq[x.split(";")[0]] = nil
}

edict = {}
IO.read("../data/JMdict_e.xml").scan(/<entry>.*?<\/entry>/m).each { |entry|
  keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
  next if edict[keb]
  next unless freq.key?(keb)
  gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/]
  gloss = gloss.sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
  next if gloss.length > 20
  edict[keb] = gloss
}

Dir["#{Dir.home}/Desktop/*.srt"].each { |f|
  out = ""
  IO.read(f).gsub("\r", "").split("\n\n").each { |s|
    id, time, subs = s.split("\n", 3)
    out << id + "\n" + time + "\n"
    IO.popen(["mecab", "-F%M\t%f[6]\n", "-U%M\n", "-E", "EOS\n"], "r+") { |io|
      io.puts subs
      io.close_write
      io.read
    }.split("\n").each { |morpheme|
      if morpheme == "EOS"
        out << "\n"
      elsif morpheme =~ /(.+)\t(.+)/
        if english = edict[$2]
          out << " " + $1 + " " + english + " "
        else
          out << $1
        end
      else
        out << morpheme
      end
    }
    out << "\n"
  }
  IO.write(f, out.gsub!(/^ | $/, ""))
}

The output looks like this:

778
01:03:09,196 --> 01:03:13,200
上巻 first volume 下巻 last volume じゃなくて
上中下だって。

kanjidic-tsv.rb

This script converts parts of kanjidic2.xml to TSV.

require"nokogiri"

xml=IO.read("#{Dir.home}/japanese/data/kanjidic2.xml")

Nokogiri.XML(xml).css("character").each{|e|
  puts[
    e.css("literal").text,
    e.css("reading[r_type='ja_on']").map(&:text)*" ",
    e.css("reading[r_type='ja_kun']").map(&:text)*" "),
    e.css("nanori").map(&:text)*" ",
    e.css("meaning:not([m_lang])").map(&:text)*", ",
    e.css("grade").text,
    e.css("stroke_count").text,
    e.css("rad_value[rad_type='classical']").text
  ]*"\t"
}

printable-homophones.rb

This script generates an HTML file for reviewing kanji compounds where homophones are grouped together.

edict = {}
IO.read("#{Dir.home}/japanese/data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each { |entry|
  keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
  next if edict[keb]
  reb = entry[/(?<=<reb>).*(?=<\/reb>)/]
  gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/].sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
  edict[keb] = [reb, gloss]
}

freq = {}
IO.foreach("#{Dir.home}/sites/jp/word-frequency.txt").grep(/^[^#]/)[10000..50000].each { |l|
  freq[l.split(";")[0]] = nil
}

rtk = Hash[IO.foreach("#{Dir.home}/sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l|
  l.split(";").values_at(0, 2)
}]

words = []
IO.read("#{Dir.home}/japanese/data/edict.txt").scan(/^([#{rtk.keys.join}]{2}) \[(.*?)\] \/(.*?)\//).each { |w|
  next unless freq.key?(w[0])
  w[2] = w[2].sub(/^(\(.*?\) )*/, "").sub(/ \([^)]*?\)$/, "").sub(/^to /, "")
  next unless w[2] =~ /^[a-z -]{1,20}$/
  words << w
}

kana = words.transpose[1].uniq
output = ""

kana.shuffle.each { |k|
  found = words.select { |w| w[1] == k }
  found = found.select { |w| found.transpose[2].count(w[2]) == 1 }.sample(4)
  next unless found.size >= 2
  output << "<div><span>#{k}</span>\n"
  found.each { |f|
    output << "<div><div>#{f[0]}</div><div>#{f[2]}</div></div>\n"
  }
  output << "</div>\n"
}

exit if output == ""

f = "#{Dir.home}/Sites/jp/printable-homophones.html"
IO.write(f, IO.read(f).sub(/<div.*\/div>\n/m, output))

printable-two-kanji.rb

This script generates an HTML file for reviewing kanji compounds that consist of two kanji.

edict={}
IO.read("#{Dir.home}/japanese/data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each{|entry|
  keb=entry[/(?<=<keb>).*(?=<\/keb>)/]||next
  next if edict.key?(keb)
  reb=entry[/(?<=<reb>).*(?=<\/reb>)/]
  gloss=entry[/(?<=<gloss>).*(?=<\/gloss>)/].sub(/^\(.*?\) */,"").sub(/ \(.*?\)$/,"").sub(/^to /,"")
  edict[keb]=[reb,gloss]
}

rtk=IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map{|x|x[0]}.join
freq=IO.read("#{Dir.home}/Sites/jp/word-frequency.txt").scan(/^[^#;]+/)[15000..40000]
core=IO.read("#{Dir.home}/Sites/jp/core-6000.txt").scan(/^[^#;]+/)
output=""

(freq-core).shuffle.each{|word|
  next unless word=~/^[#{rtk}]{2}$/
  next unless edictword=edict[word]
  next unless edictword[1]=~/^[a-z -]{1,20}$/
  output<<"<div><div>#{edictword[0]}</div><div>#{word}</div><div>#{edictword[1]}</div></div>\n"
}

f="#{Dir.home}/Sites/jp/printable-two-kanji.html"
IO.write(f,IO.read(f).sub(/<div.*<\/div>\n/m,output))