The script uses MeCab to find the lemmas of words, selects words of type 2 (one or more kanji) or type 6 (hiragana), and sums the positions of the words on a word frequency list.
mecab -F '%f[6]\t%t\t' -E '\n'|awk -F\\t 'NR==FNR{a[$0]=NR;next}{sum=0;for(i=1;i<NF-2;i+=2){if($(i+1)~/2|6/)sum+=a[$i]};print sum}' <(curl -s jptxt.net/word-frequency.txt|grep -v '^#'|cut -d\; -f1) -
This script prints EDICT entries for each word from STDIN.
awk 'NR==FNR{a[$1]=a[$1]$0"\n";next}{printf"%s",a[$0]}' <(curl ringtail.its.monash.edu.au/pub/nihongo/edict.gz|gzip -d|iconv -f euc-jp -t utf-8) -
This script creates an audio file for reviewing vocabulary items by concatenating the Japanese audio files used by WWWJDIC and English audio files generated by the say command that comes with OS X. wordlist.txt contains lines like 単語 vocabulary, and the audio files are named like 単語 たんご.mp3.
i=10000
cat wordlist.txt|awk -F/ 'NR==FNR{x=$NF;gsub(/ .*/,"",x);a[x]=$0;next}$1 in a{print$1"\t"$2"\t"a[$1]}' <(printf %s\\n ~/japanese/files/pod/*.mp3) -|while IFS=$'\t' read jp en mp3;do
ffmpeg -v 0 -i "$mp3" -ar 22050 /tmp/$((i++)).aif</dev/null
say "[[volm 0.6]]$en[[slnc 800]]" -v alex -o /tmp/$((i++)).aif
done
sox /tmp/*.aif /tmp/0.aif
ffmpeg -i /tmp/0.aif -c:a libfaac -q 150 -y output.m4a
rm /tmp/*.aif
This script converts RTK keywords to kanji.
printf %s\\n "${@-$(cat)}"|awk -F\; 'NR==FNR{a[$3]=$1;next}{print a[$0]}' <(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#) -|paste -sd\\0 -
For example rtk Sino- character prints 漢字.
This script replaces Latin letters with katakana. You can use its output to practice reading katakana while reading text that is written in the Latin alphabet.
awk 'NR==FNR{a[$1]=$2;next}{for(i in a)gsub(a[i],i)}1' <(printf %s\\n カ ka ガ ga キ ki ギ gi ク ku グ gu ケ ke ゲ ge コ ko ゴ go サ sa ザ za シ shi ジ ji ス su ズ zu セ se ゼ ze ソ so ゾ zo タ ta ダ da チ chi ツ tsu テ te デ de ト to ド do ナ na ニ ni ヌ nu ネ ne ノ no ハ ha バ ba パ pa ヒ hi ビ bi ピ pi フ fu ブ bu プ pu ヘ he ベ be ペ pe ホ ho ボ bo ポ po マ ma ミ mi ム mu メ me モ mo ヤ ya ユ yu ヨ yo ラ ra リ ri ル ru レ re ロ ro ワ wa ヰ wi ヱ we ヲ wo|paste - -) -
This script replaces Latin letters with hiragana.
awk 'NR==FNR{a[$1]=$2;next}{for(i in a)gsub(a[i],i)}1' <(printf %s\\n か ka が ga き ki ぎ gi く ku ぐ gu け ke げ ge こ ko ご go さ sa ざ za し shi じ ji す su ず zu せ se ぜ ze そ so ぞ zo た ta だ da ち chi つ tsu づ du て te で de と to ど do な na に ni ぬ nu ね ne の no は ha ば ba ぱ pa ひ hi び bi ぴ pi ふ fu ぶ bu ぷ pu へ he べ be ぺ pe ほ ho ぼ bo ぽ po ま ma み mi む mu め me も mo や ya ゆ yu よ yo ら ra り ri る ru れ re ろ ro わ wa ゐ wi ゑ we を wo|paste - -) -
This script converts hiragana to katakana. It does not work with GNU tr which does not support multibyte characters. Another option is to use ruby -pe'$_.tr!"\u{3040}-\u{309f}","\u{30a0}-\u{30ff}"'.
tr $'[\u3040-\u309f]' $'[\u30a0-\u30ff]'
This script converts ASCII printable characters to full-width characters. It does not work with GNU tr which does not support multibyte characters. Another option is to use ruby -pe'$_.tr!" -~","\u{3000}\u{ff01}-\u{ff5e}"'.
tr ' -~' $'\u3000\uff01-\uff5e'
This script searches for entries in the edict file. You can download the edict file and convert it to UTF-8 by running curl -s ringtail.its.monash.edu.au/pub/nihongo/edict.gz|gzip -d|iconv -f euc-jp -t utf-8>~/edict.
[ $# -ne 0 ]&&grep -Eie "$*" ~/f/edict
This script replaces RTK keywords with kanji in English text.
sed "$(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#|head -2200|awk -F\; '$3{print$1$3}'|sed 's,[][/.*],\\&,g;s,\(.\)\(.*\),s/\\b\2\\b/\1/g,')"
The output looks like this:
"An 醜 bit 之 古 metal," says the 聖 男 to the shopkeeper; "but it
will 為 井 enough to 煮 my humble drop 之 水 之 an 夕. 吾'll
呉 you 三 厘 for it." This 彼 did and took the kettle 宅,
rejoicing; for it was 之 bronze, fine 働, the very 物 for the
Cha-no-yu.
This script displays one kanji at a time, prompts you to type the RTK keyword of the kanji, and displays the correct answer for two seconds if the answer is wrong. After showing 50 kanji, it exits and shows the correct keywords for all incorrect answers.
#!/usr/bin/env bash
cd "${0%/*}"
LC_ALL=en_US.UTF-8
trap onexit EXIT
clear
onexit(){
echo
clear
echo "$log"|awk '$2==0{print$3,$4}'|paste -sd' ' -
awk 'BEGIN{printf"%s%.1f","Average time per kanji: ",(systime()-'"$d)/$(wc -l<<<"$log")}"
}
IFS=$';\n' read -d '' -a keywords< <(curl -s jptxt.net/rtk-keywords.txt|grep -v ^\#|head -n2200|cut -d\; -f1,3)
n=${1-50}
d=$(date +%s)
for ((i=1;i<=n;i++));do
framenumber=$(($RANDOM$RANDOM%(${#keywords[@]}/2)))
kanji=${keywords[framenumber*2]}
keyword=${keywords[framenumber*2+1]}
pad=$(printf %$(($(tput cols)/2-7))s)
read -ep"$pad$kanji " -n${#keyword} answer
if [[ $answer = $keyword ]];then
status=1
else
clear
echo "$pad$kanji $keyword"
sleep 2
clear
status=0
read -d '' -t0.001 -n99999 # clear the typeahead buffer
printf '\e[2K\r'
fi
logline="$(date +%s) $status $kanji $keyword"
echo "$logline">>rtktypelog
log+=$logline$'\n'
done
This script adds furigana to Japanese text written with kanji that is paired with the same text written with hiragana or katakana.
def kata2hira(x)
x.gsub(/[\u{30a1}-\u{30fa}]/) { [$&.ord - 96].pack("U") }
end
def furigana(word, reading)
hira = kata2hira(reading)
if word == reading or reading == "" or word =~ /^[\u{3040}-\u{30ff}\u{ff00}-\u{ffef}]+$/
word
elsif word == hira
word
elsif word =~ /^[\u{4e00}-\u{9fff}]+$/
"<ruby><rb>#{word}</rb><rt>#{hira}</rt></ruby>"
else
groups = word.scan(/(?:[^\u{3040}-\u{30ff}]+|[\u{3040}-\u{30ff}]+)/)
regex = "^" + groups.map { |g|
if g =~ /^[\u{3040}-\u{30ff}]+$/
"(#{Regexp.escape(kata2hira(g))}|#{Regexp.escape(g)})"
else
"(.+?)"
end
}.join + "$"
kanagroups = hira.scan(Regexp.new(regex))[0]
return "<ruby><rb>#{word}</rb><rt>#{hira}</rt></ruby>" unless kanagroups
0.upto(groups.length - 1) { |i|
unless groups[i] =~ /[\u{3040}-\u{30ff}]/
groups[i] = "<ruby><rb>#{groups[i]}</rb><rt>#{kanagroups[i]}</rt></ruby>"
end
}
groups.join
end
end
if __FILE__ == $0
"次々 つぎつぎ
ユニークな ユニークな
痛い いたい
困難な こんなんな
言い訳 いいわけ
ごろごろ
カット かっと
くっ付ける くっつける
ジェット機 じぇっとき
湿っぽい しめっぽい
東京ドーム とうきょうドーム
3月 さんげつ
一ヶ月 いっかげつ
X線 エックスせん
八ッ橋 やつはし
4ヵ年 よんかねん
ィ形容詞 イけいようし
黄色い きいろい
物の怪 もののけ
鬼に金棒 おににかなぼう
千円貸してください せんえんかしてください".split("\n").each { |line|
puts furigana(*line.split(" ", 2))
}
end
This script uses MeCab to add furigana to Japanese text. Anki's Japanese support plugin also uses MeCab to generate furigana (see https://github.com/dae/ankiplugins/blob/master/japanese/reading.py).
require "./furigana" # this is the furigana.rb script above
def mecab_furigana(line)
IO.popen(["mecab", "-F%m\\t%f[7]\\n", "-U%m\\t\\n", "-E", ""], "r+") { |io|
io.puts line
io.close_write
io.read
}.lines.map { |l| furigana(*l.chomp.split("\t", 2)) }.join
end
"「IT」は何の略か知っていますか。
この綱は直径20cmあるそうです。
妹は来年、二十歳になります。
今日の新聞、どこに置いた?
3月は仕事が忙しい。
彼は数学の博士だそうです。
彼女はOLです。
工事は3月まで続きます。
定価から2000円割り引きますよ。
私の国について少しお話しましょう。
東京ドーム
10ヶ国
12ヶ月
どうしよ~。
X線
No.2
命の親
〆切".split.each { |line|
puts mecab_furigana(line)
}
This script finds kanji compounds whose first translation in EDICT is the same as the RTK keywords of the kanji joined by spaces.
rtk = Hash[IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l|
l.split(";").values_at(0, 2)
}]
kanji = rtk.keys.join
IO.foreach("#{Dir.home}/japanese/data/edict") { |l|
c = l.scan(/^([#{kanji}]{2}) \[(.*?)\] \/(.*?)\//)
next if c == []
c[2] = c[2].sub(/^(\([^)]*\) )*/, "").sub(/ \([^)]*\)$/, "").sub(/^to /, "")
next unless rtk[c[0][0]] + " " + rtk[c[0][1]] == c[2]
puts c[1].rjust(5, "\u{3000}") + " " + c[0] + " " + c[2]
}
The output consists of lines like this:
とくぎ 特技 special skill
のうは 脳波 brain waves
さんそう 山荘 mountain villa
This script generates an HTML file for reviewing uncommon words in subtitle files or other Japanese text.
Dir.chdir(__dir__)
require "./furigana"
edict = {}
IO.read("../data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each { |entry|
keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
next if edict.key?(keb)
reb = entry[/(?<=<reb>).*(?=<\/reb>)/]
gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/]
gloss = gloss.sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
edict[keb] = [reb, gloss]
}
freq = Hash[IO.read("#{Dir.home}/Sites/jp/word-frequency.txt").split[20000..200000].map { |w| [w, nil] }]
rtk = IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l| l[0] }.join
words = `for f in ~/desktop/*.srt;do mecab -F '%t %f[6]\\n' "$f";done|awk '$1=="2"{print$2}'`
output = ""
words.split.uniq.shuffle.each { |word|
next unless freq.key?(word)
next unless word =~ /^[\u{3040}-\u{309f}#{rtk}]{2,}$/
reb, gloss = edict[word] || next
next unless gloss =~ /^[a-z -]{1,18}$/
output << "<div onclick=\"highlight(this)\"><div>#{furigana(word, reb)}</div><div>#{gloss}</div></div>\n"
}
exit if output == ""
f = "../review/episodes.html"
IO.write(f, IO.read(f).sub(/<div.*<\/div>/m, output))
system("open", f)
This script modifies Japanese SRT subtitles to add translations after uncommon words.
freq = {}
IO.readlines("#{Dir.home}/Sites/jp/word-frequency.txt").grep(/^[^#]/).each { |x|
freq[x.split(";")[0]] = nil
}
edict = {}
IO.read("../data/JMdict_e.xml").scan(/<entry>.*?<\/entry>/m).each { |entry|
keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
next if edict[keb]
next unless freq.key?(keb)
gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/]
gloss = gloss.sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
next if gloss.length > 20
edict[keb] = gloss
}
Dir["#{Dir.home}/Desktop/*.srt"].each { |f|
out = ""
IO.read(f).gsub("\r", "").split("\n\n").each { |s|
id, time, subs = s.split("\n", 3)
out << id + "\n" + time + "\n"
IO.popen(["mecab", "-F%M\t%f[6]\n", "-U%M\n", "-E", "EOS\n"], "r+") { |io|
io.puts subs
io.close_write
io.read
}.split("\n").each { |morpheme|
if morpheme == "EOS"
out << "\n"
elsif morpheme =~ /(.+)\t(.+)/
if english = edict[$2]
out << " " + $1 + " " + english + " "
else
out << $1
end
else
out << morpheme
end
}
out << "\n"
}
IO.write(f, out.gsub!(/^ | $/, ""))
}
The output looks like this:
778
01:03:09,196 --> 01:03:13,200
上巻 first volume 下巻 last volume じゃなくて
上中下だって。
This script converts parts of kanjidic2.xml to TSV.
require"nokogiri"
xml=IO.read("#{Dir.home}/japanese/data/kanjidic2.xml")
Nokogiri.XML(xml).css("character").each{|e|
puts[
e.css("literal").text,
e.css("reading[r_type='ja_on']").map(&:text)*" ",
e.css("reading[r_type='ja_kun']").map(&:text)*" "),
e.css("nanori").map(&:text)*" ",
e.css("meaning:not([m_lang])").map(&:text)*", ",
e.css("grade").text,
e.css("stroke_count").text,
e.css("rad_value[rad_type='classical']").text
]*"\t"
}
This script generates an HTML file for reviewing kanji compounds where homophones are grouped together.
edict = {}
IO.read("#{Dir.home}/japanese/data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each { |entry|
keb = entry[/(?<=<keb>).*(?=<\/keb>)/] || next
next if edict[keb]
reb = entry[/(?<=<reb>).*(?=<\/reb>)/]
gloss = entry[/(?<=<gloss>).*(?=<\/gloss>)/].sub(/^\(.*?\) */, "").sub(/ \(.*?\)$/, "").sub(/^to /, "")
edict[keb] = [reb, gloss]
}
freq = {}
IO.foreach("#{Dir.home}/sites/jp/word-frequency.txt").grep(/^[^#]/)[10000..50000].each { |l|
freq[l.split(";")[0]] = nil
}
rtk = Hash[IO.foreach("#{Dir.home}/sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map { |l|
l.split(";").values_at(0, 2)
}]
words = []
IO.read("#{Dir.home}/japanese/data/edict.txt").scan(/^([#{rtk.keys.join}]{2}) \[(.*?)\] \/(.*?)\//).each { |w|
next unless freq.key?(w[0])
w[2] = w[2].sub(/^(\(.*?\) )*/, "").sub(/ \([^)]*?\)$/, "").sub(/^to /, "")
next unless w[2] =~ /^[a-z -]{1,20}$/
words << w
}
kana = words.transpose[1].uniq
output = ""
kana.shuffle.each { |k|
found = words.select { |w| w[1] == k }
found = found.select { |w| found.transpose[2].count(w[2]) == 1 }.sample(4)
next unless found.size >= 2
output << "<div><span>#{k}</span>\n"
found.each { |f|
output << "<div><div>#{f[0]}</div><div>#{f[2]}</div></div>\n"
}
output << "</div>\n"
}
exit if output == ""
f = "#{Dir.home}/Sites/jp/printable-homophones.html"
IO.write(f, IO.read(f).sub(/<div.*\/div>\n/m, output))
This script generates an HTML file for reviewing kanji compounds that consist of two kanji.
edict={}
IO.read("#{Dir.home}/japanese/data/JMdict_e").scan(/<entry>.*?<\/entry>/m).each{|entry|
keb=entry[/(?<=<keb>).*(?=<\/keb>)/]||next
next if edict.key?(keb)
reb=entry[/(?<=<reb>).*(?=<\/reb>)/]
gloss=entry[/(?<=<gloss>).*(?=<\/gloss>)/].sub(/^\(.*?\) */,"").sub(/ \(.*?\)$/,"").sub(/^to /,"")
edict[keb]=[reb,gloss]
}
rtk=IO.readlines("#{Dir.home}/Sites/jp/rtk-keywords.txt").grep(/^[^#]/).take(2200).map{|x|x[0]}.join
freq=IO.read("#{Dir.home}/Sites/jp/word-frequency.txt").scan(/^[^#;]+/)[15000..40000]
core=IO.read("#{Dir.home}/Sites/jp/core-6000.txt").scan(/^[^#;]+/)
output=""
(freq-core).shuffle.each{|word|
next unless word=~/^[#{rtk}]{2}$/
next unless edictword=edict[word]
next unless edictword[1]=~/^[a-z -]{1,20}$/
output<<"<div><div>#{edictword[0]}</div><div>#{word}</div><div>#{edictword[1]}</div></div>\n"
}
f="#{Dir.home}/Sites/jp/printable-two-kanji.html"
IO.write(f,IO.read(f).sub(/<div.*<\/div>\n/m,output))