tsudukuinserts つづく and
3000-303Fis CJK Symbols and Punctuation.
4E00-9FFFis CJK Unified Ideographs.
FF00-FFEFis Half-width and Full-width Forms.
All 6355 kanji in KANJIDIC and 12559 out of the 13108 kanji in KANJIDIC2 are in the CJK Unified Ideographs block. In a version of JMdict from 2013, the headwords and readings do not contain any hiragana or katakana characters outside the hiragana or katakana blocks, but they contain one kanji (𩸽,
U+29E3D) outside the CJK Unified Ideographs block.
You can convert between hiragana and katakana by shifting code points by
$ echo ひらがな|tr $'[\u3040-\u309f]' $'[\u30a0-\u30ff]' ヒラガナ $ echo カタカナ|tr $'[\u30a0-\u30ff]' $'[\u3040-\u309f]' かたかな
The block for full width characters has versions of all printable ASCII characters except space in ASCII order:
$ echo example|tr ' -~' $'\u3000\uff01-\uff5e' ｅｘａｍｐｌｅ $ echo ｅｘａｍｐｌｅ|tr $'\u3000\uff01-\uff5e' ' -~' example
U+3000 is ideographic space.
The commands above do not work with GNU
tr which does not support Unicode.
The Japanese Sensei iPhone app uses the same or similar example sentences as the Japanese Core series published by iKnow / Smart.fm. The App Store description of Japanese Sensei says that the application uses data from the CJK Dictionary Institute, but I am not sure if that refers to the example sentences or to some other data.
The data from the Japanese Sensei app was originally posted at the RevTK forums in 2011. The spreadsheet for the data (https://docs.google.com/spreadsheet/ccc?key=0AuGISeQ3yLCedGNYWnBkYXIwaTdMNVlydE45UDRSWmc&usp=sharing) is now set to private access. I have not found the original AAC sound files anywhere, but the Core10Kv4 Anki deck includes MP3 versions of the sound files.
The data from the Japanese Sensei app is often called something like "Core 10k" even though it is not part of the Japanese Core series published by iKnow / Smart.fm and it only contains 9619 pairs of words and sentences. When I compared the data from the spreadsheet linked above with JSON files for the Core 6000 data, 5544 words and 3739 sentences were identical, and about 1500 sentences only had small differences.
The word frequency lists posted at http://forum.koohii.com/viewtopic.php?pid=177749#p177749 (cb's Japanese Text Analysis Tool thread) are based on a corpus of about 5000 novels.
word_freq_report_mecab.txt includes about 190,000 words, about 54,000 of which are included in JMnedict.
word_freq_report_jparser.txt includes about 290,000 words, about 140,000 of which are included in JMnedict. JParser is the morphological or lexical analyzer that is used by Translation Aggregator.
http://corpus.leeds.ac.uk/frqc/internet-jp.num is based on a corpus of websites analyzed with ChaSen. The corpus size is 253,071,774 tokens, but the list is cut off after the first 15,000 words.
The word frequency list posted at http://pomax.nihongoresources.com/index.php?entry=1222520260 is based on a corpus of about a thousand novels downloaded with the Perfect Dark P2P application. It includes about 160,000 words, about 30,000 of which are included in JMnedict.
http://shang.kapsi.fi/kanji/jawp-mecab-words.csv (which was posted at http://forum.koohii.com/viewtopic.php?id=3367&p=1) is based on a 25 GB data dump of the Japanese language Wikipedia. It is cut off after the first 20,000 words.
kale-p-u.csv from http://edrdg.org/~smg/ includes the number of Google search results for every word in a version of JMdict from 2007.
EDICT-freq (from http://www.geocities.jp/ep3797/edict_01.html) includes the number of Yahoo search results on the
blog.goo.ne.jp domain for every word in a version of EDICT from 2008.
wordfreq_ck (from http://ftp.monash.edu.au/pub/nihongo/00INDEX.html) are based on a corpus of about 4 years of articles from the Mainichi newspaper.
wordfreq contains one line per surface form and
wordfreq_ck contains one line per lemma.
wordfreq_ck was also edited to remove some words, like English and romaji words, numbers and special characters, and words that start or end with the particle を.
wordfreq_ck includes about 140,000 words, about 20,000 of which are included in JMndict.
edict_dupefree_freq_distribution (from http://ftp.monash.edu.au/pub/nihongo/00INDEX.html) is based on a corpus of about 500 MB of text from the websites of the Yomiuri and Mainichi newspapers. Instead of using a morphological or lexical analyzer like MeCab, ChaSen, or JParser, the author used a method of guessing the base forms of words based on shared prefixes.