the space of all haiku Nov 9, 2011

How many haiku can there possibly be? Due to their small, rigid form, we should be able to roughly determine the size of the haikuspace. We will use Japanese, as it is the only language suitable for proper haiku.*

* Of course words come to mean many things, but if you're used to reading and writing haiku in English with a 5-7-5 syllable pattern, I highly recommend investigating some Japanese haiku, and writing with something like 3-4-3 (syllables) or 2-3-2 (words) to get a feeling for the Japanese style.

Phonetic Attack

Japanese syllables are generally smaller than syllables in English. They consist of a consonant and a vowel, or a vowel by itself. Here are various estimates on the size of the Japanese sound inventory:

Source	Count	Notes
the fifty sounds, see also i ro ha	50	only the basic sounds of Japanese, and so a lower bound on their total number
Wikipedia article on hiragana	102	the vowels a/i/u/e/o (5), Ya/Yu/Yo (3), Wa/Wo (2), Da/De/Do (3), K/S/T/N/H/M/R/G/Z/B/P (11) combined with a/i/u/e/o/ya/yu/yo (8), and N by itself (1), for a total of 5+3+2+3+11*8+1 = 102
Japanese pronunciation	113	14 consonants * 8 vowels + syllabic n
The Range of Sounds in Japanese	133
JMdict	172	from all kana entries, counting only syllable-characters, see below

We'll eliminate the 50, as it's clearly a low-boru. A haiku's 5-7-5 pattern is 17 syllables total, and so the upper bound is between 102¹⁷ = 14002414191924244276669361796022272 ≈ 10^34.146 and 172¹⁷ = 100921476901355254279645541839050637312 ≈ 10^38.004.

This is still a pretty wide range (about four orders of magnitude, or a factor of 10,000), and the numbers are pretty unfathomable. Here are a few others for comparison. A googol is 10¹⁰⁰. There are estimated to be about 10⁸⁰ atoms in the observable universe. The number of possible positions in chess is fewer than 10^46.7. There are about 10²⁶ molecules of water in a gallon of the stuff. But those doesn't really help, do they?

Dictionary Attack

From JMdict, a machine-readable Japanese dictionary containing nearly 160,000 entries, we extract the most common* kanji (ideographic) and kana (syllabic/reading) records from each entry. Syllables are counted by applying the regular expression substitution below, and then taking the length of the resulting string.

* Roughly, determined using JMdict's "priority" markers, otherwise using the first one. (Most entries (92%) have only one anyway.)

Thanks to memoization, it takes mere seconds for these huge permutations to be computed.

Non-syllable-character removal regex:

s/([きしちにひみりぎじびぴ])[ゃゅょ]/\1/g

(Please let me know if there are other characters or cases which do not count as syllables.)

All characters used in JMdict's kana entries: (172 characters)

、〜ぁあぃいうぇえおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゆょよらりるれろゎわゐゑをんゝゞァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロワヰヱヲンヴヶ・ーヽヾ

Using All Kana Entries

Permutations fitting in 5 syllables = 13724842934828

Permutations fitting in 7 syllables = 2495396740987223584

Permutations of 5-7-5 lines = 470061162017233273469657393428518492432749056 ≈ 10^44.672154

Using Only Common* Kana Entries

Permutations fitting in 5 syllables = 94865603412

Permutations fitting in 7 syllables = 2411754014092300

Permutations of 5-7-5 lines = 21704538552340125271960104096068971200 ≈ 10^37.336551

* as denoted by JMdict's "priority" markers

Using Only Unique Kana Entries

Permutations fitting in 5 syllables = 21007905554

Permutations fitting in 7 syllables = 302428066343444

Permutations of 5-7-5 lines = 133471212337745718580643080665018704 ≈ 10^35.125388

Duplicate Kana Entries: 18784 out of 158685 entries.

The duplication is a bit of a wrinkle. It appears (by sifting randomly through duplicates) that the vast majority of duplicate readings are indeed for separate meanings/kanji, and so I am inclined to believe the "all entries" number. The truth is probably somewhere in the middle, but don't forget we've only used one dictionary.

Tangent: I would love to be able to get a number on the phonetic saturation of Japanese from this. Perhaps after some input regarding syllable counting from those more fluent in Japanese. Until then, I'll just say this: if you map kana readings to kanji entries, there are 9377 readings (6.7%) with 2 or more kanji entries, 1161 (.8%) have 5 or more, and 181 (.1%) have 10 or more. Look at that beautiful power law action.

Summary

That was rather blustery, so here's the take-away: haikuspace is huge. Like 10⁴⁴ huge. On top of that, a phonetic approach doesn't reach a good upper bound, apparently because of homophones, which increase the haikuspace by almost seven(!) orders of magnitude. Some independent confirmation of that would be nice, though.

The next major step in finding a lower upper-bound would be to apply some sort of "sense-making" filter to the poems. This is beyond the scope of this writeup.

Some Random Haiku

A natural consequence of being able to permute all the words of a Japanese dictionary into haiku is being able to generate random haiku. And so here are a few of those that rose slightly above noise. Translations courtesy of mauler!

詰め込む間ざあざあネオン酸化物 While I cram Whooshing neon Oxide	狂暴戸レッドテープ子史籍ポロ Enraged door Red tape child A history of polo	険悪絵願掛け火食公有気 Hostile pictures Prayer cooked food Public aspiration	孝道子引ったくり急穴居人 Michiko Takashi Sudden snatching Caveman
国花櫛結論回目圏外死 National flower comb Conclusionth Out of range death	代弁課身の上西部簾戸葉書 Department of spokesmen Circumstances western Bamboo blinds postcard	沿海二心嚢浸す教唆罪 Coast two Soak pericardium Criminal incitement	幼児予示ボンレスハム荷バラスト医 Infant foreshadowing A load of boneless ham Ballast medicine
横に頃民利草規矩横丁科 That horizontal time The people's interests, grass rules Department of alleys	投げ入れミ拒絶滑りい浸食シ Throw mi Rejection slippage i Erosion shi	表立つ夏枯れ無窮真鶸説 Stand out Summer slump eternal Siskin theory

Update 2013 March 22

Having just read this exploration of the size of Twitterspace, it occurred to me that I could use written language entropy as another estimate on the size of haikuspace:

number of haiku = 2^{(5 + 7 + 5) * b}

where b is the number of bits per character for Japanese. I'm going to use 2.4 (= 452337 * 8 / 1519224) from this paper (html version via google). This gives 2^40.8 ≈ 10^12.3 haiku, a little more than a bit shy (as expected) of my previous estimate of 10⁴⁴.

See Also

kigo—season word

senryu, tanka, renga, waka—other haiku-like forms

‌

the space of all haiku

category: factuals
next: Neil deGrasse Tyson
previous: winterhaiku

all writing, chronological
next: Neil deGrasse Tyson
previous: winterhaiku

詰め込む間ざあざあネオン酸化物 While I cram Whooshing neon Oxide	狂暴戸レッドテープ子史籍ポロ Enraged door Red tape child A history of polo	険悪絵願掛け火食公有気 Hostile pictures Prayer cooked food Public aspiration	孝道子引ったくり急穴居人 Michiko Takashi Sudden snatching Caveman
国花櫛結論回目圏外死 National flower comb Conclusionth Out of range death	代弁課身の上西部簾戸葉書 Department of spokesmen Circumstances western Bamboo blinds postcard	沿海二心嚢浸す教唆罪 Coast two Soak pericardium Criminal incitement	幼児予示ボンレスハム荷バラスト医 Infant foreshadowing A load of boneless ham Ballast medicine
横に頃民利草規矩横丁科 That horizontal time The people's interests, grass rules Department of alleys	投げ入れミ拒絶滑りい浸食シ Throw mi Rejection slippage i Erosion shi	表立つ夏枯れ無窮真鶸説 Stand out Summer slump eternal Siskin theory