the space of all haiku Nov 9, 2011
How many haiku can there possibly be? Due to their small, rigid form, we should be able to roughly determine the size of the haikuspace. We will use Japanese, as it is the only language suitable for proper haiku.*
* Of course words come to mean many things, but if you're used to reading and writing haiku in English with a 5-7-5 syllable pattern, I highly recommend investigating some Japanese haiku, and writing with something like 3-4-3 (syllables) or 2-3-2 (words) to get a feeling for the Japanese style.
Japanese syllables are generally smaller than syllables in English. They consist of a consonant and a vowel, or a vowel by itself. Here are various estimates on the size of the Japanese sound inventory:
|the fifty sounds, see also i ro ha||50||only the basic sounds of Japanese, and so a lower bound on their total number|
|Wikipedia article on hiragana||102||the vowels a/i/u/e/o (5), Ya/Yu/Yo (3), Wa/Wo (2), Da/De/Do (3), K/S/T/N/H/M/R/G/Z/B/P (11) combined with a/i/u/e/o/ya/yu/yo (8), and N by itself (1), for a total of 5+3+2+3+11*8+1 = 102|
|Japanese pronunciation||113||14 consonants * 8 vowels + syllabic n|
|The Range of Sounds in Japanese||133|
|JMdict||172||from all kana entries, counting only syllable-characters, see below|
We'll eliminate the 50, as it's clearly a low-boru. A haiku's 5-7-5 pattern is 17 syllables total, and so the upper bound is between 10217 = 14002414191924244276669361796022272 ≈ 1034.146 and 17217 = 100921476901355254279645541839050637312 ≈ 1038.004.
This is still a pretty wide range (about four orders of magnitude, or a factor of 10,000), and the numbers are pretty unfathomable. Here are a few others for comparison. A googol is 10100. There are estimated to be about 1080 atoms in the observable universe. The number of possible positions in chess is fewer than 1046.7. There are about 1026 molecules of water in a gallon of the stuff. But those doesn't really help, do they?
From JMdict, a machine-readable Japanese dictionary containing nearly 160,000 entries, we extract the most common* kanji (ideographic) and kana (syllabic/reading) records from each entry. Syllables are counted by applying the regular expression substitution below, and then taking the length of the resulting string.
* Roughly, determined using JMdict's "priority" markers, otherwise using the first one. (Most entries (92%) have only one anyway.)
Thanks to memoization, it takes mere seconds for these huge permutations to be computed.
Non-syllable-character removal regex:
(Please let me know if there are other characters or cases which do not count as syllables.)
All characters used in JMdict's kana entries: (172 characters)
Using All Kana Entries
Permutations fitting in 5 syllables = 13724842934828
Permutations fitting in 7 syllables = 2495396740987223584
Permutations of 5-7-5 lines = 470061162017233273469657393428518492432749056 ≈ 1044.672154
Using Only Common* Kana Entries
Permutations fitting in 5 syllables = 94865603412
Permutations fitting in 7 syllables = 2411754014092300
Permutations of 5-7-5 lines = 21704538552340125271960104096068971200 ≈ 1037.336551
* as denoted by JMdict's "priority" markers
Using Only Unique Kana Entries
Permutations fitting in 5 syllables = 21007905554
Permutations fitting in 7 syllables = 302428066343444
Permutations of 5-7-5 lines = 133471212337745718580643080665018704 ≈ 1035.125388
Duplicate Kana Entries: 18784 out of 158685 entries.
The duplication is a bit of a wrinkle. It appears (by sifting randomly through duplicates) that the vast majority of duplicate readings are indeed for separate meanings/kanji, and so I am inclined to believe the "all entries" number. The truth is probably somewhere in the middle, but don't forget we've only used one dictionary.
Tangent: I would love to be able to get a number on the phonetic saturation of Japanese from this. Perhaps after some input regarding syllable counting from those more fluent in Japanese. Until then, I'll just say this: if you map kana readings to kanji entries, there are 9377 readings (6.7%) with 2 or more kanji entries, 1161 (.8%) have 5 or more, and 181 (.1%) have 10 or more. Look at that beautiful power law action.
That was rather blustery, so here's the take-away: haikuspace is huge. Like 1044 huge. On top of that, a phonetic approach doesn't reach a good upper bound, apparently because of homophones, which increase the haikuspace by almost seven(!) orders of magnitude. Some independent confirmation of that would be nice, though.
The next major step in finding a lower upper-bound would be to apply some sort of "sense-making" filter to the poems. This is beyond the scope of this writeup.
Some Random Haiku
A natural consequence of being able to permute all the words of a Japanese dictionary into haiku is being able to generate random haiku. And so here are a few of those that rose slightly above noise. Translations courtesy of mauler!
While I cram
Red tape child
A history of polo
Prayer cooked food
National flower comb
Out of range death
Department of spokesmen
Bamboo blinds postcard
A load of boneless ham
That horizontal time
The people's interests, grass rules
Department of alleys
Rejection slippage i
Summer slump eternal
Update 2013 March 22
Having just read this exploration of the size of Twitterspace, it occurred to me that I could use written language entropy as another estimate on the size of haikuspace:
number of haiku = 2(5 + 7 + 5) * b
where b is the number of bits per character for Japanese. I'm going to use 2.4 (= 452337 * 8 / 1519224) from this paper (html version via google). This gives 240.8 ≈ 1012.3 haiku, a little more than a bit shy (as expected) of my previous estimate of 1044.
the space of all haiku