Skip to content
Home of the finest science fiction and science fact


Zipf’s Lottery and Big Rocks From Space
Howard V. Hendrix

The Universe is characterized by its having very many small things and far fewer large things in it. Logarithmically, we humans seem to be of middling sort—about midway along the scale from Planck length to the size of the Universe. More locally and less speculatively, there are far fewer large Near-Earth Objects in space than there are smaller NEOs. As a result, the bigger the NEO, the smaller the chance of it colliding with Earth. Or at least that has long been presumed to be the case.

If we look at log-log graphs showing the size and frequency distributions of asteroid impact craters—indicative of asteroid diameter versus impact frequency—we note that such graphs follow an inverse proportionality relationship, specifically Zipf’s law. A fascinatingly self-referential mathematical description, Zipf’s law expresses how the number of times a particular element (the word “asteroid,” say) occurs in a particular set (such as the set of all words in this essay) is correlated with the place of that element in the rank-ordered list (from most frequently occurring to least frequently occurring) of all elements in that set.

Crater diameters per square kilometer on the Moon abide by Zipf’s law, as do other crater counts. Even in space, however, the law’s applicability is not limited to rocks coming out of the black. Peak gamma ray intensity of solar flares in counts per second, for instance, and cumulative distribution plots of galaxy evolution in the early universe—these too are Zipfian.

More down-to-Earth phenomena also show Zipf distributions: Telephone calls placed. Best-selling books sold. Website hits. City populations. Citations of scientific papers. Wealth of the world’s richest individuals. Intensities of wars measured in battle deaths per ten thousand of participating nations’ populations. Richter magnitudes of earthquakes. Power outage sizes and frequencies. Volcanic eruptions. Word and phrase distributions in all human languages. Dolphin whistles. Whale songs. Pleasing musical scores. Shannon entropies. Relative abundances of expressed genes. Chemical abundances that optimize the efficiency and faithfulness of self-reproduction. Methods for speeding up quantum computational attacks on stored, human-chosen passwords.

All Zipfian.

George Kingsley Zipf’s initial study, however, was not concerned with any of those things. Instead, Zipf took a copy of Herman Melville’s Moby-Dick and plotted each word’s frequency-of-occurrence against that same word’s frequency rank, from most common to least common. (Lestrade, Sander. “Unzipping Zipf’s Law.” PLOS One. August 9, 2017 doi: 10.1371/journal.pone.0181987) Zipf found that the second most frequent word in Moby-Dick occurs half as often as the most frequent word, the third most frequent word occurs one third as often as the most frequent word, the fourth most frequent word occurs one quarter as often as the most frequent word. All down the line, the frequency of a word is inversely proportional to its frequency rank. (

Zipf also found that, in Moby-Dick, there are many words that occur very rarely, and a relatively few words that occur very commonly. This relationship between frequency and rank has turned out to be present not only in Moby-Dick but in all large texts, in all natural languages—and many other places besides.

Patterns are very important to understanding Zipf’s law, since the law fundamentally deals with how tokens (individual instances) are distributed into types (classes). Take the pattern of letters that constitute the word “chiasmus”: this content-driven word, describing an inverse proportionality relationship of a more poetic sort, is relatively rarely used in the English language. There are, however, many similarly rare content-driven words in English, most of them nouns naming something specific, as “chiasmus” also does. A surprisingly large number of these rare words meet the definition of hapax legomenon, a word that occurs only once in a text, or in a particular author’s works, or in the written record of an entire language.

Contrast the content-driven (or “semantic”) nature of a word like chiasmus with the pattern of letters that constitute the word “and”—the third most common word in English, after “the” and “of.” Out of the set of all words in English, there are surprisingly few words that are anywhere near as common as “and” in their occurrence—and most of that small set of very common words are function (also called “syntactic” or “grammatic”) words, as the conjunction “and” likewise is.

Such exceptionally common words have been referred to as stop words in both computational linguistics and natural language processing because they have been considered so frequent as to be insignificant—the inverse of hapax legomena, which by the same disciplines have often been considered too infrequent to be significant. In these computation-meets-language disciplines, both stop words and hapax legomena have often been considered “degenerate cases.”

Across the entirety of a large text like Moby-Dick, the relatively many words that occur rarely in that text add up to approximately equal the number of relatively few words that occur frequently in that text. Although a particular rare word on its own is far less likely to occur in a corpus than a particular common word does, there are far more rare words than common words in that corpus overall. They balance each other out almost exactly. In Zipf’s “lottery,” any word drawn randomly from a large body of text is as likely to be a word from the large set of rare words as it is to be a word from the small set of common words—a very strange sort of coinflip equality.

Zipf distributions are so ubiquitous as to seem almost magical. Their very ubiquity, though, has led some physical scientists to conclude that Zipf’s law is another of those linguistic theories that overstates the reality, much the way that another theory brought over from linguistics, namely the Sapir–Whorf hypothesis, has been popularly misinterpreted to mean that the language one speaks determines how one thinks (the strong version, or linguistic determinism), rather than merely influencing how one thinks (the weak, but also more empirically sound, version). Skeptical physical scientists have wondered if asserting that crater-size follows Zipf’s law down Moon Rabbit holes might not itself be a rarified mathematical form of pareidolia, the observing of patterns that aren’t really there.

The deeper question, though, is not about whether Zipfian distributions are “really there”—but rather why it is so many observations graph to Zipf’s characteristic logarithmic slope of negative one. We know that occurrences of Zipf’s law almost always involve distributions across orders of magnitude in size and over broad ranges of frequencies, but the law also suggests something beyond just those characteristics. Empirical observations of Zipf’s law are important because these observations almost always point to the existence of some latent variable structure that is yet to be discovered.

We have in fact recently discovered some of those latent variables.

On their own, graphs of the distributions of individual “parts of speech” manifest narrow, non-Zipfian, generic distributions (“normal,” “bell curve,” or purged of “degenerate cases”). When considered together, however, these various individual parts regularly add up to broad, non-generic, Zipfian distributions—like those for all words in a large text or natural language. This behavior suggests that the parts of speech in this case function as the lexical “latent variable” underlying the syntax and semantics of texts.

These latent but discoverable variables exist in the Zipfian distributions of a great number of domains, not just in texts and languages. The ubiquity of these distributions suggests, paradoxically, that it is not unheard of for degenerate cases to add up to an inverse power law, of which Zipf’s law is one. A Zipfian distribution is not, after all, a “normal” distribution. As Rudy Rucker has noted, inverse power laws “are self-organizing and self-maintaining. For reasons that aren’t entirely understood they emerge spontaneously in a wide range of parallel computations, both social and natural.”

Zipf’s law, existing as it does on the edge of criticality and complexity—between ordered and random, tokens and types, samples and sources—seems to be quite good at reflecting emergent things (perhaps because it is one itself). Since Zipf’s law is a power law, it is self-similar, like a fractal. Astronomers Henry Lin and Avi Loeb have argued that the distribution of galaxies is Zipfian, emerging naturally from two-dimensional geometry (galaxies are seen as projected onto the two-dimensional plane of the sky) and a clustering behavior that is independent of size (“scale-invariance”) so that a small region of space looks the same as a large region.

“Small region looks the same as a large region” is another way of saying “fractal.”

One might even argue that spacetime itself is Zipfian, with quantum entanglement functioning as the latent variable underlying the orders of magnitude and broad ranges of frequencies characteristic of our universe, in both its spatial and temporal dimensions, its clocks and its geometries. If quantum entanglement is a condition for the existence of spacetime that nonetheless cannot be expressed in terms of spacetime, then the existence of spacetime must itself be an emergent property.

Returning to Earth, we can say that, like fractal structures in our world (coastlines, mountain ranges), Zipfian distributions aren’t perfectly self-similar—particularly when their number of samples or tokens are of small size. Greater sample size, however, means the coastline (or the Zipfian distribution curve, for that matter) looks “more like itself”—that is, its Shannon entropy and its shape stabilize.

Looked at this way, something even stranger than the coinflip equality of rarity and commonness also occurs. The Zipfian description of the relationship between sizes and rankings of discrete phenomena manifests itself in large numbers of rare events. As a result, frequency falls off relatively slowly with rank overall. Dealing as they do with those large numbers of rare events, Zipfian distributions also show us that (if we do not too enthusiastically “purge degenerate cases”) uncommon events are not as rare as we might think, and rare observations are paradoxically more common—and more impactful—than we might typically expect.

Think again of pictures of asteroids, juxtaposed with the log-log graph of their Zipfian distribution—specifically a Zipfian distribution plotting “approximate frequency of impacts” (across years covering many orders of magnitude) against “approximate TNT equivalent energy” (across energies also covering many orders of magnitude). Such a graph features plot-points labeled “Hiroshima” and “Annual Event~20 kilotons.” Then “Chelyabinsk,” “Typical Nuclear Bomb,” and “Tunguska.” Then “1000-Year Event~50 Megatons” and “Most Powerful Nuclear Bomb.” Then, considerably further along the scale, “Threshold of Global Catastrophe” and, at last, “Chicxulub Impact.”

Zipf’s law here is only broadly predictive. It can say that impacts will occur, and roughly in what sizes and how often, including for those future impactors whose crater sizes have not yet been recorded because the impacts have not yet occurred. It cannot, however, say with certainty when those impacts will occur. The broad, non-generic, Zipfian distribution of asteroids means that a powerful asteroid impact is significantly more likely to occur at any given moment than simpler odds calculations and timelines, based in assumptions of randomness, might suggest.

Recently, astronomers have raised alarms about the plans by StarLink and other satellite constellation companies to launch and position many tens of thousands of satellites, mostly into low Earth orbit. StarLink and other satellite internet systems are vying to deliver fast (low latency) internet to pretty much anyone anywhere on the planet. Yet these swarms of satellites are most visible at morning and evening twilight, which is also when near-Earth asteroids are most readily detectable. It’s also the time when—without these satellite swarms—the newer sky-survey and wide-field observatories could best discover a variety of previously undetected near-Earth objects. Given that the sun most brightly illuminates the solar panels of low Earth orbit satellites precisely at morning and evening twilight, many scientists in the astronomy community believe their partially-sunlit observations are being made far more difficult by this unparalleled proliferation of satellites. Even radio astronomy will likely be imperiled by the noise of radio communications among the myriad swarm satellites themselves.

The majority of near-miss meteors and bolides that have “snuck up” on our warning systems and shocked everyone in the meteor astronomy community have, on the Zipfian log-log graph, tended to be of rather middling size and frequency—and have come out of the direction of the sun during evening or morning twilight, that is, crepuscularly. The Chelyabinsk superbolide of 2013, which produced a blast of roughly 500 kilotons, is an example of just such an undetected early-morning asteroid, as was its considerably larger and earlier predecessor, the Tunguska event of 1908.

During the 2001-2002 India-Pakistan standoff, too, the Hiroshima-sized meteor airburst of the June 6, 2002 Eastern Mediterranean Event raised concerns that the flash and shock wave effects of even a small undetected asteroid could—if those effects occurred in a time and place of greatly heightened military tensions—accidentally trigger a nuclear war. As US Air Force General Simon Worden said at the time, had the June 2002 bolide airburst occurred a few hours earlier over the India-Pakistan border, it could have radically changed the history of humans on this planet by providing “the spark that would have ignited the nuclear horror.”

Our best protection against such surprises is robust optical and radio astronomy, particularly wide-field sky surveys. Without applying a full Zipfian analysis, we can already say that the proliferation of satellite swarms is killing the dark, cold, and quiet of night, blinding and deafening some of our best space-guard instruments, potentially impairing astronomy until we cannot see to see, cannot hear to hear.

Do those corporations putting up these satellite swarms—and aware of their potential consequences—perceive the increased likelihood of space rocks getting through our warning systems as an acceptable risk? Especially given the lofty goal of making the internet accessible to everyone everywhere? Or the less lofty goal of maximizing bottom-line profits? Will they downplay the threat, contending that such slip-through space rocks are unlikely in the extreme? Or that, even if they should occur, they would only be smaller local or regional devastators—not the globally catastrophic “existential” perils posed by mountain-sized impactors falling from the sky?

Might the threat of a nuclear war-triggering event check the boxes of existential-risk criteria obviously enough to move certain billionaires to reconsider the current specifications of their satellite swarm operations? Might it lead them to go beyond reducing satellite albedos or providing satellite avoidance schedules— to the repositioning of their swarms in less problematic orbits or reducing the total numbers of satellites deployed?

I do not know. I do know, though, that no matter what the Zipfian distribution of the wealth of the world’s richest persons might suggest, no human individual should have the right to control what all humanity can or cannot see in the night sky—and for more than just esthetic reasons.



Howard V. Hendrix is the author of six novels, many works of shorter fiction (collected in several volumes, most recently The Girls with Kaleidoscope Eyes), many essays (scholarly and opinion/analysis pieces), and poems.  He is also author, editor, or coeditor of seven nonfiction books.  He taught at the college level for many years. His recent work appears regularly in Star*Line, the San Francisco Chronicle, Scientific American, and Analog. He has at one time or another also served as head of SFWA’s Credits and Ethics Committee, as SFWA Western Regional Director, and as SFWA Vice President. Most recently served as juror for SFPA’s Rhysling Award. 

Copyright © 2024 Howard V. Hendrix

Back To Top
    Your Cart
    Your cart is emptyReturn to Shop