How many times do you forget a stranger?

Reddit user katyvs1 created an AskReddit thread with the following question: What’s a random statistic about yourself you’d love to know, but never will? I was excited to see that many of the responses were bona-fide Fermi questions, and one response in particular caught my imagination:

How many times have I walked past someone that I’ve walked past before without realising? — u/colecr

While we can’t give that user a relevant answer without knowing more about their life, it seems to me that we can come up with an estimate for the average American city dweller (from here on out referred to as Mike). The first step is to break the problem down into simpler questions.

  1. How many people does Mike walk past every day?
  2. How many people does Mike pass on more than one occasion?
  3. How many people can Mike remember in total?

How many people does Mike walk past every day?

Let’s set a fairly realistic scene. Mike lives and works within a bustling neighborhood situated in a large city. We’ll call this neighborhood Fermitown. It’s roughly one square mile and holds about 20,000 residents.¹ Rather conveniently for this analysis, Fermitown is perfectly square, spanning twelve blocks in each dimension.² In total, Fermitown has 144 blocks and 576 sidewalks.

That’s more than enough sidewalk for Mike, who only walks about two miles on average.³ We’ll ignore street crossings and say that Mike walks 24 blocks each day, or 4.2% of the sidewalks. Like most city dwellers, Mike walks a relatively fast five feet per second. That means he spends 2112 seconds walking, or 2.4% of the day. Let’s keep it simple and say that for any given second of the day, he has a 0.1% chance of being on a specific sidewalk.

Remember that Mike is the everyman of his neighborhood. If we ignore the likelihood of popular paths or commute times we can use his statistic to estimate twenty residents on any sidewalk at any given time. From personal experience, that seems like a reasonable guess.⁴ So if we assume that half of those pedestrians are going the opposite direction of Mike, we can say that he passes ten residents per block.

Answer: Mike walks past 240 residents each day.

How many people does Mike pass more than once?

According to this model, Mike walks past 240 residents, or 1.2% of Fermitown’s population, every day. We can use that to approximate a 1.2% chance of seeing a specific resident on the sidewalk on any given day. After one year, Mike will have seen about 93% of the population more than once.⁵ After two years, assuming residents move every nine years, Mike will have seen about 21,900 faces at least twice, and a quarter of those faces at least ten times.

Answer: After two years Mike passes around 21,900 residents at least twice.

How many people can Mike remember?

This is the hardest part of the question. The best statistic I can find is a recent study that pegs average recall at about 1,000 faces, and average recognition at about 5,000 faces (although at least one participant could recognize over 10,000 faces). The problem we face is that these numbers don’t tell us what Mike’s actual potential (i.e. facial recognition capacity) is. We can however assume that his capacity is partly used up by friends, family, significant figures, and celebrities. We also have to remember that we’re concerned with Mike’s ability to recognize faces that he passes on the street. In reality I suspect that multiple, significant encounters need to be had for a face to be committed to memory. We’re most likely overestimating the likelihood of Mike committing a face to memory and underestimating Mike’s ability to recognize faces in general, so I’ll leave the estimate at 5,000.

Answer: 5,000 faces.

Putting it all together

After two years of living in Fermitown, Mike has unknowingly passed by at least 16,900 people on more than one occasion.⁶ Assuming facial recognition capacity is limited to 5,000 faces, he would continue passing the rest of his neighborhood’s citizens multiple times without being the wiser. In reality I think Mike might perform much better than we’ve estimated. But even if he recognized 50% of the people he passed, that would still leave thousands unrecognized.

Let’s continue this example by moving Mike to the suburbs. Population density is much less, so we’ll say he lives in a town of 50,000 people and only walks past 15 strangers each day. Assuming he lives there for nine years, he’ll have seen nearly 10,500 residents at least twice. It took a little longer for multiple encounters to build up, but they eventually grew to twice our limit of 5,000 faces.

Final estimate: The average American will have unknowingly passed by thousands of strangers more than once. You can see my simulation’s code on GitHub. Let me know what I did wrong in the comments. (:

Footnotes

  1. Using San Francisco as a guide, the average population per square mile is a little over 18,000. However, most neighborhoods have higher population densities.
  2. I’m assuming a block length of 350 feet and street widths of 60 feet, using San Francisco as a guide once again.
  3. Americans don’t seem to walk much at all.
  4. It also implies that over half of the neighborhood’s residents are walking at any given time, which sounds less reasonable.
  5. I originally tried a naive equation of 1-.988³⁶⁵ but it didn’t match my (most likely more accurate) simulation. In the meantime I’m going to look into a better way to model this distribution.
  6. Remember this number only includes residents.

Using Neural Networks to Identify Hate Speech

Last year, researchers from Cornell published a paper describing their work classifying tweets as hateful, offensive, or neither. They experimented with a handful of models and parameters before settling on logistic regression. In addition to the effort spent building and evaluating models, they also extracted a handful of features for their model beyond the text itself, including the readability, sentiment, and metadata of each tweet. This was a lot of work, but the results were promising.

I wanted a similar classifier for my own purposes, but I didn’t have the patience necessary to extract the same data. Instead, I decided to leverage a technique that would require no pre-processing on my end: a convolutional neural network. Using this method, words are represented by vectors (thanks to pre-trained word embeddings), and features are selected by the neural network itself.

I implemented the CNN architecture suggested by Yoon Kim, deviating only in the word embeddings, the number of filters generated at the convolutional layer, and the error metric. The resulting model performed as well as, if not better than, the model used by the Cornell researchers when comparing overall F1 scores on the same data (0.91). This is gratifying, because I didn’t optimize my model’s parameters, let alone extract any features.

However, at the classification level, my model didn’t do as good a job at differentiating between hate speech and offensive speech. This is an important distinction to make, as it may be the difference between what is legal and illegal in some localities. Contradictions within the source data may be responsible for this confusion. “Tired of hoes man” is labeled as offensive, while “some lying ass hoe lol” is labeled as hate speech. Although both are offensive, I can’t say that one is more hateful than the other. This problem arises for a multitude of racial and gender-based slurs.

And although the previous examples show that annotators were not working with clearly defined categories, problems extend beyond that. Even a tweet stating that the “Lakers are trash right now” was labeled as hate speech. If I don’t know what the annotators were thinking, I can’t expect my model to.

So what about future work? Right now, I have a reasonably accurate (90%) classifier of offensive tweets that I can run against individual accounts or topics. Alternatively, I could try to create a better dataset. This seems hard, because people have different opinions about what hate speech actually is, but I do think that a good first step in that direction would be building a classifier that identifies genocidal or violent messages. This is both a clearer definition of hate and a more immediate concern in an era of mass shootings and populist leanings.

My code is available here.

An Introduction to the Storefront Index

Joe Cortright and Dillon Mahmoudi introduced the ‘Storefront Index’ in a 2016 report available here. The metric is a simple one, counting the total number of businesses within a city that meet several conditions (publicly accessible, densely located, and close to the city center). Despite its simplicity, this metric identifies the vibrancy of major metropolitan centers and indirectly measures other features such as walkability, safety, and economic health — all of which contribute to the quality of life of local citizens.

Although Cortright and Mahmoudi only calculated scores for the nation’s fifty largest metropolitan areas, they provided enough detail for replicating their method for smaller towns and cities. To demonstrate the utility of this metric, I will apply it to Oxnard, CA: a coastal city of over 200,000 people (and this author’s hometown).

Methodology

Cortright and Mahmoudi acquired their data through a third party, but Oxnard’s data portal was all I needed, although some additional preprocessing steps were required. I had to convert the Business Data dataset’s 11,000 business licenses into a list of storefronts. Doing this required de-duplication and manual curation of some categories, but the end result was a list of Oxnard’s 1,200 storefronts.

The preprocessed data was loaded into QGIS as a delimited text layer. The latitude and longitude columns were assigned their respective point coordinates — in this case, WGS 84 (GPS). These coordinates were then re-projected into a local coordinate system (EPSG:26745) for more accurate distance measurements. (Note that points can be plotted against XYZ tiles as a sanity check.)

Once the points were mapped, QGIS’ distance matrix tool was used to calculate the distances between each location’s nearest neighbor, so that storefronts more than 100 meters away from their nearest neighbor could be removed. Next, the distance to nearest hub tool was used to filter out locations more than three miles away from Oxnard’s city hall (manually added as a separate layer).

Analysis

After performing the steps described in the methodology section, the original count of 1,200 storefronts was reduced down to 963. This final number represents Oxnard’s own storefront index score. Surprisingly, the score of 963 is competitive with the quantities attributed to larger cities (circa 2014) in Cortright and Mahmoudi’s research, including the cities of Austin and Pittsburgh. This number may be deceiving, however, due to the use of different datasets.

Figure 1: filtering out Oxnard’s neighborless and distant storefronts

Plotting the qualifying storefronts (i.e. those that are densely populated and close to city hall) provides greater insight into city-wide trends, and certainly more insight than would be visible by simply plotting all storefronts (see Figure 1 for a before-and-after comparison). The resulting visual reveals a predominantly north-south, linear orientation. Significantly, only two regions show clustering, the larger of the two being anchored by city hall (the yellow dot) and the lesser being distributed around the Ventura Freeway (Highway 101). These results will make sense to locals, as the first cluster represents Oxnard’s historic downtown and business improvement district, and the second cluster mainly represents a new outdoor shopping center known as The Collection.

Neighborhoods with high walkability and convenience scores will most likely be adjacent to or interlaced within these clusters. Getting to these areas, however, is another question. The city hall cluster sits on the connecting road between both Highway 101 and the Pacific Coast Highway, the arterial roadways of the Central Coast, and The Collection is a freeway exit off the 101 itself. The rest of the city’s storefronts are essentially strip malls, none of which are enough to meet the variety of entertainment and consumption needs of nearby households (see Figure 2). This indicates critical gaps in Oxnard’s neighborhood development fueled in part by the historical presence of major highways.

Figure 2: Strip malls influence the index.

Thoughts

While the Storefront Index is supposed to measure economic strength, it is also supposed to shed light on vibrant communities. In this sense, fast food restaurants and big box retailers don’t contribute as much to a neighborhood’s character as do independent bookstores, pinball arcades, and bars. The Storefront Index would need to acknowledge this difference to better reflect neighborhood quality.

I also propose an additional step that removes the likelihood of sparse or linearly oriented storefronts increasing a city’s score. Starting from the hub point (in Oxnard’s case, City Hall), only a single chain of storefronts linked by distances of less than 100 meters should be included. This would exclude clusters of stores that are marooned on the boundaries and strengthen the index’s relationship to walkability scores.

The (Unchanging) Statistics of Deadly Quarrels

An illustration of Richardson’s vision of human computers performing calculations within a forecast factory. NOAA / L. Bengtsson.

Statistics of Deadly Quarrels was written by Lewis Fry Richardson and published in 1960. The book is notable for both its findings and for being one of the first examples of quantitative methods being applied to the realm of international relations. Richardson, a meteorologist by trade, turned his revolutionary and now widely used weather forecasting methods toward the outbreak of interstate conflict, hoping to find predictive variables by analyzing the years between 1809 and 1950. Although Richardson failed in this regard, he made a relatively shocking discovery: that outbreaks of war mirror the occurrence rates of rare events like meteor strikes and earthquakes, or the category of events known as “acts of God”.

The occurrences of these events and others, such as genetic mutations and customer arrivals, can be statistically modeled with Poisson distributions. The basic requirements of Poisson distributions are that events occur independently of each other and that the rate of occurrence is fixed over the period of time being studied. That the outbreak of war would follow a distribution that meets these assumptions raises interesting mathematical and philosophical questions that have yet to be resolved, and simultaneously assert and reject the value of forecasting attempts within this realm.

I first learned of Richardson’s work while reading the June edition of Harper’s Magazine on an airplane. The article’s author, Gary Greenberg, went on to describe Richardson as a visionary who imagined large rooms filled with “computers” (in this case people) who would perform calculations on incoming data in real-time. In a fitting testament to Richardson’s foresight, I just so happened to be en route to D.C., where I’d be spending my summer interning as a quantitative geopolitical analyst. The era of big data had arrived, and the $200 billion industry now reflects the popularization of the belief that any question can be answered with enough observations and computational power.

Out of curiosity, I decided to pick up where Richardson left off and conduct the same analysis on interstate conflicts through the present day. Specifically, I wanted to compare the frequency of n number of occurrences per year against that expected in a Poisson distribution. Thankfully, the task is much easier today than it would have been 50 years ago. There would be no monotonous paging through encyclopedias or lengthy calculations by hand for me. After a relatively simple Google search, I was able to get the data I needed from the UCDP/PRIO Armed Conflict dataset, which provided me with well-coded observations from 1946 through 2009. (In order to avoid overlap and any resulting bias, I only looked at the years from 1952 and on.) And, 60 lines of code later, here are the results:

wars started in a given yearcount (observed)count (expected)proportion (observed)proportion (expected)
03030.65.52.53
12119.55.36.35
256.24.09.11
321.33.03.02

To summarize the table, there were 30 years where no new conflicts were started, 21 years where one conflict started, five years where two conflicts started, and two years where three conflicts started. As can be seen by comparing the values of the expected and observed columns, the distribution of actual conflict outbreaks mirrored that of a Poisson distribution. This was verified at the 95% confidence level using a Yate’s corrected Chi-Square goodness of fit test. From the results, it appears that Richardson’s finding continues to remain relevant as we enter the new millennium.

View my code on GitHub.

Sources