What is the most common dad joke? What pun is the most worthy of posting?

Join me on this useless and odd journey!

Why?

I really love dad jokes and similar puns.

I also love analysing big chunks of data.

…soooo let’s go!

Where’s the data?

There is a subreddit called /r/dadjokes that has been a bastion of bad jokes since 2011. Most of the jokes posted there have the initial bit of the joke in the title and then the punchline in the body (or selftext as reddit calls it).

joke-1

Oof that’s a groaner.

Then I had this idea.

Let’s just download all of them. Easy right? Right.

Again, where’s the data?

There’s a python library called psaw which is a wrapper around the pushshift.io API’s

With that library I could setup something like this.

from psaw import PushshiftAPI
import datetime as dt
import time

api = PushshiftAPI()

# Start here and go backwards
start_epoch = int(dt.datetime(2019, 8, 18).timestamp())

for i in range(500):
    # Generator that will only give you 2000 at a time
    joke_list = api.search_submissions(before = start_epoch,
                                        subreddit = 'dadjokes',
                                        filter = ['id', 'title','selftext', 'score'],
                                        limit = 2000)

    for joke in joke_list:
        # Some posts don't have .selftext
        selftext = ""
        if hasattr(joke, "selftext"):
            selftext = joke.selftext.replace("\n", " ")

        # Dump that out to be piped into a file
        # ||| for quick simple parsing later
        print(joke.created_utc, "|||", joke.id, "|||", joke.score, "|||", joke.title, "|||", selftext)

        # next search to start where the last entry
        start_epoch = joke.created_utc
    
    # be nice to rate limits
    time.sleep(2)

I ran this, saw it was giving me sensible results, then I set it to fetch a million jokes and went to get dinner.

I arrived back to a script that had finished running.

How many jokes? - 157.266, all the way back to the first joke posted in 2011.

Awesome, that’s a nice little data set!

But first

Obviously the first thing I had to do was to run all the jokes through a markov chain generator. I used markovify for this because it’s the quickest way. I even used their “Basic Usage” example almost verbatim.

Ready to read some markov chain generated dad jokes? Most of them are bad. The ones I picked here were the ones that made sense and none of them have a punchline.

Why washing machine instead of giving me cannons. Things didnt work out.

Knock, Knock. Our local TV weatherman broke both her arms?

What the difference between a piano, and an empty plate in the knees, and naturally, he was Finnish.

Any guy who invented the knock knock joke, shouldve been a post earlier about the wheels.

Did you hear about the hard-working mechanic who specializes in small groups

Dad: Sure, wheres the punchline?

Ok enough nonsense

Right. Here’s my idea.

Most jokes have a similar structure. Setup and punchline.

Let’s look at a common joke.

How do you find Will Smith in the snow? You look for fresh prints!

So this joke starts with a common setup. “How do you…” and then the punchline is the last couple of words, here it’s “fresh prints”

Cleanup

We might get somewhere if we only look at the first three words of the joke. But first we need to cleanup the text.

I created a simple class to create some structure to the joke.

class Joke:
    def __init__(self, aTimestamp, aId, aScore, aJoke):
        self.myTimestamp = aTimestamp
        self.myId = aId
        self.myScore = aScore
        self.myJoke = aJoke.strip()

And then I read in the jokes.txt file generated before to create an array of jokes.

jokes = []

with open("jokes.txt") as myFile:
    for line in myFile:
        s = line.split(" ||| ")

        # a couple of jokes parsed badly, let's just ignore them
        if len(s) < 4:
            continue

        j = Joke(s[0], s[1], s[2], s[3] + " " + s[4])
        jokes.append(j)

You’ll see here that I treat the setup and punchline as the same string, it’s not a very reliable seperation to look at title and selftext, so I combine them.

To cleanup the text I use a few methods.

I tokenize the joke using nltk.tokenize.word_tokenize

from nltk.tokenize import word_tokenize

words = word_tokenize(self.myJoke)

I use string.punctuation to translate any punctuation to an empty string.

import string

punktTable = str.maketrans('', '', string.punctuation)
strippedWords = list(filter(None, [w.translate(punktTable) for w in words]))

And lastly I use nltk.stem.snowball.SnowballStemmer to stem each word. This helps immensly when you want to group together similar sentences.

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

stemmedJoke = []
for word in strippedWords:
    stemmedJoke.append(stemmer.stem(word))

Setup

When this is done, the first three items in the list stemmedJoke are the cleaned up first three words of the joke.

Let’s join those words together, print them out and count the occurance of each triplet.

Here is the top 10.

   7898 what_do_you
   2632 what_did_the
   2629 did_you_hear
   2159 whi_did_the
   1667 how_do_you
    806 what_the_differ
    725 what_kind_of
    663 what_is_the
    561 did_you_know
    543 what_doe_a

So 7898 jokes start with “What do you…”, followed pretty far behind with “What did the…” and “Did you hear…”

This is the stemmed version, which explains why one of them says “differ” and not “difference”

Punchline

What if we do the exact same thing but for the last 3 words of the joke? Here is the top 20 most common with a version of the joke/jokes below each entry.

    172 sticki_a_stick
Whats brown and sticky? A stick.

    168 out_of_it
How do you make holy water? By boiling the hell out of it
I used to be a very small kid But i grew out of it

    150 in_his_field
Why did the scarecrow win an award? For being outstanding in his field

    142 grow_on_me
Ive gotten pretty attached to my beard. Its really starting to grow on me

    120 all_of_them
Hey dad, did you get a haircut? No son, they cut all of them
How many apples grow on a tree? all of them
How many dead people do you think are in the cemetary? Hopefully all of them

    115 medium_at_larg
What do you call a midget psychic on the run from the law? 
A small medium at large

    114 see_that_well
Why did the blind man fall down the well? He couldnt see that well

    113 get_in_there
Thats a nice cemetery, I hear people are dying to get in there

    106 have_2020_vision
I dont know where I see myself in a year. I dont have 20/20 vision

     94 waist_of_time
What do you call a belt made from a watch? A waist of time

     93 he_woke_up
Did you hear about the kidnapping at school? Its ok he woke up

     93 a_dad_joke
* This isn't a joke, it's mostly people ending the joke by saying
* something like "I made a dad joke"

     87 to_get_in
Thats a nice cemetery, I hear people are dying to get in
* Same as the one above, just skipping "there"

     85 trip_all_day
I bought some shoes from a drug dealer.. I dont really know 
what he laced them up with, but I was tripping all day

     85 do_he_laugh
I dont always tell dad jokes... ...but when I do, he laughs

     85 a_littl_lighter
Whats the difference between a hippo and a Zippo?
Ones heavy, ones a little lighter

     83 them_all_cut
Hey dad, did you get a haircut? No son, I got them all cut
* Same as above, but the ending is worded differently

     81 make_up_everyth
Never trust an atom. They make up everything

     76 a_chicken_sedan
Why does a chicken coup only have 2 doors? 
Because, if it had 4 doors it would be a chicken sedan

     74 food_no_atmospher
Have you heard about the restaurant on the moon? 
Great food, no atmosphere

Copy paste

Some of the jokes are impressively copy pasted, to the letter.

Here’s "I dont always tell dad jokes... ...but when I do, he laughs" but I’m only searching for "he laughs"

joke-copy

Yeah, it has been posted before, sorry.

Conclusion

I don’t think there is one, this was mostly for me. If you want the data set, yell at me on twitter and we’ll figure something out.

Later.