I’ve been playing around with OpenAI’s newGPT-3 language model. When I gotbeta access, the first thing I wondered was, how human is GPT-3? Howclose is it to …

Ive been playing around with OpenAIs new
GPT-3 language model. When I got
beta access, the first thing I wondered was, how human is GPT-3? How
close is it to passing a Turing test?
How It Works
Let me explain how exactly Im generating these conversations. GPT-3
is a general language model, trained on a large amount of
uncategorized text from the internet. It isnt
specific to a conversational format, and it isnt trained to answer
any specific type of question. The only thing it does is, given some
text, guess what text comes next.
So if we want GPT-3 to generate answers to questions, we need to seed
it with a prompt. Im using this prompt to initialize all of the Q&A
sessions:
Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.
Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was president of the United States in 1955.
Q: What party did he belong to?
A: He belonged to the Republican Party.
Q: Who was president of the United States before George W. Bush?
A: Bill Clinton was president of the United States before George W. Bush.
Q: Who won the World Series in 1995?
A: The Atlanta Braves won the World Series in 1995.
This is the default prompt suggested by OpenAI for Q&A, so Im not
cherrypicking it to prove anything. The point of this prompt is just
to show GPT-3 that were doing questions and answers, not to provide
it with information. For the prompt, both questions and answers are provided by a
human. For all the others, the answers are generated by GPT-3.
Common Sense
Traditionally, artificial intelligence struggles at common sense. But
GPT-3 can answer a lot of common sense questions.
Q: What is your favorite animal?
A: My favorite animal is a dog.
Q: Why?
A: Because dogs are loyal and friendly.
Q: What are two reasons that a dog might be in a bad mood?
A: Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many legs does a frog have?
A: A frog has four legs.
Q: Are there any animals with three legs?
A: No, there are no animals with three legs.
Q: Why don’t animals have three legs?
A: Animals don’t have three legs because they would fall over.
Ten years ago, if I had this conversation, I would have assumed the
entity on the other end was a human. You can no longer take it for
granted that an AI does not know the answer to common sense questions.
How does GPT-3 know that a giraffe have two eyes? I wish I had some
sort of debug output to answer that question. I dont know for sure,
but I can only theorize that there must be some web page in its
training data that discusses how many eyes a giraffe has. If we want
to stump GPT-3 with common sense questions, we need to think of
questions about things so mundane, they will not appear on the internet.
Q: Which is heavier, a mouse or an elephant?
A: An elephant is heavier than a mouse.
Q: Which is heavier, a mouse or a toaster?
A: A toaster is heavier than a mouse.
Q: Which is heavier, a toaster or a pencil?
A: A pencil is heavier than a toaster.
Q: Which is heavier, a paper clip or a can of soda?
A: A can of soda is heavier than a paper clip.
Q: Which is heavier, a football player or a car?
A: A car is heavier than a football player.
Its only 4/5. Were closer to stumping GPT-3 here. I think a human
would be pretty close to 100% on these questions. It makes sense these
are trickier – there probably isnt any web page that compares
toasters and pencils by weight. Its only indirectly that humans gain
this knowledge.
This gives us a hint for how to stump the AI more consistently. We
need to ask questions that no normal human would ever talk about.
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many eyes does my foot have?
A: Your foot has two eyes.
Q: How many eyes does a spider have?
A: A spider has eight eyes.
Q: How many eyes does the sun have?
A: The sun has one eye.
Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.
Now were getting into surreal territory. GPT-3 knows how to have a
normal conversation. It doesnt quite know how to say Wait a
moment your question is nonsense. It also doesnt know how to say
I dont know.
Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.
Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.
Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.
Q: Which colorless green ideas sleep furiously?
A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep
furiously.
Q: Do you understand these questions?
A: I understand these questions.
The lesson here is that if youre a judge in a Turing test, make sure
you ask some nonsense questions, and see if the interviewee responds
the way a human would.
Trivia Questions
GPT-3 is quite good at answering questions about obscure things.
Q: Who was president of the United States in 1801?
A: Thomas Jefferson was president of the United States in 1801.
Q: What city is in the northwest corner of Ohio?
A: Toledo is in the northwest corner of Ohio.
Q: What highway do you take to get to North Lake Tahoe from Oakland?
A: Highway 80 is the highway you take to get to North Lake Tahoe from Oakland.
Q: What are Darius’s abilities in League of Legends?
A: Darius’s abilities in League of Legends are: Apprehend, Decimate,
Noxian Guillotine, Crippling Strike, Apprehend, and Hemorrhage.
Q: Who did Capablanca defeat to become world chess champion?
A: Capablanca defeated Lasker to become world chess champion.
Oops, a repeat snuck in with question 4, but a human would make that
sort of error too. GPT-3 seems to be above
human-level on this sort of question. The tricky thing for
applications, I think, is to figure out when the answer can be relied
on. The OpenAI API does expose more data than just the text, here, so
perhaps something clever is possible.
In general, if you are trying to distinguish an AI from a human, you
dont want to ask it obscure trivia questions. GPT-3 is pretty good at
a wide variety of topics.
One trend that continues from the common sense is that GPT-3 is
reluctant to express that it doesnt know the answer. So invalid
questions get wrong answers.
Q: Who was president of the United States in 1700?
A: William Penn was president of the United States in 1700.
Q: Who was president of the United States in 1650?
A: Richard Bennett was president of the United States in 1650.
Q: Who was president of the United States in 1620?
A: James I was president of the United States in 1620.
Q: Who was president of the United States in 1600?
A: Queen Elizabeth I was president of the United States in 1600.
These wrong answers are actually fascinating! None of these were
presidents of the United States, of course, since the US didnt exist
then. But they are all prominent political figures who were in charge
of some US-related political entity around that time. In a sense, they
are good guesses.
Q: Who won the World Series in 2021?
A: The New York Yankees won the World Series in 2021.
Q: Who won the World Series in 2022?
A: The New York Yankees won the World Series in 2022.
Q: Who won the World Series in 2023?
A: The New York Yankees won the World Series in 2023.
A bleak view of a dystopian future.
Q: How many Super Bowls do the Cincinnati Bengals win in the 2030s?
A: The Cincinnati Bengals win two Super Bowls in the 2030s.
Encouraging as a Bengals fan, but perhaps not the objectively most accurate
prediction. Well have to wait and see.
Logic
People are used to computers being superhuman at logical activities,
like playing chess or adding numbers. It might come as a surprise that
GPT-3 is not perfect at simple math questions.
Q: When counting, what number comes before 100?
A: Ninety-nine comes before 100.
Q: When counting, what number comes before 123?
A: One hundred and twenty-two comes before 123.
Q: When counting, what number comes before 1000?
A: Nine hundred and ninety-nine comes before 1000.
Q: When counting, what number comes before 10,000?
A: Nine thousand and ninety-nine comes before 10,000.
Q: When counting, what number comes before a million?
A: Nine hundred thousand and ninety-nine comes before a million.
This is where the generic nature of GPT-3 comes into play. It isnt
just the generic model, though, its also the architecture of neural
networks themselves. As far as I know there is no neural network that
is capable of doing basic arithmetic like addition and multiplication
on a large number of digits based on training data rather than
hardcoding.
Its funny, because these operations are simple for a customized
program. But recursive logic that does some operation and repeats it
several times often doesnt quite map onto the architecture of a
neural net well.
An interesting corollary is that GPT-3 often finds it easier to write
code to solve a programming problem, than to solve the problem on one
example input:
Q: Write one line of Ruby code to reverse an array.
A: ary.reverse
Q: Write one line of Ruby code to count how many X’s are in an array.
A: ary.count { |x| x == ‘X’ }
Q: Reverse the following array: [1, 3, 5, 6, 10, 4, 2, 77]
A: [10, 6, 4, 2, 77, 3, 5, 1]
Q: How many X’s are in this array: [1, 3, 5, 6, 10, 4, 2, 77]
A: 2
This problem shows up in more human questions as well, if you ask
it about the result of a sequence of operations.
Q: If I have a marble and a paper clip in a box, put a pencil in the box,
and remove the marble, what is left?
A: A paper clip.
Q: If I have two shoes in a box, put a pencil in the box, and
remove one shoe, what is left?
A: A shoe.
Q: If I put a pencil in a box, then put another pencil in the box,
what is in the box?
A: Two pencils.
Q: Then if I take out a pencil and put in a frog, what is in the box?
A: A frog.
Its like GPT-3 has a limited short-term memory, and has trouble reasoning
about more than one or two objects in a sentence.
Conclusion
We have certainly come a long way. The state of the art before modern
neural networks was
> Hello, I am Eliza.
Q: Who was president of the United States in 1801?
A: What answer would please you the most?
Q: Which is heavier, a mouse or an elephant?
A: Come, come, elucidate your thoughts.
GPT-3 is quite impressive in some areas, and still clearly subhuman in
others. My hope is that with a better understanding of its strengths
and weaknesses, we software engineers will be better equipped to use
modern language models in real products.
As I write this, the GPT-3 API is still in a closed beta, so you have
to join a waitlist to use it. I recommend that you sign up
here and check it out when you get the chance.