Transformers: more than meets the eye? (Part 1: The Past)

Posted on 2019-06-18 by Stephen Hogg
Original article and Stephen's blog can be found here: https://breakitdownto.earth/2019/05/10/Transformers_part1.html

Reposted with permission.

There’s quite a lot of talk about how wonderful Transformers are and how they solve a lot of problems. Fair enough, they are pretty good. That said, all the explanations I’ve seen so far include math and neural network diagrams (nothing wrong with that!). What I’m going to attempt to do is talk about how they work in a more conceptual way, because it’s much easier to understand the math and diagrams if you have some idea of why those equations even exist.

In this post I’ll attempt to walk through the history that got us to the point where Transformers (the neural network technique, for clarity), seemed like a good idea. There are a couple of things we’ll need to think about:

  • The situation we use a Transformer in - and why
  • How have we evolved in thinking about how to deal with natural language?
  • Why wasn’t any of the stuff we thought of in the past good enough? Why did we have to keep going and invent transformers?
  • How do transformers work?

We’ll look at the first two points in this post and the second two in the next post. I’ll be assuming you know what you’re doing with neural networks, to at least some extent.

Hopefully what you’ll see at the end of all of this is that Transformers contain echoes of all the things that came before and that in conceptual terms they’re actually quite simple. Finally, I’m going to talk about this through the lens of Natural Language Processing (NLP), because that’s what these things get used for. If you’re interested in this with respect to financial time series stuff or otherwise, just squint really hard and imagine some of the words are replaced with whatever the appropriate domain-specific replacement is.

What problem is a transformer meant to solve?

This is actually the simple bit of these posts, if anything is. It’s what is known as a sequence to sequence problem. Also known as a seq2seq problem if you’re one of the people who works on this sort of thing.

Now, what the heck is that?

Basically, it’s the sort of thing you see over and over in NLP. Translation is a perfect example. Take for instance the following input and output pair:

Input : I do want to eat vegetables Output: 我想吃蔬菜

The goal of this task is to turn the input (English), into the output (Mandarin). Clearly, the input is just a bunch of words and so is the output. BUT! The order in which they appear matters. To see this in action, the input “do I want to eat vegetables” clearly has a different meaning. Some languages might have less of a problem with this than others, but in any event it’s a real source of information that we want to incorporate when we’re working on this problem. So the challenge is, how to predict a sequence of values - not necessarily words, it could be a financial time series problem you’re working on - using not just the knowledge of the input values but their structure (i.e. order), as well.

Just to briefly emphasise, this general type of problem has existed forever. People creating neural networks didn’t invent this type of problem, only new ways of solving it.1

Important digression!

Looking at the above example and a great deal of explanatory material on the topic of seq2seq problems, there’s a common problem you can encounter. It’s this: the problem looks like the input and output are concurrent, but they actually aren’t.

The way the problem works is, you feed in an input sequence and then want your model to complete the sequence. So really, the input and output pair above should be presented more like this2:

I do want to eat vegetables BREAK 我想吃蔬菜

The reason for the BREAK there is that you need something to tell your model to stop reading input and start generating output.

This will become very important as time goes on in this discussion.

Baby steps

Ok so first let’s step back in time.3 In the beginning (and I mean actually a long time ago, like 40 years or more), we didn’t have any of the gleaming mathematical machinery we now do for working on language problems. So where to start?

N-grams

This was the first, and simplest go at trying to figure out the what comes next? problem beyond just feeding in your input and hoping for the best. The idea here is that you can actually do a better job of predicting the next word if you use not only the current one, but the one before it. All the “N” in “N-grams” tells you is that you might use more or less context as opposed to specifically one or two words of history. The following examples might make things clearer.

Guess the next word given the input:

is

This is basically impossible, right? There’s some stuff that might not be that likely to come next, but by and large we can’t really narrow it down.

What about this input:

weather is

This is a bit more workable. We could probably make a few guesses about what goes next, but it’s definitely a narrower scope of possibilities than last time.

Now what about this:

The weather is

At this point I’m fairly sure something like hot or cold comes next.

So far so good, but what’s the cost? It’s pretty simple, in fact. Firstly, the reason we know one of those words comes next is because we’ve all watched too much news on the TV so far in this life. That’s well and good for a human, but sometimes you don’t have enough data for that to work for a machine.

Another problem arises if you’re trying to predict more than just the next word. If you’re trying to predict a whole sentence, you’re in a lot of trouble. Maybe you can get the first word of your output correct, but after that you’re at the mercy of whatever you predicted as your first word. If that’s wrong, everything else becomes even wronger. We’ve only really started solving this problem comparatively recently.

Even worse, the longer your input sequence the rarer it is! In the case of “The weather is”, you might be able to see it a few times in a dataset if you’re lucky. But not as many times as you see “weather is”, perhaps. So we’re stuck in a classic problem where we can get a few edge cases right all of the time, or have hopefully better than random performance over a much broader set of cases, assuming your input doesn’t fundamentally change as time goes on. That’s probably not good enough. So people kept researching stuff.

Time becomes a loop4

Another thing that people discovered when talking about seq2seq problems is that sometimes stuff a long time ago in your input sequence actually matters in terms of the prediction you make at the end of it. Doing stuff with N-grams won’t work here because for example a 128-gram is basically unique. You’ll only every see that sequence of tokens the once, if at all, so it’s no use for prediction. So what can you do about this?

This is where things start getting complicated, unfortunately. That is to say, as soon as you try and do anything beyond the most basic stuff with a seq2seq problem, it gets complicated.

Intuitively, what we want is something that can remember the earlier parts of a sequence like this:4

I was born in a water moon. Some people, especially its inhabitants, called it a planet, but as it was only a little over two hundred kilometres in diameter, ‘moon’ seems the more accurate term. The moon was made entirely of water, by which I mean it was a globe that not only had no land, but no rock either, a sphere with no solid core at all, just liquid water, all the way down to the very centre of the globe.
If it had been much bigger the moon would have had a core of ice, for water, though supposedly incompressible, is not entirely so, and will change under extremes of pressure to become ice. (If you are used to living on a planet where ice floats on the surface of water, this seems odd and even wrong, but nevertheless it is the case.) The moon was not quite of a size for an ice core to form, and therefore one could, if one was sufficiently hardy, and adequately proof against the water pressure, make one’s way down, through the increasing weight of water above, to the very centre of the moon.
Where a strange thing happened.
For here, at the very centre of this watery globe, there seemed to be no gravity. There was colossal pressure, certainly, pressing in from every side, but one was in effect weightless (on the outside of a planet, moon or other body, watery or not, one is always being pulled towards its centre; once at its centre one is being pulled equally in all directions), and indeed the pressure around one was, for the same reason, not quite as great as one might have expected it to be, given the mass of water that the moon was made up from.
This was, of course,

Imagine you’re trying to guess what comes next after “This was, of course,”. Do you think it’s something about the physics stuff towards the end of the passage, or about the perspective of people who live on planets where ice floats on the surface referred to much earlier in the text? It could be either, but if your model isn’t capable of having some way of hanging onto relevant info from much earlier in your input sequence, you have no choice. You can only really consider the physics option.

So we needed some way of hanging onto not everything but at least some important info from the earlier parts of a potentially long sequence. This is where neural networks start to play a role.

The original go at this involved trying to get a neural network to have a memory of its own, after a fashion.5 I don’t want to go on forever about this because it turns into talking about math, but the long and the short of it was that you wanted a bit of your network to loop around on itself in a way that made some of the information coming in from a given sequence element take a while to get back out again.

It’s a bit like Gumbo. 6 This is a slightly weird analogy, but stick with me. Gumbo is this dish you get in the southern parts of the U.S. that takes forever to cook. But the trick is to make it so that the flavour of the ingredients you put in early on is still retained some hours later in the pot when you finally finish cooking, even if you’ve taken that ingredient back out.

In our case, we want to put words in and move on to subsequent parts of the sequence without necessarily losing all of the earlier flavour. By the time we get to the end of that passage I put in above, we still want to be able to have some hint of the earlier part where the speaker says his planet was a moon.

In practical terms, what you do is make information take a really long path to get out of your network. Problem is, this isn’t perfect either. For a really long sequence, you can wind up with a gigantic number of parameters in your network. You also don’t really know what to hang on to, so by creating a recurrent neural network all you’re really doing is trying to hang on to as much of your input information as possible for as long as possible. This is on top of the other problem where introducing recurrence makes it slower to process an input sequence because now you have a dependency problem: you can only process some of your input after you’ve processed the bits that came before.

That said, it’s still a big big step forward.

My model accurately predicts everything!

The next step forward after recurrence seemed at first like it solved everything. We’ll see that wasn’t the case though.

It turns out two of the problems with the basic approach to recurrence can be solved in the same way. The non-selectiveness of memory storage in a basic recurrent net and the escalating number of parameters needed were both addressed by the Long Short-Term Memory cell.

What this thing did differently was conceptually simple enough. Rather than just adding links between cells in a neural network so that it took a while for information to find the exit, each cell now got its own memory storage.

As always, so far so good. With this particular invention, you could get to the end of a passage like the one in the previous section and have some sort of useful numerical information from earlier on predict what comes next.

Going back to the first example I gave:

I do want to eat vegetables BREAK 我想吃蔬菜

Now that we have an LSTM to work with, the input sequence can comfortably be quite large and we’ll still have all of it to work with when we start predicting an output sequence.

That’s another big step forward, but there’s still a problem here. Now, although we have a lot of information to work with when we start predicting an output sequence, we still have the problem that the sequence is only as good as the first prediction.

That is to say, we need the entire sequence to be meaningfully encoded in whatever you have after you’ve looked at the last item in your input. It also has to be able to let you spell out the entire output sequence. And when you do, you’re still stuck with the problem that your second prediction depends on your first one and your third prediction depends on your second one.

Also this LSTM stuff is relatively slow because you have to process one element of your sequence at a time and there’s no real way around that in the framework. That said, as always, it was a big leap forwards and started to show some serious potential for dealing with seq2seq problems.

Wouldn’t it make more sense and potentially be faster if we could use all of the bits of the input sequence independently in determining an output, instead of pancaking the information you have up against an invisible wall separating your input sequence and whatever comes next? This question is more or less what the Attention method and Transformers grow out of. I’ll talk about why next time.


  1. Just in case anyone was wondering. ↩︎

  2. From The Algebraist by Iain M. Banks. R.I.P. ↩︎

  3. This is secretly a joke. If you work out what I’m referring to, I’m not sorry. ↩︎

  4. https://www.youtube.com/watch?v=DNb4VKln1uw ↩︎

  5. Not as cool as anything invented by Miles Bennett Dyson, but still. ↩︎

  6. https://en.wikipedia.org/wiki/Gumbo#Preparation_and_serving ↩︎

ai

Machine Learning Jokes

Posted on 2019-05-20 by Imam

Machine Learning Jokes

Robot performing a stand up comedy routine.

If you don’t find this funny, well, its time to brush up your linear algebra and multivariable calculus.

  1. Why are neural networks better than humans ? Because they actually learn from their mistakes.
  2. Why is machine learning so racist ? Because it divides everything in classes.
  3. What do you call glorified curve fitting ? Deep learning.
  4. Why can’t random forest learn anything on its own ? Because it needs supervision.
  5. Data scientist goes in for a kidney transplant. Asks the surgeon “ Have you done this before “ The surgeon says “ No, but I have watched the first 10 minutes of ‘Learn surgery in 1 hour’".
  6. How do you raise a billion dollar for your coffee business in silicon valley ? You rename the business to Deep.Coffee.AI
  7. What is Ian Goodfellow's favourite catch phrase ? Yes we GAN.
  8. 1 Million robots marched towards the cliff, 999999 fell off the cliff and were destroyed but 1 didn't, why ? Because it takes a robot that many examples to learn something.
  9. Whats the difference between a LSTM and Donald Trump ? A LSTM is capable of generating perfect English sentences.
  10. What did the cost function say to the waiter after looking at the dinner bill ? Hey can you minimize that.

ai  ml  humour  jokes

Symbolic AI, return of the prodigal son

Posted on 2018-03-26 by Imam

Robot looking at a dog in the snow. It can't decide whether he sees a wolf or a dog.

What is Symbolic AI ?

As humans, we represent the world around us through high level concepts. For example, words like 'Trees', 'Flowers', 'Skyscrapers' are symbols that stand for something or someone. The essence of symbolic AI is reasoning through relationships among high level concepts, such as “All men are mortal; Caius is a man; therefore Caius is mortal – associating mortality with all men. However, it is quite labour intensive, as a human has to encode these high level concepts and their relationship into rules. These rules are encoded as if then statements.

A bit about its history

Symbolic AI was the dominant form of AI until the 1980s when statistical methods took over. The term artificial intelligence was coined at the seminal conference in 1956 at Dartmouth, organized by John McCarthy. The pioneers of the field John McCarthy and Marvin Minsky believed a rule-based system will ultimately lead to human level intelligence (Artificial General Intelligence) and beyond. Symbolic AI is also often referred as 'Good Old Fashioned AI' because of this historical context. There is an ongoing project called 'Cyc' is-, one of the longest running. It aims to codify all of the 'common sense knowledge' a human possesses. The hypothesis is that an artificial system should eventually be able to reason like a human once this system has access to all common sense knowledge.

Why it failed to deliver on its promise ?

The approach was overtaken by statistical methods for basically two reasons.

  1. Hard coding a chain of rules for sufficiently complex tasks can be overwhelming and requires domain expertise. Imagine defining a set of rules to recognize dogs in images- they are furry (but not always), have got two eyes and four legs, and have different shapes, sizes and colours. Defining rules to identify every possible species of dogs can soon become overwhelmingly complex and impractical.
  2. Rule based reasoning systems are not robust against noisy data. The real world is noisy, peppered with counter examples. Going back to earlier examples of defining rules for identifying dogs. Dogs are four legged animals, but some might be born with three due to a genetic defect or might lose one due to an accident. A system based on rigid set or chain of rules is very difficult to modify and accommodate for all the variations found in nature.

Finally, its resurgence or why do we need it again ?

Statistical machine learning and its modern avatar deep learning (including deep reinforcement learning) has had a profound impact on our society. Autonomous cars, super human performance in the game of GO and near perfect speech recognition is the result of deep learning and specialized hardware. However, these systems are notorious for being data intensive.

When DeepMind achieved super human results in the Atari game of Pong and Breakout using deep reinforcement learning it had to play millions of games with itself, while humans can understand the game mechanics and become decent players with just few examples. The deep network has no concept of a ball, pedal or that the ball needs to be struck with the pedal. The system learns everything from scratch (just pixels).

This is where a hybrid system (Symbolic AI + Deep learning) could potentially be revolutionary as they complement each other quite well. If the system can learn to represent high level concepts like ball, pedal and their relationships as step 1 that would speed up learning to play the game quite significantly. Higher level concepts and relationships also makes it re-usable and easy to transfer knowledge.

Summary

Symbolic AI is set of powerful tools to build intelligent systems that humans can understand and reason about. The tedious nature of building these systems eventually resulted in statistical methods becoming dominant for AI, however it issreturningmaking a comeback . Shortcomings of deep learning align well with strengths of Symbolic systems, hence its poised to take the field of AI to the next level.

To learn more

Symbolic reasoning

A nice video introduction

What was Good Old Fashioned AI and why did it fail

Towards Deep Symbolic Reinforcement Learning

Artwork by offoff.studio

symbolic-ai  ml