[What is Inductive Bias? – Essential for predicting the future of AI semiconductors]
Among the lectures I’ve taught so far, it may be the most impactful lecture in terms of business planning.
First of all, if we interpret an inductive bias literally, it can be called a “biased way of thinking about something.” (It’s hard, right? What the hell is this?) Perhaps when many people study Transformer, or when they read deep learning books, the term that often appears is inductive bias, but there are few places to explain it easily, so I haven’t seen many people who clearly understand such important concepts. Just in case, even if I search, it’s hard to find things that explain well. So today, I’m going to try to solve this story as easily as possible. It’s also an important clue to predict the future of semiconductors.
There are always irrationalities in easy examples, but the concept is still important, so please follow me. Let’s say there are 1,000 women and 1,000 men in the world. Each has a journey to find the best spouse. What if every woman meets every thousand men and makes a judgment? Then there is no (or extremely low) inductive bias. But meeting everyone like this can be inefficient or expensive. So I said, “I’ll only meet one to 1,000 men with a zero end number. Then I’ll meet 1/10,” “I’ll just meet someone from my neighborhood,” or “If you’re close, you’ll have a similar hobby or background,” or “I’ll only meet people with xx or more in academic background and xx or more in judgment.” In other words, my subjective intervention is involved. The more complicated and demanding conditions you offer, the higher the inductive bias is.” It seems like a rough example, but if you look at the actual AI algorithms, there are many examples above. For example, when you’re calculating an attention, you just have to look at the tokens that are close to you (though the words are very related to you when we understand what they’re writing), or you can skip a few words and quickly sort out the important tokens by some standard. And there are algorithms like that, and the idea at CNN is that the convolution calculation itself only has to perform some special calculations locally. Inductive bias is similar to these examples. It’s usually used to sort out a few special ones and calculate them, rather than trying to put them all together.
If you look at how scholars have made Inductive bias, I think it is not an exaggeration to say that this is almost the whole history of AI. About 20 years ago, when I was studying abroad at Tokyo Institute of Technology in Japan, I looked into the natural network for the first time, and I saw that there were many formulas in reference books. I wondered why these formulas always come out when I study, do I have to memorize them unconditionally, or do I have any important reason to understand them? As I couldn’t motivate myself from the beginning, I was afraid of why I had to memorize many formulas. Now that I see that there was no explanation on why such an inductive bias was valid in a book full of inductive bias (for me in the past, it’s natural that I can’t understand, I don’t have to study such formulas…) And I think that’s why studying AI is not interesting at the time. And… One of the things I studied at the time was that if you know that the human eye is a person, you tend to recognize it using a specific pattern, that is, a set of diagonal lines or hat-shaped local patterns. If the natural network is constructed in that way, it can show higher accuracy than before, so I was surprised to see that it showed much higher accuracy (in MNIST). In other words, if you make good inductive bias, you can become a very good AI model, so if you study various things (bio or brain science), you can find your own good inductive bias, write a good thesis, and become a good AI scholar. (Of course, I fell into semiconductor in no time.)
In the era of deep learning, there were still many inductive biases in the beginning, but there were many more before deep learning. For example, vision or language input data were specially processed and then put into a natural network (in this case, there are at most two layers), and computer vision formulas, linguistic basic literacy, help from a speech expert, and analysis studies for each domain were of great help. Pre-processing and post-processing were important and needed not only GPU but also CPU. Then, with the advent of deep learning, the inductive bias gradually decreased. As a result, rather than specializing in input, we just put raw data (as is, in pixels if it is an image), and when calculating, we make the formula as simple as possible without human intervention. In conclusion, the recent ultra-large AI models have very few cases because the inductive bias has been reduced. Voice, language, and vision are put into the model without much preprocessing, and most of the calculation formulas in the model consist of matrix multiplication (commonly referred to as linear layers). Since it consists only of AI models without end-to-end preprocessing or post-processing, the configuration becomes more and more dramatic than before.
If you look at machine learning as a tool, you can see it in two ways. In other words, 1) universal approximator, which is a huge formula for expressing various things, exponentially increasing expressive power as the number of parameters increases, and 2) regularization, how do you solve the problem of overfitting when you use universal approximator for learning without restriction? So, there are two types of formulas that effectively constrain learning. First of all, if you look at the universal approximator, you can see that stacking layers on top of it is much more efficient than stretching the width to the side with only a few or two layers, and deep learning methods that stack layers high (and put a non-linear function in the middle) became popular, and the problem was regularization, and until recently, human intervention was huge. Let’s look at an example? The term L2-norm adds to the loss function is this inductive bias, which means that the weight should have a Gaussian distribution close to zero. In this way, regularization always involves human intuitive logic. Up until a few years ago, a lot of new regularization methods were proposed, and they were quite effective. Intuitively interesting, and in real-world experiments, it quickly became a star paper when we introduced a regularization method. When we competitively showed the highest accrual in a given benchmark data set, now we have a very small model size, but the most effective way to achieve higher accrual in a given model size is to intuitively create a good inductive bias, create a regularization formula that connects to it, and prove it experimentally. In fact, there is still a lot of trend in small models.
The problem is that the regularization method is efficient when the learning data is small. It actually comes up at the beginning of a deep learning textbook, but the best way to avoid overfitting is to put in a lot of data. Inductive bias is the basis for overfitting without the necessary regularization, which is more fundamental overfitting prevention. Therefore, it can be a natural way to lower the inductive bias, as AI develops and the model size grows and the amount of learning data increases accordingly. I took the example of a thousand-person blind date above, but for example, if you really have enough time and resources, why don’t you meet all the thousand? No matter what standard I make, actually the best spouse may be someone I didn’t know I would like. (Those of you who have been in a relationship agree? It can be important to meet people without prejudice ^^) Just as it is good to have less love to become an inductive bias (of course, our time and effort are limited, so there are many inductive bias in reality, whether it was a relationship or a company interview). If AI learning is possible, it would be advantageous to have less inductive bias.
It’s not just theory, but in the Transformer era, it’s become very clear that inductive bias should be reduced. In particular, as the famous Yi Tay study found that Vanilla Transformer overwhelmingly outperforms models that were known to be efficient, such as Performer and Reformer, in a study that demonstrated that the larger the model and data size, the more confident they began to invest in large models with strong confidence at openAI and Google. In other words, it was best not to have an inductive bias
Tesla News Summary Is Getting Scented to Surge Tesla slashes Canadian Autopilot feature price by…
Tesla News Summary Adjusted for Options Expiration Date SpaceX Successfully Captures Starship Super Heavy Booster…
Let's start the new year vigorously. Happy new year, and I sincerely hope that you…
In the last essay, I told you that under the Trump administration, inflation could be…
One of the biggest issues of the year is the Trump administration's tariffs. First of…
We're continuing our annual outlook. However, since it's a weekday essay, the progress is not…