Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

EdrawMind

Deep learning theoretical knowledge

MindMap Gallery Deep learning theoretical knowledge

Deep learning theoretical knowledge

Part of the content is collapsed, and a total of 1216 modules are included. Based on Yasuki Saito's two books "Introduction to Deep Learning: Theory and Implementation Based on Python" and "Advanced Deep Learning: Natural Language Processing Author: [Japanese] Yasuki Saito Translator: Lu Yujie". This is the most suitable book for getting started with deep learning that I have ever read. I highly recommend it before studying "Hands-On Deep Learning" by Li Mu! The content inside does not require any basic knowledge. It is taught from scratch and can be understood by high school students.

Edited at 2024-02-04 00:57:48

PlotWizard

Recent works View more works>>

Deep learning theoretical knowledge

PlotWizard

Recent works View more works>>

Recommended to you
Outline

Deep learning theoretical knowledge

introduction

basic concepts

The deep learning problem is a machine learning problem, which refers to summarizing general rules through algorithms from limited examples and applying them to new unknown data.

Unlike traditional machine learning, the models used in deep learning are generally more complex.

The data flow from the original input to the output target passes through multiple linear or nonlinear components. Each component processes information and in turn affects subsequent components.

When we finally get the output, we don't know exactly how much each component contributes. This question is called contribution Allocation problem.

The contribution allocation problem is also often translated as the credit allocation problem or the credit allocation problem.

The contribution allocation problem is a very critical issue, which is related to how to learn the parameters in each component.

At present, the model that can better solve the problem of contribution distribution is artificial neural network (ANN).

Neural networks and deep learning are not equivalent. Deep learning can use neural network models or other models (for example, deep belief network is a probabilistic graph model).

AI basic concepts

smart concept

natural intelligence

definition

Refers to the power and behavioral abilities of humans and some animals

natural human intelligence

It is the comprehensive ability of human beings in understanding the objective world that is manifested by thinking processes and mental activities.

Different views and hierarchies of intelligence

View

theory of mind

Intelligence comes from thinking activities

knowledge threshold theory

Intelligence depends on applicable knowledge

evolutionary theory

Intelligence can be achieved by gradual evolution

Hierarchy

Characteristic capabilities included in intelligence

Perception

memory and thinking skills

learning and adaptability

capacity

Artificial intelligence concept

explain

Use artificial methods to achieve intelligence on machines

Study how to construct intelligent machines or systems, and simulate and extend artificial intelligence

Turing test

Basic content of AI research

The subject position of artificial intelligence

The intersection of natural sciences and social sciences

Core: Thinking and Intelligence

Basic subjects: mathematics, thinking science, computer

Interdisciplinary research with brain science and cognitive science

Research on methods and technologies of intelligent simulation

machine perception

Vision

hearing

machine thinking

machine learning

machine behavior

Domain classification

Perception: that is, simulating human perception ability to perceive and process external stimulus information (visual and speech, etc.). Main research areas include speech information processing and computer vision.

Learning: Simulating human learning ability, mainly studying how to learn from examples or interacting with the environment. Main research areas include supervised learning, unsupervised learning and reinforcement learning.

Cognition: Simulates human cognitive abilities. The main research areas include knowledge representation, natural language understanding, reasoning, planning, decision-making, etc.

history

Different schools of AI research

symbolism

Symbolism, also known as logicism, psychology school or computer school. By analyzing the functions of human intelligence and then realizing these functions through computers.

Basic assumptions

Information can be represented using symbols

Symbols can be manipulated through explicit rules (such as logical operations)

Human cognitive processes can be viewed as symbolic manipulation processes. In the reasoning period and knowledge period of artificial intelligence, the symbolic method is more popular and has achieved a lot of results.

connectionism

Connectionism, also known as the bionic school or the physiological school, is a type of information processing methods and theories in the field of cognitive science.

In the field of cognitive science, human cognitive process can be regarded as an information processing process. Connectionism believes that human cognitive processes are information processing processes in neural networks composed of a large number of simple neurons, rather than symbolic operations.

Therefore, the main structure of the connectionist model is an interconnected network composed of a large number of simple information processing units, which has the characteristics of nonlinearity, distribution, parallelization, local computing and adaptability.

Behaviorism

Behaviorism believes that artificial intelligence stems from cybernetics.

In addition to deep learning, there is currently another exciting technology in the field of machine learning, reinforcement learning.

Let an agent (Agent) continuously take different actions (Action), change its state (State), and interact with the environment (Enviroment) to obtain different rewards (Reward). We only need to design appropriate rewards. (Reward) rules, the agent can learn appropriate strategies through continuous trial and error.

Neural Networks

brain neural network

Artificial neural networks

The development history of neural networks

Model proposed

The period from 1943 to 1969 was the first climax period of the development of neural networks. During this period, scientists proposed many neuron models and learning rules.

In 1943, psychologist Warren McCulloch and mathematician Walter Pitts first described an idealized artificial neural network and constructed a computing mechanism based on simple logical operations. The neural network model they proposed is called the MP model.

ice age

From 1969 to 1983, it was the first low-level development of neural network. valley period. During this period, research on neural networks was at a state of stagnation and low ebb for many years.

In 1969, Marvin Minsky published the book "Perceptron", pointing out two key flaws of neural networks: the first is that the perceptron cannot handle the XOR loop problem; the second is that the computers at that time could not support the processing of large neural networks. Requires computing power.

In 1974, Paul Webos of Harvard University invented the backpropagation algorithm (BP), but it did not receive the attention it deserved at that time.

The renaissance caused by the backpropagation algorithm

1983～1995. The backpropagation algorithm has reignited interest in neural networks.

Caltech physicist John Hopfield proposed a neural network for associative memory and optimization calculations, called the Hopfield network. The Hopfield network achieved the best results at the time on the traveling salesman problem and caused a sensation.

David Rumelhart and James McClelland provide a comprehensive discussion of the application of connectionism to computer simulations of neural activity and reinvent the backpropagation algorithm.

Decline in popularity

1995～2006. Support vector machines and other simpler methods (such as linear classifiers) are gradually surpassing neural networks in popularity in the field of machine learning.

The rise of deep learning

2006 ~ now. Multi-layer feedforward neural networks can be pre-trained layer by layer and then fine-tuned using the backpropagation algorithm. Learn effectively.

machine learning

Data preprocessing

After data preprocessing, such as removing noise, etc. For example, in text classification, removing stop words, etc.

Feature extraction

Extract some effective features from raw data. For example, in image classification, extract edges, scale invariant feature transform (SIFT) features, etc.

Feature transformation

Perform certain processing on the features, such as dimensionality reduction and dimensionality enhancement. Dimensionality reduction includes two approaches: Feature Extraction and Feature Selection. Commonly used feature transformation methods include principal component analysis (PCA), linear discriminant analysis (Linear Discriminant Analysis), etc.

predict

The core part of machine learning, making predictions through a function

Indicates learning

In order to improve the accuracy of machine learning systems, convert input information into effective features

If there is an algorithm that can automatically learn effective features and improve the performance of the final machine learning model, then this kind of learning can be called representation learning.

display method

local representation

One way to represent colors is to name different colors by different names

The dimension is high and cannot be expanded. The similarity between different colors is 0.

distributed representation

RGB values to represent colors

To learn a good high-level semantic representation (generally distributed representation), it is usually necessary to start from the low-level features and go through multiple steps of non-linear transformation to obtain it.

deep learning

step

Contribution distribution problem

Different from "shallow learning", the key problem that deep learning needs to solve is the distribution of contribution

Take the following game of Go as an example. Whenever a game is played, the final result is either a win or a loss. We will think about which moves led to the final victory, and which moves led to the final defeat. How to judge the contribution of each move is the problem of contribution distribution, which is also a very difficult problem.

In a sense, deep learning can also be regarded as a kind of reinforcement learning (RL). Each internal component cannot directly obtain supervision information, but needs to obtain it through the final supervision information (reward) of the entire model, and there is Certain delay.

The neural network model can use the error back propagation algorithm, which can better solve the contribution distribution problem.

End-to-end learning

traditional learning style

In some complex tasks, traditional machine learning methods need to artificially cut the input and output of a task into many sub-modules (or multiple stages), and each sub-module is learned separately.

For example, a natural language understanding task generally requires steps such as word segmentation, part-of-speech tagging, syntactic analysis, semantic analysis, and semantic reasoning.

There are two problems with this way of learning

First, each module needs to be optimized separately, and its optimization goals and the overall mission goals are not guaranteed to be consistent.

The second is error propagation, that is, errors in the previous step will have a great impact on subsequent models. This increases the difficulty of practical application of machine learning methods.

New way of learning

End-to-End Learning, also known as end-to-end training, refers to the overall goal of directly optimizing the task without conducting training in modules or stages during the learning process.

Generally, there is no need to explicitly give the functions of different modules or stages, and no human intervention is required in the intermediate process.

Most deep learning using neural network models can also be regarded as an end-to-end learning.

Commonly used deep learning frameworks

Theano: A Python toolkit from the University of Montreal, used to efficiently define, optimize and execute Theano project is currently out of maintenance. Multidimensional array data corresponds to mathematical expressions. Theano can transparently use GPUs and efficient symbols differential.

Caffe: The full name is Convolutional Architecture for Fast Feature Embedding. It is a computing framework for convolutional network models. The network structure to be implemented can be specified in the configuration file and does not require coding. Caffe is implemented in C and Python and is mainly used for computer vision.

TensorFlow: A Python toolkit developed by Google that can run on any device with a CPU or GPU. TensorFlow's calculation process is represented using data flow graphs. Tensor Flow's name comes from the fact that the operation object in its calculation process is a multi-dimensional array, that is, a tensor.

Chainer: One of the earliest neural network frameworks that uses dynamic computing graphs. Its core development team is Preferred Networks, a machine learning startup from Japan. Compared with static calculation graphs used by Tensorflow, Theano, Caffe and other frameworks, dynamic calculation graphs can dynamically construct calculation graphs at runtime, so they are very suitable for some complex decision-making or reasoning tasks.

PyTorch5: A deep learning framework developed and maintained by Facebook, NVIDIA, Twitter and other companies. Its predecessor is Torch6 of Lua language. PyTorch is also a framework based on dynamic computing graphs, which has obvious advantages in tasks that require dynamically changing the structure of neural networks.

Organization of this book

perceptron

A perceptron is an algorithm with inputs and outputs. Given an input, a given value will be output.

The perceptron sets weights and biases as parameters

Logic circuits such as AND gates and OR gates can be represented using perceptrons.

An XOR gate cannot be represented by a single layer perceptron.

An XOR gate can be represented using a 2-layer perceptron

Single-layer perceptrons can only represent linear spaces, while multi-layer perceptrons can represent nonlinear spaces.

A 2-layer perceptron can (in theory) represent a computer.

Neural Networks

Perceptrons and Neural Networks

"Naive perceptron" refers to a single-layer network and a model that uses the step function as an activation function.

"Multilayer perceptron" refers to a neural network, that is, a multilayer network that uses smooth activation functions such as the sigmoid function or the ReLU function.

Operation: inner product of neural network

Y = np.dot(X, W)

Neural networks can be implemented efficiently by using matrix operations.

Affine layer

The matrix product operation performed in the forward propagation of the neural network is called "affine transformation" in the field of geometry

Affine transformation includes a linear transformation and a translation, which respectively correspond to the weighted sum operation and offset operation of the neural network.

Y = sigmoid(Y)

output layer

Activation function: Identity function is used for regression problems, and softmax function is used for classification problems.

Identity function

The input signal will be output unchanged

softmax function

Assume that the output layer has n neurons in total, and calculate the output yk of the k-th neuron.

Features: The sum of the output values of the output layer is 1

Note: overflow issues

quantity

Classification problem

Generally set to the number of categories

Handwritten digit recognition

The input layer has 28*28=784 neurons, and the output layer has 10 neurons. There are also two hidden layers, and the number of neurons can be any value.

Batch processing

Enter multiple sets of data at once

Neural network learning

loss function

Introduce concepts

When looking for optimal parameters (weights and biases), you are looking for parameters that make the value of the loss function as small as possible, so you need to calculate the derivatives of the parameters (the gradient to be precise)

Why not directly use recognition accuracy as an indicator?

The derivative of the parameter will become 0 in most places

The same goes for step functions as activation functions

type

mean square error

cross entropy error

mini-batch

Extract some test data

gradient

The vector summed up by the partial derivatives of all variables is called the gradient

The direction indicated by the gradient is the direction in which the function value at each point decreases the most.

hyperparameters

Manual setting

learning raten

minibatch size

Update times iters_num

training gained

Weight w and bias theta

Neural Networks

The gradient of the loss function with respect to the weight parameters

epoch

Number of cycles/minibatch size

Stochastic Gradient Descent (SGD)

Perform gradient descent on randomly selected data

error back propagation method

Although numerical differentiation is simple and easy to implement, its disadvantage is that it is time-consuming to calculate. There is a method that can efficiently calculate the gradient of weight parameters: error back propagation method

Computational graph

By using calculation graphs, you can intuitively grasp the calculation process

Forward propagation of computational graphs performs general computations. By backpropagating the computational graph, the derivatives of each node can be calculated

The error term of layer l can be calculated by the error term of layer l 1 Obtained, this is the back propagation of error.

formula

calculate

The amount of yellow is the value obtained during backpropagation

The green quantity is a known quantity

By implementing the constituent elements of a neural network as layers, gradients can be calculated efficiently

By comparing the results obtained by numerical differentiation and the error back propagation method, you can confirm whether the implementation of the error back propagation method is correct (gradient confirmation)

Reference video

https://www.bilibili.com/video/BV1LM411J7cW/?spm_id_from=333.788&vd_source=048c7bdfe54313b8b3ee1483d9d07e38

convolutional neural network

Everything should be as simple as possible, but not too simple. [Albert Einstein]

the whole frame

Compared

Network based on fully connected layer (Affine layer)

CNN based network

link order

Convolution[convolution layer]-ReLU-(Pooling[pooling layer])

The layer close to the output uses the previous Affine [affine transformation] - ReLU combination

The final output layer uses the previous Affine-Softmax combination

convolution layer

Convolution concept

Problems with the fully connected layer

The shape of the data is "ignored". The image is usually a 3-dimensional shape in the height, length, and channel directions, but the 3-dimensional data needs to be flattened into 1-dimensional data when input.

The image is a 3-dimensional shape, and this shape should contain important spatial information.

Spatially adjacent pixels have similar values

Each channel of RBG is closely related to each other.

There is little correlation between pixels that are far apart

The convolutional layer can keep the shape unchanged

definition

The input and output data of the convolution layer are called feature maps

The input data of the convolutional layer is called the input feature map

The output data is called the output feature map

Convolution operation

The convolution operation is equivalent to the filter operation in image processing

The main function of convolution is to slide a convolution kernel (ie filter) on an image (or some feature) and obtain a new set of features through the convolution operation.

two-dimensional

Given an image X ∈ R(M×N), and a filter W ∈ R (m×n), generally m << M, n << N, the convolution is

three dimensional

Correlation

In the process of calculating convolution, it is often necessary to flip the convolution kernel.

Flip is to reverse the order in two dimensions (top to bottom, left to right), that is, rotate 180 degrees.

In terms of specific implementation, cross-correlation operations are used instead of convolutions, which will reduce some unnecessary operations or overhead.

Cross-Correlation is a function that measures the correlation between two series, usually implemented using a sliding window dot product calculation.

Given an image X ∈ R(M×N) and a convolution kernel W ∈ R (m×n), their cross-correlation is

The difference between cross-correlation and convolution is only whether the convolution kernel is flipped. Cross-correlation can also be called non-flip convolution.

Convolution is used in neural networks for feature extraction. Whether the convolution kernel is flipped has nothing to do with its feature extraction capability. Especially when the convolution kernel is a learnable parameter, convolution and cross-correlation are equivalent.

Variants of convolution

Zero padding

In order to keep the space size constant, the input data needs to be padded

Stride

The interval of positions at which the filter is applied is called the stride

Commonly used convolutions

Narrow Convolution: Step size s = 1, no zero padding at both ends p = 0, and the output length after convolution is n − m 1.

Wide Convolution: Step size s = 1, zero padding at both ends p = m − 1, and the output length after convolution is n m − 1.

Equal-Width Convolution: Step size s = 1, zero padding at both ends p = (m −1)/2, output length n after convolution.

Mathematical properties of convolution

Convolution operation on 3D data

The input data and filter channel numbers should be set to the same value.

Multiple convolution operations

Regarding the filters of the convolution operation, the number of filters must also be considered. Therefore, as 4-dimensional data, the weight data of the filter should be written in the order of (output_channel, input_channel, height, width). For example, if there are 20 filters with a channel number of 3 and a size of 5 × 5, it can be written as (20, 3, 5, 5).

Batch processing

We hope that the convolution operation also corresponds to batch processing. To do this, the data passed between each layer needs to be saved as 4-dimensional data. Specifically, the data is saved in the order of (batch_num, channel, height, width).

Properties of convolutional layers

Local connection: Each neuron in the convolutional layer (assumed to be the l-th layer) is only connected to the neurons in a local window in the next layer (l-1 layer), forming a local connection network. The number of connections between the convolutional layer and the next layer is greatly reduced, from the original n(l) × n(l - 1) connections to n(l) × m connections. m is the filter size.

Weight sharing: The filter w(l) as a parameter is the same for all neurons in layer l.

Due to local connections and weight sharing, the parameters of the convolutional layer only have an m-dimensional weight w(l) and a 1-dimensional bias b(l), with a total of m 1 parameters.

The number of neurons in layer l is not chosen arbitrarily, but satisfies n(l) = n(l−1) − m 1.

Pooling layer

Also called aggregation layer, subsampling layer

Pooling is feature selection, which reduces the number of features, thereby reducing the number of parameters, reducing feature dimensions, and reducing the space in the height and length directions.

Commonly used aggregation functions

Maximum: Generally, the maximum value of all neurons in a region is taken.

Average aggregation (Mean): Generally, the average value of all neurons in the area is taken.

A typical pooling layer divides each feature map into non-overlapping regions of 2×2 size, and then uses maximum pooling for downsampling.

The pooling layer can also be regarded as a special convolutional layer

In some early convolutional networks (such as LeNet-5), nonlinear activation functions were sometimes used in the pooling layer.

where Y(′d) is the output of the pooling layer, f(·) is the nonlinear activation function, w(d) and b(d) are learnable scalar weights and biases.

Characteristics of the pooling layer

There are no parameters to learn

The number of channels does not change

Robust to small position changes (robustness)

parameter learning

Calculation of error terms

Visualization of CNN

Visualization of layer 1 weights

The filter before learning is randomly initialized, so there is no pattern in the shades of black and white, but the filter after learning becomes a regular image. We found that through learning, the filters are updated into regular filters, such as filters that gradient from white to black, filters that contain blocky areas (called blobs), etc. Filter responsive to horizontal and vertical edges

It can be seen that the filters of the convolutional layer will extract original information such as edges or patches. The CNN just implemented will pass this raw information to subsequent layers.

Information extraction based on hierarchical structure

Information extracted from the convolutional layers of CNN. The neurons in layer 1 respond to edges or patches, layer 3 responds to texture, layer 5 responds to object parts, and the final fully connected layer responds to the category of the object (dog or car).

If multiple convolutional layers are stacked, as the layers deepen, the extracted information becomes more complex and abstract. This is a very interesting part of deep learning. As the layers deepen, neurons change from simple shapes to "high-level" information. In other words, just as we understand the "meaning" of things, the objects of response gradually change.

Typical convolutional neural network

LeNet-5

LeNet was proposed in 1998 as a network for handwritten digit recognition. It has consecutive convolutional layers and pooling layers, and finally outputs the results through a fully connected layer.

Excluding the input layer, LeNet-5 has a total of 7 layers.

Input layer: The input image size is 32 × 32 = 1024.

Convolutional layer: Using six 5 × 5 filters, six groups of feature maps with a size of 28 × 28 = 784 are obtained. Therefore, the number of neurons in layer C1 is 6 × 784 = 4704, the number of trainable parameters is 6 × 25 6 = 156, and the number of connections is 156 × 784 = 122304 (including biases, the same below).

Pooling layer: The sampling window is 2×2, average pooling is used, and a nonlinear function is used. The number of neurons is 6 × 14 × 14 = 1176, the number of trainable parameters is 6 × (1 1) = 12, and the number of connections is 6 × 196 × (4 1) = 5, 880.

Convolutional layer. A connection table is used in LeNet-5 to define the dependency between the input and output feature maps. As shown in the figure, a total of 60 5 × 5 filters are used to obtain 16 groups of feature maps with a size of 10 × 10. The number of neurons is 16 × 100 = 1, 600, the number of trainable parameters is (60 × 25) 16 = 1, 516, and the number of connections is 100 × 1, 516 = 151, 600.

In the pooling layer, the sampling window is 2 × 2, and 16 feature maps of 5 × 5 size are obtained. The number of trainable parameters is 16 × 2 = 32, and the number of connections is 16 × 25 × (4 1) = 2000.

Convolutional layers, using 120 × 16 = 1, 920 5 × 5 filters, obtain 120 sets of feature maps with a size of 1 × 1. The number of neurons in layer C5 is 120, the number of trainable parameters is 1, 920 × 25 120 = 48120, and the number of connections is 120 × (16 × 25 1) = 48120.

The fully connected layer has 84 neurons and the number of trainable parameters is 84×(120 1) =10164. The number of connections and the number of trainable parameters are the same, which is 10164.

Output layer: The output layer consists of 10 Euclidean radial basis functions

join table

The fully connected relationship between the input and output feature maps of the convolutional layer is not necessary, and non-shared convolution can be used

Define a link table (Link Table) T to describe the connection relationship between the input and output feature maps.

If the p-th output feature map depends on the d-th input feature map, then Tp,d = 1, otherwise 0.

AlexNet

It was proposed in 2012 and uses many technical methods of modern deep convolutional networks.

Parallel training using GPUs

The activation function uses ReLU

Use Dropout to prevent overfitting

Use data augmentation to improve model accuracy

Use LRN (Local Response Normalization) layer for local normalization

Inception network

Inception module: A convolutional layer contains multiple convolution operations of different sizes

The Inception network is stacked by multiple inception modules and a small number of aggregation layers.

V1 version

The earliest v1 version of the Inception network is the very famous GoogLeNet [Szegedy et al., 2015], and won the 2014 ImageNet image classification competition.

Residual network ResNet

The information propagation efficiency is improved by adding direct edges to the nonlinear convolutional layer.

nonlinear elements

Can be one or more convolutional layers

Let this nonlinear unit f(x, θ) approximate an objective function h(x)

A nonlinear unit composed of a neural network has sufficient ability to approximately approximate the original objective function or the residual function, but in practice the latter is easier to learn

Conclusion: Let the nonlinear unit f(x, θ) approximate the residual function h(x)−x, and use f(x, θ) x to approximate h(x).

The residual network is a very deep network composed of many residual units connected in series.

Other convolution methods

transposed convolution

Atrous convolution (dilated convolution)

deep learning

Deepen the network

Heading towards a deeper network

This network refers to VGG which will be introduced in the next section.

Convolutional layer based on 3×3 small filters

The activation function is ReLU

The Dropout layer is used behind the fully connected layer.

Adam-based optimization

Use the initial value of He as the initial value of the weight

Recognition accuracy is 99.38%

Further improve recognition accuracy

Ensemble learning

learning rate decay

Data Augmentation

Increase the number of images by applying small changes such as rotation, vertical or horizontal movement, cropping, flipping, increasing brightness, etc.

Deeper motivation

Improve recognition performance

The importance of deepening can be seen from the results of large-scale image recognition competitions represented by ILSVRC. The results of this competition show that the top methods recently are mostly based on deep learning and have a tendency to gradually deepen the layers of the network. In other words, it can be seen that the deeper the layer, the higher the recognition performance.

Reduce the number of parameters of the network

The advantage of stacking small filters to deepen the network is that it can reduce the number of parameters and expand the receptive field (a local spatial region that imposes changes on neurons). Moreover, through overlay layers, activation functions such as ReLU are sandwiched in the middle of the convolutional layer, further improving the expressiveness of the network. This is because "nonlinear" expressiveness based on activation functions is added to the network. Through the superposition of nonlinear functions, more complex things can be expressed.

Make learning more efficient

Compared with a network without deepening layers, by deepening the layers, the learning data can be reduced and learning can be performed efficiently.

Deep learning network structure

ILSVRC competition

ImageNet contains a variety of images, and each image is associated with a label (category name). The ILSVRC Image Recognition Competition using this huge dataset is held every year.

The large-scale image recognition competition ILSVRC was held in 2012. In that year's competition, the method based on deep learning (commonly known as AlexNet) won overwhelmingly, completely subverting previous image recognition methods. In subsequent competitions, deep learning has been active at the center of the stage.

In particular, 2015's ResNet (a deep network with more than 150 layers) reduced the false recognition rate to 3.5%. It is said that this result even exceeds the recognition ability of ordinary people.

VGG

VGG is a basic CNN composed of convolutional layers and pooling layers. However, its characteristic is that it superimposes weighted layers (convolutional layers or fully connected layers) to 16 layers (or 19 layers), which has depth (sometimes also called "VGG16" or "VGG19" depending on the depth of the layer) .

GoogLeNet

The network not only has depth vertically, but also breadth horizontally, which is called the Inception structure.

ResNet

Has a deeper structure than previous networks

We already know that deepening depth is important for improving performance. However, in deep learning, if you deepen the depth too much, learning will not proceed smoothly in many cases, resulting in poor final performance. In ResNet, in order to solve this kind of problem, "shortcut structure" (also called "shortcut" or "path") is introduced. After importing this shortcut structure, the performance can be continuously improved as the layer deepens (of course, the layer deepening is also limited).

In practice, the weight data learned using the huge data set of ImageNet is often flexibly applied. This is called transfer learning. The learned weights (parts) are copied to other neural networks for re-learning (fine tuning). For example, prepare a network with the same structure as VGG, use the learned weights as initial values, and use the new data set as the object to re-learn. Transfer learning is very effective when the data set at hand is small.

Speeding up deep learning

Problems that need to be solved

The time ratio of each layer in AlexNet's forward processing: the left side is when using GPU, and the right side is when using CPU. "conv" in the figure corresponds to the convolution layer, "pool" corresponds to the pooling layer, "fc" corresponds to the fully connected layer, and "norm" corresponds to the regularization layer

The processing time of the convolutional layer accounts for 95% of the overall GPU and 89% of the overall CPU.

GPU-based speedup

GPUs are mainly provided by two companies, NVIDIA and AMD. Although both GPUs can be used for general numerical calculations, NVIDIA's GPU is more "close" to deep learning. In fact, most deep learning frameworks only benefit from NVIDIA's GPUs. This is because the deep learning framework uses CUDA, a comprehensive development environment for GPU computing provided by NVIDIA.

Distributed learning

Distributed computing on multiple GPUs or multiple machines

Google's TensorFlow and Microsoft's CNTK attach great importance to distributed learning during the development process

The horizontal axis is the number of GPUs The vertical axis is the speedup rate compared to a single GPU.

Digit reduction of arithmetic precision

Regarding numerical precision (number of digits to represent a value), we already know that deep learning does not require numerical precision in digits. This is an important property of neural networks. This property is based on the robustness of neural networks.

In the future, half-precision floating point numbers will be used as a standard, and it is expected to achieve speeds up to approximately 2 times that of the previous generation of GPUs.

Deep learning application cases

Object detection

Determine the type of object and the location of the object from the image

Among the methods of using CNN for object detection, there is a method called R-CNN

Image segmentation

Classify images at the pixel level

FCN classifies all pixels through one forward process.

FCN literally means "a network composed entirely of convolutional layers". Compared with the general CNN containing fully connected layers, FCN replaces the fully connected layers with convolutional layers that play the same role.

Image caption generation

A representative method for generating image captions based on deep learning is called NIC

NIC is composed of deep CNN and RNN (Recurrent Neural Network) that processes natural language.

The future of deep learning

Image style transformation

Image generation

Autopilot

reinforcement learning

Distributed representation of natural language and words

Marty: “This is heavy.” Dr. Brown: “In the future, things are so heavy?” —The movie "Back to the Future"

What is natural language processing

Our language is made of words, and the meaning of language is made of words. In other words, a word is the smallest unit of meaning.

Three ways to get computers to understand the meaning of words

Thesaurus-based approach

count-based approach

Inference-based approach (word2vec)

Thesaurus-based approach

Consider manually defining word meanings

Currently widely used is the synonym dictionary

A diagram based on the hypernym-hyponym relationship according to the meaning of each word

WordNet

The most famous synonym dictionary

effect

Get synonyms for a word

Calculate similarity between words

Used via the NLTK module

Problems

New words continue to appear, making it difficult to adapt to changes in the times

High labor cost

Unable to express subtle differences in words

count-based approach

corpus

A corpus is a large amount of text data

Corpora used in the field of natural language processing sometimes add additional information to text data. For example, each word of the text data can be marked with a part-of-speech. Here, it is assumed that the corpus we use has no tags added.

Python-based corpus preprocessing

famous corpora

Wikipedia and Google News

preprocessing

Uppercase -> Lowercase

text.lower()

Process punctuation

text.replace('.', ' .')

re.split('(\W )', text)

\W: Matches non-word characters (not letters, numbers, or underscores)

: Indicates matching the previous pattern "\W" repeated one or more times

Create word IDs and correspondence tables

Convert a list of words into a list of word IDs

corpus = [word_to_id[w] for w in words]

Distributed representation of words

Construct compact and reasonable vector representations in the word domain

Distribution hypothesis

The meaning of a word is formed by the words surrounding it

Context refers to the words surrounding a centered word

The size of the context is called the window size

The window size is 1 and the context contains 1 word on the left and right

co-occurrence matrix

The simplest way to use a vector is to count how many times a word appears around it.

text = 'You say goodbye and I say hello.'

Set window size to 1

similarity between vectors

cosine similarity

Sorting of similar words

Get the word vector of the query word

Obtain the cosine similarity between the word vector of the query word and all other word vectors respectively.

Results based on cosine similarity, showing their values in descending order

Improvements in count-based methods

Click mutual information

In the co-occurrence matrix, common words like the will be considered to have a strong correlation with nouns like car

PMI

, P(x) represents the probability of x occurring, P(y) represents the probability of y occurring, and P(x, y) represents the probability of x and y occurring simultaneously.

PMI based on co-occurrence matrix

insufficient

When the number of co-occurrences of two words is 0, log(2)(0) = −∞

positive point mutual information

Get the PPMI matrix based on the co-occurrence matrix

Dimensionality reduction

We need to observe the distribution of data and find important "axes"

Singular value decomposition (SVD)

SVD decomposes any matrix into the product of 3 matrices

where U and V are orthogonal matrices whose column vectors are orthogonal to each other, and S is a diagonal matrix in which all but the diagonal elements are 0

The original matrix can be approximated by removing the redundant column vectors in the matrix U

U, S, V = np.linalg.svd(W)

If the matrix size is N*N, the computational complexity of SVD will reach O(N^3). Therefore, faster methods such as Truncated SVD are often used. Truncated SVD achieves high speed by truncating the parts with smaller singular values.

from sklearn.utils.extmath import randomized_svd U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5, random_state=None)

PTB dataset

The PTB corpus is often used as a benchmark for evaluating proposed methods

Preprocessing of PTB corpus

brackets to replace rare words with the special character

Replace specific numbers with "N"

The preprocessing I did

Concatenate all the sentences and treat it as one big time series data. At this time, a special character <eos> is inserted at the end of each sentence

Hyperparameter assignment

window_size = 2

wordvec_size = 100

Evaluation based on PTB data set

For the query word you, you can see that personal pronouns such as i and we are ranked first. These are words with the same usage in grammar.

The query word year has synonyms such as month and quarter.

The query word car has synonyms such as auto and vehicle.

When using toyota as the query term, car manufacturer names or brand names such as nissan, honda and lexus appeared.

Summarize

Use the corpus to calculate the number of words in the context, convert them into a PPMI matrix, and then obtain good word vectors based on SVD dimensionality reduction.

word2vec

“If you don’t have a basis for judgment, don’t reason.” ——Arthur Conan Doyle, "A Scandal in Bohemia"

word embedding

Word2Vec is an algorithm for generating "word embeddings"

In addition to Word2Vec, there are other methods for generating word embeddings, such as GloVe (Global Vectors for Word Representation), FastText, etc. These methods may use different strategies and algorithms, but they all aim to effectively capture the semantic information of words in vector space.

Inference-based methods and neural networks

Problems with count-based methods

In the real world, corpora deal with a very large number of words. For example, it is said that the English vocabulary has over 1 million words. If the vocabulary size exceeds 1 million, then using the count-based method requires generating a huge matrix of 1 million × 1 million, but it is obviously unrealistic to perform SVD on such a large matrix.

Inference-based approaches using neural networks

Learning on mini-batch data. That is, using part of the data to learn and repeatedly updating the weights.

The learning of neural networks can be performed in parallel using multiple machines and multiple GPUs, thus accelerating the entire learning process.

Summary of inference-based methods

Target

Predict what words will come in the middle when given the surrounding words (context), like a cloze

reasoning method

Input context, and the model outputs the occurrence probability of each word.

As a product of model learning, we will get a distributed representation of the word

How to process words in neural networks

Convert words to vectors

Neural networks cannot directly process words like you or say. To use neural networks to process words, you need to first convert the words into fixed-length vectors.

Conversion method

one-hot vector

Only one element is 1, the other elements are 0

Neural Networks

input layer

Fully connected layer

The initial weights are random

Simple word2vec

Inference of CBOW model

structure

Features

There are two input layers

The transformation from the input layer to the intermediate layer is completed by the same fully connected layer (weight W(in))

The transformation from the intermediate layer to the output layer neurons is completed by another fully connected layer (weight W(out))

The neurons in the middle layer are the "average" of the values obtained by each input layer after being transformed by the fully connected layer.

The neurons in the output layer are the scores of each word, and the higher its value The larger the value, the higher the occurrence probability of the corresponding word.

CBOW model learning

Convert scores into probabilities using the Softmax function

Find the cross-entropy error between these probabilities and the supervised labels

Learn it as a loss

Weighted and distributed representation of word2vec

W(in) weight is the distributed representation of the word we want

Preparing study data

context and target words

Convert to one-hot representation

Implementation of CBOW model

Additional information

CBOW models and probability

The probability that wt occurs after wt−1 and wt 1 occur.

Loss function L (negative log likelihood) of CBOW model

skip-gram model

word2vec has two models

CBOW

skip-gram

Skip-gram is a model that inverts the context and target words processed by the CBOW model.

skip-gram network structure diagram

skip-gram models and probability

The skip-gram model has only one input layer, and the number of output layers is equal to the number of words in the context. First, the losses of each output layer must be calculated separately and then added together as the final loss.

Predict the context wt−1 and wt 1 based on the middle word (target word) wt

The loss function of the skip-gram model can be expressed as

Loss function comparison

The number of predictions of the skip-gram model is as many as the number of context words, so its loss function requires the sum of the losses corresponding to each context word. The CBOW model only requires the loss of the target word.

Judging from the accuracy of the distributed representation of words, the skip-grm model gives better results in most cases.

Counting-based vs. inference-based

Scenarios where new words need to be added to the vocabulary and the distributed representation of the words updated

Count-based methods require calculations from scratch

Inference-based methods allow for incremental learning of parameters

Properties of distributed representations of words

Count-based methods mainly encode the similarity of words

Inference-based methods can understand complex patterns between words

kingman woman=queen

Accuracy of distributed representations of words

Inference-based methods and counting-based methods are indistinguishable

Speeding up word2vec

Don't try to know everything, or you will know nothing. ——Democritus (ancient Greek philosopher)

Improve

study

other

Application of word2vec

The distributed representation of words obtained using word2vec can be used to find approximate words

transfer learning

Knowledge learned in one field can be applied to other fields

When solving natural language processing tasks, word2vec is generally not used to learn the distributed representation of words from scratch. Instead, it is first learned on a large-scale corpus (text data such as Wikipedia, Google News, etc.), and then the learned distributed representation is Applies to a single task.

In natural language processing tasks such as text classification, text clustering, part-of-speech tagging, and sentiment analysis, the first step of word vectorization can use the distributed representation of learned words.

Distributed representations of words work great in almost all types of natural language processing tasks!

Using distributed representations of words, it is also possible to convert documents (sequences of words) into fixed-length vectors.

If you can convert natural language into vectors, you can use many machine learning methods

How to evaluate word vectors

artificially created word similarity evaluation set to evaluate

The similarity between cat and animal is 8, and the similarity between cat and car is 2... Similar to this, the similarity between words is manually scored with a score from 0 to 10.

Compare the scores given by people and the cosine similarity given by word2vec to examine the correlation between them

in conclusion

Different models have different accuracies (choose the best model based on the corpus)

The bigger the corpus, the better the results (big data is always needed)

The dimensionality of word vectors must be moderate (too large will lead to poor accuracy)

RNN

I just remember me meowing and crying in a dark and humid place. ——Natsume Soseki's "I am a Cat"

Probability and Language Models

A simple feedforward network cannot fully learn the properties of time series data. As a result, RNN (Recurrent Neural Network) came into being.

word2vec from a probabilistic perspective

Can the original purpose of the CBOW model "predict target words from context" be used for something? Can P(wt|wt−2, wt−1) play a role in some practical scenarios?

The windows we considered before are all symmetrical, and then we only consider the left window.

language model

Use probability to evaluate the likelihood that a sequence of words will occur, that is, the extent to which a sequence of words is natural.

Probability representation

Where P(A,B) = P(A|B)*P(B) = P(B|A)*P(A)

Using the CBOW model as a language model

Markov chain

When the probability of an event depends only on the N events preceding it, it is called an "N-order Markov chain."

Limiting the context to the 2 words on the left is a second-order Markov chain

insufficient

If the window is too short, the context cannot be combined

If the window is too long, the order of words in the context will be ignored.

CBOW is short for Continuous Bag-Of-Words. Bag-Of-Words means "a bag of words," which means that the order of the words in the bag is ignored.

RNN has a mechanism that can remember context information no matter how long the context is. Therefore, time series data of arbitrary length can be processed using RNN.

word2vec is a method aimed at obtaining distributed representation of words, and is generally not used in language models.

RNN

recurrent neural network

The structure of the RNN layer

The input at time t is xt, which implies that time series data (x0, x1, ··· , xt, ···) will be input into the layer. Then, in the form corresponding to the input, output (h0, h1, ··· , ht, ···)

The output has two forks, which means the same thing was copied. A fork in the output will become its own input.

unroll loop

We use the word "moment" to refer to the index of time series data (that is, the input data at time t is xt). Expressions such as "the t-th word" and "the t-th RNN layer" are used, as well as expressions such as "the word at time t" or "the RNN layer at time t".

The RNN layer at each moment receives two values, which are the input passed to this layer and the output of the previous RNN layer.

RNN has two weights, namely the weight Wx that converts the input x into the output h and the weight Wh that converts the output of the previous RNN layer into the output at the current moment. Additionally, there is bias b.

From another perspective, RNN has a state h, which is continuously updated through the above formula. So h can be called a hidden state or a hidden state vector

The two schematic drawing methods are equivalent

Backpropagation Through Time

time-based backpropagation

To find the gradient based on BPTT, the intermediate data of the RNN layer at each moment must be saved in memory. Therefore, as the time series data gets longer, the computer's memory usage (not just calculations) also increases.

Truncated BPTT

In order to solve the above problem, when processing long time series data, the common practice is to cut the network connection into an appropriate length.

Networks that are too long in the direction of the time axis are truncated at appropriate locations to create multiple small networks, and then the error backpropagation method is performed on the cut-out small networks. This method is called Truncated BPTT (truncated BPTT).

In Truncated BPTT, the backward propagation connection of the network is cut off, but the forward propagation connection is still maintained.

Processing order

The first thing to do is to feed the input data of block 1 (x0, ... , x9) into the RNN layer.

Perform forward propagation first and then back propagation to get the desired gradient.

Next, the input data of the next block (x10, x11, ··· , x19) are fed into the RNN layer. The calculation of this forward propagation requires the last hidden state h9 of the previous block, so that the forward propagation connection can be maintained.

Mini-batch learning of Truncated BPTT

At the beginning of the input data, an "offset" needs to be made within individual batches.

Notice

To enter data in order

To shift the starting position of each batch (each sample) of input data

Implementation of RNN

Considering learning based on Truncated BPTT, the target neural network receives time series data of length T (T is any value), and these T states can be regarded as a layer

Call a layer that processes T steps at a time a "Time RNN layer"

The layer that performs the single-step processing in the Time RNN layer is called the "RNN layer"

Like Time RNN, layers that process time series data holistically are named starting with the word "Time", which is the naming convention laid out in this book. After that, we will also implement the Time Affine layer, Time Embedding layer, etc.

Implementation of RNN layer

forward propagation

Backpropagation

Implementation of Time RNN layer

forward propagation

Time RNN layer saves the hidden state h in a member variable to inherit the hidden state between blocks

Use the parameter stateful to record whether the hidden state h is called. In backpropagation, when stateful is False, the hidden state of the first RNN layer is the zero matrix.

Backpropagation

We store the gradient flowing to the hidden state at the previous moment in the member variable dh. This is because we will use it when we discuss seq2seq (sequence-to-sequence) in Chapter 7

Implementation of layers for processing time series data

Full picture of RNNLM

RNN-based language models are called RNNLM

structure

Layer 1 is the Embedding layer, which converts word IDs into distributed representations of words (word vectors). This word vector is fed into the RNN layer.

The RNN layer outputs the hidden state to the next layer (top), and also outputs the hidden state to the next RNN layer (right).

The hidden state output by the RNN layer upward passes through the Affine layer and is passed to the Softmax layer.

Sample

you say goodbye and i say hello

The first word, you with word ID 0, is entered. At this time, looking at the probability distribution output by the Softmax layer, we can see that the probability of say is the highest, which indicates that the word that appears after you is correctly predicted to be say.

Word 2 says. At this time, the output of the Softmax layer has a higher probability at goodbye and hello. Because the RNN layer saves the past information of "you say" as a short hidden state vector. The job of the RNN layer is to pass this information to the Affine layer above and the RNN layer at the next moment.

Implementation of Time layer

Target neural network structure

Time Affine

The Time Affine layer does not simply use T Affine layers, but uses matrix operations to achieve efficient overall processing.

Time Softmax

The loss error is implemented in Softmax. The Cross Entropy Error layer calculates the cross entropy error.

T Softmax with Loss layers each calculate the loss, then add them together and average, and the resulting value is used as the final loss.

Learning and evaluation of RNNLM

Implementation of RNNLM

Language model evaluation

The input data is 1

Perplexity is often used as an indicator to evaluate the prediction performance of language models. Perplexity=1/probability

The input data is multiple

Here, it is assumed that the amount of data is N. tn is the correct solution label in the form of a one-hot vector, tnk represents the k-th value of the n-th data, and ynk represents the probability distribution (the output of Softmax in a neural network). By the way, L is the loss of the neural network

The bigger the probability, the better, and the smaller the confusion, the better.

Perplexity represents the number of options that can be chosen next. If the confusion is 1.25, it means that the number of candidates for the next word is about 1.

Learning of RNNLM

Trainer class of RNNLM

Encapsulate the above operations into classes

Extended to graph structures

recurrent neural network

There are three hidden layers h1, h2 and h3, where h1 is calculated from two inputs x1 and x2, h2 is calculated from two other input layers x3 and x4, and h3 is calculated from two hidden layers h1 and h2.

graph network

Gated RNN

Take off your baggage and travel light. ——Nietzsche

When we say RNN, we refer more to the LSTM layer than the RNN from the previous chapter. When we need to explicitly refer to the RNN from the previous chapter, we say "simple RNN" or "Elman".

Problems with simple RNN

During the learning process, the RNN layer learns dependencies in the time direction by passing "meaningful gradients" to the past. But the gradient of learning is difficult to control, which can lead to gradient disappearance or gradient explosion.

reason

activation function

tanh

As you can see from the graph, its value is less than 1.0, and its value gets smaller as x moves away from 0. This means that as the backpropagated gradient passes through the tanh node, its value will become smaller and smaller. Therefore, if you pass the tanh function T times, the gradient will also decrease T times.

ReLU

Gradient does not degrade

MatMul (Matrix Product) Node

gradient explosion

As shown in the figure, the size of the gradient increases exponentially with the time step. If a gradient explosion occurs, it will eventually lead to overflow and values such as NaN (Not a Number, non-numeric value). As a result, the learning of the neural network will not work correctly.

gradient disappears

As shown in the figure, the size of the gradient decreases exponentially with the time step. If gradient disappearance occurs, the gradient will quickly become smaller. Once the gradient becomes small, the weight gradient cannot be updated and the model cannot learn long-term dependencies.

Reason for change

The matrix Wh is repeatedly multiplied T times. If Wh were a scalar, the problem would be simple: when Wh is greater than 1, the gradient increases exponentially; when Wh is less than 1, the gradient decreases exponentially.

If Wh is a matrix. At this point, the singular values of the matrix will become indicators. Simply put, the singular values of a matrix represent the degree of dispersion of the data. Depending on whether this singular value (more precisely the maximum of multiple singular values) is greater than 1, one can predict changes in the magnitude of the gradient.

Countermeasures against exploding gradients

There is an established method to solve gradient explosion, which is called gradient clipping.

It is assumed here that the gradients of all parameters used by the neural network can be integrated into a variable and represented by the symbol g. Then set the threshold to threshold. At this time, if the L2 norm g of the gradient is greater than or equal to the threshold, the gradient is corrected as described above.

Vanishing gradients and LSTM

In RNN learning, gradient disappearance is also a big problem. In order to solve this problem, the structure of the RNN layer needs to be fundamentally changed. Here the topic of this chapter, Gated RNN, is about to appear. Many Gated RNN frameworks (network structures) have been proposed, among which LSTM and GRU are the representative ones.

LSTM interface

LSTM is the abbreviation of Long Short-Term Memory, which means that it can maintain short-term memory (Short-Term Memory) for a long time.

First express the calculation of tanh(h(t−1)*Wh xt*Wx b) as a rectangular node tanh (ht−1 and xt are row vectors)

Let’s compare the interface (input and output) of LSTM and RNN

The difference between the interface of LSTM and RNN is that LSTM also has path c. This c is called a memory unit (or simply "unit"), which is equivalent to the dedicated memory department of LSTM.

The characteristic of the memory unit is that it only receives and transmits data within the LSTM layer. That is to say, from the side receiving the output of the LSTM, the output of the LSTM only has the hidden state vector h. Memory unit c is invisible to the outside world, and we don't even need to consider its existence.

The structure of the LSTM layer

ct stores the memory of the LSTM at time t, which can be considered to contain all necessary information from the past to time t. Then based on the ct of this carrier memory, the hidden state ht is output.

calculate

The current memory unit ct is calculated based on the three inputs c(t−1) h(t−1) and xt through "some kind of calculation" (described later).

The hidden state ht is calculated using the updated ct, the formula is ht = tanh(ct)

Gate

The opening and closing degree of the door is also automatically learned from the data. The opening and closing degree is represented by a real number from 0.0 to 1.0 (1.0 is fully open)

output gate

The hidden state ht only applies the tanh function to the memory unit ct, and we consider applying gates to tanh(ct). Since this gate manages the output of the next hidden state ht, it is called an output gate.

The output gate is calculated as follows. The sigmoid function is represented by σ()

ht can be calculated from the product of o and tanh(ct). The calculation method is the element-wise product, which is the product of the corresponding elements. It is also called the Hadamard product.

oblivion door

Now, we add a gate to forget unnecessary memories on the memory unit c(t−1), which is called the forget gate here.

The calculation of the forget gate is as follows

ct is obtained by the product of this f and the corresponding element of the previous memory unit ct−1

new memory unit

Now we also want to add some new information to this memory unit that should be remembered, for this we add a new tanh node

The result calculated based on the tanh node is added to the memory unit ct−1 at the previous moment.

The role of this tanh node is not to gate, but to add new information to the memory unit. Therefore, it does not use the sigmoid function as the activation function, but uses the tanh function.

input gate

We add a gate to g in Figure 6-17. This newly added gate is called the input gate here.

The input gate determines the value of each element of the new information g. Input gates do not add new information without consideration; rather, they make choices about which information to add. In other words, the input gate adds weighted new information.

The input gate is calculated as follows

LSTM gradient flow

Backpropagation of memory cells only flows through the " " and "×" nodes. The " " node flows out the gradient from the upstream as it is, so the gradient does not change (degenerates). The calculation of the "×" node is not a matrix product, but the product of the corresponding elements (Hadama product), which will not cause gradient changes.

Implementation of LSTM

For affine changes such as x*Wx h*Wh b, it can be integrated into one formula

Matrix libraries are generally faster when computing "large matrices" and the source code is cleaner by managing the weights together.

Language model using LSTM

The language model implemented here is almost the same as the previous chapter. The only difference is that where the Time RNN layer was used in the previous chapter, the Time LSTM layer is used this time.

Further improvements to RNNLM

Multi-layering of LSTM layers

Deepening the LSTM layer (stacking multiple LSTM layers) often works well.

How many layers are appropriate?

Because the number of layers is a hyperparameter, it needs to be determined based on the complexity of the problem to be solved and the size of the training data that can be provided.

In the case of learning a language model on the PTB data set, better results can be obtained when the number of LSTM layers is 2 to 4

The GNMT model used in Google Translate is superimposed on a network of 8 layers of LSTM.

Suppress overfitting based on Dropout

By deepening the depth, more expressive models can be created, but such models often suffer from overfitting. To make matters worse, RNNs are more prone to overfitting than conventional feedforward neural networks, so overfitting countermeasures for RNNs are very important.

Countermeasures

Add training data

Reduce model complexity

Regularization

Dropout

Dropout randomly selects a subset of neurons, then ignores them and stops transmitting signals forward.

Dropout layer insertion position

Regular Dropout

error structure

If Dropout is inserted in the time series direction, information will be gradually lost over time as the model learns.

Correct structure

Insert Dropout layer vertically

Variation Dropout

By sharing the mask between Dropouts of the same layer, the mask is "fixed". In this way, the way information is lost is also "fixed", so the exponential information loss that occurs with regular Dropout can be avoided.

weight sharing

Weight tying can be literally translated as "weight binding".

The trick to binding (sharing) the weights of the Embedding layer and the Affine layer is weight sharing. By sharing the weights between these two layers, the number of parameters learned can be greatly reduced. In addition to this, it improves accuracy.

Better RNNLM implementation

Frontier Research

Generate text based on RNN

There is no such thing as a perfect article, just like there is no perfect despair. ——Haruki Murakami "Listen to the Wind Sing"

Generate text using language models

How to generate the next new word

Select the word with the highest probability, the result is uniquely determined

Words with high probability are easy to be selected, words with low probability are difficult to be selected.

Implementation of text generation

Better text generation

Use better language models

seq2seq model

Seq2Seq (Sequence to Sequence, sequence to sequence model)

Models for converting time series data into other time series data

The principle of seq2seq

This model has two modules - Encoder and Decoder. The encoder encodes the input data and the decoder decodes the encoded data.

seq2seq consists of two LSTM layers, the encoder LSTM and the decoder LSTM.

The hidden state h of LSTM is a fixed-length vector. The difference between it and the model in the previous section is that the LSTM layer receives the vector h. This single, small change allowed ordinary language models to evolve into decoders that could handle translation.

A simple attempt at time series data conversion

Trying to get seq2seq to do addition calculations

Variable length time series data

filling

Fill in the original data with invalid (meaningless) data, from And make the data length aligned.

When using padding you need to add some padding-specific processing to seq2seq

When padding is input in the decoder, its loss should not be calculated (this can be solved by adding a Softmax with Loss mask function to the layer)

When input padding in the encoder, the LSTM layer should output the input from the previous moment as is

additive data set

Implementation of seq2seq

Improvements to seq2seq

Reverse input data

In many cases, learning progresses faster and final accuracy improves after using this technique.

Intuitively, the propagation of gradients can be smoother and more effective after inverting the data.

peeping

The encoder using the helmet is called Peeky Decoder, and the seq2seq using Peeky Decoder is called Peeky seq2seq.

The encoder converts the input sentence into a fixed-length vector h, which concentrates all the information required by the decoder. It is the only source of information for the decoder.

The output h of the encoder, which concentrates important information, can be assigned to other layers of the decoder

Two vectors are input to the LSTM layer and the Affine layer at the same time, which actually represents the concatenation of the two vectors.

Application of seq2seq

Machine Translation: Convert "text in one language" to "text in another language"

Autosummary: Convert "a long text" into a "short summary"

Question and answer system: convert "question" into "answer"

Email auto-reply: Convert "received email text" to "reply text"

chatbot

algorithm learning

Automatic image description

Attention

Attention is everything. ——Title of Vaswani’s paper

Attention is undoubtedly one of the most important technologies in the field of deep learning in recent years. The goal of this chapter is to understand the structure of Attention at the code level, and then apply it to practical problems to experience its wonderful effects.

Attention structure

Problems with seq2seq

An encoder is used in seq2seq to encode the time series data, and then the encoded information is passed to the decoder. At this point, the output of the encoder is a fixed-length vector.

Fixed-length vectors mean that whatever the length of the input statement (no matter how long) is, will be converted into a vector of the same length.

Encoder improvements

The length of the encoder's output should change accordingly based on the length of the input text

Because the encoder processes from left to right, strictly speaking, the "cat" vector just contains the information of the three words "我的人", "は" and "猫". Considering the overall balance, it is best to include information around the word "cat" evenly. In this case, bidirectional RNN (or bidirectional LSTM) that processes time series data from both directions is more effective.

Decoder improvements

Previously we put the "last" hidden state of the encoder's LSTM layer into the "initial" hidden state of the decoder's LSTM layer.

The decoder in the previous chapter only used the last hidden state of the encoder's LSTM layer. If using hs, only the last row is extracted and passed to the decoder. Next we improve the decoder to be able to use all hs.

We focus on a certain word (or set of words) and convert this word at any time. This allows seq2seq to learn the correspondence between "which words in the input and output are related to which words"

Example

My generation [わがはい] = I

猫[ねこ] = cat

Many studies exploit knowledge of word correspondences such as "cat =cat". Such information indicating the correspondence between words (or phrases) is called alignment. So far, alignment has been mainly done manually, but the Attention technology we will introduce has successfully introduced the alignment idea into seq2seq automatically. This is also the evolution from "manual operation" to "mechanical automation".

structural changes

How to calculate

Can the operation of "selection" be replaced by a differentiable operation? Instead of "single selection", it is better to "select all". We separately calculate the weight representing the importance (contribution value) of each word.

a Like a probability distribution, each element is a scalar from 0.0 to 1.0, and the sum is 1. Then, calculate the weighted sum of the weight representing the importance of each word and the word vector hs to obtain the target vector.

When processing sequence data, the network should pay more attention to the important parts of the input and ignore the unimportant parts. It explicitly weights the important parts of the input sequence by learning the weights of different parts, so that the model can be better Pay close attention to output-related information. The key to the Attention mechanism is to introduce a mechanism to dynamically calculate the weight of each position in the input sequence, so that at each time step, different parts of the input sequence are weighted and summed to obtain the output of the current time step. When generating each output, the decoder pays different attention to different parts of the input sequence, allowing the model to better focus on important information in the input sequence.

Learning of weight a

Our goal is to express numerically how "similar" this h is to the individual word vectors of hs.

Here we use the simplest vector inner product.

There are several ways to calculate vector similarity. In addition to inner products, there is also the practice of using small neural networks to output scores.

Next, s is regularized using the old Softmax function

Integrate

Implementation of seq2seq with Attention

Attention's evaluation

We turned to confirm the effect of seq2seq with Attention by studying the "date format conversion" problem

Date format conversion problem

Learning of seq2seq with Attention

Visualization of Attention

Other topics about Attention

Bidirectional RNN

If we consider the overall balance, we want the vector to contain information around the word "cat" more evenly.

Bidirectional LSTM adds an LSTM layer processing in the opposite direction on top of the previous LSTM layer.

Splice the hidden states of the two LSTM layers at each moment and use it as the final hidden state vector (in addition to splicing, you can also "sum" or "average", etc.)

How to use the Attention layer

The output of the attention layer (context vector) is connected to the input of the LSTM layer at the next moment. Through this structure, the LSTM layer is able to use the information of the context vector. In contrast, the model we implemented uses context vectors in the Affine layer.

Deepening of seq2seq and skip connection

Deepen the RNN layer

seq2seq with attention using 3 layers of LSTM layer

residual connection

At the junction of the residual connection, two outputs are added.

Because addition propagates gradients "as is" when backpropagating, gradients in residual connections can be propagated to the previous layer without any effect. In this way, even if the layer is deepened, the gradient can propagate normally without gradient disappearance (or gradient explosion), and learning can proceed smoothly.

Application of Attention

GNMT

The history of machine translation

Rule-based translation

Use case-based translation

Statistics-based translation

Neural Machine Translation

Since 2016, Google Translate has been using neural machine translation for actual services. Machine translation system called GNMT

GNMT requires large amounts of data and computing resources. GNMT uses a large amount of training data, (1 model) learned on nearly 100 GPUs for 6 days. In addition, GNMT is also trying to further improve accuracy based on technologies such as ensemble learning and reinforcement learning that can learn 8 models in parallel.

Transformer

RNN can handle variable-length time series data well. However, RNN also has shortcomings, such as parallel processing problems.

RNN needs to be calculated step by step based on the calculation results of the previous moment, so it is (basically) impossible to calculate RNN in parallel in the time direction. This will become a big bottleneck when performing deep learning in a parallel computing environment using GPUs, so we have the motivation to avoid RNN.

Transformer does not use RNN, but uses Attention for processing. Let's take a brief look at this Transformer.

Self-Attention

Transformer is based on Attention, which uses the Self-Attention technique, which is important. Self-Attention literally translates as "one's own attention to oneself", that is to say, this is Attention based on a time series data, aiming to observe the relationship between each element in a time series data and other elements.

Use a fully connected neural network with one hidden layer and activation function ReLU. In addition, Nx in the figure means that the elements surrounded by the gray background are stacked N times.

NTM

NTM (Neural Turing Machine)

Neural networks can also gain additional capabilities using external storage devices.

Based on Attention, encoders and decoders implement "memory operations" in computers. In other words, this can be interpreted as, the encoder writes the necessary information to memory and the decoder reads from memory Get the necessary information.

In order to imitate the computer's memory operation, NTM's memory operation uses two Attentions,

Content-based Attention is the same as the Attention we introduced before, and is used to find similar vectors of a certain vector (query vector) from memory.

Position-based Attention is used to move forward and backward from the memory address (the weight of each location in the memory) that was focused on at the last moment. Here we omit the discussion of its technical details, which can be achieved through one-dimensional convolution operation. The movement function based on the memory location can reproduce the unique computer activity of "reading while advancing (one memory address)".

NTM has successfully solved long-term memory problems, sorting problems (arranging numbers from large to small), etc.

Network optimization and regularization

No mathematical trick can compensate for the lack of information [Cornelius Lanczos]

Two major difficulties

Optimization

Difficult to optimize and computationally intensive

generalization problem

The fitting ability is too strong and it is easy to overfit.

Network Optimization

Difficulties in network optimization

Network structure diversity

It is difficult to find a general optimization method. Different optimization methods also have relatively large differences in different network structures.

Difficulties with low-dimensional spaces

How to choose initialization parameters

Escape from the local optimum

Difficulties with high-dimensional spaces

How to choose initialization parameters

How to escape from a saddle point

In some dimensions it is the highest point, in other dimensions it is the lowest point

flat bottom

There are many parameters in deep neural networks and there is a certain degree of redundancy, which results in each single parameter having a relatively small impact on the final loss.

stuck in local minimum

optimization

Gradient descent method type

batch gradient descent

stochastic gradient descent

mini-batch gradient descent

If in gradient descent, calculating the gradient on the entire training data for each iteration requires more computing resources. Additionally, the data in large training sets is often very redundant, and there is no need to compute gradients over the entire training set.

learning rate decay

The learning rate should be kept larger at the beginning to ensure the convergence speed, and smaller when it converges to near the optimal point to avoid back and forth oscillations.

type

Reverse time decay

exponential decay

natural exponential decay

β is the attenuation rate, generally taking a value of 0.96.

There are also methods for adaptively adjusting the learning rate, such as AdaGrad, RMSprop, AdaDelta, etc.

AdaGrad method

Among the effective techniques for learning rate is a method called learning rate decay, which gradually decreases the learning rate as learning proceeds.

AdaGrad takes this idea further, adjusting the learning rate appropriately for each element of the parameters while learning at the same time

Ada comes from the English word Adaptive, which means "appropriate"

Like the previous SGD, W represents the weight parameter to be updated, the partial derivative represents the gradient, and n represents the learning rate.

But a new variable h appears, which stores the sum of the squares of all previous gradient values. Therefore, the deeper the learning, the smaller the update amplitude.

RMSProp method

If you learn endlessly, the update amount will become 0

The RMSProp method does not add all the past gradients equally, but gradually forgets the past gradients and reflects more information about the new gradients when doing the addition operation.

Technically speaking, this operation is called "exponential moving average", which exponentially reduces the scale of past gradients.

Gradient direction optimization

Momentum method

In mini-batch gradient descent, if the number of samples selected each time is relatively small, the loss will decrease in an oscillating manner.

By using the average gradient in the latest period of time instead of the gradient at the current moment as the direction of parameter update.

also called momentum method

Disadvantages of SGD

f(x,y)=(1/20)*x^2 y^2

Optimized update path based on SGD: moving towards the minimum value (0, 0) in a zigzag shape, low efficiency

ways to improve

Like the previous SGD, W represents the weight parameter to be updated, the partial derivative represents the gradient, and n represents the learning rate.

But a new variable v appears, which corresponds to the physical speed, which can be understood as the force exerted on the object in the gradient direction.

Adam method

Momentum moves according to the physical rules of a ball rolling in a bowl, and AdaGrad adjusts the update pace appropriately for each element of the parameter. Combining them is Adam's idea

There is no (currently) method that performs well on all problems. Each of these four methods has its own characteristics, and each has its own problems that it is good at solving and problems that it is not good at solving.

gradient cutoff

If the gradient suddenly increases, using a large gradient to update the parameters will lead to it being far away from the optimal point.

When the modulus of the gradient is greater than a certain threshold, the gradient is truncated.

Limit the modulus of the gradient to an interval, and truncate it when the modulus of the gradient is smaller or larger than this interval.

type

Truncate by value

gt = max(min(gt, b), a).

Truncate according to mold

Parameter initialization

Gaussian distribution initialization

The Gaussian initialization method is the simplest initialization method. The parameters are randomly initialized from a Gaussian distribution with a fixed mean (such as 0) and a fixed variance (such as 0.01).

When the number of input connections of a neuron is n(in), its input connection weight can be set to be initialized with the Gaussian distribution of N(0,sqrt(1/nin)).

If the number of output connections nout is also considered, it can be initialized according to the Gaussian distribution of N(0,sqrt(2/(nin nout)))

Uniformly distributed initialization

Uniform distribution initialization uses uniform distribution to initialize parameters within a given interval [−r, r]. The setting of the hyperparameter r can also be adjusted adaptively according to the number of connections of neurons.

Activation function type

logistic function

tanh

Xavier initial value

We tried using the weight initial values recommended in the paper by Xavier Glorot et al.

If the number of nodes in the previous layer is n, the initial value uses a Gaussian distribution with a standard deviation of (1/sqrt(n))

ReLU weight initial value

When the activation function uses ReLU, it is generally recommended to use the initial value dedicated to ReLU, which is the initial value recommended by Kaiming He et al., also known as the "He initial value".

When the number of nodes in the current layer is n, the initial value of He uses a Gaussian distribution with a standard deviation of (2/sqrt(n))

Data preprocessing

different units

The different sources and measurement units of each dimensional feature will cause the distribution range of these feature values to be very different. When we calculate the Euclidean distance between different samples, features with a large value range will play a dominant role.

scaling normalization

The value range of each feature is normalized to [0, 1] or [−1, 1] by scaling.

standard normalization

Also called z-score normalization

Each dimensional feature is processed to conform to the standard normal distribution (mean is 0, standard deviation is 1).

Data redundancy

After the input data is whitened, the correlation between features is low and all features have the same variance.

One of the main ways to achieve whitening is to use principal component analysis to remove the correlation between components.

Layer-by-layer normalization

When using stochastic gradient descent to train a network, each parameter update will cause the distribution of inputs to each layer in the middle of the network to change. The deeper the layer, the more obviously the distribution of its input will change.

batch normalization

Also called Batch Normalization, BN method

In order to make each layer have the appropriate breadth, the distribution of activation values is "forced" to be adjusted.

Perform regularization so that the mean of the data distribution is 0 and the variance is 1.

Any intermediate layer in the neural network can be normalized.

advantage

Can make learning happen quickly (can increase the learning rate)

Less dependent on initial values (not so sensitive to initial values)

Suppress overfitting (reduce the need for Dropout, etc.)

Batch Norm layer

Affine->Batch Norm->ReLU

layer normalization

Limitations of batch normalization

Batch normalization is a normalization operation on a single neuron in an intermediate layer, so the number of mini-batch samples must not be too small, otherwise it will be difficult to calculate the statistical information of a single neuron.

If the distribution of a neuron's net input changes dynamically in a neural network, such as a recurrent neural network, then the batch normalization operation cannot be applied

Layer normalization normalizes all neurons in an intermediate layer.

Batch normalization is very effective in convolutional neural networks (CNN), while layer normalization is more common in recurrent neural networks (RNN) and Transformer networks.

Hyperparameter optimization

composition

Network structure

connections between neurons

Number of layers

Number of neurons per layer

Type of activation function

Optimization parameters

Network optimization methods

learning rate

Sample size for small batches

regularization coefficient

Validation data (validation set)

Hyperparameter performance cannot be evaluated using test data

If the test data is used to confirm the "goodness" of the hyperparameter value, it will cause the hyperparameter value to be adjusted to only fit the test data.

The training data is used for parameter learning (weights and biases), and the validation data is used for performance evaluation of hyperparameters. In order to confirm the generalization ability, the test data should be used at the end (ideally only once)

Optimization

grid search

Target the right one by trying all combinations of hyperparameters Methods for group hyperparameter configuration.

Choose several "experience" values. For example, learning rate α, we can set α ∈ {0.01, 0.1, 0.5, 1.0}.

random search

Set the range of hyperparameters and randomly sample from the set range of hyperparameters

Evaluate recognition accuracy through validation data (but set epoch very small)

Repeat the above (100 times, etc.) and narrow the range of hyperparameters based on the results of their recognition accuracy

Bayesian optimization

Dynamic resource allocation

network regularization

Purpose: To suppress overfitting

weight decay

Weight decay is a method that has been frequently used to suppress overfitting. This method penalizes large weights during the learning process.

Simply put, the loss function becomes

λ is a hyperparameter that controls the strength of regularization

discard method

Dropout Method

If the network model becomes very complex, it will be difficult to deal with it using only weight decay.

The method of randomly deleting neurons during the learning process chooses to discard neurons randomly each time. The simplest way is to set a fixed probability p. For each neuron, there is a probability p to determine whether to retain it.

data augmentation

Rotate, flip, scale, translate, add noise

label smoothing

Add noise to output labels to avoid model overfitting

Model independent learning method

Ensemble learning

Integrate multiple models through a certain strategy to improve decision-making accuracy through group decision-making. The primary issue in ensemble learning is how to integrate multiple models. The more commonly used integration strategies include direct average, weighted average, etc.

Self-training and collaborative training

All belong to semi-supervised learning

self training

Self-training is to first use labeled data to train a model, and use this model to predict the labels of unlabeled samples, add samples with relatively high prediction confidence and their predicted pseudo labels to the training set, and then retrain the new model, and Keep repeating this process.

collaborative training

Co-training is an improved method of self-training

Two classifiers based on different views promote each other. A lot of data has different perspectives that are relatively independent.

Due to the conditional independence of different perspectives, models trained on different perspectives are equivalent to understanding the problem from different perspectives and have certain complementarity. Collaborative training is a method that uses this complementarity to perform self-training. First, two models f1 and f2 are trained on the training set according to different perspectives, and then f1 and f2 are used to predict on the unlabeled data set. Samples with relatively high prediction confidence are selected to be added to the training set, and two different perspectives are retrained. model and repeat this process.

multi-task learning

General machine learning models are aimed at a single specific task, such as handwritten digit recognition, object detection, etc. Models for different tasks are learned separately on their respective training sets.

If two tasks are related, there will be some shared knowledge between them, and this knowledge will be helpful to both tasks. These shared knowledge can be representations (features), model parameters, or learning algorithms, etc.

type

transfer learning

If there is a related task that already has a large amount of training data, although the distribution of these training data is different from that of the target task, due to the relatively large scale of the training data, we assume that we can learn some generalizable knowledge from it, then this knowledge will be useful for Target tasks will be of some help. How to transfer the generalizable knowledge in the training data of related tasks to the target task is the problem to be solved by transfer learning.

Transfer learning refers to the process of knowledge transfer between two different fields, using the knowledge learned in the source domain (Source Domain) DS to help the learning tasks in the target domain (Target Domain) DT. The number of training samples in the source domain is generally much larger than that in the target domain.

Classification

inductive transfer learning

A model is learned on the training data set that minimizes the expected risk (i.e., the error rate on the real data distribution).

Deriving transfer learning

Learn a model that minimizes error on a given test set

Fine-tuning is an application method of transfer learning. It usually refers to using new, task-specific data sets to perform additional training on the basis of an already trained model to improve the performance of the model on a specific task. The purpose of fine-tuning is to use the general knowledge learned by the pre-trained model on large-scale data to accelerate and optimize the learning process on specific tasks.

lifelong learning

question

Once training is completed, the model remains fixed and is no longer iteratively updated.

It is still very difficult for a model to be successful on many different tasks at the same time.

Lifelong Learning, also called Continuous Learning, refers to the continuous learning ability like humans, using the experience and knowledge learned in historical tasks to help learn new tasks that constantly emerge, and these experiences and Knowledge is continuously accumulated and will not change because of new tasks and forget old knowledge.

In lifelong learning, it is assumed that a lifelong learning algorithm has learned a model on historical tasks T1, T2, · · · , Tm. When a new task Tm 1 appears, this algorithm can learn a model based on the past tasks learned on m tasks. knowledge to help the m 1th task, while accumulating knowledge on all m 1 tasks.

This setting is very similar to inductive transfer learning, but the goal of inductive transfer learning is to optimize the performance of the target task without caring about the accumulation of knowledge. The goal of lifelong learning is continuous learning and knowledge accumulation. In addition, unlike multi-task learning, lifelong learning does not involve learning on all tasks simultaneously.

meta-learning

According to the no free lunch theorem, no universal learning algorithm is effective on all tasks. Therefore, when using machine learning algorithms to implement a certain task, we usually need to "discuss the matter" and choose the appropriate model, loss function, optimization algorithm, and hyperparameters based on the specific tasks.

The ability to dynamically adjust your own learning methods is called meta-learning, also known as learning of learning.

Another machine learning problem related to meta-learning is small sample learning

Two typical meta-learning methods

Optimizer-based meta-learning

The difference between different optimization algorithms lies in the different rules for updating parameters. Therefore, a natural meta-learning is to automatically learn a rule for updating parameters, that is, modeling the gradient descent process through another neural network (such as a recurrent neural network).

Model-agnostic meta-learning

It is a simple model-independent and task-independent meta-learning algorithm.