쑤쑤_CS 기록장

Chapter 6: Taking Control of Training with Keras 본문

IT 지식 기록/ML for iOS | 인공지능

Chapter 6: Taking Control of Training with Keras

(╹◡╹)_ 2020. 8. 3. 17:07
728x90

이전 단원에서는 how to train your own models using CreateML and Turi Create에 대해서 배웠다.

이것은 코드를 많이 쓸 필요 없이 사용자에게 편리한 툴 이다. 하지만 이 단점은 사용자에게 많은 control over the training process 권한이 없다. 당신이 뭘 하고 있는지 알고 싶고, ML에서 얻고싶은게 많다면 더 강력한 툴을 사용해야 한다.

 

이번 단원에서는 대중적인 deep learning tool인 Keras를 사용하여 snacks classifier를 학습하자.

 

Keras는 실제 performs the actual computations하는 backend에서 작동한다. 가장 유명한 TensorFlow를 이용해서 이를 적용해보자.

 

TensorFlow는 neural networks 뿐만 아니라 any kind of computational graph를 할 수 있는 툴이다. neural network layers 대신에 TensorFlow는 matrix multiplications와 taking derivatives 와 같은 rudimentary mathematical operations를 다룬다. TensorFlow에는 higher level abstractions 들도 있지만, 많은 사람들은 Keras 가 편리해서 이를 많이 쓴다. 사실, Keras는 TensorFlow에 built in 되어 있는 버전이 있을 정도로 대중적이다.


* Getting started

Python environment for running Keras를 위한 셋팅을 Terminal에서 한다. 

이전에 사용했던 snacks 관련 데이터를 해당 프로젝트 안에 넣어준다.


* Back to basics with logistic regression

transfer learning : a logistic regression model is trained on top of features extracted from the training images.

transfer learing의 장점은 scratch로 training 하는것 보다 (모델이 이미 contained in the pre-trained to another의 장점이 있기 때문에) 빠르다.

Hence, you are transferring knowledge from one problem domain to another. In this case, the feature extractors are trained on the general problem of recognizing objects in photos, and you’ll adapt them to the specific problem of recognizing 20 different types of snacks.


[Transfer learning 이란?]

- 트랜스퍼 러닝이란 딥러닝을 feature extractor로만 사용하고 그렇게 추출한 피처를 가지고 다른 모델을 학습하는 것
- 기존의 만들어진 모델을 사용하여 새로운 모델을 만들 시 학습을 빠르게 하며, 예측을 더 높이는 방법
- 일반적으로 VGG, ResNet, gooGleNet등 이미 이러한 사전에 학습이 완료된 모델(Pre-Training Model)을 가지고 우리가 원하는 학습에 미세 조정 즉, 작은 변화를 이용하여 학습시키는 방법이 Transfer Learning이다. 
- 이미 학습된 weight들을 transfer(전송)하여 자신의 model에 맞게 학습을 시키는 방법
- 신경망의 이러한 재학습 과정을 세부 조정(fine-tuning)이라 부름
- 실제로 CNN을 구축하는 경우 대부분 처음부터 (random initialization) 학습하지는 않는다. 
-  ImageNet과 같은 대형 데이터셋을 사용해서 pretrain된 ConvNet 을 사용한다. 


[출처] https://blog.naver.com/PostView.nhn?blogId=flowerdances&logNo=221189533377&categoryNo=32&parentCategoryNo=0&viewDate=¤tPage=1&postListTopCurrentPage=1&from=postView

 

[인턴일지] Transfer Learning (전이 학습) 이란?

[왜 Transfer Learning 인가?] - 머신러닝(Machine Learning)의 많은 모델은 적용하려는 데이터가 학습...

blog.naver.com

 

logistic regression model을 skips the feature extraction part and works directly on pixels 하게 만들서 'using a feature extractor works better than training the logistic regression classifier on the image pixels directly.'를 확인해보자.


* A quick refresher

Logistic regression은 find a straight line between your data points that best separates the classes 하기 위한 통계 모델이다.
이것은 데이터를 straight line으로 잘 나누는 것이 잘 작동하고, 또는 hyperplane in higher dimensions를 이용한다.

 

* Let's talk math


 y = a[0]*x[0] + a[1]*x[1] + b
The b is still the y-intercept — the value of y at the origin of the coordinate system — although in machine learning it is called the bias. This is the value of y when both x[0] and x[1] are 0.

The coefficients a[0] and a[1] are constants. b is also a constant. In fact, what logistic regression learns during training is the values of these constants. Therefore, we call those the learned parameters of the model. Our model currently has three learned parameters: a[0], a[1] and b.

 

By the way, when programmers say parameters, we often refer to the values that we pass into functions. Mathematicians call these arguments. To a mathematician, a parameter is a constant that is used inside the function. So, technically speaking, parameters and arguments are two different things — and if you have to believe the mathematicians then we programmers tend to use the wrong term.

 

* Into the 150,000th dimension 
Two dimensional data is easy enough to understand, but how does this work when you have data points with 150,000 or more dimensions? You just keep adding coefficients to the formula:

This is a bit labor intensive, which is why mathematicians came up with a shorter notation: the dot product. You can treat a and x as arrays — or vectors in math- speak — with 150,000 elements each. And then you can write:

 

So far, the formula we’ve talked about for the line (actually, hyperplane) is for linear regression, not logistic. The linear regression formula just describes the best line that goes between the data points, which is useful in case you want to predict what x[1] is when you only have a given x[0].

 

* From linear to logistic

To turn the linear regression formula into a classifier, you extend the formula to make it a logistic regression:

probability = sigmoid(dot(a, x) + b)

The sigmoid function, also known as the logistic sigmoid, takes the decision boundary and looks at which side of the line the given point x is. The formula for sigmoid is:

sigmoid(x) = 1 / (1 + exp(-x))

 

* Not everything is black and white...

What if you have more than two classes? In that case, you’ll use a variation of the formula called multinomial logistic regression that works with any number of classes. Instead of one output, you now compute a separate prediction for each class:

 

If you have K classes, you end up with K different logistic regressions. Each has its own slopes and bias, which is why you now don’t have just one a and b but several different ones. For each class, you do the dot product of the input x with the coefficients for that class, add the bias, and take the sigmoid.

 

In practice all of these individual slopes are combined into a big matrix called the weights matrix.

 

It’s now possible for more than one class to be chosen, since these K probabilities are independent from one another. This is known as a multi-label classifier. You would use this kind of classifier if you wanted to identify more than one kind of object in the same image.

However, for a multi-class classifier, such as the one you’ve been reading about in the past chapters, you don’t want independent probabilities. Instead, you want to choose the best class amongst the K different ones. You can do that by applying a different function instead of the logistic sigmoid, called softmax:

 

* Building the model

위에서 다룬 수학적 내용을 Keras를 이용하여 코드로 실행해보자.

Jupyter 에서 새로운 Python 3 notebook 프로젝트를 만든다. 이전 단원에서 사용했던 LogisticRegression.ipynb 를 사용해도 된다.

 

1. 필요한 packages들을 import 한다.

 

2. image_width, image_height, num_class 등의 constants 를 정의한다.

   모델이 예측하는 것은 20개의 다른 타입 objects 이므로 num_classes 를 20 으로 설정한다.

 

3. Keras 를 이용한 regression model을 정의한다.

   model = Sequential()

   model.add(Flatten(input_shape = (image_height, image_width, 3)))

   model.add(Dense(num_classes))

   model.add(Activation("softmax"))

   이 모델은 Sequentials 모델이라고도 불린다. 이것은 simple pipeline that consists of a list of layers. Each layer is a stage in the pipeline that transforms the data in some particular way. 여기서 3개의 layers를 모델에 더한 것이다.

 

The logistic regression model in Keras

logistic regression이 동작하는 features는 input images의 pixels 이다.

첫번째 layer인 Flatten은 3차원 이미지를 -> 1차원 이미지 벡터로 바꾼다.

logistic regression이 1차원 벡터를 input 값으로 expects 하기 때문에, Flatten layer가 simply unrolls the image's lines of pixels into one big strip

Flatten turns the 3D image into a 1D vector

Flatten은 계산을 하는 것이 아니라, input 값의 모양을 바꾼다.

실제 logistic regression은 dense layer에서 일어난다.

클래스 종류와 같이 20개의 output으로 나가는 snacks dataset의 층은 각 input이 서로 연결되어있다.

What the Dense layer does

교재 p 214

 

 

* Compiling the model

how to train the model

  • The loss function to use: Recall from the introduction that the loss function determines how good — or rather, how bad — the model is at making predictions. During training, the loss is initially high as the model just makes random predictions at the start. But as training progresses the loss should become lower and lower while the model gets better and better.

    It’s important to choose a loss function that makes sense for your model. Because your model uses softmax to produce the final output, the corresponding loss function is the categorical cross-entropy. That sounds nasty, but categorical just means you’re building a classifier with more than two classes, and cross-entropy is the loss that belongs with softmax. For a classifier with two classes, you’d use binary cross-entropy loss instead.

  • An optimizer: This is the object that implements the Stochastic Gradient Decent or SGD process that finds the best values for the weights and biases. As the loss function computes how wrong the model is at making predictions, the optimizer uses that loss and tweaks the learnable parameters in the model to make the model slightly better. Mathematically speaking, the optimizer finds the parameters that minimize the loss.

    There are different types of optimizers but they all work in kind of the same way. You’re using the Adam optimizer, which is a good default choice, with learning rate 1e-3 or 0.001. The learning rate or LR determines how big the steps are taken by the optimizer. If the LR is too big, the optimizer will go nuts and the loss never becomes any smaller (or may even blow up into a huge number). If the LR is too small, it will take forever for the model to learn anything.

    The learning rate is one of the most important hyperparameters that you can set, and finding a good value for the LR is key to getting your model to learn. The author tried out a few different values and settled on 1e-3 as a good choice for this particular model.

  • Any metrics you want to see: As it is training your model, Keras will always print out the loss value, but you’re also interested in the accuracy of the model as that is an easier metric to interpret. A loss value of 0.35 by itself doesn’t say much about how good the model is, but an accuracy value of 94% correct does.

 

* Loading the data

이전에 다운 받은 Snacks 데이터 파일을 load 한다.

 

이때 pixelssms 0-255 사이의 값을 갖고 있다. training을 하기 전에 normalize 를 진행한다. 

Normalizing or feature scaling means that the data will have an average value or mean of 0 and usually also a standard deviation of 1. This is important when different features are not all in the same numerical range. For example, if your data has one feature with values between 0 and 1000 and another feature with values between 5 and 10, training will generally work better if you first normalize the features so that they both are between -1 and +1.

 

* Too soon to start making predictions?

모델의 train 이 완료되지 않아도 input image에 대한 prediction은 가능하다.

untrained model은 각 클래스에 대해 평균적으로 비슷한 값의 예측을 할 것이다. 왜냐하면 it hasn't learned yet how to distinguish the classes.

If you were to make predictions for the entire dataset at this point, each class would be predicted the same number of times and the overall accuracy would be 0.05 or 5% — basically a random guess. The goal of machine learning is to train a classifier that can do better than random guessing.

 

* Using generators

The data generator takes the normalize_pixels function as its preprocessing function so that it automatically normalizes the images as it loads them.

 

The train_generator is for images from the train folder, the val_generator for images from the val folder, and the test_generator for images from the test folder. In this case you’re making a generator than can produce images by loading them from the given folder. The reason you need to use generators is that you cannot possibly load all the images into memory all at once, since that would require many gigabytes or even terabytes of RAM — more than fits in your computer! The only way to deal with that much data is to load the images on-demand. That’s what the Keras generators allow you to do.

 

You may have expected to see a label like 'apple' or 'cake', but instead you get a vector with 20 numbers. When you try this, you’ll probably get a different label than what’s printed in the book, since the training set is randomly shuffled. But whatever label you get, it should consist of 19 zeros and a single one.

This is called one-hot encoding. The position of the 1 corresponds to the name of the class. In this case the 1 is in the 13th position, which belongs to class pineapple. You can see this by executing the cell:

 

This is a so-called Python dictionary comprehension. It takes all the key-value pairs in the class_indices dictionary and creates a new dictionary that flips the order of the key and value. Now you can look up the name of the class by the index of the element that is 1 in the one-hot encoded vector for the label.

 

* The first evaluation 

* Training the logistic regression model

Training is really just a matter of calling fit_generator() on the model. To start with, you’ll train for five epochs — an epoch is one pass through all the training images.

 

* What happens during training?

모델을 train 시키면 train 폴더에서 랜덤하게 선택하여 모델에게 보여준다.

 

모델을 create 하면 the learnable parameters are just randomly chosen numbers and the predictions will be way off. Over the course of training, these random numbers will slowly change into something more reasonable that can actually make good predictions.

이미지가 어떤 폴더에서 왔는지 아니까, Keras는 prediction과 ground-truth label for the image의 loss를 계산할 수 있다. hot encoding을 이용하여 banana와 같은 것이 숫자로 바뀌면 Keras는 계산을 할 수 있다. 그리고 이 확률을 통해 클래스를 알 수 있다.

Now that you have two vectors of 20 elements each, it means you can compare them. The formula for this is known as the cross-entropy loss. This chapter has already had enough math in it, so let’s just say that this compares each element between the two vectors in some fashion, and adds up the results. This gives the loss for this particular image, which is just a single number.

If the prediction was also (mostly) banana, then the softmax output looks a lot like the ground-truth and the loss is very small; if the prediction was 100% banana, then the loss is 0 because it’s exactly right.

If the prediction for this image is not banana, then the loss is a larger number. The worse the prediction is, the less the predicted probabilities match the ground-truth probabilities, and the higher the loss will be.

 

* Hey, it's progress! 

train이 진행되는 동안 Keras는 progress bar를 보여준다.

 

* It could be better

Overfitting.

The model isn’t actually learning to classify images, it’s just learning to tell apart the images that are in the training set. It’s likely that the model is learning which combinations of pixels belong to which training image — and that’s not what you want. You want the model to understand what those pixels represent in a more abstract sense.

You don’t want to train a model that remembers specific training images; you want a model that can learn to classify images it hasn’t seen yet.

 

* Your first neural network

hidden neurons

rectified linear unit

non-linear way

rectified

 

* KEY POINTS

  • Linear regression is one of the most basic machine-learning models, dating back to the 1800s when Gauss and others discovered the method of Ordinary Least Squares. It models the relationship between different variables. You can turn linear regression into logistic regression with the sigmoid function, making it a classifier model.

  • To build a logistic regression classifier in Keras, you just need one Dense layer followed by softmax activation. To use images with the Dense layer, you need to Flatten the image data into a one-dimensional vector first.

  • To train a model in Keras, you need to choose a loss function — cross-entropy for a classifier — as well as an optimizer. Setting the optimizer’s learning rate is important or the model won’t be able to learn anything.

  • Load your data with ImageDataGenerator. Use a normalization function to give your data a mean of 0 and a standard deviation of 1. Choose a batch size that fits on your GPU — 32 or 64 is a good default choice.

  • Be sure to check the loss and accuracy of your test set on the untrained model, to see if you get reasonable values. The accuracy should be approximately 1/ num_classes, the loss should be close to np.log(num_classes).

  • Keep your eye on the validation accuracy during training. If it stops improving while the training accuracy continues going up, your model is overfitting.

  • A classical neural network is just two or more logistic regressions in a row.

  • Logistic regression and classical feed-forward neural networks are not the best choice for building image classifiers.

 

 

 

 

728x90
Comments