Exploring large Datasets - Cats vs Dogs

This part of the week we the course focus more in the convolutional Network, using as example the cats and dogs Kaggle competition.

Dataset and file/directory structure¶

First we will need to organize the directory in such way that mimic the structure that we use in previous example

Hourses and humans But in this case we will do it with the cats and dogs dataset.

1. Download the dataset¶

First we need to download the dataset from kaggle ( casts vs dogs dataset)

This data set contain images of cats and dogs all in a single directory.

2. Reorganize the dataset in a specific directory structure¶

We want to re-organize the data to follow the structure:

cats-v-dogs
        |
        --> Training
                |
                --->dogs
                |
                --->cats
        |
        --> Testing
                |
                --->dogs
                |
                --->cats

For that we will make use of the os library and the path.join() and os.mkdir() functions

Regarding the code¶

1. Libraries to import¶

We will need several libraries but for the first step we will need something to get the Dataset .zip file, something to create the path and directories and something to copy the files to the new directory

import os It will allow use to make directories and create the paths
import zipfile allow use to handle .zip files
from shutil import copyfile allow use to copy files from one directory to other.

import random we will use it later to randomize the samples, so later we can divide the dataset in training and test in the random way.

import os
import zipfile
import random
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from shutil import copyfile

We will use the ImageDataGenerator class just for re-scale the data but in the future we will use it to generate more data.

2. Handle the zipfile¶

THis is going to use code use on colab and jupyter notebooks, this might be different if we are using just an script and running it locally.

!wget --no-check-certificate \
    "https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip" \
    -O "/tmp/cats-and-dogs.zip"

local_zip = '/tmp/cats-and-dogs.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

The content of the zip file will be "unzip" on the directory /tmp.

Checking the size of the directory¶

We can check the size of the directory using os.listdir('path')

print(len(os.listdir('/tmp/PetImages/Cat/')))
print(len(os.listdir('/tmp/PetImages/Dog/')))

3. Creating the directory structure and moving files¶

The creation of directories is a task that might fail so it is a good practice to put it on a try/except block

First we will create the path to the directories we want to create, we will do that using os.path.join(parent_dir, child_dir)

base_path = os.path.join('/tmp/','cats-v-dogs')
training_path = os.path.join(base_path,'/tmp/cats-v-dogs/training')
train_dog_path = os.path.join(training_path,'/tmp/cats-v-dogs/training/dogs')
train_cat_path = os.path.join(training_path,'/tmp/cats-v-dogs/training/cats')
testing_path = os.path.join(base_path,'/tmp/cats-v-dogs/testing')
test_dog_path = os.path.join(testing_path,'/tmp/cats-v-dogs/testing/dogs')
test_cat_path = os.path.join(testing_path,'/tmp/cats-v-dogs/testing/cats')

Now we need to use os.mkdir('path') to create the directories, all of this inside a try/except block.

try:
    os.mkdir(base_path)
    os.mkdir(training_path)
    os.mkdir(testing_path)
    os.mkdir(train_dog_path)
    os.mkdir(train_cat_path)
    os.mkdir(test_dog_path)
    os.mkdir(test_cat_path)
except OSError:
    pass

4. Split the dataset in Training and Testing¶

Now, we need to create a dataset to be used as training data and other as testing, for that we can create a function, this case we will create function that will get as parameters: SOURCE, TRAINING, TESTING, SPLIT-SIZE. SOURCE been the original folder and the SPLIT_SIZE ratio to split the dataset.

def split_data(SOURCE, TRAINING, TESTING, SPLIT_SIZE):
    clean_list = [data for data in os.listdir(SOURCE) if os.path.getsize(SOURCE+data)  ]
    random_list = random.sample(clean_list, len(clean_list))
    limit = int(len(random_list)*SPLIT_SIZE)
    training_list = random_list[:limit]
    testing_list = random_list[limit:]
    for img in training_list:
        copyfile(os.path.join(SOURCE,img), os.path.join(TRAINING,img))
    for img in testing_list:
        copyfile(os.path.join(SOURCE,img), os.path.join(TESTING,img))

In the nutshell:

we get the full dataset, check remove those files that are corrupt or which size is 0 using os.path.getsize().
later randomize the dataset with random.sample().
Create a limit based in what is the ratio split for the dataset (example: \(.9\) means \(90\%\) data goes for Training \(10\%\) for testing).
Create 2 list one for training others for testing.
Finally using copyfile() we move the files from the source to the final destination.

Here is how it will look in the Notebook

def split_data(SOURCE, TRAINING, TESTING, SPLIT_SIZE):
    print(f' Source path {SOURCE},\n Training {TRAINING}')
    clean_list = [data for data in os.listdir(SOURCE) if os.path.getsize(SOURCE+data)  ]
    random_list = random.sample(clean_list, len(clean_list))
    limit = int(len(random_list)*SPLIT_SIZE)
    print(f'total: {len(random_list)} after split {limit}')
    training_list = random_list[:limit]
    testing_list = random_list[limit:]
    print(f'{len(training_list)}, {len(testing_list)}')
    for img in training_list:
        copyfile(os.path.join(SOURCE,img), os.path.join(TRAINING,img))
    for img in testing_list:
        copyfile(os.path.join(SOURCE,img), os.path.join(TESTING,img))


CAT_SOURCE_DIR = "/tmp/PetImages/Cat/"
TRAINING_CATS_DIR = "/tmp/cats-v-dogs/training/cats/"
TESTING_CATS_DIR = "/tmp/cats-v-dogs/testing/cats/"
DOG_SOURCE_DIR = "/tmp/PetImages/Dog/"
TRAINING_DOGS_DIR = "/tmp/cats-v-dogs/training/dogs/"
TESTING_DOGS_DIR = "/tmp/cats-v-dogs/testing/dogs/"

split_size = .9
split_data(CAT_SOURCE_DIR, TRAINING_CATS_DIR, TESTING_CATS_DIR, split_size)
split_data(DOG_SOURCE_DIR, TRAINING_DOGS_DIR, TESTING_DOGS_DIR, split_size)

5. The model¶

The architecture of the model¶

We are going to use 3 layers of convolution and 3 of pooling. The important details will be the first layer that contain the input_size() 150X150 size x3 because it is a color image so we need 3 channels (RGB). The last Layer will use a different activation function, in this case a sigmoid because the selection we are doing is binary (cat and dog) so sigmoid is the best option.

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(16,(3,3),activation='relu', input_shape=(150,150,3)),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(32,(3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64,(3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512,activation='relu'),
    tf.keras.layers.Dense(1,activation='sigmoid')
])

Now for the compilation we will use an optimizer RMSprop(lr=0,001) the learning rate will be lr=0.001, the loss function will be binary_crossentropy again because is a binary selection, and the metric will be accuracy.

model.compile(optimizer=RMSprop(lr=0.001), loss='binary_crossentropy', metrics=['accuracy'])

Image Data Generator¶

In this case we wont use the different parameters of the class ImageDataGenerator we will use just the rescale parameter.

train_datagen = ImageDataGenerator(rescale=1./255)

now to be able to use it correctly we will follow the steps:

Define the Training directory path.
Create an instance of ImageDataGenerator.
Use flow_from_directory to get the different pieces of data\

batch_size = 10 # we need this later
#1. Training directory
TRAINING_DIR = "/tmp/cats-v-dogs/training"
#2. Instance of class ImageDataGEnerator
train_datagen = ImageDataGenerator(rescale=1./255)
#3. use `flow_from_directory` to process the files
train_generator = train_datagen.flow_from_directory(
                                             directory=TRAINING_DIR,
                                              target_size =(150,150),
                                              batch_size=batch_size,
                                              class_mode='binary' )

VALIDATION_DIR = '/tmp/cats-v-dogs/testing'
validation_datagen = ImageDataGenerator(rescale=1./255)
validation_generator = validation_datagen.flow_from_directory(
                                             directory=VALIDATION_DIR,
                                              target_size =(150,150),
                                              batch_size=batch_size,
                                              class_mode='binary' )

Training the model¶

We will use the ImageDataGEnerator create before to train the model

history = model.fit(train_generator,
                              epochs=15,
                              verbose=1,
                              validation_data=validation_generator)

6. Plot the results¶

The following is copy form the coded provided in the coursera course, it is use to plot the results of the model, it will plot Loss and Accuracy

# PLOT LOSS AND ACCURACY
%matplotlib inline

import matplotlib.image  as mpimg
import matplotlib.pyplot as plt

#-----------------------------------------------------------
# Retrieve a list of list results on training and test data
# sets for each training epoch
#-----------------------------------------------------------
acc=history.history['accuracy']
val_acc=history.history['val_accuracy']
loss=history.history['loss']
val_loss=history.history['val_loss']

epochs=range(len(acc)) # Get number of epochs

#------------------------------------------------
# Plot training and validation accuracy per epoch
#------------------------------------------------
plt.plot(epochs, acc, 'r', "Training Accuracy")
plt.plot(epochs, val_acc, 'b', "Validation Accuracy")
plt.title('Training and validation accuracy')
plt.figure()

#------------------------------------------------
# Plot training and validation loss per epoch
#------------------------------------------------
plt.plot(epochs, loss, 'r', "Training Loss")
plt.plot(epochs, val_loss, 'b', "Validation Loss")


plt.title('Training and validation loss')

# Desired output. Charts with training and validation metrics. No crash :)