ImageDataGenerator Boon to TensorFlow Users

4 min readMar 6, 2021

How to handle large dataset

Suppose You are dealing with image dataset having 100 classes and each class consists 1000 images if we will train the model on low configuration device it will give run out of memory.Then the question is what should we do? Here Keras wrapper comes with an idea of ImageDataGenerator . Instead of taking whole dataset with the help of ImageDataGenerator We will divide the data into batches and then will feed the batches of image data into network for image classification or various CNN applications.Please read documentation of tensorflow believe me it will give more insight of whatever i am going to explain further.Let’s get started…….

Type I :-When we have data inside the directory of local system

Importing the ImageDataGenerator

from tensorflow.keras.preprocessing.image import ImageDataGenerator

Augmentation :-This technique is used to expand the size of image so that our convolution layer can extract more and more feature out of it.Important point to remember we should not do augmentation on test image.It makes our model more generalized.

Augmentation of image is like re scaling the image into certain range,flipping the image horizontally,changing the angle of image etc.

>> train_datagen=ImageDataGenerator(rescale=1./255.,
                 validation_split=0.2,rotation_range=30,
                 width_shift_range=0.2, height_shift_range=0.2,
                 horizontal_flip=True, vertical_flip=True)                                              
>> train_images =train_datagen.flow_from_directory(directory="Parent 
                 Folder/train",target_size=(256, 256) ,seed=1
                 color_mode='rgb', class_mode='categorical' ,
                 batch_size=32 , shuffle=True, subset="training")
>> val_images =  train_datagen.flow_from_directory(directory="Parent 
                 Folder/train",target_size=(256, 256),shuffle=False,
                 color_mode='rgb',class_mode='categorical',seed=1
                 batch_size=32, subset="validation")              
>> test_datagen =ImageDataGenerator(rescale=1./255., 
                 validation_split=0.2)
>> test_images = test_datagen.flow_from_directory(directory="Parent 
                 Folder/test",target_size=(256, 256),
                 color_mode='rgb', class_mode='categorical',
                 batch_size=32, shuffle=False)

rescale=1./255. As we know each pixel ranges between 0–255 so we need to normalize all the pixel into same scale.

validation_split splitting the train directory into train set and validation set ,where validation set will predict about accuracy of model which has been trained on the given train set.

flow_from_directory accessing the data present in local directory.

class_mode=”categorical” most commonly used for multi class classification.It will take the multi class data in one hot encoded form.

target_size The dimension at which image will be resized.By default it will be taken as (256,256)

seed=1 ensures the data is equally distributed among train and validation set

Note:-We will take two different datagen i.e:-train_datagen in this case we will apply the data augmentation on it and test_datagen we won’t apply data augmentation on it.

Type II :- When we have data in csv format

when we want to access the image data from csv file we must use datagen.flow_from_dataframe().

The csv file works differently from the directory because in case of csv file all the image data are present as a single features and another features represent it’s classes.

# importing necessary libraries to deal with csv image files
>> from sklearn.model_selection import train_test_split
>> import pandas as pd

Here we will divide the dataset in three parts i.e:- train set ,validation set and test.

train set:- The data at which we will train our model, validation set the at which we will check how accurate our model is (most commonly checking accuracy of model) and test set will predict the result on unknown(test) data.let’s play with csv file.

df=pd.read_csv("/parent.csv",dtype=str)
df_train,df_test=train_test_split(df,test_size=0.1,random_state=1)

This split will divide each classes of data into 80% of train set and 20% of test set.

train_datagen=ImageDataGenerator(rescale=1./255,validation_split=0.2
          rotation_range=30,width_shift_range=0.2,
          height_shift_range=0.2,horizontal_flip=True,
          vertical_flip=True)
test_datagen=ImageDataGenerator(rescale=1/255,validation_split=0.2)
train_images=train_datagen.flow_from_dataframe(dataframe=df_train,
             directory="Parent Folder/images",x_col='image',
             y_col='class',target_size=(256, 256),color_mode='rgb',
             class_mode='categorical', batch_size=32, shuffle=True,
             subset="training", seed=42)
val_images=train_datagen.flow_from_dataframe(dataframe=df_train,
             directory="Parent Folder/images",x_col='image',
             y_col='class',target_size=(256, 256),color_mode='rgb',
             class_mode='categorical', batch_size=32, shuffle=True,
             subset="validation", seed=42)
test_images=train_datagen.flow_from_dataframe(dataframe=df_test,
             directory="Parent Folder/images",x_col='image',
             y_col='class',target_size=(256, 256),color_mode='rgb',
             class_mode='categorical', batch_size=32, shuffle=False,
            seed=42)

directory :-The path where the csv file has been saved.

x_col,y_col :- column of csv files containing images and their classes respectively

shuffle :- We will shuffle the data for train test, validation set and for test set we will leave it as it is.

Note:- Whenever you will get custom data in directory folder try to convert it into csv format then apply the image classification.You will get a lot of benefit out of it.

Conclusion :-

Hope this article will help you to resolve so many issue.please comment below for any suggestions and improvement you want to see in my further blogs. keep learning keep exploring……

ImageDataGenerator Boon to TensorFlow Users

Type I :-When we have data inside the directory of local system

Type II :- When we have data in csv format

Written by akhil anand