Improving accuracy and speeding up Document Image Classication through parallel systems

This paper presents a study showing the benefits of the EfficientNet models compared with heavier Convolutional Neural Networks (CNNs) in the Document Classification task.

We show in the RVL-CDIP dataset that we can improve previous results with a much lighter model and present its transfer learning capabilities on a smaller in-domain dataset such as Tobacco3482. Moreover, we present an ensemble pipeline which is able to boost solely image input by combining image model predictions with the ones generated by BERT model on extracted text by OCR.

We also show that the batch size can be effectively increased without hindering its accuracy so that the training process can be sped up by parallelizing throughout multiple GPUs, decreasing the computational time needed. Lastly, we expose the training performance differences between PyTorch and Tensorflow Deep Learning frameworks.

Paper: Improving accuracy and speeding up Document Image Classification through parallel systems

View code on GitHub

section icon
Model

Parallel pre-training in BigTobacco and fine-tuning in SmallTobacco:

ensemble

SmallTobacco training/fine-tuning:

ensemble

section icon
results

Results comparison

tfvspy

TensorFlow vs PyTorch distributed training

tfvspy

section icon
usage

Image Model Distributed Training

PyTorch

efficientnet_pytorch library downloads the models in .cache/torch/checkpoints. In case your machine has no internet connection, make sure to add the models manually.

1
2
3
4
5
python -m torch.distributed.launch eff_big_training_distributed.py \
	-n 1 -g 4 -nr 0 \
	--epochs 20 \
	--eff_model b0 \
	--load_path /gpfs/scratch/bsc31/bsc31275/
  • n: number of nodes

  • g: number of gpus in each node

  • nr: the rank of the current node within all the nodes

  • epochs: training number of epochs

  • eff_model: EfficientNet model

  • load_path: path where datasets are stored

TensorFlow

efficientnet.tfkeras library downloads the models in .keras/models.

1
python distr_effnet_shear.py --image_model 0 --optimizer sgd --epochs 20
  • image_model: : EfficientNet model

  • optimizer: optimizer to be used

  • epochs: training number of epochs

Text Model (PyTorch)

pytorch_transformers library downloads the models in .cache/torch/pytorch_transformers. BERT training is simply done by running python main.py. To get the ensemble results run python ensemble.py.

Also see README.md on github!

section icon
datasets

Datasets

BigTobacco and SmallTobacco raw datasets can be downloaded here and here.

We provide the scripts to generate the .hdf5 and .TfRecord used here.

For BigTobacco run python ./Data/python BT_hdf5_dataset_creation.py to create the .hdf5 files for train, test and validation sets. Run python ./Data/python hdf5_to_tfrecord.py to convert .hdf5 files to tfRecord.

For SmallTobacco, we provide the scripts for both obtaining Tesserect OCR .txt files and generating random splits .hdf5 files.

Run:

  • python ./Data/ocr_tobacco.py to extract OCR and save .txt files in the same path as the images.

  • python ./Data/python ST_hdf5_dataset_creation.py to create the .hdf5 file dataset.

Please contact the repository owner for more information.

citation

If you find this paper useful, consider citing:

@InProceedings{10.1007/978-3-030-50417-5_29,
author="Ferrando, Javier
and Dom{\'i}nguez, Juan Luis
and Torres, Jordi
and Garc{\'i}a, Ra{\'u}l
and Garc{\'i}a, David
and Garrido, Daniel
and Cortada, Jordi
and Valero, Mateo",
editor="Krzhizhanovskaya, Valeria V.
and Z{\'a}vodszky, G{\'a}bor
and Lees, Michael H.
and Dongarra, Jack J.
and Sloot, Peter M. A.
and Brissos, S{\'e}rgio
and Teixeira, Jo{\~a}o",
title="Improving Accuracy and Speeding Up Document Image Classification Through Parallel Systems",
booktitle="Computational Science -- ICCS 2020",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="387--400",
isbn="978-3-030-50417-5"
}