The example below shows how to fine tune a TensorFlow text classification model using your own dataset in the .csv format. The .csv file is expected to have 2 columns: a numerical class label and the text/sentence to classify. Note that although the TLT API is more flexible and allows for providing map functions to translate string class names to numerical values and filtering which columns are being used, the CLI only allows using .csv files in the expected format.
The --dataset-dir
argument is the path to the directory where your dataset is located, and the
--dataset-file
is the name of the .csv file to load from that directory. Use the --class-names
argument to specify a list of the classes and the --delimiter
to specify the character that
separates the two columns. If no --delimiter
is specified, the CLI will default to use a comma (,
).
This example is downloading the SMS Spam Collection
dataset, which has a tab separated value file in the .zip file. This dataset has labeled SMS text
messages that are either being classified as ham
or spam
. The first column in the data file has
the label (ham
or spam
) and the second column is the text of the SMS message. The string class
labels are replaced with numerical values before training.
# Create dataset and output directories
export DATASET_DIR=/tmp/data
export OUTPUT_DIR=/tmp/output
mkdir -p ${DATASET_DIR}
mkdir -p ${OUTPUT_DIR}
# Download and extract the dataset
wget -P ${DATASET_DIR} https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip
unzip ${DATASET_DIR}/sms+spam+collection.zip
# Make a copy of the .csv file with 'numerical' in the file name
DATASET_FILE=SMSSpamCollection_numerical.csv
cp ${DATASET_DIR}/SMSSpamCollection ${DATASET_DIR}/${DATASET_FILE}
# Replace string class labels with numerical values in the .csv file\
# The list numerical class labels passed as the --class-names during training and evaluation
sed -i 's/ham/0/g' ${DATASET_DIR}/${DATASET_FILE}
sed -i 's/spam/1/g' ${DATASET_DIR}/${DATASET_FILE}
# Train google_bert_uncased_L-10_H-256_A-4 using our dataset file, which has tab delimiters
tlt train \
-f tensorflow \
--model-name google_bert_uncased_L-10_H-256_A-4 \
--output-dir ${OUTPUT_DIR} \
--dataset-dir ${DATASET_DIR} \
--dataset-file ${DATASET_FILE} \
--epochs 2 \
--class-names 0,1 \
--delimiter $'\t'
# Evaluate the model exported after training
# Note that your --model-dir path may vary, since each training run creates a new directory
tlt eval \
--model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/1 \
--model-name google_bert_uncased_L-10_H-256_A-4 \
--dataset-dir ${DATASET_DIR} \
--dataset-file ${DATASET_FILE} \
--class-names 0,1 \
--delimiter $'\t'
This example demonstrates using the Intel Transfer Learning Tool CLI to fine tune a text classification model using a dataset from the TensorFlow Datasets (TFDS) catalog. Intel Transfer Learning Tool supports the following text classification datasets from TFDS: imdb_reviews, glue/sst2, and glue/cola.
# Create dataset and output directories
export DATASET_DIR=/tmp/data
export OUTPUT_DIR=/tmp/output
mkdir -p ${DATASET_DIR}
mkdir -p ${OUTPUT_DIR}
# Name of the dataset to use
DATASET_NAME=imdb_reviews
# Train google_bert_uncased_L-10_H-256_A-4 using the TFDS dataset
tlt train \
-f tensorflow \
--model-name google_bert_uncased_L-10_H-256_A-4 \
--output-dir ${OUTPUT_DIR} \
--dataset-dir ${DATASET_DIR} \
--dataset-name ${DATASET_NAME} \
--epochs 2
# Evaluate the model exported after training
# Note that your --model-dir path may vary, since each training run creates a new directory
tlt eval \
--model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/2 \
--model-name google_bert_uncased_L-10_H-256_A-4 \
--dataset-dir ${DATASET_DIR} \
--dataset-name ${DATASET_NAME}
This example runs a distributed PyTorch training job using the TLT CLI. It fine tunes a text classification model for document-level sentiment analysis using a dataset from the Hugging Face catalog. Intel Transfer Learning Tool supports the following text classification datasets from Hugging Face:
Follow these instructions to set up your machines for distributed training with PyTorch. This will
ensure your environment has the right prerequisites, package dependencies, and hostfile configuration. When
you have successfully run the sanity check, the following commands will fine-tune bert-large-uncased
with sst2 for
one epoch using 2 nodes and 2 processes per node.
# Create dataset and output directories
export DATASET_DIR=/tmp/data
export OUTPUT_DIR=/tmp/output
mkdir -p ${DATASET_DIR}
mkdir -p ${OUTPUT_DIR}
# Name of the dataset to use
DATASET_NAME=sst2
# Train bert-large-uncased using the Hugging Face dataset sst2
tlt train \
-f pytorch \
--model_name bert-large-uncased \
--dataset_name sst2 \
--output_dir $OUTPUT_DIR \
--dataset_dir $DATASET_DIR \
--distributed \
--hostfile hostfile \
--nnodes 2 \
--nproc_per_node 2
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
@inproceedings{wang2019glue,
title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
note={In the Proceedings of ICLR.},
year={2019}
}
@misc{misc_sms_spam_collection_228,
author = {Almeida, Tiago},
title = {{SMS Spam Collection}},
year = {2012},
howpublished = {UCI Machine Learning Repository}
}
@inproceedings{socher-etal-2013-recursive,
title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank",
author = "Socher, Richard and
Perelygin, Alex and
Wu, Jean and
Chuang, Jason and
Manning, Christopher D. and
Ng, Andrew and
Potts, Christopher",
booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
month = oct,
year = "2013",
address = "Seattle, Washington, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D13-1170",
pages = "1631--1642",
}