Skip to content

Latest commit

 

History

History
183 lines (159 loc) · 7.58 KB

text_classification.md

File metadata and controls

183 lines (159 loc) · 7.58 KB

Text Classification Intel® Transfer Learning Tool CLI Example

Fine Tuning Using Your Own Dataset

The example below shows how to fine tune a TensorFlow text classification model using your own dataset in the .csv format. The .csv file is expected to have 2 columns: a numerical class label and the text/sentence to classify. Note that although the TLT API is more flexible and allows for providing map functions to translate string class names to numerical values and filtering which columns are being used, the CLI only allows using .csv files in the expected format.

The --dataset-dir argument is the path to the directory where your dataset is located, and the --dataset-file is the name of the .csv file to load from that directory. Use the --class-names argument to specify a list of the classes and the --delimiter to specify the character that separates the two columns. If no --delimiter is specified, the CLI will default to use a comma (,).

This example is downloading the SMS Spam Collection dataset, which has a tab separated value file in the .zip file. This dataset has labeled SMS text messages that are either being classified as ham or spam. The first column in the data file has the label (ham or spam) and the second column is the text of the SMS message. The string class labels are replaced with numerical values before training.

# Create dataset and output directories
export DATASET_DIR=/tmp/data
export OUTPUT_DIR=/tmp/output
mkdir -p ${DATASET_DIR}
mkdir -p ${OUTPUT_DIR}

# Download and extract the dataset
wget -P ${DATASET_DIR} https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip
unzip ${DATASET_DIR}/sms+spam+collection.zip

# Make a copy of the .csv file with 'numerical' in the file name
DATASET_FILE=SMSSpamCollection_numerical.csv
cp ${DATASET_DIR}/SMSSpamCollection ${DATASET_DIR}/${DATASET_FILE}

# Replace string class labels with numerical values in the .csv file\
# The list numerical class labels passed as the --class-names during training and evaluation
sed -i 's/ham/0/g' ${DATASET_DIR}/${DATASET_FILE}
sed -i 's/spam/1/g' ${DATASET_DIR}/${DATASET_FILE}

# Train google_bert_uncased_L-10_H-256_A-4 using our dataset file, which has tab delimiters
tlt train \
    -f tensorflow \
    --model-name google_bert_uncased_L-10_H-256_A-4 \
    --output-dir ${OUTPUT_DIR} \
    --dataset-dir ${DATASET_DIR} \
    --dataset-file ${DATASET_FILE} \
    --epochs 2 \
    --class-names 0,1 \
    --delimiter $'\t'

# Evaluate the model exported after training
# Note that your --model-dir path may vary, since each training run creates a new directory
tlt eval \
    --model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/1 \
    --model-name google_bert_uncased_L-10_H-256_A-4 \
    --dataset-dir ${DATASET_DIR} \
    --dataset-file ${DATASET_FILE} \
    --class-names 0,1 \
    --delimiter $'\t'

Fine Tuning Using a Dataset from the TFDS Catalog

This example demonstrates using the Intel Transfer Learning Tool CLI to fine tune a text classification model using a dataset from the TensorFlow Datasets (TFDS) catalog. Intel Transfer Learning Tool supports the following text classification datasets from TFDS: imdb_reviews, glue/sst2, and glue/cola.

# Create dataset and output directories
export DATASET_DIR=/tmp/data
export OUTPUT_DIR=/tmp/output
mkdir -p ${DATASET_DIR}
mkdir -p ${OUTPUT_DIR}

# Name of the dataset to use
DATASET_NAME=imdb_reviews

# Train google_bert_uncased_L-10_H-256_A-4 using the TFDS dataset
tlt train \
    -f tensorflow \
    --model-name google_bert_uncased_L-10_H-256_A-4 \
    --output-dir ${OUTPUT_DIR} \
    --dataset-dir ${DATASET_DIR} \
    --dataset-name ${DATASET_NAME} \
    --epochs 2

# Evaluate the model exported after training
# Note that your --model-dir path may vary, since each training run creates a new directory
tlt eval \
    --model-dir ${OUTPUT_DIR}/google_bert_uncased_L-10_H-256_A-4/2 \
    --model-name google_bert_uncased_L-10_H-256_A-4 \
    --dataset-dir ${DATASET_DIR} \
    --dataset-name ${DATASET_NAME}

Distributed Transfer Learning Using a Dataset from Hugging Face

This example runs a distributed PyTorch training job using the TLT CLI. It fine tunes a text classification model for document-level sentiment analysis using a dataset from the Hugging Face catalog. Intel Transfer Learning Tool supports the following text classification datasets from Hugging Face:

Follow these instructions to set up your machines for distributed training with PyTorch. This will ensure your environment has the right prerequisites, package dependencies, and hostfile configuration. When you have successfully run the sanity check, the following commands will fine-tune bert-large-uncased with sst2 for one epoch using 2 nodes and 2 processes per node.

# Create dataset and output directories
export DATASET_DIR=/tmp/data
export OUTPUT_DIR=/tmp/output
mkdir -p ${DATASET_DIR}
mkdir -p ${OUTPUT_DIR}

# Name of the dataset to use
DATASET_NAME=sst2

# Train bert-large-uncased using the Hugging Face dataset sst2
tlt train \
    -f pytorch \
    --model_name bert-large-uncased \
    --dataset_name sst2 \
    --output_dir $OUTPUT_DIR \
    --dataset_dir $DATASET_DIR \
    --distributed \
    --hostfile hostfile \
    --nnodes 2 \
    --nproc_per_node 2

Citations

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}

@misc{misc_sms_spam_collection_228,
  author       = {Almeida, Tiago},
  title        = {{SMS Spam Collection}},
  year         = {2012},
  howpublished = {UCI Machine Learning Repository}
}

@inproceedings{socher-etal-2013-recursive,
    title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank",
    author = "Socher, Richard  and
      Perelygin, Alex  and
      Wu, Jean  and
      Chuang, Jason  and
      Manning, Christopher D.  and
      Ng, Andrew  and
      Potts, Christopher",
    booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
    month = oct,
    year = "2013",
    address = "Seattle, Washington, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D13-1170",
    pages = "1631--1642",
}