This project uses TensorFlow and Keras to build and train neural network models for recognizing 26 letters in CAPTCHA images. The project is improved based on the Keras Official Example and optimized for specific CAPTCHA scenarios.
The repository contains the following files:
tutorial.ipynb
: Model training tutorial with CTCnoCTC_tutorial.ipynb
: Alternative model training tutorial without CTCimage.ipynb
: Image processing experiments (for adapting code to other image scenarios)captcha_images
: CAPTCHA image folderuserscript
: Folder containing userscript and models (work in progress)
The ultimate goal is to develop a userscript capable of automatic recognition and submission. To enable browser-side prediction, we attempted to convert the Keras model using TFJS, ONNX, and other conversion methods for deployment. However, we encountered some issues (Known Issues).
Currently, the project provides easy-to-reproduce model training for corresponding CAPTCHAs with simple tutorials.
Create and activate conda environment using the following commands:
conda create -n captcha_ocr python=3.10.13
conda activate captcha_ocr
Install required Python packages:
pip install tensorflow==2.9.0
pip install numpy==1.26.4
pip install matplotlib==3.9.2
You can quickly prepare the dataset using CaptchaToolkit.
The project provides 78 images for learning and experimentation.
As a reference, using 860 images as a dataset can achieve the following training results:
Epoch 58/150
48/49 [============================>.] - ETA: 0s - loss: 0.9573
Epoch 58: val_loss improved from 0.57060 to 0.48835, saving model to model.h5
Epoch 124/150
48/49 [============================>.] - ETA: 0s - loss: 0.0981
Epoch 124: val_loss improved from 0.02669 to 0.02192, saving model to model.h5
In actual prediction, the accuracy remains as high as 99.99%.
Follow the steps in tutorial.ipynb
, which contains detailed annotations. After manually selecting the kernel, Jupyter Notebook installation is required.
- CNN is specialized for processing grid-structured data (like images) and automatically learns image features through convolution operations.
- In this project, CNN extracts visual features from CAPTCHA images, such as edges, textures, and shapes.
- Through multiple layers of convolution and pooling operations, CNN builds feature representations from low to high level.
- RNN excels at processing sequential data and can make more accurate predictions using contextual information.
- In CAPTCHA recognition, RNN improves recognition accuracy by considering character context relationships.
- For example, when recognizing the letter "m", RNN uses contextual information to avoid misidentification as "i" or other similar characters.
- CTC automatically aligns input sequences and label sequences without explicit segmentation position information.
- CTC calculates conditional probability P(Y|X) to predict target sequence Y given input sequence X.
- CTC introduces blank labels (ε) to handle spacing between characters and merges repeated predictions to output complete text sequences.
The following introduction is based on tutorial.ipynb
.
Input(shape=(width, height, 1)) → BatchNormalization()
- Receives grayscale image input
- Implements data normalization through BatchNormalization
Conv2D(32, 3×3) → BatchNorm → ReLU → MaxPool(2×2) → Dropout(0.2)
Conv2D(64, 3×3) → BatchNorm → ReLU → MaxPool(2×2) → Dropout(0.2)
- Two convolution blocks extract image features
- Each block includes normalization, activation, and pooling operations
- Dropout layers prevent overfitting
Reshape((width/4, height/4 * 64)) → Dense(128) → BatchNorm → ReLU → Dropout(0.4)
- Reshapes feature dimensions for sequence processing
- Feature transformation through fully connected layer
- Higher Dropout rate (0.4) enhances generalization
Bidirectional(LSTM(64, return_sequences=True, dropout=0.3))
- Bidirectional LSTM processes sequence features
- Sequence return mode captures contextual information
- Built-in Dropout mechanism prevents overfitting
Dense(vocab_size + 1, activation='softmax') → CTCLayer
- Softmax layer outputs character probability distribution
- CTC layer handles sequence alignment and loss calculation
- Optimizer: Adam (learning_rate=0.0005)
- Loss Function: CTC Loss
- Regularization: L2 regularization
Layer (Type) | Output Shape | Params | Description |
---|---|---|---|
image (InputLayer) | (None, 280, 80, 1) | 0 | Grayscale image input |
BatchNormalization | (None, 280, 80, 1) | 4 | Input normalization |
Conv1 (Conv2D) | (None, 280, 80, 32) | 320 | First convolution |
BatchNormalization | (None, 280, 80, 32) | 128 | Feature normalization |
Activation (ReLU) | (None, 280, 80, 32) | 0 | ReLU activation |
pool1 (MaxPooling2D) | (None, 140, 40, 32) | 0 | Feature pooling |
Dropout | (None, 140, 40, 32) | 0 | Prevent overfitting |
Conv2 (Conv2D) | (None, 140, 40, 64) | 18,496 | Second convolution |
BatchNormalization | (None, 140, 40, 64) | 256 | Feature normalization |
Activation (ReLU) | (None, 140, 40, 64) | 0 | ReLU activation |
pool2 (MaxPooling2D) | (None, 70, 20, 64) | 0 | Feature pooling |
Dropout | (None, 70, 20, 64) | 0 | Prevent overfitting |
reshape (Reshape) | (None, 70, 1280) | 0 | Reshape features |
dense1 (Dense) | (None, 70, 128) | 163,968 | Fully connected layer |
BatchNormalization | (None, 70, 128) | 512 | Feature normalization |
Activation (ReLU) | (None, 70, 128) | 0 | ReLU activation |
Dropout | (None, 70, 128) | 0 | Prevent overfitting |
Bidirectional (LSTM) | (None, 70, 128) | 98,816 | Bidirectional LSTM |
label (InputLayer) | (None, None) | 0 | Label input |
dense2 (Dense) | (None, 70, 27) | 3,483 | Output layer |
ctc_loss (CTCLayer) | (None, 70, 27) | 0 | CTC loss calculation |
Total Parameters: 285,983
Trainable Parameters: 285,533
Non-trainable Parameters: 450
Unfortunately, when using the model in JavaScript, neither TensorFlow.js nor ONNX.js has native implementation of CTC loss. So if you want to make predictions in the browser backend, you may need to reconsider the model architecture.
This is also why I created a second model without the CTC layer, which you can find in noCTC_tutorial.ipynb
.
This is an alternative solution without using CTC layer. This approach requires a larger dataset. As a reference, 13,000 images can achieve a val_loss less than 2 after 200 epochs of training.
Since I haven't done too detailed research, it might not be the best model structure. So here's just a brief introduction to its implementation:
- Split the 6-digit CAPTCHA into 6 independent character recognition tasks, with each character having its own prediction branch responsible for predicting one of 26 English letters.
- Process images uniformly to 80×280 grayscale images, with each character represented by a 26-dimensional vector (corresponding to A-Z).
- Use 3 CNN convolution blocks to extract image features, each block containing: batch normalization, ReLU activation, max pooling and Dropout, finally predicting characters at each position through 6 independent branches.
- Use categorical_crossentropy as loss function, employ BatchNormalization and Dropout to prevent overfitting, and learning rate adjusts automatically based on validation set performance.
Following noCTC_tutorial.ipynb
, I created two HDF5
model files and trained them in the same way, with the only difference being the dataset size, resulting in different val_loss values.
For the test sample "TARJOT", the results are as follows:
- TensorFlow 2.15.0 (Keras 3): val_loss ≈ 2.0 Prediction: RBRQMQ
- TensorFlow 2.9.0 (Keras 2): val_loss ≈ 6.0 Prediction: TMRBQZ
Interestingly, although the newer version of the model has a lower validation loss (theoretically should perform better), its prediction performance in the JavaScript environment is actually worse than the older model with higher validation loss. Both models can correctly recognize test samples in the Python environment, but their performance drops significantly after conversion to JavaScript.
This anomaly might be due to issues in the model conversion process or limitations of the TensorFlow.js converter itself. Since I have limited knowledge in this direction, I haven't made further attempts.
This project is licensed under the MIT License.
For issues or suggestions, please submit an Issue. If you have the intention to help complete the remaining parts of the project, please contact me at any time.
This tool is for educational and research purposes only. Users should comply with the website's terms of service and relevant laws and regulations. The authors are not responsible for any misuse or potential consequences of using this tool.