Skip to content

Commit

Permalink
deploy: 1f4b186
Browse files Browse the repository at this point in the history
  • Loading branch information
arxyzan committed Oct 29, 2023
0 parents commit 399d0c1
Show file tree
Hide file tree
Showing 493 changed files with 88,891 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 7254a81eeaa30845194afa3d93625a5e
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added .doctrees/contribute/add_datasets.doctree
Binary file not shown.
Binary file added .doctrees/contribute/add_docs.doctree
Binary file not shown.
Binary file added .doctrees/contribute/add_models.doctree
Binary file not shown.
Binary file added .doctrees/contribute/add_tests.doctree
Binary file not shown.
Binary file added .doctrees/contribute/contribute_to_hezar.doctree
Binary file not shown.
Binary file added .doctrees/contribute/index.doctree
Binary file not shown.
Binary file added .doctrees/contribute/pull_requests.doctree
Binary file not shown.
Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/get_started/index.doctree
Binary file not shown.
Binary file added .doctrees/get_started/installation.doctree
Binary file not shown.
Binary file added .doctrees/get_started/overview.doctree
Binary file not shown.
Binary file added .doctrees/get_started/quick_tour.doctree
Binary file not shown.
Binary file added .doctrees/guide/advanced_training.doctree
Binary file not shown.
Binary file added .doctrees/guide/hezar_architecture.doctree
Binary file not shown.
Binary file added .doctrees/guide/index.doctree
Binary file not shown.
Binary file added .doctrees/guide/models_advanced.doctree
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.builders.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.configs.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.constants.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.data.datasets.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.data.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.embeddings.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.accuracy.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.bleu.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.cer.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.f1.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.metric.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.precision.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.recall.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.seqeval.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.metrics.wer.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.models.backbone.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.models.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.models.image2text.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.models.model.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.preprocessors.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.registry.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.trainer.doctree
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.trainer.trainer.doctree
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.audio_utils.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.common_utils.doctree
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.core_utils.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.data_utils.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.file_utils.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.hub_utils.doctree
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.image_utils.doctree
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/hezar.utils.logging.doctree
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/source/index.doctree
Binary file not shown.
Binary file added .doctrees/source/modules.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/datasets.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/index.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/models.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/preprocessors.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/training.doctree
Binary file not shown.
Empty file added .nojekyll
Empty file.
21 changes: 21 additions & 0 deletions _sources/contribute/add_datasets.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Add a Dataset
Adding datasets involves two main steps:
1. Uploading the dataset to the Hub and providing a load script.
2. Providing a proper dataset class in Hezar.

## Uploading dataset to the Hub
Datasets of different types, require different format in terms of raw files and annotations. In Hezar, we prefer
uploading the files to the same repo of the dataset. The way the datasets are provided on the Hub is really up to you,
but conventionally, it's better to follow the same procedure for every dataset. The recommended way is to get
inspiration from other datasets on the Hub that have a similar task. Either datasets provided by Hezar or others.
Some notes to consider:
- Providing zip files rather that folder of raw files is recommended.
- For datasets containing raw files like images, audio files, etc. use a csv annotation file mapping files to labels.
- Providing both train and test splits is a must, but validation set is optional.
- Put all files in the `data` folder and put `X_train.zip`, `X_test.zip`, `X_validation.zip` inside it or put all files named after splits inside a `data.zip` file.
- Don't forget to provide a dataset card (`README.md`) and specify properties such as task, license, tags, etc.

## Providing a loading script
Hezar has some ready to use templates for dataset loading scripts. You can find them [here](https://github.com/hezarai/hezar/tree/main/templates/dataset_scripts).
You can learn more about dataset loading scripts [here](https://huggingface.co/docs/datasets/dataset_script).
It's recommended to upload the dataset to the Hub to test that it works properly.
1 change: 1 addition & 0 deletions _sources/contribute/add_docs.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Contribute to Docs
1 change: 1 addition & 0 deletions _sources/contribute/add_models.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Add a Model
1 change: 1 addition & 0 deletions _sources/contribute/add_tests.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Add Tests
1 change: 1 addition & 0 deletions _sources/contribute/contribute_to_hezar.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Contribute to Hezar
10 changes: 10 additions & 0 deletions _sources/contribute/index.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Contribute

```{toctree}
contribute_to_hezar.md
add_models.md
add_datasets.md
add_docs.md
add_tests.md
pull_requests.md
```
1 change: 1 addition & 0 deletions _sources/contribute/pull_requests.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Sending a Pull Request
8 changes: 8 additions & 0 deletions _sources/get_started/index.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Get Started
```{toctree}
:maxdepth: 1

overview.md
installation.md
quick_tour.md
```
41 changes: 41 additions & 0 deletions _sources/get_started/installation.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Installation

## Install from PyPi
Installing Hezar is as easy as any other Python library! Most of the requirements are cross-platform and installing
them on any machine is a piece of cake!

```
pip install hezar
```
### Installation variations
Hezar is packed with a lot of tools that are dependent on other packages. Most of the
time you might not want everything to be installed, hence, providing multiple variations of
Hezar so that the installation is light and fast for general use.

You can install optional dependencies for each mode like so:
```
pip install hezar[nlp] # For natural language processing
pip install hezar[vision] # For computer vision and image processing
pip install hezar[audio] # For audio and speech processing
pip install hezar[embeddings] # For word embeddings
```
Or you can also install everything using:
```
pip install hezar[all]
```
## Install from source
Also, you can install the dev version of the library using the source:
```
pip install git+https://github.com/hezarai/hezar.git
```

## Test installation
From a Python console or in CLI just import `hezar` and check the version:
```python
import hezar

print(hezar.__version__)
```
```
0.23.1
```
20 changes: 20 additions & 0 deletions _sources/get_started/overview.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Overview

Welcome to Hezar! A library that makes state-of-the-art machine learning as easy as possible aimed for the Persian
language, built by the Persian community!

In Hezar, the primary goal is to provide plug-and-play AI/ML utilities so that you don't need to know much about what's
going on under the hood. Hezar is not just a model library, but instead it's packed with every aspect you need for any
ML pipeline like datasets, trainers, preprocessors, feature extractors, etc.

Hezar is a library that:
- brings together all the best works in AI for Persian
- makes using AI models as easy as a couple of lines of code
- seamlessly integrates with Hugging Face Hub for all of its models
- has a highly developer-friendly interface
- has a task-based model interface which is more convenient for general users.
- is packed with additional tools like word embeddings, tokenizers, feature extractors, etc.
- comes with a lot of supplementary ML tools for deployment, benchmarking, optimization, etc.
- and more!

To find out more, just take the [quick tour](quick_tour.md)!
190 changes: 190 additions & 0 deletions _sources/get_started/quick_tour.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# Quick Tour
Let's have a quick tour on some of the most important features of Hezar!

### Models
There's a bunch of ready to use trained models for different tasks on the Hub. To see all the models see [here](https://huggingface.co/hezarai)!

- **Text classification (sentiment analysis, categorization, etc)**
```python
from hezar import Model

example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load("hezarai/bert-fa-sentiment-dksf")
outputs = model.predict(example)
print(outputs)
```
```
{'labels': ['positive'], 'probs': [0.812910258769989]}
```
- **Sequence Labeling (POS, NER, etc.)**
```python
from hezar import Model

pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k") # Part-of-speech
ner_model = Model.load("hezarai/bert-fa-ner-arman") # Named entity recognition
inputs = ["شرکت هوش مصنوعی هزار"]
pos_outputs = pos_model.predict(inputs)
ner_outputs = ner_model.predict(inputs)
print(f"POS: {pos_outputs}")
print(f"NER: {ner_outputs}")
```
```
POS: [[{'token': 'شرکت', 'tag': 'Ne'}, {'token': 'هوش', 'tag': 'Ne'}, {'token': 'مصنوعی', 'tag': 'AJe'}, {'token': 'هزار', 'tag': 'NUM'}]]
NER: [[{'token': 'شرکت', 'tag': 'B-org'}, {'token': 'هوش', 'tag': 'I-org'}, {'token': 'مصنوعی', 'tag': 'I-org'}, {'token': 'هزار', 'tag': 'I-org'}]]
```
- **Language Modeling**
```python
from hezar import Model

roberta_mlm = Model.load("hezarai/roberta-fa-mlm")
inputs = ["سلام بچه ها حالتون <mask>"]
outputs = roberta_mlm.predict(inputs)
print(outputs)
```
```
{'filled_texts': ['سلام بچه ها حالتون چطوره'], 'filled_tokens': [' چطوره']}
```
- **Speech Recognition**
```python
from hezar import Model

whisper = Model.load("hezarai/whisper-small-fa")
transcripts = whisper.predict("examples/assets/speech_example.mp3")
print(transcripts)
```
```
{'transcripts': ['و این تنها محدود به محیط کار نیست']}
```
- **Image to Text (OCR)**
```python
from hezar import Model
# OCR with TrOCR
model = Model.load("hezarai/trocr-base-fa-v1")
texts = model.predict(["examples/assets/ocr_example.jpg"])
print(f"TrOCR Output: {texts}")

# OCR with CRNN
model = Model.load("hezarai/crnn-base-fa-64x256")
texts = model.predict("examples/assets/ocr_example.jpg")
print(f"CRNN Output: {texts}")
```
```
TrOCR Output: {'texts': [' چه میشه کرد، باید صبر کنیم']}
CRNN Output: {'texts': ['چه میشه کرد، باید صبر کنیم']}
```

- **Image to Text (Image Captioning)**
```python
from hezar import Model

model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k")
texts = model.predict("examples/assets/image_captioning_example.jpg")
print(texts)
```
```
{'texts': ['سگی با توپ تنیس در دهانش می دود.']}
```
We constantly keep working on adding and training new models and this section will hopefully be expanding over time ;)
### Word Embeddings
- **FastText**
```python
from hezar import Embedding

fasttext = Embedding.load("hezarai/fasttext-fa-300")
most_similar = fasttext.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7579, 'word': 'میلیون'},
{'score': 0.6943, 'word': '21هزار'},
{'score': 0.6861, 'word': 'میلیارد'},
{'score': 0.6825, 'word': '26هزار'},
{'score': 0.6803, 'word': '٣هزار'}]
```
- **Word2Vec (Skip-gram)**
```python
from hezar import Embedding

word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7885, 'word': 'چهارهزار'},
{'score': 0.7788, 'word': '۱۰هزار'},
{'score': 0.7727, 'word': 'دویست'},
{'score': 0.7679, 'word': 'میلیون'},
{'score': 0.7602, 'word': 'پانصد'}]
```
- **Word2Vec (CBOW)**
```python
from hezar import Embedding

word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7407, 'word': 'دویست'},
{'score': 0.7400, 'word': 'میلیون'},
{'score': 0.7326, 'word': 'صد'},
{'score': 0.7276, 'word': 'پانصد'},
{'score': 0.7011, 'word': 'سیصد'}]
```

### Datasets
You can load any of the datasets on the [Hub](https://huggingface.co/hezarai) like below:
```python
from hezar import Dataset

sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance
...
```

### Training
Hezar makes it super easy to train models using out-of-the-box models and datasets provided in the library.
```python
from hezar import (
BertSequenceLabeling,
BertSequenceLabelingConfig,
Trainer,
TrainerConfig,
Dataset,
Preprocessor,
)

base_model_path = "hezarai/bert-base-fa"
dataset_path = "hezarai/lscp-pos-500k"

train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path)
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path)

model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label))
preprocessor = Preprocessor.load(base_model_path)

train_config = TrainerConfig(
task="sequence_labeling",
device="cuda",
init_weights_from=base_model_path,
batch_size=8,
num_epochs=5,
checkpoints_dir="checkpoints/",
metrics=["seqeval"],
)

trainer = Trainer(
config=train_config,
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=train_dataset.data_collator,
preprocessor=preprocessor,
)
trainer.train()

trainer.push_to_hub("bert-fa-pos-lscp-500k") # push model, config, preprocessor, trainer files and configs
```

Want to go deeper? Check out the [guides](../guide/index.md).
2 changes: 2 additions & 0 deletions _sources/guide/advanced_training.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Advanced Training
Docs coming soon, stay tuned!
Loading

0 comments on commit 399d0c1

Please sign in to comment.