Skip to content

Commit

Permalink
deploy: 23a80d7
Browse files Browse the repository at this point in the history
  • Loading branch information
arxyzan committed Aug 23, 2023
0 parents commit 160d9a5
Show file tree
Hide file tree
Showing 94 changed files with 11,943 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: cdec0220fc6fa5a0b01e3d9fbaabfd7d
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added .doctrees/contribute/add_datasets.doctree
Binary file not shown.
Binary file added .doctrees/contribute/add_docs.doctree
Binary file not shown.
Binary file added .doctrees/contribute/add_models.doctree
Binary file not shown.
Binary file added .doctrees/contribute/add_tests.doctree
Binary file not shown.
Binary file added .doctrees/contribute/contribute_to_hezar.doctree
Binary file not shown.
Binary file added .doctrees/contribute/index.doctree
Binary file not shown.
Binary file added .doctrees/contribute/pull_requests.doctree
Binary file not shown.
Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/get_started/index.doctree
Binary file not shown.
Binary file added .doctrees/get_started/installation.doctree
Binary file not shown.
Binary file added .doctrees/get_started/overview.doctree
Binary file not shown.
Binary file added .doctrees/get_started/quick_tour.doctree
Binary file not shown.
Binary file added .doctrees/guide/hezar_architecture.doctree
Binary file not shown.
Binary file added .doctrees/guide/index.doctree
Binary file not shown.
Binary file added .doctrees/guide/models_in_depth_overview.doctree
Binary file not shown.
Binary file added .doctrees/guide/train_custom_models.doctree
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/source/index.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/datasets.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/index.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/models.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/preprocessors.doctree
Binary file not shown.
Binary file added .doctrees/tutorial/training.doctree
Binary file not shown.
Empty file added .nojekyll
Empty file.
1 change: 1 addition & 0 deletions _sources/contribute/add_datasets.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Add a Dataset
1 change: 1 addition & 0 deletions _sources/contribute/add_docs.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Contribute to Docs
1 change: 1 addition & 0 deletions _sources/contribute/add_models.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Add a Model
1 change: 1 addition & 0 deletions _sources/contribute/add_tests.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Add Tests
1 change: 1 addition & 0 deletions _sources/contribute/contribute_to_hezar.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Contribute to Hezar
10 changes: 10 additions & 0 deletions _sources/contribute/index.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Contribute

```{toctree}
contribute_to_hezar.md
add_models.md
add_datasets.md
add_docs.md
add_tests.md
pull_requests.md
```
1 change: 1 addition & 0 deletions _sources/contribute/pull_requests.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Sending a Pull Request
8 changes: 8 additions & 0 deletions _sources/get_started/index.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Get Started
```{toctree}
:maxdepth: 2

overview.md
installation.md
quick_tour.md
```
25 changes: 25 additions & 0 deletions _sources/get_started/installation.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Installation

#### Install from PyPi
Installing Hezar is as easy as any other Python library! Most of the requirements are cross-platform and installing
them on any machine is a piece of cake!

```
pip install hezar
```
#### Install from source
Also, you can install the dev version of the library using the source:
```
pip install git+https://github.com/hezarai/hezar.git
```

#### Test installation
From a Python console or in CLI just import `hezar` and check the version:
```python
import hezar

print(hezar.__version__)
```
```
0.23.1
```
20 changes: 20 additions & 0 deletions _sources/get_started/overview.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Overview

Welcome to Hezar! A library that makes state-of-the-art machine learning as easy as possible aimed for the Persian
language, built by the Persian community!

In Hezar, the primary goal is to provide plug-and-play AI/ML utilities so that you don't need to know much about what's
going on under the hood. Hezar is not just a model library, but instead it's packed with every aspect you need for any
ML pipeline like datasets, trainers, preprocessors, feature extractors, etc.

Hezar is a library that:
- brings together all the best works in AI for Persian
- makes using AI models as easy as a couple of lines of code
- seamlessly integrates with Hugging Face Hub for all of its models
- has a highly developer-friendly interface
- has a task-based model interface which is more convenient for general users.
- is packed with additional tools like word embeddings, tokenizers, feature extractors, etc.
- comes with a lot of supplementary ML tools for deployment, benchmarking, optimization, etc.
- and more!

To find out more, just take the [quick tour](quick_tour.md)!
151 changes: 151 additions & 0 deletions _sources/get_started/quick_tour.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Quick Tour
Let's have a quick tour on some of the most important features of Hezar!

### Models
There's a bunch of ready to use trained models for different tasks on the Hub. To see all the models see [here](https://huggingface.co/hezarai)!

- **Text classification (sentiment analysis, categorization, etc)**
```python
from hezar import Model

example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load("hezarai/bert-fa-sentiment-dksf")
outputs = model.predict(example)
print(outputs)
```
```
{'labels': ['positive'], 'probs': [0.812910258769989]}
```
- **Sequence labeling (POS, NER, etc.)**
```python
from hezar import Model

pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k") # Part-of-speech
ner_model = Model.load("hezarai/bert-fa-ner-arman") # Named entity recognition
inputs = ["شرکت هوش مصنوعی هزار"]
pos_outputs = pos_model.predict(inputs)
ner_outputs = ner_model.predict(inputs)
print(f"POS: {pos_outputs}")
print(f"NER: {ner_outputs}")
```
```
POS: [[{'token': 'شرکت', 'tag': 'Ne'}, {'token': 'هوش', 'tag': 'Ne'}, {'token': 'مصنوعی', 'tag': 'AJe'}, {'token': 'هزار', 'tag': 'NUM'}]]
NER: [[{'token': 'شرکت', 'tag': 'B-org'}, {'token': 'هوش', 'tag': 'I-org'}, {'token': 'مصنوعی', 'tag': 'I-org'}, {'token': 'هزار', 'tag': 'I-org'}]]
```
- **Speech recognition**
```python
from hezar import Model
from datasets import load_dataset

ds = load_dataset("mozilla-foundation/common_voice_11_0", "fa", split="test")
sample = ds[1001]
whisper = Model.load("hezarai/whisper-small-fa")
transcript = whisper.predict(sample["path"]) # or pass `sample["audio"]["array"]` (with the right sample rate)
print(transcript)
```
```
{'transcription': ['و این تنها محدود به محیط کار نیست']}
```

### Word Embeddings
- **FastText**
```python
from hezar import Embedding

fasttext = Embedding.load("hezarai/fasttext-fa-300")
most_similar = fasttext.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7579, 'word': 'میلیون'},
{'score': 0.6943, 'word': '21هزار'},
{'score': 0.6861, 'word': 'میلیارد'},
{'score': 0.6825, 'word': '26هزار'},
{'score': 0.6803, 'word': '٣هزار'}]
```
- **Word2Vec (Skip-gram)**
```python
from hezar import Embedding

word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7885, 'word': 'چهارهزار'},
{'score': 0.7788, 'word': '۱۰هزار'},
{'score': 0.7727, 'word': 'دویست'},
{'score': 0.7679, 'word': 'میلیون'},
{'score': 0.7602, 'word': 'پانصد'}]
```
- **Word2Vec (CBOW)**
```python
from hezar import Embedding

word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
```
```
[{'score': 0.7407, 'word': 'دویست'},
{'score': 0.7400, 'word': 'میلیون'},
{'score': 0.7326, 'word': 'صد'},
{'score': 0.7276, 'word': 'پانصد'},
{'score': 0.7011, 'word': 'سیصد'}]
```

### Datasets
You can load any of the datasets on the [Hub](https://huggingface.co/hezarai) like below:
```python
from hezar import Dataset

sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance
...
```

### Training
Hezar makes it super easy to train models using out-of-the-box models and datasets provided in the library.
```python
from hezar import (
BertSequenceLabeling,
BertSequenceLabelingConfig,
TrainerConfig,
SequenceLabelingTrainer,
Dataset,
Preprocessor,
)

base_model_path = "hezarai/bert-base-fa"
dataset_path = "hezarai/lscp-pos-500k"

train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path)
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path)

model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label))
preprocessor = Preprocessor.load(base_model_path)

train_config = TrainerConfig(
device="cuda",
init_weights_from=base_model_path,
batch_size=8,
num_epochs=5,
checkpoints_dir="checkpoints/",
metrics=["seqeval"],
)

trainer = SequenceLabelingTrainer(
config=train_config,
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=train_dataset.data_collator,
preprocessor=preprocessor,
)
trainer.train()

trainer.push_to_hub("bert-fa-pos-lscp-500k") # push model, config, preprocessor, trainer files and configs
```

Want to go deeper? Check out the [guides](../guide/index.md).
Loading

0 comments on commit 160d9a5

Please sign in to comment.