Go-Sastrawi is a Go package for doing stemming in Indonesian language. It is based from Sastrawi for PHP by Andy Librian.
From Wikipedia, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. For example :
- menahan => tahan
- pewarna => warna
The most basic usage is by using default dictionary that provided by Sastrawi :
import (
"fmt"
"github.com/RadhiFadlillah/go-sastrawi"
)
func main() {
// Original sentence
sentence := "Rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya. Baca berita selengkapnya di http://www.kompas.com."
// Reduce inflected words to its root form
dictionary := sastrawi.DefaultDictionary()
stemmer := sastrawi.NewStemmer(dictionary)
for _, word := range sastrawi.Tokenize(sentence) {
fmt.Printf("%s => %s\n", word, stemmer.Stem(word))
}
}
Beside using the default dictionary, you can also create your own root words dictionary :
import (
"fmt"
"github.com/RadhiFadlillah/go-sastrawi"
)
func main() {
// Create new dictionary
dictionary := sastrawi.NewDictionary("lapar")
dictionary.Print("")
// Add new words to dictionary
dictionary.Add("ingin", "makan", "gizi", "enak", "lezat")
dictionary.Print("")
// Remove some words from dictionary
dictionary.Remove("enak", "lezat")
dictionary.Print("")
// Use your new dictionary for stemming
sentence := "Aku kelaparan dan menginginkan makanan yang bergizi."
stemmer := sastrawi.NewStemmer(dictionary)
for _, word := range sastrawi.Tokenize(sentence) {
fmt.Printf("%s => %s\n", word, stemmer.Stem(word))
}
}
Sastrawi also provides list of stop words that can be used to remove common words in Indonesian language. This list of stop words is an ordinary Dictionary
, therefore you can add or remove the stop words depending on your purpose :
package main
import (
"fmt"
"github.com/RadhiFadlillah/go-sastrawi"
)
func main() {
stopwords := sastrawi.DefaultStopword()
dictionary := sastrawi.DefaultDictionary()
stemmer := sastrawi.NewStemmer(dictionary)
sentence := "Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan"
for _, word := range sastrawi.Tokenize(sentence) {
if stopwords.Contains(word) {
continue
}
fmt.Printf("%s => %s\n", word, stemmer.Stem(word))
}
}
- Nazief and Adriani Algorith
- Asian J. 2007. Effective Techniques for Indonesian Text Retrieval. PhD thesis School of Computer Science and Information Technology RMIT University Australia. (PDF and Amazon)
- Arifin, A.Z., I.P.A.K. Mahendra dan H.T. Ciptaningtyas. 2009. Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language, Proceeding of International Conference on Information & Communication Technology and Systems (ICTS). (PDF)
- A. D. Tahitoe, D. Purwitasari. 2010. Implementasi Modifikasi Enhanced Confix Stripping Stemmer Untuk Bahasa Indonesia dengan Metode Corpus Based Stemming, Institut Teknologi Sepuluh Nopember (ITS) – Surabaya, 60111, Indonesia. (PDF)
- Additional stemming rules from Sastrawi's contributors.
Stemming process by this package is depends heavily on the root words dictionary. Sastrawi use root words dictionary from kateglo.com with some changes.
As Sastrawi for PHP, Go-Sastrawi is also distributed using MIT license. Root words dictionary is distributed by Kateglo using CC-BY-NC-SA 3.0 license.
- Sastrawi - PHP
- JSastrawi - Java
- cSastrawi - C
- PySastrawi - Python
- Sastrawi-Ruby - Ruby