Skip to content

Commit

Permalink
update README, add jieba
Browse files Browse the repository at this point in the history
  • Loading branch information
Yasheed1995 committed Dec 5, 2017
1 parent 63bd602 commit 42fabca
Show file tree
Hide file tree
Showing 37 changed files with 1,441,777 additions and 2 deletions.
Binary file removed project/7.0.pdf
Binary file not shown.
28 changes: 26 additions & 2 deletions project/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,33 @@ find . -size +90M | sed 's|^\./||g' | cat >> .gitignore; awk '!NF || !seen[$0]++
```

Training data download
```


https://drive.google.com/open?id=1rqz_-uIrPyVee96H83hGo6KNy9UJt1uB
```


Jieba note


https://github.com/ldkrsi/jieba-zh_TW


Pre-trained wordvec


https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md


Powerpoint


https://docs.google.com/presentation/d/1dz4c0CQPBC1CJ7Iy7pcop0L8Ko0EdT0qyqNz1sK_WQo/edit#slide=id.p5
https://docs.google.com/presentation/d/1bFF5a35awPQpyHAdqu81nEYKJq9lfkF2DhH_LfB0mms/edit#slide=id.g197aa5b1a4_0_72

Kaggle


https://www.kaggle.com/c/ml2017fallfinaltaiwanese/data


### Evaluation Metric
Expand Down
22 changes: 22 additions & 0 deletions project/jieba-zh_TW-master/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Auto detect text files and perform LF normalization
* text=auto

# Custom for Visual Studio
*.cs diff=csharp
*.sln merge=union
*.csproj merge=union
*.vbproj merge=union
*.fsproj merge=union
*.dbproj merge=union

# Standard to msysgit
*.doc diff=astextplain
*.DOC diff=astextplain
*.docx diff=astextplain
*.DOCX diff=astextplain
*.dot diff=astextplain
*.DOT diff=astextplain
*.pdf diff=astextplain
*.PDF diff=astextplain
*.rtf diff=astextplain
*.RTF diff=astextplain
173 changes: 173 additions & 0 deletions project/jieba-zh_TW-master/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
#################
## Eclipse
#################

*.pydevproject
.project
.metadata
bin/
tmp/
*.tmp
*.bak
*.swp
*~.nib
local.properties
.classpath
.settings/
.loadpath

# External tool builders
.externalToolBuilders/

# Locally stored "Eclipse launch configurations"
*.launch

# CDT-specific
.cproject

# PDT-specific
.buildpath


#################
## Visual Studio
#################

## Ignore Visual Studio temporary files, build results, and
## files generated by popular Visual Studio add-ons.

# User-specific files
*.suo
*.user
*.sln.docstates

# Build results
[Dd]ebug/
[Rr]elease/
*_i.c
*_p.c
*.ilk
*.meta
*.obj
*.pch
*.pdb
*.pgc
*.pgd
*.rsp
*.sbr
*.tlb
*.tli
*.tlh
*.tmp
*.vspscc
.builds
*.dotCover

## TODO: If you have NuGet Package Restore enabled, uncomment this
#packages/

# Visual C++ cache files
ipch/
*.aps
*.ncb
*.opensdf
*.sdf

# Visual Studio profiler
*.psess
*.vsp

# ReSharper is a .NET coding add-in
_ReSharper*

# Installshield output folder
[Ee]xpress

# DocProject is a documentation generator add-in
DocProject/buildhelp/
DocProject/Help/*.HxT
DocProject/Help/*.HxC
DocProject/Help/*.hhc
DocProject/Help/*.hhk
DocProject/Help/*.hhp
DocProject/Help/Html2
DocProject/Help/html

# Click-Once directory
publish

# Others
[Bb]in
[Oo]bj
sql
TestResults
*.Cache
ClientBin
stylecop.*
~$*
*.dbmdl
Generated_Code #added for RIA/Silverlight projects

# Backup & report files from converting an old project file to a newer
# Visual Studio version. Backup files are not needed, because we have git ;-)
_UpgradeReport_Files/
Backup*/
UpgradeLog*.XML
############
## pycharm
############
.idea

############
## Windows
############

# Windows image file caches
Thumbs.db

# Folder config file
Desktop.ini


#############
## Python
#############

*.py[co]

# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg

# Installer logs
pip-log.txt

# Unit test / coverage reports
.coverage
.tox

#Translations
*.mo

#Mr Developer
.mr.developer.cfg

# Mac crap
.DS_Store
*.log
test/tmp/*

#jython
*.class

MANIFEST
test.py
4 changes: 4 additions & 0 deletions project/jieba-zh_TW-master/Changelog
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
2016-07-18: fork jieba(0.38)
1. 替換zh-tw版本詞庫(dict.txt)
2. 替換zh-tw版本HMM機率表

20 changes: 20 additions & 0 deletions project/jieba-zh_TW-master/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
The MIT License (MIT)

Copyright (c) 2013 Sun Junyi

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
2 changes: 2 additions & 0 deletions project/jieba-zh_TW-master/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
graft README.md
graft Changelog
67 changes: 67 additions & 0 deletions project/jieba-zh_TW-master/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# jieba-zh_TW

結巴(jieba)斷詞台灣繁體版本


## 原理

採用和原始jieba相同的演算法,替換其詞庫及HMM機率表製做出針對台灣繁體的jieba斷詞器


## 使用說明

* 相容python2和python3
* 將jieba資料夾放在你程式的資料夾底下
* `import jieba`


## 程式碼範例

操作方法同原始jieba

### 斷詞

```python
import jieba

#如果您的電腦同時要使用兩個版本的jieba,請自訂cache檔名,避免兩個cache互相蓋住對方
#jieba.dt.cache_file = 'jieba.cache.new'

seg_list = jieba.cut("在非洲,每六十秒,就有一分鐘過去")
print("|".join(seg_list))
# 在|非洲|,|每|六十秒|,|就|有|一分鐘|過去

```

### 關鍵詞抽取
尚未替換機率表,輸出的結果非常不可靠


### 詞性標記
應該是一跑就會噴錯的狀態


## 可靠度探討
拿本份程式碼去和*jieba轉簡體後斷詞**jieba直接斷繁體字*這兩個方法,去斷[這篇台灣記者寫的新聞](http://www.appledaily.com.tw/appledaily/article/international/20160715/37308809/)。並以[中研院中文斷詞系統](http://ckipsvr.iis.sinica.edu.tw/)作為標準答案,以詞為單位,去計算這三個方法和中研院的結果的[Edit distance](https://en.wikipedia.org/wiki/Edit_distance)


|Edit distance|第一段(92)|第二段(136)|第三段(75)|第四段(52)|第五段(63)|
|---|---|---|---|---|---|
|jieba zh_TW |9|20|12|12|9|
|jieba轉簡體後斷詞|44|43|31|23|21|
|jieba直接斷繁體字|53|74|43|34|34|
(括號內為中研院斷出來的詞彙數)


## 感謝

* 中央研究院資訊科學所詞庫小組中文斷詞線上服務

## 注意事項

使用本份程式碼請遵守[中研院斷詞服務之服務條款](http://ckipsvr.iis.sinica.edu.tw/terms.htm)其中的衍生資料相關規定


## 一些問題

詳見我Blog上的這篇文章:[關於結巴(Jieba)斷詞的幾個問題](https://blog.ldkrsi.in/%E9%97%9C%E6%96%BC%E7%B5%90%E5%B7%B4%E6%96%B7%E8%A9%9E%E7%9A%84%E5%B9%BE%E5%80%8B%E5%95%8F%E9%A1%8C/)
Loading

0 comments on commit 42fabca

Please sign in to comment.