-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
63bd602
commit 42fabca
Showing
37 changed files
with
1,441,777 additions
and
2 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Auto detect text files and perform LF normalization | ||
* text=auto | ||
|
||
# Custom for Visual Studio | ||
*.cs diff=csharp | ||
*.sln merge=union | ||
*.csproj merge=union | ||
*.vbproj merge=union | ||
*.fsproj merge=union | ||
*.dbproj merge=union | ||
|
||
# Standard to msysgit | ||
*.doc diff=astextplain | ||
*.DOC diff=astextplain | ||
*.docx diff=astextplain | ||
*.DOCX diff=astextplain | ||
*.dot diff=astextplain | ||
*.DOT diff=astextplain | ||
*.pdf diff=astextplain | ||
*.PDF diff=astextplain | ||
*.rtf diff=astextplain | ||
*.RTF diff=astextplain |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
################# | ||
## Eclipse | ||
################# | ||
|
||
*.pydevproject | ||
.project | ||
.metadata | ||
bin/ | ||
tmp/ | ||
*.tmp | ||
*.bak | ||
*.swp | ||
*~.nib | ||
local.properties | ||
.classpath | ||
.settings/ | ||
.loadpath | ||
|
||
# External tool builders | ||
.externalToolBuilders/ | ||
|
||
# Locally stored "Eclipse launch configurations" | ||
*.launch | ||
|
||
# CDT-specific | ||
.cproject | ||
|
||
# PDT-specific | ||
.buildpath | ||
|
||
|
||
################# | ||
## Visual Studio | ||
################# | ||
|
||
## Ignore Visual Studio temporary files, build results, and | ||
## files generated by popular Visual Studio add-ons. | ||
|
||
# User-specific files | ||
*.suo | ||
*.user | ||
*.sln.docstates | ||
|
||
# Build results | ||
[Dd]ebug/ | ||
[Rr]elease/ | ||
*_i.c | ||
*_p.c | ||
*.ilk | ||
*.meta | ||
*.obj | ||
*.pch | ||
*.pdb | ||
*.pgc | ||
*.pgd | ||
*.rsp | ||
*.sbr | ||
*.tlb | ||
*.tli | ||
*.tlh | ||
*.tmp | ||
*.vspscc | ||
.builds | ||
*.dotCover | ||
|
||
## TODO: If you have NuGet Package Restore enabled, uncomment this | ||
#packages/ | ||
|
||
# Visual C++ cache files | ||
ipch/ | ||
*.aps | ||
*.ncb | ||
*.opensdf | ||
*.sdf | ||
|
||
# Visual Studio profiler | ||
*.psess | ||
*.vsp | ||
|
||
# ReSharper is a .NET coding add-in | ||
_ReSharper* | ||
|
||
# Installshield output folder | ||
[Ee]xpress | ||
|
||
# DocProject is a documentation generator add-in | ||
DocProject/buildhelp/ | ||
DocProject/Help/*.HxT | ||
DocProject/Help/*.HxC | ||
DocProject/Help/*.hhc | ||
DocProject/Help/*.hhk | ||
DocProject/Help/*.hhp | ||
DocProject/Help/Html2 | ||
DocProject/Help/html | ||
|
||
# Click-Once directory | ||
publish | ||
|
||
# Others | ||
[Bb]in | ||
[Oo]bj | ||
sql | ||
TestResults | ||
*.Cache | ||
ClientBin | ||
stylecop.* | ||
~$* | ||
*.dbmdl | ||
Generated_Code #added for RIA/Silverlight projects | ||
|
||
# Backup & report files from converting an old project file to a newer | ||
# Visual Studio version. Backup files are not needed, because we have git ;-) | ||
_UpgradeReport_Files/ | ||
Backup*/ | ||
UpgradeLog*.XML | ||
############ | ||
## pycharm | ||
############ | ||
.idea | ||
|
||
############ | ||
## Windows | ||
############ | ||
|
||
# Windows image file caches | ||
Thumbs.db | ||
|
||
# Folder config file | ||
Desktop.ini | ||
|
||
|
||
############# | ||
## Python | ||
############# | ||
|
||
*.py[co] | ||
|
||
# Packages | ||
*.egg | ||
*.egg-info | ||
dist | ||
build | ||
eggs | ||
parts | ||
bin | ||
var | ||
sdist | ||
develop-eggs | ||
.installed.cfg | ||
|
||
# Installer logs | ||
pip-log.txt | ||
|
||
# Unit test / coverage reports | ||
.coverage | ||
.tox | ||
|
||
#Translations | ||
*.mo | ||
|
||
#Mr Developer | ||
.mr.developer.cfg | ||
|
||
# Mac crap | ||
.DS_Store | ||
*.log | ||
test/tmp/* | ||
|
||
#jython | ||
*.class | ||
|
||
MANIFEST | ||
test.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
2016-07-18: fork jieba(0.38) | ||
1. 替換zh-tw版本詞庫(dict.txt) | ||
2. 替換zh-tw版本HMM機率表 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
The MIT License (MIT) | ||
|
||
Copyright (c) 2013 Sun Junyi | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy of | ||
this software and associated documentation files (the "Software"), to deal in | ||
the Software without restriction, including without limitation the rights to | ||
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of | ||
the Software, and to permit persons to whom the Software is furnished to do so, | ||
subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS | ||
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR | ||
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER | ||
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN | ||
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
graft README.md | ||
graft Changelog |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# jieba-zh_TW | ||
|
||
結巴(jieba)斷詞台灣繁體版本 | ||
|
||
|
||
## 原理 | ||
|
||
採用和原始jieba相同的演算法,替換其詞庫及HMM機率表製做出針對台灣繁體的jieba斷詞器 | ||
|
||
|
||
## 使用說明 | ||
|
||
* 相容python2和python3 | ||
* 將jieba資料夾放在你程式的資料夾底下 | ||
* `import jieba` | ||
|
||
|
||
## 程式碼範例 | ||
|
||
操作方法同原始jieba | ||
|
||
### 斷詞 | ||
|
||
```python | ||
import jieba | ||
|
||
#如果您的電腦同時要使用兩個版本的jieba,請自訂cache檔名,避免兩個cache互相蓋住對方 | ||
#jieba.dt.cache_file = 'jieba.cache.new' | ||
|
||
seg_list = jieba.cut("在非洲,每六十秒,就有一分鐘過去") | ||
print("|".join(seg_list)) | ||
# 在|非洲|,|每|六十秒|,|就|有|一分鐘|過去 | ||
|
||
``` | ||
|
||
### 關鍵詞抽取 | ||
尚未替換機率表,輸出的結果非常不可靠 | ||
|
||
|
||
### 詞性標記 | ||
應該是一跑就會噴錯的狀態 | ||
|
||
|
||
## 可靠度探討 | ||
拿本份程式碼去和*jieba轉簡體後斷詞*、*jieba直接斷繁體字*這兩個方法,去斷[這篇台灣記者寫的新聞](http://www.appledaily.com.tw/appledaily/article/international/20160715/37308809/)。並以[中研院中文斷詞系統](http://ckipsvr.iis.sinica.edu.tw/)作為標準答案,以詞為單位,去計算這三個方法和中研院的結果的[Edit distance](https://en.wikipedia.org/wiki/Edit_distance) | ||
|
||
|
||
|Edit distance|第一段(92)|第二段(136)|第三段(75)|第四段(52)|第五段(63)| | ||
|---|---|---|---|---|---| | ||
|jieba zh_TW |9|20|12|12|9| | ||
|jieba轉簡體後斷詞|44|43|31|23|21| | ||
|jieba直接斷繁體字|53|74|43|34|34| | ||
(括號內為中研院斷出來的詞彙數) | ||
|
||
|
||
## 感謝 | ||
|
||
* 中央研究院資訊科學所詞庫小組中文斷詞線上服務 | ||
|
||
## 注意事項 | ||
|
||
使用本份程式碼請遵守[中研院斷詞服務之服務條款](http://ckipsvr.iis.sinica.edu.tw/terms.htm)其中的衍生資料相關規定 | ||
|
||
|
||
## 一些問題 | ||
|
||
詳見我Blog上的這篇文章:[關於結巴(Jieba)斷詞的幾個問題](https://blog.ldkrsi.in/%E9%97%9C%E6%96%BC%E7%B5%90%E5%B7%B4%E6%96%B7%E8%A9%9E%E7%9A%84%E5%B9%BE%E5%80%8B%E5%95%8F%E9%A1%8C/) |
Oops, something went wrong.