Commit ccd73228 authored by haven-jeon's avatar haven-jeon
Browse files

first commit

parents
Loading
Loading
Loading
Loading

.gitignore

0 → 100644
+4 −0
Original line number Diff line number Diff line
setup.py
__pycache__
.eggs
dist

MANIFEST.in

0 → 100644
+5 −0
Original line number Diff line number Diff line
include kospacing/resources/dicts/*
include kospacing/resources/models/* 
include kospacing/embedding_maker.py
include kospacing/kospacing.py

README.md

0 → 100644
+61 −0
Original line number Diff line number Diff line
KoSpacing 
---------------

Python package for automatic Korean word spacing.


[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](http://www.gnu.org/licenses/gpl-3.0)


#### Introduction

Word spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis. `KoSpacing` has fairly accurate automatic word spacing performance, especially good for online text originated from SNS.

`KoSpacing` is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from [Chan-Yub Park](https://github.com/mrchypark)). 


#### Performance

| Test Set  | Accuracy | 
|---|---|
| Sejong(colloquial style) Corpus(1M) | 97.1% |
| OOOO(literary style)  Corpus(3M)   | 94.3% |

- Accuracy = # correctly spaced characters/# characters in the test data.
  - Might be increased performance if normalize compound words. 


#### Install

You need to install conda binary from https://www.anaconda.com/download/. Please install Python 3.6 version or later.

To install from GitHub, use

    pip install git+git://github.com/haven-jeon/.git


#### Example 

    >>> from kospacing import spacing
    >>> spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.")
    "김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다."


#### Model Architecture

![](arch.png)


#### Citation

```markdowns
@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing with R},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}
```


arch.png

0 → 100644
+105 KiB
Loading image diff...

kospacing/__init__.py

0 → 100644
+5 −0
Original line number Diff line number Diff line


from pykospacing.kospacing import *

__version__ = '0.01'