Unverified Commit 09c6bc04 authored by Heewon Jeon(gogamza)'s avatar Heewon Jeon(gogamza) Committed by GitHub
Browse files

Merge pull request #29 from JustKode/master

rules 관련 새로운 function 추가 및 타입 검사
parents 2e303d3c d6b02aa3
Loading
Loading
Loading
Loading
+16 −0
Original line number Diff line number Diff line
@@ -77,6 +77,22 @@ To install from GitHub, use
'귀 밑에서 턱까지 잇따라 난 수염을 구레나룻이라고 한다.'
```

Setting rules with csv file. (you only need to use `set_rules_by_csv()` method.)

```bash
$ cat test.csv
인덱스,단어
1,네이버영화
2,언급된단어
```

```python
>>> from pykospacing import Spacing
>>> spacing = Spacing(rules=[''])
>>> spacing.set_rules_by_csv('./test.csv', '단어')
>>> spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.")
"김형호 영화시장 분석가는 '1987'의 네이버영화 정보 네티즌 10점 평에서 언급된단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다."
```

Run on command line(thanks [lqez](https://github.com/lqez)). 

+29 −2
Original line number Diff line number Diff line
# -*- coding: utf-8 -*-
import os
import re
import csv

import numpy as np
import pkg_resources
@@ -28,7 +29,33 @@ class Spacing:
        self._w2idx = W2IDX
        self.max_len = MAX_LEN
        self.pattern = re.compile(r'\s+')
        self.rules = [(re.compile('\s*'.join(r)), r) for r in rules]
        self.rules = {}
        for r in rules:
            if type(r) == str:
                self.rules[r] = re.compile('\s*'.join(r))
            else:
                raise ValueError("rules must to have only string values.")
    
    def set_rules_by_csv(self, file_path, key=None):
        with open(file_path, 'r', encoding='UTF-8') as csvfile:
            csv_var = csv.reader(csvfile)
            if key == None:
                for line in csv_var:
                    for word in line:
                        self.rules[word] = re.compile('\s*'.join(word))
            else:
                csv_var = list(csv_var)
                index = -1
                for i, word in enumerate(csv_var[0]):
                    if word == key:
                        index = i
                        break
                
                if index == -1:
                    raise KeyError(f"'{key}' is not in csv file")
                
                for line in csv_var:
                    self.rules[line[index]] = re.compile('\s*'.join(line[index]))

    def get_spaced_sent(self, raw_sent):
        raw_sent_ = "«" + raw_sent + "»"
@@ -57,7 +84,7 @@ class Spacing:
        return subs

    def apply_rules(self, spaced_sent):
        for rgx, word in self.rules:
        for word, rgx in self.rules.items():
            spaced_sent = rgx.sub(word, spaced_sent)
        return spaced_sent