Loading README.md +16 −0 Original line number Diff line number Diff line Loading @@ -77,6 +77,22 @@ To install from GitHub, use '귀 밑에서 턱까지 잇따라 난 수염을 구레나룻이라고 한다.' ``` Setting rules with csv file. (you only need to use `set_rules_by_csv()` method.) ```bash $ cat test.csv 인덱스,단어 1,네이버영화 2,언급된단어 ``` ```python >>> from pykospacing import Spacing >>> spacing = Spacing(rules=['']) >>> spacing.set_rules_by_csv('./test.csv', '단어') >>> spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.") "김형호 영화시장 분석가는 '1987'의 네이버영화 정보 네티즌 10점 평에서 언급된단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다." ``` Run on command line(thanks [lqez](https://github.com/lqez)). Loading pykospacing/kospacing.py +29 −2 Original line number Diff line number Diff line # -*- coding: utf-8 -*- import os import re import csv import numpy as np import pkg_resources Loading Loading @@ -28,7 +29,33 @@ class Spacing: self._w2idx = W2IDX self.max_len = MAX_LEN self.pattern = re.compile(r'\s+') self.rules = [(re.compile('\s*'.join(r)), r) for r in rules] self.rules = {} for r in rules: if type(r) == str: self.rules[r] = re.compile('\s*'.join(r)) else: raise ValueError("rules must to have only string values.") def set_rules_by_csv(self, file_path, key=None): with open(file_path, 'r', encoding='UTF-8') as csvfile: csv_var = csv.reader(csvfile) if key == None: for line in csv_var: for word in line: self.rules[word] = re.compile('\s*'.join(word)) else: csv_var = list(csv_var) index = -1 for i, word in enumerate(csv_var[0]): if word == key: index = i break if index == -1: raise KeyError(f"'{key}' is not in csv file") for line in csv_var: self.rules[line[index]] = re.compile('\s*'.join(line[index])) def get_spaced_sent(self, raw_sent): raw_sent_ = "«" + raw_sent + "»" Loading Loading @@ -57,7 +84,7 @@ class Spacing: return subs def apply_rules(self, spaced_sent): for rgx, word in self.rules: for word, rgx in self.rules.items(): spaced_sent = rgx.sub(word, spaced_sent) return spaced_sent Loading Loading
README.md +16 −0 Original line number Diff line number Diff line Loading @@ -77,6 +77,22 @@ To install from GitHub, use '귀 밑에서 턱까지 잇따라 난 수염을 구레나룻이라고 한다.' ``` Setting rules with csv file. (you only need to use `set_rules_by_csv()` method.) ```bash $ cat test.csv 인덱스,단어 1,네이버영화 2,언급된단어 ``` ```python >>> from pykospacing import Spacing >>> spacing = Spacing(rules=['']) >>> spacing.set_rules_by_csv('./test.csv', '단어') >>> spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.") "김형호 영화시장 분석가는 '1987'의 네이버영화 정보 네티즌 10점 평에서 언급된단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다." ``` Run on command line(thanks [lqez](https://github.com/lqez)). Loading
pykospacing/kospacing.py +29 −2 Original line number Diff line number Diff line # -*- coding: utf-8 -*- import os import re import csv import numpy as np import pkg_resources Loading Loading @@ -28,7 +29,33 @@ class Spacing: self._w2idx = W2IDX self.max_len = MAX_LEN self.pattern = re.compile(r'\s+') self.rules = [(re.compile('\s*'.join(r)), r) for r in rules] self.rules = {} for r in rules: if type(r) == str: self.rules[r] = re.compile('\s*'.join(r)) else: raise ValueError("rules must to have only string values.") def set_rules_by_csv(self, file_path, key=None): with open(file_path, 'r', encoding='UTF-8') as csvfile: csv_var = csv.reader(csvfile) if key == None: for line in csv_var: for word in line: self.rules[word] = re.compile('\s*'.join(word)) else: csv_var = list(csv_var) index = -1 for i, word in enumerate(csv_var[0]): if word == key: index = i break if index == -1: raise KeyError(f"'{key}' is not in csv file") for line in csv_var: self.rules[line[index]] = re.compile('\s*'.join(line[index])) def get_spaced_sent(self, raw_sent): raw_sent_ = "«" + raw_sent + "»" Loading Loading @@ -57,7 +84,7 @@ class Spacing: return subs def apply_rules(self, spaced_sent): for rgx, word in self.rules: for word, rgx in self.rules.items(): spaced_sent = rgx.sub(word, spaced_sent) return spaced_sent Loading