Skip to content

Commit

Permalink
Merge pull request #8 from manorie/development
Browse files Browse the repository at this point in the history
Refactorings.
  • Loading branch information
Mehmet Cetin committed Oct 22, 2015
2 parents 368dce9 + cefe8e7 commit c923646
Show file tree
Hide file tree
Showing 13 changed files with 114 additions and 40 deletions.
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[![Dependency Status](https://gemnasium.com/manorie/textoken.svg)](https://gemnasium.com/manorie/textoken)
[![Gem Version](https://badge.fury.io/rb/textoken.svg)](http://badge.fury.io/rb/textoken)

Textoken is a Ruby library for text tokenization. This gem extracts words from text with many customizations. It can be used in many fields like crawling and Natural Language Processing.
Textoken is a Ruby library for text tokenization. This gem extracts words from text with many customizations. It can be used in many fields like Web Crawling and Natural Language Processing.

## Basic Usage

Expand Down Expand Up @@ -43,7 +43,7 @@ Textoken('Oh, no! Alfa 2000 is at home.', only_regexp: '^[0-9]*$').tokens

You can combine all options. 'Only' and 'Exclude' Options support multiple option values like **only: 'punctuations, dates, numerics'**

Public interface of Textoken presents two methods, tokens & word;
Public interface of Textoken presents two methods, **tokens** & **words**

```ruby
Textoken('Alfa.').tokens
Expand All @@ -57,31 +57,31 @@ Textoken('Alfa.').words

## Current Options

**only:** accepts any regexp defined in [option_values.yml](//github.com/manorie/textoken/blob/development/lib/textoken/regexps/option_values.yml)
- **only:** Accepts any regexp defined in [option_values.yml](//github.com/manorie/textoken/blob/development/lib/textoken/regexps/option_values.yml)

**exclude:** accepts any regexp defined in [option_values.yml](https://github.com/manorie/textoken/blob/development/lib/textoken/regexps/option_values.yml)
- **only_regexp:** Accepts any regexp but only one regexp can be given.

**less_than:** accepts any integer bigger than 1
- **exclude:** Accepts any regexp defined in [option_values.yml](https://github.com/manorie/textoken/blob/development/lib/textoken/regexps/option_values.yml)

**more_than:** accepts any positive integer
- **exclude_regexp** Accepts any regexp but only one regexp can be given.

**only_regexp:** accepts any regexp but only one regexp can be given
- **less_than:** Accepts any integer bigger than 1.

**exclude_regexp** accepts any regexp but only one regexp can be given
- **more_than:** Accepts any positive integer.

## Option Meanings

**only:** If a word in text consist of a regexp or regexps, only option includes it in result.
- **only:** If a word in text consist of a regexp or regexps, only option includes it in result.

**only_regexp:** If a word in text consist of user given regexp, only_regexp option includes it in result.
- **only_regexp:** If a word in text consist of user given regexp, only_regexp option includes it in result.

**exclude:** If a word in text does not have a regexp at some part, exclude option excludes it from result. Opposite of only.
- **exclude:** If a word in text does not have a regexp at some part, exclude option excludes it from result. Opposite of only.

**exclude_regexp:** If a word in text does not have user given regexp at some part, exclude option excludes it from result. Opposite of only_regexp.
- **exclude_regexp:** If a word in text does not have user given regexp at some part, exclude option excludes it from result. Opposite of only_regexp.

**less_than:** Filters result by the word length less than the option value given.
- **less_than:** Filters result by the word length less than the option value given.

**more_than:** Filters result by the word length bigger than the option value given.
- **more_than:** Filters result by the word length bigger than the option value given.


## Installation
Expand Down
1 change: 1 addition & 0 deletions lib/textoken.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
require 'textoken/tokenizer'
require 'textoken/scanner'

require 'textoken/options/modules/tokenizable_option'
require 'textoken/options/modules/numeric_option'
require 'textoken/options/modules/conditional_option'
require 'textoken/options/modules/regexp_option'
Expand Down
5 changes: 3 additions & 2 deletions lib/textoken/options/exclude.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ module Textoken
class Exclude
include ConditionalOption

private

# base.text is raw tokens splitted with ' '
# values are Regexps array to search
# base.findings, Findings object for pushing matching tokens
def tokenize(base)
@base = base
def tokenize_condition
tokenize_if { |word, regexp| !word.match(regexp) }
end
end
Expand Down
11 changes: 5 additions & 6 deletions lib/textoken/options/less_than.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,14 @@ module Textoken
class LessThan
include NumericOption

def tokenize(base)
@base = base
private

def tokenize_condition
tokenize_if { |word| word.length < number }
end

private

def validate_option_value(value)
validate { value.class == Fixnum && value > 1 }
def validate_option_value
validate { |value| value > 1 }
end
end
end
4 changes: 4 additions & 0 deletions lib/textoken/options/modules/conditional_option.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
module Textoken
# This module will be shared in options like, only and exclude
module ConditionalOption
include TokenizableOption

attr_reader :regexps, :findings, :base

def priority
Expand All @@ -12,6 +14,8 @@ def initialize(values)
@findings = Findings.new
end

private

def tokenize_if(&block)
regexps.each do |r|
base.text.each_with_index do |w, i|
Expand Down
10 changes: 7 additions & 3 deletions lib/textoken/options/modules/numeric_option.rb
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
module Textoken
# This module will be shared in options like, more_than and less_than
module NumericOption
attr_reader :number, :findings, :base
include TokenizableOption

attr_reader :number, :findings

def priority
2
end

def initialize(value)
validate_option_value(value)
@number = value
@findings = Findings.new
validate_option_value
end

private

def tokenize_if(&code)
base.text.each_with_index do |w, i|
findings.push(i, w) if code.call(w)
Expand All @@ -21,7 +25,7 @@ def tokenize_if(&code)
end

def validate(&code)
return if code.call
return if number.class == Fixnum && code.call(number)
Textoken.expression_err "value #{number} is not permitted for
#{self.class.name} option."
end
Expand Down
18 changes: 18 additions & 0 deletions lib/textoken/options/modules/tokenizable_option.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
module Textoken
# This module will be shared in options like, only_regexp and exclude_regexp
module TokenizableOption
attr_reader :base

def tokenize(base)
@base = base
tokenize_condition
end

private

def tokenize_condition
Textoken.type_err('tokenize_condition method has to be implemented
for Options.')
end
end
end
11 changes: 5 additions & 6 deletions lib/textoken/options/more_than.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,14 @@ module Textoken
class MoreThan
include NumericOption

def tokenize(base)
@base = base
private

def tokenize_condition
tokenize_if { |word| word.length > number }
end

private

def validate_option_value(value)
validate { value.class == Fixnum && value >= 0 }
def validate_option_value
validate { |value| value >= 0 }
end
end
end
5 changes: 3 additions & 2 deletions lib/textoken/options/only.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ module Textoken
class Only
include ConditionalOption

private

# base.text is raw tokens splitted with ' '
# values are Regexps array to search
# base.findings, Findings object for pushing matching tokens
def tokenize(base)
@base = base
def tokenize_condition
tokenize_if { |word, regexp| word.match(regexp) }
end
end
Expand Down
2 changes: 1 addition & 1 deletion lib/textoken/version.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
module Textoken
VERSION = "1.1.0"
VERSION = "1.1.1"
end
4 changes: 2 additions & 2 deletions spec/lib/textoken/options/modules/numeric_option_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ def tokenize_false(base)

private

def validate_option_value(value)
validate { value.class == Fixnum && value > 1 }
def validate_option_value
validate { |value| value > 1 }
end
end
end
Expand Down
50 changes: 50 additions & 0 deletions spec/lib/textoken/options/modules/tokenizable_option_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
require 'spec_helper'

module Textoken
# A test dummy
class TheDumy
include TokenizableOption

private

def tokenize_condition
end
end
end

module Textoken
# Another test dummy
class TheErrorDumy
include TokenizableOption
end
end

describe Textoken::TokenizableOption do
describe '#tokenize' do
context 'sets the base' do
it 'as expected' do
t = Textoken::TheDumy.new
object = Object.new
t.tokenize(object)
expect(t.base).to eq(object)
end
end

context 'sends tokenize_condition' do
it 'as expected' do
t = Textoken::TheDumy.new
expect(t).to receive(:tokenize_condition)
t.tokenize(Object.new)
end
end

context 'raises error when not implemented' do
it 'as expected' do
t = Textoken::TheErrorDumy.new
expect do
t.tokenize(Object.new)
end.to raise_error(Textoken::TypeError)
end
end
end
end
5 changes: 1 addition & 4 deletions textoken.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,11 @@ Gem::Specification.new do |s|
s.email = ["[email protected]"]
s.homepage = "https://github.com/manorie/textoken"
s.summary = "Simple and customizable text tokenization gem."
s.description = "Textoken is a Ruby library for text tokenization.
This gem extracts words from text with many customizations.
It can be used in many fields like crawling and Natural Language Processing."
s.description = "Textoken is a Ruby library for text tokenization. This gem extracts words from text with many customizations. It can be used in many fields like Web Crawling and Natural Language Processing."
s.license = "MIT"

s.files = Dir["{app,config,db,lib}/**/*", "MIT-LICENSE", "Rakefile", "README.rdoc"]

s.add_development_dependency 'rspec', '~> 3.3.0', '>= 3.3.0'
s.add_development_dependency 'rake', '~> 10.0'
s.add_development_dependency 'pry', '~> 0'
end

0 comments on commit c923646

Please sign in to comment.