Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing tokens after state update #4

Open
bitnik opened this issue Jun 13, 2017 · 1 comment
Open

missing tokens after state update #4

bitnik opened this issue Jun 13, 2017 · 1 comment

Comments

@bitnik
Copy link

bitnik commented Jun 13, 2017

Hello,

While using this module, for one revision I realized that returned current_tokens list from state update method misses some tokens.

Here is an example code to generate the described problem:

import requests
from pprint import pprint
import mwpersistence
import deltas
from mwreverts.defaults import RADIUS
from deltas.tokenizers.wikitext_split import wikitext_split


page_id = 2161298
rev_id = 480327915  # for testing purpose, process only this revision id
# wikitext_split is used, defult is text_split.
state = mwpersistence.DiffState(deltas.SegmentMatcher(tokenizer=wikitext_split), 
                                revert_radius=RADIUS)

# get text of given revision
params = {'pageids': page_id, 'action': 'query', 'prop': 'revisions',
          'rvprop': 'content|ids|timestamp|sha1|comment|flags|user|userid',
          'rvlimit': 1, 'format': 'json', 'rvstartid': rev_id}
result = requests.get(url='https://en.wikipedia.org/w/api.php', params=params).json()
_, page = result['query']['pages'].popitem()
for rev in page.get('revisions', []):
    text = rev.get('*', '')
    text = text.lower()
    # process revision
    current_tokens, tokens_added, tokens_removed = state.update(text, revision=rev_id)

    # split rev text to compare with returned current_tokens
    tokens = wikitext_split.tokenize(text)

    print(len(current_tokens), len(tokens))
    # pprint(current_tokens)

When you run this code, you will see that number of tokens returned are different (3822 and 5563) and the last 5 tokens in ´current_tokens' are:

  • Token('has', type='word', revisions=[480327915]),
  • Token(' ', type='whitespace', revisions=[480327915]),
  • Token('higher', type='word', revisions=[480327915]),
  • Token(' ', type='whitespace', revisions=[480327915]),
  • Token('mechanical', type='word', revisions=[480327915])
    which are not last tokens in that revision .

Firstly I would like to ask if I use these modules correctly? If yes, why are those tokens are missing?

I am workin on python 3.5.3 and installed all modules with pip.

@groceryheist
Copy link

groceryheist commented Sep 5, 2018

This is fixed in halfak/deltas@aa314c8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants