You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using this module, for one revision I realized that returned current_tokens list from state update method misses some tokens.
Here is an example code to generate the described problem:
importrequestsfrompprintimportpprintimportmwpersistenceimportdeltasfrommwreverts.defaultsimportRADIUSfromdeltas.tokenizers.wikitext_splitimportwikitext_splitpage_id=2161298rev_id=480327915# for testing purpose, process only this revision id# wikitext_split is used, defult is text_split.state=mwpersistence.DiffState(deltas.SegmentMatcher(tokenizer=wikitext_split),
revert_radius=RADIUS)
# get text of given revisionparams= {'pageids': page_id, 'action': 'query', 'prop': 'revisions',
'rvprop': 'content|ids|timestamp|sha1|comment|flags|user|userid',
'rvlimit': 1, 'format': 'json', 'rvstartid': rev_id}
result=requests.get(url='https://en.wikipedia.org/w/api.php', params=params).json()
_, page=result['query']['pages'].popitem()
forrevinpage.get('revisions', []):
text=rev.get('*', '')
text=text.lower()
# process revisioncurrent_tokens, tokens_added, tokens_removed=state.update(text, revision=rev_id)
# split rev text to compare with returned current_tokenstokens=wikitext_split.tokenize(text)
print(len(current_tokens), len(tokens))
# pprint(current_tokens)
When you run this code, you will see that number of tokens returned are different (3822 and 5563) and the last 5 tokens in ´current_tokens' are:
Hello,
While using this module, for one revision I realized that returned
current_tokens
list from state update method misses some tokens.Here is an example code to generate the described problem:
When you run this code, you will see that number of tokens returned are different (3822 and 5563) and the last 5 tokens in ´current_tokens' are:
which are not last tokens in that revision .
Firstly I would like to ask if I use these modules correctly? If yes, why are those tokens are missing?
I am workin on python 3.5.3 and installed all modules with pip.
The text was updated successfully, but these errors were encountered: