-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathreplace.py
970 lines (881 loc) · 41.4 KB
/
replace.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
# -*- coding: utf-8 -*-
"""
This bot will make direct text replacements. It will retrieve information on
which pages might need changes either from an XML dump or a text file, or only
change a single page.
These command line parameters can be used to specify which pages to work on:
¶ms;
-xml Retrieve information from a local XML dump (pages-articles
or pages-meta-current, see http://download.wikimedia.org).
Argument can also be given as "-xml:filename".
-page Only edit a specific page.
Argument can also be given as "-page:pagetitle". You can
give this parameter multiple times to edit multiple pages.
Furthermore, the following command line parameters are supported:
-regex Make replacements using regular expressions. If this argument
isn't given, the bot will make simple text replacements.
-nocase Use case insensitive regular expressions.
-dotall Make the dot match any character at all, including a newline.
Without this flag, '.' will match anything except a newline.
-multiline '^' and '$' will now match begin and end of each line.
-xmlstart (Only works with -xml) Skip all articles in the XML dump
before the one specified (may also be given as
-xmlstart:Article).
-save Saves the titles of the articles to a file instead of
modifying the articles. This way you may collect titles to
work on in automatic mode, and process them later with
-file. Opens the file for append, if exists.
If you insert the contents of the file into a wikipage, it
will appear as a numbered list, and may be used with -links.
Argument may also be given as "-save:filename".
-savenew Just like -save, except that overwrites the existing file.
Argument may also be given as "-savenew:filename".
-saveexc With this parameter a new option will appear in choices:
"no+eXcept". If you press x, the text will not be replaced,
and the title of page will be saved to the given exception
file to exclude this page from future replacements. At the
moment you may paste the contents directly into 'title' list
of the exceptions dictionary of your fix (use tab to indent).
Reading back the list from file will be implemented later.
Argument may also be given as "-saveexc:filename".
Opens the file for append, if exists.
-saveexcnew Just like -saveexc, except that overwrites the existing file.
Argument may also be given as "-saveexcnew:filename".
-readexc Reserved for reading saved exceptions from a file.
Not implemented yet.
-addcat:cat_name Adds "cat_name" category to every altered page.
-excepttitle:XYZ Skip pages with titles that contain XYZ. If the -regex
argument is given, XYZ will be regarded as a regular
expression. Use multiple times to ignore multiple pages.
-requiretitle:XYZ Only do pages with titles that contain XYZ. If the -regex
argument is given, XYZ will be regarded as a regular
expression.
-excepttext:XYZ Skip pages which contain the text XYZ. If the -regex
argument is given, XYZ will be regarded as a regular
expression.
-exceptinside:XYZ Skip occurences of the to-be-replaced text which lie
within XYZ. If the -regex argument is given, XYZ will be
regarded as a regular expression.
-exceptinsidetag:XYZ Skip occurences of the to-be-replaced text which lie
within an XYZ tag.
-summary:XYZ Set the summary message text for the edit to XYZ, bypassing
the predefined message texts with original and replacements
inserted.
-sleep:123 If you use -fix you can check multiple regex at the same time
in every page. This can lead to a great waste of CPU because
the bot will check every regex without waiting using all the
resources. This will slow it down between a regex and another
in order not to waste too much CPU.
-query: The maximum number of pages that the bot will load at once.
Default value is 60. Ignored when reading an XML file.
-fix:XYZ Perform one of the predefined replacements tasks, which are
given in the dictionary 'fixes' defined inside the files
fixes.py and user-fixes.py.
The -regex, -recursive and -nocase argument and given
replacements and exceptions will be ignored if you use -fix
and they are present in the 'fixes' dictionary.
Currently available predefined fixes are:
&fixes-help;
-always Don't prompt you for each replacement
-recursive Recurse replacement as long as possible. Be careful, this
might lead to an infinite loop.
-allowoverlap When occurences of the pattern overlap, replace all of them.
Be careful, this might lead to an infinite loop.
other: First argument is the old text, second argument is the new
text. If the -regex argument is given, the first argument
will be regarded as a regular expression, and the second
argument might contain expressions like \\1 or \g<name>.
It is possible to introduce more than one pair of old text
and replacement.
-replacementfile Lines from the given file name(s) will be read as if they
were added to the command line at that point. I.e. a file
containing lines "a" and "b", used as
python replace.py -page:X -replacementfile:file c d
will replace 'a' with 'b' and 'c' with 'd'. However, using
python replace.py -page:X c -replacementfile:file d will
also work, and will replace 'c' with 'a' and 'b' with 'd'.
Examples:
If you want to change templates from the old syntax, e.g. {{msg:Stub}}, to the
new syntax, e.g. {{Stub}}, download an XML dump file (pages-articles) from
http://download.wikimedia.org, then use this command:
python replace.py -xml -regex "{{msg:(.*?)}}" "{{\\1}}"
If you have a dump called foobar.xml and want to fix typos in articles, e.g.
Errror -> Error, use this:
python replace.py -xml:foobar.xml "Errror" "Error" -namespace:0
If you want to do more than one replacement at a time, use this:
python replace.py -xml:foobar.xml "Errror" "Error" "Faail" "Fail" -namespace:0
If you have a page called 'John Doe' and want to fix the format of ISBNs, use:
python replace.py -page:John_Doe -fix:isbn
Let's suppose, you want to change "color" to "colour" manually, but gathering
the articles is too slow, so you want to save the list while you are sleeping.
You have Windows, so "python" is not necessary. Use this:
replace.py -xml -save:color.txt color colour -always
You may use color.txt later with -file or -links, if you upload it to the wiki.
This command will change 'referer' to 'referrer', but not in pages which
talk about HTTP, where the typo has become part of the standard:
python replace.py referer referrer -file:typos.txt -excepttext:HTTP
Please type "replace.py -help | more" if you can't read the top of the help.
"""
from __future__ import generators
#
# (C) Daniel Herding & the Pywikipedia team, 2004-2012
#
__version__='$Id: replace.py 10163 2012-05-01 14:40:41Z xqt $'
#
# Distributed under the terms of the MIT license.
#
import sys, re, time, codecs
import wikipedia as pywikibot
import pagegenerators
import editarticle
from pywikibot import i18n
import webbrowser
# Imports predefined replacements tasks from fixes.py
import fixes
# This is required for the text that is shown when you run this script
# with the parameter -help.
docuReplacements = {
'¶ms;': pagegenerators.parameterHelp,
'&fixes-help;': fixes.help,
}
class XmlDumpReplacePageGenerator:
"""
Iterator that will yield Pages that might contain text to replace.
These pages will be retrieved from a local XML dump file.
Arguments:
* xmlFilename - The dump's path, either absolute or relative
* xmlStart - Skip all articles in the dump before this one
* replacements - A list of 2-tuples of original text (as a
compiled regular expression) and replacement
text (as a string).
* exceptions - A dictionary which defines when to ignore an
occurence. See docu of the ReplaceRobot
constructor below.
"""
def __init__(self, xmlFilename, xmlStart, replacements, exceptions):
self.xmlFilename = xmlFilename
self.replacements = replacements
self.exceptions = exceptions
self.xmlStart = xmlStart
self.skipping = bool(xmlStart)
self.excsInside = []
if "inside-tags" in self.exceptions:
self.excsInside += self.exceptions['inside-tags']
if "inside" in self.exceptions:
self.excsInside += self.exceptions['inside']
import xmlreader
self.site = pywikibot.getSite()
dump = xmlreader.XmlDump(self.xmlFilename)
self.parser = dump.parse()
def __iter__(self):
try:
for entry in self.parser:
if self.skipping:
if entry.title != self.xmlStart:
continue
self.skipping = False
if not self.isTitleExcepted(entry.title) \
and not self.isTextExcepted(entry.text):
new_text = entry.text
for old, new in self.replacements:
new_text = pywikibot.replaceExcept(
new_text, old, new, self.excsInside, self.site)
if new_text != entry.text:
yield pywikibot.Page(self.site, entry.title)
except KeyboardInterrupt:
try:
if not self.skipping:
pywikibot.output(
u'To resume, use "-xmlstart:%s" on the command line.'
% entry.title)
except NameError:
pass
def isTitleExcepted(self, title):
if "title" in self.exceptions:
for exc in self.exceptions['title']:
if exc.search(title):
return True
if "require-title" in self.exceptions:
for req in self.exceptions['require-title']:
if not req.search(title): # if not all requirements are met:
return True
return False
def isTextExcepted(self, text):
if "text-contains" in self.exceptions:
for exc in self.exceptions['text-contains']:
if exc.search(text):
return True
return False
class ReplaceRobot:
"""
A bot that can do text replacements.
"""
def __init__(self, generator, replacements, exceptions={},
acceptall=False, allowoverlap=False, recursive=False,
addedCat=None, sleep=None, editSummary='', articles=None,
exctitles=None):
"""
Arguments:
* generator - A generator that yields Page objects.
* replacements - A list of 2-tuples of original text (as a
compiled regular expression) and replacement
text (as a string).
* exceptions - A dictionary which defines when not to change an
occurence. See below.
* acceptall - If True, the user won't be prompted before changes
are made.
* allowoverlap - If True, when matches overlap, all of them are
replaced.
* addedCat - If set to a value, add this category to every page
touched.
* articles - An open file to save the page titles. If None,
we work on our wikisite immediately (default).
Corresponds to titlefile variable of main().
* exctitles - An open file to save the excepted titles. If None,
we don't ask the user about saving them (default).
Corresponds to excoutfile variable of main().
Structure of the exceptions dictionary:
This dictionary can have these keys:
title
A list of regular expressions. All pages with titles that
are matched by one of these regular expressions are skipped.
text-contains
A list of regular expressions. All pages with text that
contains a part which is matched by one of these regular
expressions are skipped.
inside
A list of regular expressions. All occurences are skipped which
lie within a text region which is matched by one of these
regular expressions.
inside-tags
A list of strings. These strings must be keys from the
exceptionRegexes dictionary in pywikibot.replaceExcept().
require-title
Opposite of title. Only pages with titles that are matched by
ALL of these regular expressions will be processed.
This is not an exception, and is here for technical reasons.
Listing the same regex in title and require-title will thus
prevent the bot of doing anything.
include
One standalone value, either the name of a dictionary in your
file or the name of a callable function that takes the name of
the fix as argument and returns a dictionary of exceptions.
This dictionary may have any of the five above keys (but not
'include' itself!), and the lists belonging to those keys will
be added to your exceptions. This way you may define one or
more basic collection of exceptions used for multiple fixes,
and add separate exceptions to each fix.
"""
self.generator = generator
self.replacements = replacements
self.exceptions = exceptions
self.acceptall = acceptall
self.allowoverlap = allowoverlap
self.recursive = recursive
if addedCat:
site = pywikibot.getSite()
self.addedCat = pywikibot.Page(site, addedCat, defaultNamespace=14)
self.sleep = sleep
# Some function to set default editSummary should probably be added
self.editSummary = editSummary
self.articles = articles
self.exctitles = exctitles
# An edit counter to split the file by 100 titles if -save or -savenew
# is on, and to display the number of edited articles otherwise.
self.editcounter = 0
# A counter for saved exceptions
self.exceptcounter = 0
def isTitleExcepted(self, title):
"""
Iff one of the exceptions applies for the given title, returns True.
"""
if "title" in self.exceptions:
for exc in self.exceptions['title']:
if exc.search(title):
return True
if "require-title" in self.exceptions:
for req in self.exceptions['require-title']:
if not req.search(title):
return True
return False
def isTextExcepted(self, original_text):
"""
Iff one of the exceptions applies for the given page contents,
returns True.
"""
if "text-contains" in self.exceptions:
for exc in self.exceptions['text-contains']:
if exc.search(original_text):
return True
return False
def doReplacements(self, original_text):
"""
Returns the text which is generated by applying all replacements to
the given text.
"""
new_text = original_text
exceptions = []
if "inside-tags" in self.exceptions:
exceptions += self.exceptions['inside-tags']
if "inside" in self.exceptions:
exceptions += self.exceptions['inside']
for old, new in self.replacements:
if self.sleep is not None:
time.sleep(self.sleep)
new_text = pywikibot.replaceExcept(new_text, old, new, exceptions,
allowoverlap=self.allowoverlap)
return new_text
def writeEditCounter(self):
""" At the end of our work this writes the counter. """
if self.articles:
pywikibot.output(u'%d title%s saved.'
% (self.editcounter,
(lambda x: bool(x-1) and 's were' or ' was')
(self.editcounter)))
else:
pywikibot.output(u'%d page%s changed.'
% (self.editcounter,
(lambda x: bool(x-1) and 's were' or ' was')
(self.editcounter)))
def writeExceptCounter(self):
""" This writes the counter of saved exceptions if applicable. """
if self.exctitles:
pywikibot.output(u'%d exception%s saved.'
% (self.exceptcounter,
(lambda x: bool(x-1) and 's were' or ' was')
(self.exceptcounter)))
def splitLine(self):
"""Returns a splitline after every 100th title. Splitline is in HTML
comment format in case we want to insert the list into a wikipage.
We use it to make the file more readable.
"""
if self.editcounter % 100:
return ''
else:
return (u'<!-- ***** %dth title is above this line. ***** -->\n' %
self.editcounter)
def run(self):
"""
Starts the robot.
"""
# Run the generator which will yield Pages which might need to be
# changed.
for page in self.generator:
if self.isTitleExcepted(page.title()):
pywikibot.output(
u'Skipping %s because the title is on the exceptions list.'
% page.title(asLink=True))
continue
try:
# Load the page's text from the wiki
original_text = page.get(get_redirect=True)
if not (self.articles or page.canBeEdited()):
pywikibot.output(u"You can't edit page %s"
% page.title(asLink=True))
continue
except pywikibot.NoPage:
pywikibot.output(u'Page %s not found' % page.title(asLink=True))
continue
new_text = original_text
while True:
if self.isTextExcepted(new_text):
pywikibot.output(
u'Skipping %s because it contains text that is on the exceptions list.'
% page.title(asLink=True))
break
new_text = self.doReplacements(new_text)
if new_text == original_text:
pywikibot.output(u'No changes were necessary in %s'
% page.title(asLink=True))
break
if self.recursive:
newest_text = self.doReplacements(new_text)
while (newest_text!=new_text):
new_text = newest_text
newest_text = self.doReplacements(new_text)
if hasattr(self, "addedCat"):
cats = page.categories()
if self.addedCat not in cats:
cats.append(self.addedCat)
new_text = pywikibot.replaceCategoryLinks(new_text,
cats)
# Show the title of the page we're working on.
# Highlight the title in purple.
pywikibot.output(u"\n\n>>> \03{lightpurple}%s\03{default} <<<"
% page.title())
pywikibot.showDiff(original_text, new_text)
if self.acceptall:
break
if self.exctitles:
choice = pywikibot.inputChoice(
u'Do you want to accept these changes?',
['Yes', 'No', 'no+eXcept', 'Edit',
'open in Browser', 'All', 'Quit'],
['y', 'N', 'x', 'e', 'b', 'a', 'q'], 'N')
else:
choice = pywikibot.inputChoice(
u'Do you want to accept these changes?',
['Yes', 'No', 'Edit', 'open in Browser', 'All',
'Quit'],
['y', 'N', 'e', 'b', 'a', 'q'], 'N')
if choice == 'e':
editor = editarticle.TextEditor()
as_edited = editor.edit(original_text)
# if user didn't press Cancel
if as_edited and as_edited != new_text:
new_text = as_edited
continue
if choice == 'b':
webbrowser.open("http://%s%s" % (
page.site().hostname(),
page.site().nice_get_address(page.title())
))
i18n.input('pywikibot-enter-finished-browser')
try:
original_text = page.get(get_redirect=True, force=True)
except pywikibot.NoPage:
pywikibot.output(u'Page %s has been deleted.'
% page.title())
break
new_text = original_text
continue
if choice == 'q':
self.writeEditCounter()
self.writeExceptCounter()
return
if choice == 'a':
self.acceptall = True
if choice == 'x': #May happen only if self.exctitles isn't None
self.exctitles.write(
u"ur'^%s$',\n" % re.escape(page.title()))
self.exctitles.flush()
self.exceptcounter += 1
if choice == 'y':
if not self.articles:
# Primary behaviour: working on wiki
page.put_async(new_text, self.editSummary)
self.editcounter += 1
# Bug: this increments even if put_async fails
# This is separately in two clauses of if for
# future purposes to get feedback form put_async
else:
#Save the title for later processing instead of editing
self.editcounter += 1
self.articles.write(u'#%s\n%s'
% (page.title(asLink=True, textlink=True),
self.splitLine()))
self.articles.flush() # For the peace of our soul :-)
# choice must be 'N'
break
if self.acceptall and new_text != original_text:
if not self.articles:
#Primary behaviour: working on wiki
try:
page.put(new_text, self.editSummary)
self.editcounter += 1 #increment only on success
except pywikibot.EditConflict:
pywikibot.output(u'Skipping %s because of edit conflict'
% (page.title(),))
except pywikibot.SpamfilterError, e:
pywikibot.output(
u'Cannot change %s because of blacklist entry %s'
% (page.title(), e.url))
except pywikibot.PageNotSaved, error:
pywikibot.output(u'Error putting page: %s'
% (error.args,))
except pywikibot.LockedPage:
pywikibot.output(u'Skipping %s (locked page)'
% (page.title(),))
else:
#Save the title for later processing instead of editing
self.editcounter += 1
self.articles.write(u'#%s\n%s'
% (page.title(asLink=True, textlink=True),
self.splitLine()))
self.articles.flush()
#Finally:
self.writeEditCounter()
self.writeExceptCounter()
def prepareRegexForMySQL(pattern):
pattern = pattern.replace('\s', '[:space:]')
pattern = pattern.replace('\d', '[:digit:]')
pattern = pattern.replace('\w', '[:alnum:]')
pattern = pattern.replace("'", "\\" + "'")
#pattern = pattern.replace('\\', '\\\\')
#for char in ['[', ']', "'"]:
# pattern = pattern.replace(char, '\%s' % char)
return pattern
def main(*args):
add_cat = None
gen = None
# summary message
summary_commandline = False
# Array which will collect commandline parameters.
# First element is original text, second element is replacement text.
commandline_replacements = []
# A list of 2-tuples of original text and replacement text.
replacements = []
# Don't edit pages which contain certain texts.
exceptions = {
'title': [],
'text-contains': [],
'inside': [],
'inside-tags': [],
'require-title': [], # using a seperate requirements dict needs some
} # major refactoring of code.
# Should the elements of 'replacements' and 'exceptions' be interpreted
# as regular expressions?
regex = False
# Predefined fixes from dictionary 'fixes' (see above).
fix = None
# the dump's path, either absolute or relative, which will be used
# if -xml flag is present
xmlFilename = None
useSql = False
PageTitles = []
# will become True when the user presses a ('yes to all') or uses the
# -always flag.
acceptall = False
# Will become True if the user inputs the commandline parameter -nocase
caseInsensitive = False
# Will become True if the user inputs the commandline parameter -dotall
dotall = False
# Will become True if the user inputs the commandline parameter -multiline
multiline = False
# Do all hits when they overlap
allowoverlap = False
# Do not recurse replacement
recursive = False
# This is the maximum number of pages to load per query
maxquerysize = 60
# This factory is responsible for processing command line arguments
# that are also used by other scripts and that determine on which pages
# to work on.
genFactory = pagegenerators.GeneratorFactory()
# Load default summary message.
# BUG WARNING: This is probably incompatible with the -lang parameter.
editSummary = i18n.twtranslate(pywikibot.getSite(), 'replace-replacing',
{'description': u''})
# Between a regex and another (using -fix) sleep some time (not to waste
# too much CPU
sleep = None
# Do not save the page titles, rather work on wiki
filename = None # The name of the file to save titles
titlefile = None # The file object itself
# If we save, primary behaviour is append rather then new file
append = True
# Default: don't write titles to exception file and don't read them.
excoutfilename = None # The name of the file to save exceptions
excoutfile = None # The file object itself
# excinfilename: reserved for later use (reading back exceptions)
# If we save exceptions, primary behaviour is append
excappend = True
# Read commandline parameters.
for arg in pywikibot.handleArgs(*args):
if arg == '-regex':
regex = True
elif arg.startswith('-xmlstart'):
if len(arg) == 9:
xmlStart = pywikibot.input(
u'Please enter the dumped article to start with:')
else:
xmlStart = arg[10:]
elif arg.startswith('-xml'):
if len(arg) == 4:
xmlFilename = i18n.input('pywikibot-enter-xml-filename')
else:
xmlFilename = arg[5:]
elif arg =='-sql':
useSql = True
elif arg.startswith('-page'):
if len(arg) == 5:
PageTitles.append(pywikibot.input(
u'Which page do you want to change?'))
else:
PageTitles.append(arg[6:])
elif arg.startswith('-saveexcnew'):
excappend = False
if len(arg) == 11:
excoutfilename = pywikibot.input(
u'Please enter the filename to save the excepted titles' +
u'\n(will be deleted if exists):')
else:
excoutfilename = arg[12:]
elif arg.startswith('-saveexc'):
if len(arg) == 8:
excoutfilename = pywikibot.input(
u'Please enter the filename to save the excepted titles:')
else:
excoutfilename = arg[9:]
elif arg.startswith('-savenew'):
append = False
if len(arg) == 8:
filename = pywikibot.input(
u'Please enter the filename to save the titles' +
u'\n(will be deleted if exists):')
else:
filename = arg[9:]
elif arg.startswith('-save'):
if len(arg) == 5:
filename = pywikibot.input(
u'Please enter the filename to save the titles:')
else:
filename = arg[6:]
elif arg.startswith('-replacementfile'):
if len(arg) == len('-replacementfile'):
replacefile = pywikibot.input(
u'Please enter the filename to read replacements from:')
else:
replacefile = arg[len('-replacementfile')+1:]
try:
commandline_replacements.extend(
[x.lstrip(u'\uFEFF').rstrip('\r\n')
for x in codecs.open(replacefile, 'r', 'utf-8')])
except IOError:
raise pywikibot.Error(
'\n%s cannot be opened. Try again :-)' % replacefile)
elif arg.startswith('-excepttitle:'):
exceptions['title'].append(arg[13:])
elif arg.startswith('-requiretitle:'):
exceptions['require-title'].append(arg[14:])
elif arg.startswith('-excepttext:'):
exceptions['text-contains'].append(arg[12:])
elif arg.startswith('-exceptinside:'):
exceptions['inside'].append(arg[14:])
elif arg.startswith('-exceptinsidetag:'):
exceptions['inside-tags'].append(arg[17:])
elif arg.startswith('-fix:'):
fix = arg[5:]
elif arg.startswith('-sleep:'):
sleep = float(arg[7:])
elif arg == '-always':
acceptall = True
elif arg == '-recursive':
recursive = True
elif arg == '-nocase':
caseInsensitive = True
elif arg == '-dotall':
dotall = True
elif arg == '-multiline':
multiline = True
elif arg.startswith('-addcat:'):
add_cat = arg[8:]
elif arg.startswith('-summary:'):
editSummary = arg[9:]
summary_commandline = True
elif arg.startswith('-allowoverlap'):
allowoverlap = True
elif arg.startswith('-query:'):
maxquerysize = int(arg[7:])
else:
if not genFactory.handleArg(arg):
commandline_replacements.append(arg)
if pywikibot.verbose:
pywikibot.output(u"commandline_replacements: " +
', '.join(commandline_replacements))
if (len(commandline_replacements) % 2):
raise pywikibot.Error, 'require even number of replacements.'
elif (len(commandline_replacements) == 2 and fix is None):
replacements.append((commandline_replacements[0],
commandline_replacements[1]))
if not summary_commandline:
editSummary = i18n.twtranslate(pywikibot.getSite(),
'replace-replacing',
{'description': ' (-%s +%s)'
% (commandline_replacements[0],
commandline_replacements[1])})
elif (len(commandline_replacements) > 1):
if (fix is None):
for i in xrange (0, len(commandline_replacements), 2):
replacements.append((commandline_replacements[i],
commandline_replacements[i + 1]))
if not summary_commandline:
pairs = [( commandline_replacements[i],
commandline_replacements[i + 1] )
for i in range(0, len(commandline_replacements), 2)]
replacementsDescription = '(%s)' % ', '.join(
[('-' + pair[0] + ' +' + pair[1]) for pair in pairs])
editSummary = i18n.twtranslate(pywikibot.getSite(),
'replace-replacing',
{'description':
replacementsDescription})
else:
raise pywikibot.Error(
'Specifying -fix with replacements is undefined')
elif fix is None:
old = pywikibot.input(u'Please enter the text that should be replaced:')
new = pywikibot.input(u'Please enter the new text:')
change = '(-' + old + ' +' + new
replacements.append((old, new))
while True:
old = pywikibot.input(
u'Please enter another text that should be replaced,' +
u'\nor press Enter to start:')
if old == '':
change += ')'
break
new = i18n.input('pywikibot-enter-new-text')
change += ' & -' + old + ' +' + new
replacements.append((old, new))
if not summary_commandline:
default_summary_message = i18n.twtranslate(pywikibot.getSite(),
'replace-replacing',
{'description': change})
pywikibot.output(u'The summary message will default to: %s'
% default_summary_message)
summary_message = pywikibot.input(
u'Press Enter to use this default message, or enter a ' +
u'description of the\nchanges your bot will make:')
if summary_message == '':
summary_message = default_summary_message
editSummary = summary_message
else:
# Perform one of the predefined actions.
fixname = fix # Save the name for passing to exceptions function.
try:
fix = fixes.fixes[fix]
except KeyError:
pywikibot.output(u'Available predefined fixes are: %s'
% fixes.fixes.keys())
return
if "regex" in fix:
regex = fix['regex']
if "msg" in fix:
if isinstance(fix['msg'], basestring):
editSummary = i18n.twtranslate(pywikibot.getSite(),
str(fix['msg']))
else:
editSummary = pywikibot.translate(pywikibot.getSite(),
fix['msg'])
if "exceptions" in fix:
exceptions = fix['exceptions']
# Try to append common extensions for multiple fixes.
# It must be either a dictionary or a function that returns a dict.
if 'include' in exceptions:
incl = exceptions['include']
if callable(incl):
baseExcDict = incl(fixname)
else:
try:
baseExcDict = incl
except NameError:
pywikibot.output(
u'\nIncluded exceptions dictionary does not exist.' +
u' Continuing with the exceptions\ngiven in fix.\n')
baseExcDict = None
if baseExcDict:
for l in baseExcDict:
try:
exceptions[l].extend(baseExcDict[l])
except KeyError:
exceptions[l] = baseExcDict[l]
if "recursive" in fix:
recursive = fix['recursive']
if "nocase" in fix:
caseInsensitive = fix['nocase']
try:
replacements = fix['replacements']
# enable regex/replacements as a dictionary for different langs
if isinstance(replacements, dict):
replacements = replacements[pywikibot.getSite().lang]
except KeyError:
pywikibot.output(
u"No replacements given in fix.")
return
# Set the regular expression flags
flags = re.UNICODE
if caseInsensitive:
flags = flags | re.IGNORECASE
if dotall:
flags = flags | re.DOTALL
if multiline:
flags = flags | re.MULTILINE
# Pre-compile all regular expressions here to save time later
for i in range(len(replacements)):
old, new = replacements[i]
if not regex:
old = re.escape(old)
oldR = re.compile(old, flags)
replacements[i] = oldR, new
for exceptionCategory in [
'title', 'require-title', 'text-contains', 'inside']:
if exceptionCategory in exceptions:
patterns = exceptions[exceptionCategory]
if not regex:
patterns = [re.escape(pattern) for pattern in patterns]
patterns = [re.compile(pattern, flags) for pattern in patterns]
exceptions[exceptionCategory] = patterns
if xmlFilename:
try:
xmlStart
except NameError:
xmlStart = None
gen = XmlDumpReplacePageGenerator(xmlFilename, xmlStart,
replacements, exceptions)
elif useSql:
whereClause = 'WHERE (%s)' % ' OR '.join(
["old_text RLIKE '%s'" % prepareRegexForMySQL(old.pattern)
for (old, new) in replacements])
if exceptions:
exceptClause = 'AND NOT (%s)' % ' OR '.join(
["old_text RLIKE '%s'" % prepareRegexForMySQL(exc.pattern)
for exc in exceptions])
else:
exceptClause = ''
query = u"""
SELECT page_namespace, page_title
FROM page
JOIN text ON (page_id = old_id)
%s
%s
LIMIT 200""" % (whereClause, exceptClause)
gen = pagegenerators.MySQLPageGenerator(query)
elif PageTitles:
pages = [pywikibot.Page(pywikibot.getSite(), PageTitle)
for PageTitle in PageTitles]
gen = iter(pages)
gen = genFactory.getCombinedGenerator(gen)
if not gen:
# syntax error, show help text from the top of this file
pywikibot.showHelp('replace')
return
preloadingGen = pagegenerators.PreloadingGenerator(gen,
pageNumber=maxquerysize)
# Finally we open the file for page titles or set parameter article to None
if filename:
try:
# This opens in strict error mode, that means bot will stop
# on encoding errors with ValueError.
# See http://docs.python.org/library/codecs.html#codecs.open
titlefile = codecs.open(filename, encoding='utf-8',
mode=(lambda x: x and 'a' or 'w')(append))
except IOError:
pywikibot.output("%s cannot be opened for writing." %
filename)
return
# The same process with exceptions file:
if excoutfilename:
try:
excoutfile = codecs.open(
excoutfilename, encoding='utf-8',
mode=(lambda x: x and 'a' or 'w')(excappend))
except IOError:
pywikibot.output("%s cannot be opened for writing." %
excoutfilename)
return
bot = ReplaceRobot(preloadingGen, replacements, exceptions, acceptall,
allowoverlap, recursive, add_cat, sleep, editSummary,
titlefile, excoutfile)
try:
bot.run()
finally:
# Just for the spirit of programming (they were flushed)
if titlefile:
titlefile.close()
if excoutfile:
excoutfile.close()
if __name__ == "__main__":
try:
main()
finally:
pywikibot.stopme()