-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/port to py3 #184
base: develop
Are you sure you want to change the base?
Feature/port to py3 #184
Conversation
It looks like we are not using it anymore
Celery 4.0 won't automatically register tasks that inherit from Task. We need to solve this before we can use the newer version. Celery's documentation only mentions class-based tasks in a way that is very different from what we do here, though.
This test was changed by 2to3, that's why it broke.
This test now makes sure only the correct document is updated.
2to3 introduced an error because it couldn't know what StringIO was being used for.
The next few commits will change tests in a similar manner. This test is no longer touching the database, because we rely on the PyPLNTask test to test the fetch/save functionality.
List sorting changed in Python 3 and apparently string sorting did too.
This test is still not passing because cld is throwing an exception
Address a possible exception raised by Magic.id_buffer and remove the superfluous text.decode('utf-8', 'replace') call since decoding with the iso8859-1 codec will never raise a UnicodeDecodeError exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checker.set_text(document['text']) | ||
errors = [[e.word, e.wordpos, e.suggest()] for e in checker] | ||
except KeyError: | ||
errors = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the change in the way we treat the case where the specific dictionary does not exist (it wasn't a very good idea to do that catching an exception) and I understand why you kept the try/except block (because there is another possible KeyError
there). But now that this change was made, I think we can get rid of the entire try/except. This possible KeyError
is possible in every Task (it represents a pre-requisite not being there when the task runs). I think I prefer to let this exception trickle up so we can see later there was an error (as it happens everywhere else). In short: you found a bug that meant silencing errors that shouldn't be silenced :)
pypln/backend/workers/extractor.py
Outdated
# Decoding with iso8859-1 doesn't raise UnicodeDecodeError, so this is | ||
# a last resort. | ||
result = text.decode('iso8859-1') | ||
return result | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand here, the behavior didn't change because we were never executing result = text.decode('utf-8', 'replace')
(since decoding with iso8859-1 never raises a UnicodeDecodeError
), right? If that's the case, perfect. If not, I think it would be a good idea to keep the forced decoding.
Do you know why decoding with iso8859-1 never raises this exception? (I'm not doubting it, just didn't understand the reason :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because iso8859-1 is a single byte encoding and every byte from \x00 to \xff is a "valid" char in it. If for instance you were to erroneously try to decode chars encoded in utf8 you would just obtain a sequence of strange chars (aka a mojibake) for each multi-byte utf8 char. If you can provide a test case that proves otherwise I'd be glad to change my mind though.
A sort of proof for what I said about every byte being a valid iso8859-1 char:
In [1]: import struct
In [2]: for i in range(256):
...: byte = struct.pack('B', i)
...: char = byte.decode('iso8859-1')
...: print(i, repr(char), char)
...:
0 '\x00'
1 '\x01' �
2 '\x02' �
3 '\x03' �
4 '\x04' �
5 '\x05'
6 '\x06' �
7 '\x07'
8 '\x08'
9 '\t'
10 '\n'
11 '\x0b'
12 '\x0c'
13 '\r'
14 '\x0e'
15 '\x0f'
16 '\x10' �
17 '\x11' �
18 '\x12' �
19 '\x13' �
20 '\x14' �
21 '\x15' �
22 '\x16' �
23 '\x17' �
24 '\x18' �
25 '\x19' �
26 '\x1a' �
27 '\x1b' �
28 '\x1c' �
29 '\x1d' �
30 '\x1e' �
31 '\x1f' �
32 ' '
33 '!' !
34 '"' "
35 '#' #
36 '$' $
37 '%' %
38 '&' &
39 "'" '
40 '(' (
41 ')' )
42 '*' *
43 '+' +
44 ',' ,
45 '-' -
46 '.' .
47 '/' /
48 '0' 0
49 '1' 1
50 '2' 2
51 '3' 3
52 '4' 4
53 '5' 5
54 '6' 6
55 '7' 7
56 '8' 8
57 '9' 9
58 ':' :
59 ';' ;
60 '<' <
61 '=' =
62 '>' >
63 '?' ?
64 '@' @
65 'A' A
66 'B' B
67 'C' C
68 'D' D
69 'E' E
70 'F' F
71 'G' G
72 'H' H
73 'I' I
74 'J' J
75 'K' K
76 'L' L
77 'M' M
78 'N' N
79 'O' O
80 'P' P
81 'Q' Q
82 'R' R
83 'S' S
84 'T' T
85 'U' U
86 'V' V
87 'W' W
88 'X' X
89 'Y' Y
90 'Z' Z
91 '[' [
92 '\\' \
93 ']' ]
94 '^' ^
95 '_' _
96 '`' `
97 'a' a
98 'b' b
99 'c' c
100 'd' d
101 'e' e
102 'f' f
103 'g' g
104 'h' h
105 'i' i
106 'j' j
107 'k' k
108 'l' l
109 'm' m
110 'n' n
111 'o' o
112 'p' p
113 'q' q
114 'r' r
115 's' s
116 't' t
117 'u' u
118 'v' v
119 'w' w
120 'x' x
121 'y' y
122 'z' z
123 '{' {
124 '|' |
125 '}' }
126 '~' ~
127 '\x7f' �
128 '\x80' �
129 '\x81' �
130 '\x82' �
131 '\x83' �
132 '\x84' �
133 '\x85' �
134 '\x86' �
135 '\x87' �
136 '\x88' �
137 '\x89' �
138 '\x8a' �
139 '\x8b' �
140 '\x8c' �
141 '\x8d' �
142 '\x8e' �
143 '\x8f' �
144 '\x90' �
145 '\x91' �
146 '\x92' �
147 '\x93' �
148 '\x94' �
149 '\x95' �
150 '\x96' �
151 '\x97' �
152 '\x98' �
153 '\x99' �
154 '\x9a' �
155 '\x9b' �
156 '\x9c' �
157 '\x9d' �
158 '\x9e' �
159 '\x9f' �
160 '\xa0'
161 '¡' ¡
162 '¢' ¢
163 '£' £
164 '¤' ¤
165 '¥' ¥
166 '¦' ¦
167 '§' §
168 '¨' ¨
169 '©' ©
170 'ª' ª
171 '«' «
172 '¬' ¬
173 '\xad'
174 '®' ®
175 '¯' ¯
176 '°' °
177 '±' ±
178 '²' ²
179 '³' ³
180 '´' ´
181 'µ' µ
182 '¶' ¶
183 '·' ·
184 '¸' ¸
185 '¹' ¹
186 'º' º
187 '»' »
188 '¼' ¼
189 '½' ½
190 '¾' ¾
191 '¿' ¿
192 'À' À
193 'Á' Á
194 'Â' Â
195 'Ã' Ã
196 'Ä' Ä
197 'Å' Å
198 'Æ' Æ
199 'Ç' Ç
200 'È' È
201 'É' É
202 'Ê' Ê
203 'Ë' Ë
204 'Ì' Ì
205 'Í' Í
206 'Î' Î
207 'Ï' Ï
208 'Ð' Ð
209 'Ñ' Ñ
210 'Ò' Ò
211 'Ó' Ó
212 'Ô' Ô
213 'Õ' Õ
214 'Ö' Ö
215 '×' ×
216 'Ø' Ø
217 'Ù' Ù
218 'Ú' Ú
219 'Û' Û
220 'Ü' Ü
221 'Ý' Ý
222 'Þ' Þ
223 'ß' ß
224 'à' à
225 'á' á
226 'â' â
227 'ã' ã
228 'ä' ä
229 'å' å
230 'æ' æ
231 'ç' ç
232 'è' è
233 'é' é
234 'ê' ê
235 'ë' ë
236 'ì' ì
237 'í' í
238 'î' î
239 'ï' ï
240 'ð' ð
241 'ñ' ñ
242 'ò' ò
243 'ó' ó
244 'ô' ô
245 'õ' õ
246 'ö' ö
247 '÷' ÷
248 'ø' ø
249 'ù' ù
250 'ú' ú
251 'û' û
252 'ü' ü
253 'ý' ý
254 'þ' þ
255 'ÿ' ÿ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geron that makes perfect sense to me. As I said, I didn't doubt it, I just didn't understand it before. Now I am a little ashamed of not realizing that.
call('utf-8'), | ||
call('iso8859-1')]) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We really needed this!
u", ".join(unicode(item) for item in diff_set))) | ||
self.assertEqual(refreshed_document['mimetype'], 'application/pdf') | ||
", ".join(str(item) for item in diff_set))) | ||
self.assertEqual(result['mimetype'], 'application/pdf') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, when running this test on my machine, the date format is different. It might be a version of pdfinfo or something like that. For the test to pass on my machine I needed the following patch:
diff --git a/tests/test_worker_extractor.py b/tests/test_worker_extractor.py
index 4c94349..9f4d2e0 100644
--- a/tests/test_worker_extractor.py
+++ b/tests/test_worker_extractor.py
@@ -123,7 +123,7 @@ class TestExtractorWorker(TestCase):
'Author': 'Álvaro Justen',
'Creator': 'Writer',
'Producer': 'LibreOffice 3.5',
- 'CreationDate': 'Fri Jun 1 17:07:57 2012',
+ 'CreationDate': 'Fri Jun 1 17:07:57 2012 BRT',
'Tagged': 'no',
'Pages': '1',
'Encrypted': 'no',
This probably does not happen on your system, so we might need to check if the problem really is with software versions, and make sure we are always using the same, everywhere.
tests/test_worker_extractor.py
Outdated
result = Extractor().process(data) | ||
self.assertEqual(result['text'], expected) | ||
self.assertEqual(result['file_metadata'], {}) | ||
self.assertEqual(result['language'], 'en') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test that is still not passing because pycld2 is blowing up with the weird byte, right?
I don't know exactly what to do about this. I don't think any of the options is very good. From what I remembered you gave me three options (of course we can think of new ones):
- Try to use the old cld version
- Not detect language for this documents
- Try to remove the weird byte somehow
1
sounds terrible because it is old, unmaintained code, that does not support python3. It's my last option!
2
has one big problem: if we don't detect language, we won't run palavras for the documents that are in Portuguese. This means a lot of the important analysis won't happen.
3
sounds a bit more interesting if we can do it the right way. I'm worried about two things: we should only try to do it when we detect the weird byte in the document, and it also should be reasonably fast. This options has another limitation, though: if we find different, new weird bytes, we have to change our code to check for them also.
@fccoelho Any ideas on how to deal with this?
Minor comment: Mongodict is still in the requirements... |
@fccoelho the purpose of this branch is to only port to python 3. migrating to another database will come in steps in another branch after this one is merged. |
Mongodict hasn't been used for about a year, so if it is in the requirements file, it should go. I didn't find it there, though. |
When I pulled from Geron's fork to look at the PR, Pycharm detected it as a
requirement. But I didn't check the requirements file
…On Thu, Dec 1, 2016 at 3:04 PM, Flávio Amieiro ***@***.***> wrote:
Mongodict hasn't been used for about a year, so if it is in the
requirements file, it should go. I didn't find it there, though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#184 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIjW401GeqKmAttdSfapwQEkxMiUmRuks5rDv4MgaJpZM4K94Qx>
.
--
Flávio Codeço Coelho
================
+55(21) 3799-5551
Professor
Escola de Matemática Aplicada
Fundação Getulio Vargas
Praia de Botafogo, 190 sala 312
Rio de Janeiro - RJ
22250-900
Brasil
|
In the bigram implementation, the order of the bigrams do not matter since they only represent the unique bigrams found in a document. |
No description provided.