ANSI to UTF8 conversion #2

thib-b · 2011-06-29T13:06:13Z

I've added support for CSV files encoded in ANSI.
Importing ANSI files in UTF8 mode into redmine replaces accents like é,à or è by some unrecognized character.
I've implemented transcoding ANSI into UTF8 before parsing (FasterCSV does not parse ANSI files). This uses 'iconv' from ruby standard library.

This is convenient for windows users. Feel free to ask for questions or give some feedback.
thank you !

thib-b · 2011-06-29T13:23:03Z

I've found some issues, not ready to be added to trunk now.
I will reopen this pull request when it will be fully tested.

UTF8 without BOM works properly UTF8 with BOM is having issues(the very first field is unrecognized, which can lead to issues if the field isn't specified by user in the GUI)

UTF8 without BOM properly working UTF8 with BOM having issues (very first field not automatically recognized, must be set by user in the match view)

Conflicts: app/controllers/importer_controller.rb test/ansi2utf8/utf8_accent.csv test/ansi2utf8/utf8_noaccent.csv

thib-b · 2011-06-29T16:14:34Z

ANSI issues are now resolved and fully tested.

Talking about charsets, there is still an issue with charsets (which is not related to what I added).
UTF8 without BOM does not have problems.
However, UTF8 with BOM (marker added in the beginning of the file) makes the firste field unrecognized (in the match view, for example if "Subject" is the first field, it's not matched with "Subject", it's just unrecognized("Ignore").

Sorry for the previous issues. My add is now fully-tested and working.
Hope you'll like it and be interested in the "BOM" issue.
thibault

(btw, I'm new to github and don't know how to add my new commit in this pull request)

* unwanted separators at end of lines are removed * unwanted spaces at start of line and around separators are removed

leovitch · 2011-08-03T02:09:34Z

Thanks for the pull request! I'm sorry I just haven't gotten a chance to review and pull it, I'm also out of the country for a couple weeks but I'll get to it I promise!

thib-b · 2011-08-03T08:19:41Z

No worries !

For later, let me explain you why we needed this and what I've done.

We're using a lot MS Excel and we wanted to be able to fill in the issues fileds in Excel and export them to CSV.
The problem was that Excel exports its data in ANSI encoding, wich was not handled by your plugin.

The way I implemented it was to add the ANSI choice and convert the ANSI data to UTF-8 (with iconv). It's just preprocessing of data before FasterCSV parsing.

Because Excel makes any cell exist since it has been clicked on, this generated lot of irrelevant data (lines of separators ";;;;;;;;;..." until the clicked cell for example). I added some processing to handle that.

A last point was the presence of blanks (" ") around separators which made the plugin return a 500 Error, or before the first field (which could make it unrecognised in the match view of the plugin). I also handled those cases.

As I said before, there is a last thing regarding to charset issues, it's the UTF-8 with BOM which makes the very first field unrecognised (don't understand why FasterCSV wouldn't handle that). I did'nt take care of this in my commits since I wasn't really impacted by the issue, but I guess something like this might do the trick:

if csv_data.starts_with?("ï»¿")
  csv_data.delete("ï»¿")
end

(there's must be a cleaner way to do this, but here is the idea)

I'll also add some notes to the code so you can have a cleaner view of what I've done.
By the way, if you like the feature but not the way it's coded don't hesitate to refactor if the way you want, I won't be upset ;)

thib-b · 2011-08-03T08:21:26Z

app/controllers/importer_controller.rb

+    end
+
+    # preparing the import
+    chomp_separators!(iip)


Here is the call to the function removing the irrelevant separators in the csv_data

leovitch · 2012-01-04T05:57:46Z

Hi Thib,

I'm finally getting a chance to look at this in detail, sorry for the long delay.

First off, could you let me know what version/flavor of Excel you're using? Most of my work with the plugin has been with Excel-generated CSV files on Windows, but I'm confused on two points.

All of the versions of Excel I've ever seen have a pulldown menu in the export dialog which allows you to choose the exported character encoding. This is really kind of a better practice to encourage, but you didn't say what vintage/flavor of Excel you're using...
I've never seen Excel do the behavior you're talking about with spaces or so forth, so again I'm curious about the version (AFAICS you didn't include a sample file for this; more on why I'm asking below).

A couple other comments:

I don't really want to propagate the error of calling the Windows CP1252 encoding "ANSI". This actually confused me when I first read your post! Check out http://en.wikipedia.org/wiki/Windows-1252, especially the second reference, for a discussion. It's a Windows-proprietary encoding, not an ANSI standard, so I'm thinking calling it "Windows (CP1252)" or something like that in the pulldown menu would be better.
The spaces behavior in your diffs is unconditional (i.e., applied to all incoming files). While I realize it's usually OK, if you look at the closest thing we have to a spec for CSV, RFC 4180, those spaces are actually valid data, and it's incorrect to trim them off. I'm open to adding that option, but it should definitely be a checkbox that defaults off.
For the separators behavior, I understand what it's doing, but it seems easy to also construct a CSV file where that behavior will make the plugin fail on a valid import. For instance, if I'm importing a bunch of issues and the first few have values for every column, but the last ones had blank values, your regexp will chop those values off and the import will fail. So, I'd kind of rather not incorporate the chomp separators functionality at all (extra values in the row shouldn't bother the importer)
A possibly related feature might be to ignore totally blank lines. Right now it doesn't do that, but I can't think of any case where a totally blank line would be useful to try and import, and it's trivial to implement by passing :blank=>true to FasterCSV if it's a problem for you.

Sorry to pull you back to something you did so long ago!

Leo

Russian locale

thib-b added 2 commits June 29, 2011 12:13

added ANSI charset handlilng in CSV files import

6643c70

added csv files to test ansi2utf8 encoding

e24c638

thib-b closed this Jun 29, 2011

thib-b added 3 commits June 29, 2011 17:44

fixed ANSI result issues (issues not translated into utf8)

14e7378

UTF8 without BOM works properly UTF8 with BOM is having issues(the very first field is unrecognized, which can lead to issues if the field isn't specified by user in the GUI)

fixed ANSI result issues (untranscoded data in results)

9653978

UTF8 without BOM properly working UTF8 with BOM having issues (very first field not automatically recognized, must be set by user in the match view)

Merge branch 'master' of https://github.com/thib-b/redmine_importer

c46c592

Conflicts: app/controllers/importer_controller.rb test/ansi2utf8/utf8_accent.csv test/ansi2utf8/utf8_noaccent.csv

thib-b reopened this Jun 29, 2011

added preprocessing of CSV data

d1190df

* unwanted separators at end of lines are removed * unwanted spaces at start of line and around separators are removed

thib-b reviewed Aug 3, 2011
View reviewed changes

octaviogl pushed a commit to octaviogl/redmine_importer that referenced this pull request Jan 10, 2014

Merge pull request leovitch#2 from logical-and/master

7b18385

Russian locale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANSI to UTF8 conversion #2

ANSI to UTF8 conversion #2

thib-b commented Jun 29, 2011

thib-b commented Jun 29, 2011

thib-b commented Jun 29, 2011

leovitch commented Aug 3, 2011

thib-b commented Aug 3, 2011

thib-b Aug 3, 2011

leovitch commented Jan 4, 2012

ANSI to UTF8 conversion #2

Are you sure you want to change the base?

ANSI to UTF8 conversion #2

Conversation

thib-b commented Jun 29, 2011

thib-b commented Jun 29, 2011

thib-b commented Jun 29, 2011

leovitch commented Aug 3, 2011

thib-b commented Aug 3, 2011

thib-b Aug 3, 2011

Choose a reason for hiding this comment

leovitch commented Jan 4, 2012