Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANSI to UTF8 conversion #2

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

ANSI to UTF8 conversion #2

wants to merge 6 commits into from

Conversation

thib-b
Copy link

@thib-b thib-b commented Jun 29, 2011

I've added support for CSV files encoded in ANSI.
Importing ANSI files in UTF8 mode into redmine replaces accents like é,à or è by some unrecognized character.
I've implemented transcoding ANSI into UTF8 before parsing (FasterCSV does not parse ANSI files). This uses 'iconv' from ruby standard library.

This is convenient for windows users. Feel free to ask for questions or give some feedback.
thank you !

@thib-b
Copy link
Author

thib-b commented Jun 29, 2011

I've found some issues, not ready to be added to trunk now.
I will reopen this pull request when it will be fully tested.

@thib-b thib-b closed this Jun 29, 2011
thib-b added 3 commits June 29, 2011 17:44
UTF8 without BOM works properly
UTF8 with BOM is having issues(the very first field is unrecognized, which can lead to issues if the field isn't specified by user in the GUI)
UTF8 without BOM properly working
UTF8 with BOM having issues (very first field not automatically recognized, must be set by user in the match view)
Conflicts:
	app/controllers/importer_controller.rb
	test/ansi2utf8/utf8_accent.csv
	test/ansi2utf8/utf8_noaccent.csv
@thib-b
Copy link
Author

thib-b commented Jun 29, 2011

ANSI issues are now resolved and fully tested.

Talking about charsets, there is still an issue with charsets (which is not related to what I added).
UTF8 without BOM does not have problems.
However, UTF8 with BOM (marker added in the beginning of the file) makes the firste field unrecognized (in the match view, for example if "Subject" is the first field, it's not matched with "Subject", it's just unrecognized("Ignore").

Sorry for the previous issues. My add is now fully-tested and working.
Hope you'll like it and be interested in the "BOM" issue.
thibault

(btw, I'm new to github and don't know how to add my new commit in this pull request)

@thib-b thib-b reopened this Jun 29, 2011
* unwanted separators at end of lines are removed
* unwanted spaces at start of line and around separators are removed
@leovitch
Copy link
Owner

leovitch commented Aug 3, 2011

Thanks for the pull request! I'm sorry I just haven't gotten a chance to review and pull it, I'm also out of the country for a couple weeks but I'll get to it I promise!

@thib-b
Copy link
Author

thib-b commented Aug 3, 2011

No worries !

For later, let me explain you why we needed this and what I've done.

We're using a lot MS Excel and we wanted to be able to fill in the issues fileds in Excel and export them to CSV.
The problem was that Excel exports its data in ANSI encoding, wich was not handled by your plugin.

The way I implemented it was to add the ANSI choice and convert the ANSI data to UTF-8 (with iconv). It's just preprocessing of data before FasterCSV parsing.

Because Excel makes any cell exist since it has been clicked on, this generated lot of irrelevant data (lines of separators ";;;;;;;;;..." until the clicked cell for example). I added some processing to handle that.

A last point was the presence of blanks (" ") around separators which made the plugin return a 500 Error, or before the first field (which could make it unrecognised in the match view of the plugin). I also handled those cases.

As I said before, there is a last thing regarding to charset issues, it's the UTF-8 with BOM which makes the very first field unrecognised (don't understand why FasterCSV wouldn't handle that). I did'nt take care of this in my commits since I wasn't really impacted by the issue, but I guess something like this might do the trick:

if csv_data.starts_with?("")
  csv_data.delete("")
end

(there's must be a cleaner way to do this, but here is the idea)

I'll also add some notes to the code so you can have a cleaner view of what I've done.
By the way, if you like the feature but not the way it's coded don't hesitate to refactor if the way you want, I won't be upset ;)

end

# preparing the import
chomp_separators!(iip)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the call to the function removing the irrelevant separators in the csv_data

@leovitch
Copy link
Owner

leovitch commented Jan 4, 2012

Hi Thib,

I'm finally getting a chance to look at this in detail, sorry for the long delay.

First off, could you let me know what version/flavor of Excel you're using? Most of my work with the plugin has been with Excel-generated CSV files on Windows, but I'm confused on two points.

  • All of the versions of Excel I've ever seen have a pulldown menu in the export dialog which allows you to choose the exported character encoding. This is really kind of a better practice to encourage, but you didn't say what vintage/flavor of Excel you're using...
  • I've never seen Excel do the behavior you're talking about with spaces or so forth, so again I'm curious about the version (AFAICS you didn't include a sample file for this; more on why I'm asking below).

A couple other comments:

  • I don't really want to propagate the error of calling the Windows CP1252 encoding "ANSI". This actually confused me when I first read your post! Check out http://en.wikipedia.org/wiki/Windows-1252, especially the second reference, for a discussion. It's a Windows-proprietary encoding, not an ANSI standard, so I'm thinking calling it "Windows (CP1252)" or something like that in the pulldown menu would be better.
  • The spaces behavior in your diffs is unconditional (i.e., applied to all incoming files). While I realize it's usually OK, if you look at the closest thing we have to a spec for CSV, RFC 4180, those spaces are actually valid data, and it's incorrect to trim them off. I'm open to adding that option, but it should definitely be a checkbox that defaults off.
  • For the separators behavior, I understand what it's doing, but it seems easy to also construct a CSV file where that behavior will make the plugin fail on a valid import. For instance, if I'm importing a bunch of issues and the first few have values for every column, but the last ones had blank values, your regexp will chop those values off and the import will fail. So, I'd kind of rather not incorporate the chomp separators functionality at all (extra values in the row shouldn't bother the importer)
  • A possibly related feature might be to ignore totally blank lines. Right now it doesn't do that, but I can't think of any case where a totally blank line would be useful to try and import, and it's trivial to implement by passing :blank=>true to FasterCSV if it's a problem for you.

Sorry to pull you back to something you did so long ago!

Leo

octaviogl pushed a commit to octaviogl/redmine_importer that referenced this pull request Jan 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants