Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] io: Handle mismatched number of header/data values #3237

Merged
merged 4 commits into from
Sep 12, 2018

Conversation

robertcv
Copy link
Collaborator

@robertcv robertcv commented Sep 7, 2018

Issue

Fixes #1471
Orange can't load fils where there are more data then header columns

Description of changes

Strip data columns

Includes
  • Code changes
  • Tests
  • Documentation

@lanzagar
Copy link
Contributor

lanzagar commented Sep 7, 2018

The issue was about stripping empty columns, however the solution in this PR strips all columns, even those containing data, when there are more items in a row than there are headers.

I am not saying that is the wrong approach, but we should know what we are choosing.
As I see it, there are 2 options (other than raising an error and doing nothing):

  • strip columns in rows (as is done here)
  • add columns to the header, e.g. if a header has just "a" and "b" and there are 3 values in a row, generate a new unnamed variable ("Feature 1")

I don't mind going with the first, since it is already done here.
However, I think a warning when loading the data would be nice, when data is discarded due to problems!

Comments anyone? @ajdapretnar @astaric

@ajdapretnar
Copy link
Contributor

Hmm... I think it was assumed here that the values will be empty, meaning no column names AND no values. At least that's what happened sometimes with Excel.
If there are more items in a row than there are headers, then I think we should go with the second approach, i.e. Feature 1. You don't want to discard values when there are no columns names.
A warning in either case would be nice.

@astaric
Copy link
Member

astaric commented Sep 7, 2018

+1 for warning.

my vote is for 1. since it is already implemented (and less complicated :))

@codecov-io
Copy link

codecov-io commented Sep 10, 2018

Codecov Report

Merging #3237 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3237      +/-   ##
==========================================
- Coverage   82.82%   82.81%   -0.01%     
==========================================
  Files         346      346              
  Lines       59629    59636       +7     
==========================================
+ Hits        49385    49390       +5     
- Misses      10244    10246       +2

table = CSVReader(c).read()
self.assertEqual(len(table.domain.attributes), 2)
self.assertEqual(cm.warning.args[0],
"Columns with no headers were striped.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Striped --> Removed or stripped.

return lst

# Ensure all data is of equal width in a column-contiguous array
data = [_equal_length([s.strip() for s in row])
for row in data if any(row)]
data = np.array(data, dtype=object, order='F')

if strip:
warnings.warn("Columns with no headers were striped.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer "Columns with no headers were removed.", since stripped is a bit Pythonish. Or at least 'stripped'.

@lanzagar lanzagar changed the title [FIX] io: strip columns [FIX] io: Handle mismatched number of header/data values Sep 12, 2018
@lanzagar lanzagar added this to the 3.16 milestone Sep 12, 2018
@lanzagar lanzagar merged commit 1a916d7 into biolab:master Sep 12, 2018
@robertcv robertcv deleted the fixes/file_strip_columns branch May 31, 2019 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants