-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xls addition #9
Xls addition #9
Conversation
3984b8e
to
e6ed78f
Compare
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException { | ||
|
||
CombineFileSplit split = (CombineFileSplit) genericSplit; | ||
Configuration job = context.getConfiguration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change variable name to jobConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
* | ||
* @return A hashmap of column names and their manually set schemas. | ||
*/ | ||
public Map<String, Schema> getOverride() throws IllegalArgumentException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have this in all file plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case ERROR: | ||
return null; | ||
default: | ||
throw new IllegalStateException("Unexpected celltype (" + cellType + ")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use String.format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
} | ||
|
||
@Override | ||
protected void addFormatProperties(Map<String, String> properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this property also to keep split as 1.
properties.put(FileInputFormat.SPLIT_MINSIZE, Long.toString(Long.MAX_VALUE));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
context.getFailureCollector().addFailure("Sheet number must be a number.", null) | ||
.withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE); | ||
} | ||
FailureCollector collector = context.getFailureCollector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are evaluating the schema from the file, when this will be triggered.
* Utilities around XLS input format. | ||
*/ | ||
public class XlsInputFormatUtils { | ||
private static final Pattern NOT_VALID_PATTERN = Pattern.compile("[^A-Za-z0-9_]+"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the latest naming convention for BQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This impl is taken from file plugin.
This is the invalid, column for CDAP.
We are not sure of the side effect this may cause, are you sure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some unit tests before raising the external PR for review.
format-xls/pom.xml
Outdated
<version>2.12.0-SNAPSHOT</version> | ||
</parent> | ||
<artifactId>format-xls</artifactId> | ||
<name>XLSX format plugins</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets be consistent with format XLS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed !
XLS format plugins
format-xls/pom.xml
Outdated
<dependency> | ||
<groupId>org.apache.poi</groupId> | ||
<artifactId>poi</artifactId> | ||
<version>5.2.4</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define the property
for others as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
<poi.version>5.2.4</poi.version>
isRowNull = true; | ||
for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) { | ||
if (cellIndex >= fields.size()) { | ||
throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this check outside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can i move cellIndex
logic outside loop ?
cellIndex
is part of this loop 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean instead of checking it inside the loop you can do it before the loop, something like:
throw new IllegalArgumentException( | |
if (row.getLastCellNum() > fields.size()) { | |
throw new IllegalArgumentException(String.format("Schema contains less fields than the number of columns in the excel file. Schema fields: %s, Excel columns: %s", fields.size(), row.getLastCellNum())); | |
} |
} | ||
Cell cell = row.getCell(cellIndex, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL); | ||
if (cell == null) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comments regarding why the cell is not processed or can't we ignore this continue
checks and add null
in line#160
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
// Blank cells are skipped, builder will set null for the field, no processing needed.
We are also using the continue logic to track isRowNull
so we can't move it inside line#160
"Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " + | ||
"'Sheet Number' as 'sheet' input respectively. Sheet number starts with 0."; | ||
public static final String DESC_TERMINATE_ROW = "Specify whether processing needs to be terminated in case an" + | ||
" empty row is encountered while processing excel files. Options to select are true or false."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of Options to select are true or false.
, write The default value is false.
or we can remove this also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to Default value is false.
/** | ||
* Formats the cell value of an Excel file. | ||
* | ||
* @param cell the cell to format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added @param type the schema type of the cell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you update the comment as follows:
{@code Cell} the cell to format
case ERROR: | ||
return null; | ||
default: | ||
throw new IllegalStateException(String.format("Unexpected celltype (%s)", cellType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elaborate the message by specifying the list of Supported types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an impossible state , only possible if xls
has a new cell type added in future, this is internal representation of cell by library we are using. This is an implementation details not relevant for end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in impossible state we should be informative about the error messages. So, lets rewrite the error messages something as:
Failed to format (%s) due to unsupported cell type (%s)
.
format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatProvider.java
Show resolved
Hide resolved
for (int rowIndex = rowStart; rowIndex <= rowEnd; rowIndex++) { | ||
Row row = workSheet.getRow(rowIndex); | ||
if (row == null) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
come up with some logic to ignore the usage of continue
and break down the header & data processing into separate methods for better readability.
format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatProvider.java
Outdated
Show resolved
Hide resolved
return false; | ||
} | ||
// Get the next row. | ||
Row row = workSheet.getRow(rowIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it guaranteed that workSheet.getRow(rowIndex)
never return null
?
a1f6456
to
51beae1
Compare
698dd7e
to
cd5f2a0
Compare
cd5f2a0
to
1f42fb7
Compare
[s] Review Squashed
1f42fb7
to
5ae5ae1
Compare
TODO:
Add a description