Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement xlsxioread_sheet_next_cell_struct with cell types #47

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

Paxa
Copy link

@Paxa Paxa commented Mar 13, 2019

I took suggested implementation from #12 and add new method to keep existing API same

Fixes #12

Now I found similar fork matejd@a0b6a0b

Travis status: Build Status (https://travis-ci.org/Paxa/xlsxio)

How to use xlsxioread_sheet_next_cell_struct:

  xlsxioreader xlsxioread;

  if ((xlsxioread = xlsxioread_open(filename)) == NULL) {
    fprintf(stderr, "Error opening .xlsx file\n");
    return 1;
  }

  xlsxioread_cell value;
  const char *typeNames[] = { "none", "value", "boolean", "string", "date" };

  printf("Contents of first sheet:\n");
  xlsxioreadersheet sheet = xlsxioread_sheet_open(xlsxioread, NULL, XLSXIOREAD_SKIP_EMPTY_ROWS);
  while (xlsxioread_sheet_next_row(sheet)) {
    while ((value = xlsxioread_sheet_next_cell_struct(sheet)) != NULL) {
      /*
        value is a struct:
            size_t row_num;
            size_t col_num;
            char* data;
            cell_type_enum cell_type;
            char* number_fmt; - only for numbers and dates
      */
      printf("row_num = %zu \n", value->col_num);
      printf("col_num = %zu \n", value->row_num);
      printf("type = %s\n", typeNames[value->cell_type]);
      if (value->number_fmt) {
        printf("number_format = %s value = %s\n", value->number_fmt);
      }
      XML_Char_printf(X("value = %s\n"), value->data);
      free(value);
    }
    printf("\n");
  }
  xlsxioread_sheet_close(sheet);
  xlsxioread_close(xlsxioread);

How to use xlsxioread_debug_internals:
it prints internal structures with extracted number formats (just for debugging)

  xlsxioreader xlsxioread;

  if ((xlsxioread = xlsxioread_open(filename)) == NULL) {
    fprintf(stderr, "Error opening .xlsx file\n");
    return 1;
  }
  xlsxioread_debug_internals(xlsxioread);
  xlsxioread_close(xlsxioread);

Output:

XLSXIO_READ DEBUG - test_dates.xlsx
number_formats:
  number_format 0 - 0
  number_format 1 - 164
  number_format 2 - 165
date_formats:
  date_format id = 164	 isDate = 0	 fmt = #,##0.00
  date_format id = 165	 isDate = 1	 fmt = [$-409]m/d/yy h:mm AM/PM;@

@Paxa
Copy link
Author

Paxa commented Mar 15, 2019

Do you know how to install specify minizip path in windows?
Travis builds works only for windows and mac now

@brechtsanders
Copy link
Owner

You can tell XLSX I/O buid to use a specific Minizip install location with the CMake parameter -DMINIZIP_DIR:PATH=

I have started reviewing some of your code, and already have several remarks:

  • You added a dependancy on uthash, which uses a BSD revised license. I believe this is not entirely compatible with the MIT license (but I still have to check if this is really incompatible)
  • You included files from uthash, instead of using it as a library, but I could fix that if your code gets accepted (I already have static and shared libraries built for Windows with MinGW)
  • There are several unshielded malloc() calls, causing possible vulnerabilities/unstabilities which I am not willing to introduce

@Paxa
Copy link
Author

Paxa commented Mar 16, 2019

I not sure about licenses, will it be better if we download utarray.h and uthash.h in a make file?
utarray.h and uthash.h is only header files with macros, not sure if we can compile it as a library
Do you have any suggestions for alternatives?

I don't think we gonna have many styles, so performance not so critical.
And in our use case we just need append and get, may be implement own linked list for that

@Paxa
Copy link
Author

Paxa commented Mar 16, 2019

What should we do with malloc? check if (res == NULL) { return NULL }?

@Paxa
Copy link
Author

Paxa commented Mar 16, 2019

I deleted files inside build folder, is it ok? (I don't know what those files was for)
btw I made all these changes to create ruby library https://github.com/Paxa/fast_excel_reader
according to my benchmark already got 9-15 times improvement

@brechtsanders
Copy link
Owner

I usually do:

if ((s = (char*)malloc(wantedsize)) == NULL)
  return NULL; //or some error code

that way you don't assume malloc worked causing the application to crash or even to allow hacker to exploit this.

The build folder contains the Code::Blocks project files, which I use to develop and debug on Windows.
Please don't delete it.

uthash is a package in most package systems (e.g. uthash-dev apt package on Debian Linux).
I would just add a CMake parameter to locate it for systems that don't have it in their package manager (like Windows).
But can you tell me why exactly we need this? I'm not keen to add dependancies...

Question:
Do your changes look at the type on cell-level, or also to colomn or row settings in case it's not defined on cell level?

@Paxa
Copy link
Author

Paxa commented Mar 16, 2019

Will try to make malloc as suggested and bring back build/ folder

But can you tell me why exactly we need this? I'm not keen to add dependancies...

it was in code uploaded in #12 and I keep it. It is used to find relations between cell and number format:

Example styles.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <numFmts count="2">
    <numFmt numFmtId="164" formatCode="#,##0.00"/>
    <numFmt numFmtId="165" formatCode="[$-409]m/d/yy h:mm AM/PM;@"/>
  </numFmts>
  ...
  <cellStyleXfs count="1">
    <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>
  </cellStyleXfs>
  <cellXfs count="3">
    <xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/>
    <xf numFmtId="164" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/>
    <xf numFmtId="165" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/>
  </cellXfs>
  <cellStyles count="1">
    <cellStyle name="Normal" xfId="0" builtinId="0"/>
  </cellStyles>
  <dxfs count="0"/>
  <tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/>
</styleSheet>

Example sheet.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <dimension ref="A1:G5"/>
  ...
  <cols>
    <col min="2" max="2" width="20.7109375" style="1" customWidth="1"/>
    <col min="3" max="3" width="20.7109375" style="2" customWidth="1"/>
  </cols>
  <sheetData>
    <row r="1" spans="1:7">
      <c r="A1" t="s">
        <v>0</v>
      </c>
      <c r="B1" s="1">
        <v>1234</v>
      </c>
      <c r="C1" s="2">
        <v>43537.51777806081</v>
      </c>
      <c r="D1">
        <v>1234</v>
      </c>
      <c r="E1" t="b">
        <v>1</v>
      </c>
      <c r="F1">
        <f>(B1 * 2)</f>
        <v>0</v>
      </c>
      <c r="G1" t="s">
        <v>1</v>
      </c>
    </row>
    <row r="5" s="2" customFormat="1" ht="20" customHeight="1"/>
  </sheetData>
  <hyperlinks>
    <hyperlink ref="G1" r:id="rId1"/>
  </hyperlinks>
  <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>

In example files

  1. <c r="C1" s="2"> means cell have style = 2
  2. style number 2 is <xf numFmtId="165" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/>, it reference to number format 165
  3. which is <numFmt numFmtId="165" formatCode="[$-409]m/d/yy h:mm AM/PM;@"/>

From format code we can recognize it's a date, not just number. This code also consider builtin number formats 14 - 22.

If <numFmts ...> and <cellXfs ...> will always have count attribute then we can read it and allocate simple array of int for style-format relations and struct for number formats, or event array of style_id->number format struct (should be fast and simple).


For now only on cell-level, I tried to create file with row format and column format, but when extract xml files, and each cell will have style reference e.g s="2"
If you have example of file with column format or row format - I will try to make it work

Btw where do you find format specification?

@brechtsanders
Copy link
Owner

Styles are numbered. That's why I was wondering why hashing was used. A (sparse) array structure would do for this.
But I guess it will do for now. Worst case I witch the license from MIT to BSD.

How much of the API is modified? I'm assuming your changes are not very backwards compatible?

@Paxa
Copy link
Author

Paxa commented Mar 16, 2019

I changed code to use arrays of structs, but not sure if I implemented it correctly, especially memory allocation and memory releasing. So not using utarray and uthash anymore

How much of the API is modified? I'm assuming your changes are not very backwards compatible?

Should be totally backwards compatible, changed xlsxioread_sheet_next_cell to xlsxioread_sheet_next_cell_struct and added xlsxioread_sheet_next_cell that returns just string

What is your preferences for variable names? camelCase or snake_case? e.g. handle->numberFormats or handle->number_formats?

@brechtsanders
Copy link
Owner

Wow, you're quick!
I avoid camelCase in pure C. I stick with lower case and use underscore to separate things that are separate (e.g. get_numberformat).

@Paxa
Copy link
Author

Paxa commented Mar 16, 2019

lower case and use underscore to separate things that are separate (e.g. get_numberformat).

done

@Paxa
Copy link
Author

Paxa commented Mar 16, 2019

may be you know how to install minizip on windows? or is windows build is important at all?

@brechtsanders
Copy link
Owner

For me MinGW in Windows is my primary development platform, but I also make sure it builds on Debian Linux and macOS.
MiniZip 1.2.8 (downloaded from http://www.gaia-gis.it/gaia-sins/dataseltzer-sources/) builds fine on Windows with MinGW using CMake but for me it needed this patch to avoid WinRT specifics:

patch -ulbf iowin32.c << EOF
--- iowin32.c  2013-04-29 00:57:12 +0200
+++ iowin32.c  2014-08-09 23:11:20 +0200
@@ -28,3 +28,3 @@

-#if defined(WINAPI_FAMILY_PARTITION) && (!(defined(IOWIN32_USING_WINRT_API)))
+#if defined(WINAPI_FAMILY_PARTITION) && (!(defined(IOWIN32_USING_WINRT_API))) && !defined(__MINGW32__)
 #if WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_APP)
EOF

If you need binaries I could send them to you if you like.

@AlexanderBurtasov
Copy link

Hi, all.
(Excuse my poorly-bad English).
I'm trying to use this PR in my project and facing with problem: Date-format cells is recognised as "none", not "date".

@Paxa
Copy link
Author

Paxa commented Mar 25, 2019

Do you have example of a file?

@AlexanderBurtasov
Copy link

mixed.xlsx

@Paxa
Copy link
Author

Paxa commented Mar 26, 2019

Thank you @AlexanderBurtasov for founding a bug. I fixed it

@AlexanderBurtasov
Copy link

AlexanderBurtasov commented Mar 27, 2019

17596_reduced.xlsx
Hi again.
PR completely falls while loading this file. Falls while calling xlsxioread_open.

@Paxa
Copy link
Author

Paxa commented Mar 27, 2019

@AlexanderBurtasov fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Determine cell type
3 participants