Conversation
|
Hi @martindurant
{0: {
'foo.with.strings-data': array([0, 1, -1], dtype=int8),
'foo.with.strings-cats': ["hey", "there"],
'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
'foo.with.lists.list.element-cats': [0]}
}
I also am curious to know what will be the input for the general Thank you for your feedback! |
These are complex columns. In this case, a list-of-lists is made up of the data values, offsets and maybe an index (in the case of categoricals). There will be some simple wrappers in https://github.com/dask/fastparquet/blob/a9d3f309068189043f5ecec5f616de90c11fa305/fastparquet/wrappers.py to provide access to these nested structures, or the arrays could be passed directly to arrow, awkward or other libraries that know what to do with them. becomes ["hey", "there", None] as a list becomes Yes, |
|
Thanks a lot for your quick feedbacks !
|
Yes, I think so. So in the simple case of tabular data (nothing nested), this is essentially what pandas gives you anyway: |
|
@erykoff has an interest in a "dict of arrays output" and has his numparquet project: https://github.com/erykoff/numparquet. I do not want to muddle this PR with new ideas/features, but I do want to connect you all together since I think you all have common goals. :) |
|
@erykoff : happy to talk and help. I have not had a chance to see your work, since I didn't know about it until just now. |
|
My work did not exist until just now! It was a holiday break hobby project to see how far I could get. I hadn't looked at fastparquet because it was so entwined with pandas (which I try to avoid). Nevertheless, I now realize that the primitives here really do almost everything that we need. What we are looking for is:
I'm happy to look at this PR and see if I can make a minimal working example of what we need. But I don't know if it's general... |
|
To give more detail on the experiment in this PR, it does work, including variable strings, lists and nested records with or without nulls. You should find the thrift implementation here significantly faster than thriftpy2 (which may be important for big schemas). Those various complex types are returned as sets of offsets into data arrays, e.g., strings should be a (numpy) uint8 array and an uint32 array of offsets. This is best for loading speed and storage size unless you actually want python strings. Complexity is around how to combine pages and row-groups, particularly is you intend to try to parallelise. |
|
Did anyone have any use for the work in this PR? |
Due to the upcoming hard dependence of pandas on pyarrow, this branch investigates what it would look like to have a fastparquet that avoids pandas altogether and deals with numpy arrays alone. For complex columns, the representation will be similar and compatible to awkward/arrow buffers, but not require those packages.