Use uint32_t instead of size_t where possible#110
Use uint32_t instead of size_t where possible#110glebm wants to merge 1 commit intos-yata:masterfrom
uint32_t instead of size_t where possible#110Conversation
|
I have a question. Is it guaranteed that |
|
I don't think it is guaranteed by the standard yet but in practice every implementation declares these. If you prefer, I can add the |
|
How about adding |
|
Even if a library uses In addition, These changes should be checked carefully. |
This would be easier if this PR were split into two parts, one to replace |
jmr
left a comment
There was a problem hiding this comment.
Easier to follow if s/UInt32/uint32_t/ is a separate PR.
|
Oops, current implementations use |
These changes were proposed as a part of #110. UIntXX were added because an ancient VC++ didn't provide uintXX_t. The latest VC++ provides uintXX_t and we don't need UIntXX anymore.
These changes were proposed as a part of #110. UIntXX were added because an ancient VC++ didn't provide uintXX_t. The latest VC++ provides uintXX_t and we don't need UIntXX anymore.
|
9feab4f did the |
65aba94 to
1ac6dc1
Compare
Marisa node IDs are 32-bit integers, so using a 64-bit `size_t` for nodes and related types wastes memory. Additionally, key and query lengths etc are all effectively limited to 32-bit. Replaces `size_t` with `uint32_t` where appropriate. This reduces the sizes of various structs throughout. Also replaces the remaining `marisa::UInt*` usages with `<cstdint>` types. I've opted to not use the `std::` prefix for `<cstdint>` types because it is verbose and not required.
|
Rebased |
|
Are you still working on this? |
|
This is actually ready for review, I just forgot to mark it as such |
pgaskin
left a comment
There was a problem hiding this comment.
I haven't reviewed this fully yet, but these changes seem sane overall:
- The API still takes a size_t for lengths provided for pointer arguments (this prevents the change from breaking existing source code and means stuff doesn't need to be casted unnecessarily).
- Keys are now explicitly uint32_t.
- Most return values are uint32_t.
- The SWIG bindings will need to be regenerated too.
Note that this change is ABI-breaking and will require stuff to be recompiled.
I'll take a closer look at these changes later.
| uint32_t base_blocks_size_ = 0; | ||
| uint32_t base_blocks_capacity_ = 0; | ||
| std::unique_ptr<std::unique_ptr<char[]>[]> extra_blocks_; | ||
| std::size_t extra_blocks_size_ = 0; | ||
| std::size_t extra_blocks_capacity_ = 0; | ||
| uint32_t extra_blocks_size_ = 0; | ||
| uint32_t extra_blocks_capacity_ = 0; | ||
| std::unique_ptr<std::unique_ptr<Key[]>[]> key_blocks_; | ||
| std::size_t key_blocks_size_ = 0; | ||
| std::size_t key_blocks_capacity_ = 0; | ||
| uint32_t key_blocks_size_ = 0; | ||
| uint32_t key_blocks_capacity_ = 0; |
There was a problem hiding this comment.
While these should fit in a uint32_t, it might be better to keep this as a size_t since they're primarily used for indexing memory.
| } | ||
| std::size_t id() const { | ||
| uint32_t id() const { | ||
| return union_.id; |
There was a problem hiding this comment.
The setter on line 41 is void set_id(std::size_t id), but should probably also be a uint32_t.
There was a problem hiding this comment.
Also, #include <cstdint> should probably be added to this file.
| } | ||
| std::size_t length() const { | ||
| uint32_t length() const { | ||
| return length_; |
There was a problem hiding this comment.
Add an assert(length <= UINT32_MAX); to the set_str methods like the other files.
|
|
||
| std::size_t length = 0; | ||
| uint32_t length = 0; | ||
| while (str[length] != '\0') { | ||
| ++length; | ||
| } |
There was a problem hiding this comment.
I think it might be better to keep this as a size_t, then assert that it fits in a uint32_t (it doesn't matter in practice, but it's cleaner and aligns with what the rest of the code already does, e.g., in key.h).
| uint32_t num_keys() const { | ||
| return size(); | ||
| } |
There was a problem hiding this comment.
This also needs to be changed in marisa/trie.cc.
|
I think I might also open a separate PR which only touches the key IDs to make it easier to review and change incrementally. |
|
Do you mind sending a PR to my branch with the suggested improvements? Otherwise, I'm not sure when I'll get around to addressing them |
|
I think I might propose a slightly different approach to make this less error-prone, and to make the commit which has the API/ABI-breaking change as small and standalone as possible. |
Marisa node IDs are 32-bit integers, so using a 64-bit
size_tfor nodes and related types wastes memory.Additionally, key and query lengths etc are all effectively limited to 32-bit.
Replaces
size_twithuint32_twhere appropriate. This reduces the sizes of various structs throughout.Also replaces the remaining
marisa::UInt*usages with<cstdint>types. I've opted to not use thestd::prefix for<cstdint>types because it is verbose and not required.