Skip to content

Use uint32_t instead of size_t where possible#110

Open
glebm wants to merge 1 commit intos-yata:masterfrom
glebm:size_t
Open

Use uint32_t instead of size_t where possible#110
glebm wants to merge 1 commit intos-yata:masterfrom
glebm:size_t

Conversation

@glebm
Copy link
Contributor

@glebm glebm commented May 31, 2025

Marisa node IDs are 32-bit integers, so using a 64-bit size_t for nodes and related types wastes memory.
Additionally, key and query lengths etc are all effectively limited to 32-bit.

Replaces size_t with uint32_t where appropriate. This reduces the sizes of various structs throughout.

Also replaces the remaining marisa::UInt* usages with <cstdint> types. I've opted to not use the std:: prefix for <cstdint> types because it is verbose and not required.

@s-yata
Copy link
Owner

s-yata commented Jun 1, 2025

I have a question.

Is it guaranteed that uintXX_t (not std::uintXX_t) are declared in <cstdint>?

@glebm
Copy link
Contributor Author

glebm commented Jun 1, 2025

I don't think it is guaranteed by the standard yet but in practice every implementation declares these.

If you prefer, I can add the std:: qualifiers.

@s-yata
Copy link
Owner

s-yata commented Jun 1, 2025

How about adding using std::uintXX_t in "marisa/base.h"?

@s-yata
Copy link
Owner

s-yata commented Jun 2, 2025

Even if a library uses uint32_t internally, the API shoud accept std::size_t.
For example, std::size_t is generally used to store a string length and it can be implicitly narrow-casted to uint32_t as an argument.

In addition, uint32_t seems to be not enough in some cases, e.g. for capacities.

These changes should be checked carefully.

@jmr
Copy link
Contributor

jmr commented Jun 2, 2025

These changes should be checked carefully.

This would be easier if this PR were split into two parts, one to replace UInt32 with uint32_t and one to change the types from size_t to uint32_t.

Copy link
Contributor

@jmr jmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to follow if s/UInt32/uint32_t/ is a separate PR.

@glebm glebm marked this pull request as draft June 2, 2025 08:57
@s-yata
Copy link
Owner

s-yata commented Jun 4, 2025

Oops, current implementations use static_cast<uint32_t> to size_t arguments...

s-yata added a commit that referenced this pull request Jun 8, 2025
These changes were proposed as a part of #110.
UIntXX were added because an ancient VC++ didn't provide uintXX_t.
The latest VC++ provides uintXX_t and we don't need UIntXX anymore.
s-yata added a commit that referenced this pull request Jun 13, 2025
These changes were proposed as a part of #110.

UIntXX were added because an ancient VC++ didn't provide uintXX_t.
The latest VC++ provides uintXX_t and we don't need UIntXX anymore.
@jmr
Copy link
Contributor

jmr commented Jun 14, 2025

9feab4f did the uint*_t change. Can you rebase?

@glebm glebm force-pushed the size_t branch 4 times, most recently from 65aba94 to 1ac6dc1 Compare June 14, 2025 13:24
Marisa node IDs are 32-bit integers, so using a 64-bit `size_t`
for nodes and related types wastes memory.
Additionally, key and query lengths etc are all effectively
limited to 32-bit.

Replaces `size_t` with `uint32_t` where appropriate.
This reduces the sizes of various structs throughout.

Also replaces the remaining `marisa::UInt*` usages with `<cstdint>`
types. I've opted to not use the `std::` prefix for `<cstdint>`
types because it is verbose and not required.
@glebm
Copy link
Contributor Author

glebm commented Jun 14, 2025

Rebased

@pgaskin
Copy link

pgaskin commented Nov 4, 2025

Are you still working on this?

@glebm glebm marked this pull request as ready for review November 4, 2025 10:41
@glebm
Copy link
Contributor Author

glebm commented Nov 4, 2025

This is actually ready for review, I just forgot to mark it as such

Copy link

@pgaskin pgaskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed this fully yet, but these changes seem sane overall:

  • The API still takes a size_t for lengths provided for pointer arguments (this prevents the change from breaking existing source code and means stuff doesn't need to be casted unnecessarily).
  • Keys are now explicitly uint32_t.
  • Most return values are uint32_t.
  • The SWIG bindings will need to be regenerated too.

Note that this change is ABI-breaking and will require stuff to be recompiled.

I'll take a closer look at these changes later.

Comment on lines +64 to +71
uint32_t base_blocks_size_ = 0;
uint32_t base_blocks_capacity_ = 0;
std::unique_ptr<std::unique_ptr<char[]>[]> extra_blocks_;
std::size_t extra_blocks_size_ = 0;
std::size_t extra_blocks_capacity_ = 0;
uint32_t extra_blocks_size_ = 0;
uint32_t extra_blocks_capacity_ = 0;
std::unique_ptr<std::unique_ptr<Key[]>[]> key_blocks_;
std::size_t key_blocks_size_ = 0;
std::size_t key_blocks_capacity_ = 0;
uint32_t key_blocks_size_ = 0;
uint32_t key_blocks_capacity_ = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these should fit in a uint32_t, it might be better to keep this as a size_t since they're primarily used for indexing memory.

}
std::size_t id() const {
uint32_t id() const {
return union_.id;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setter on line 41 is void set_id(std::size_t id), but should probably also be a uint32_t.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, #include <cstdint> should probably be added to this file.

}
std::size_t length() const {
uint32_t length() const {
return length_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assert(length <= UINT32_MAX); to the set_str methods like the other files.

Comment on lines -48 to 51

std::size_t length = 0;
uint32_t length = 0;
while (str[length] != '\0') {
++length;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to keep this as a size_t, then assert that it fits in a uint32_t (it doesn't matter in practice, but it's cleaner and aligns with what the rest of the code already does, e.g., in key.h).

Comment on lines +39 to 41
uint32_t num_keys() const {
return size();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to be changed in marisa/trie.cc.

@pgaskin
Copy link

pgaskin commented Nov 5, 2025

I think I might also open a separate PR which only touches the key IDs to make it easier to review and change incrementally.

@glebm
Copy link
Contributor Author

glebm commented Nov 5, 2025

Do you mind sending a PR to my branch with the suggested improvements? Otherwise, I'm not sure when I'll get around to addressing them

@pgaskin
Copy link

pgaskin commented Nov 5, 2025

I think I might propose a slightly different approach to make this less error-prone, and to make the commit which has the API/ABI-breaking change as small and standalone as possible.

@pgaskin
Copy link

pgaskin commented Nov 5, 2025

I've opened #130, #131, and #132 with a different approach.

glebm/marisa-trie@size_t...pgaskin:marisa-trie:api-uint32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants