Skip to content

Ensure free space in Vector is zero-initialized#134

Open
pgaskin wants to merge 1 commit intos-yata:masterfrom
pgaskin:zero-initialize-vector
Open

Ensure free space in Vector is zero-initialized#134
pgaskin wants to merge 1 commit intos-yata:masterfrom
pgaskin:zero-initialize-vector

Conversation

@pgaskin
Copy link

@pgaskin pgaskin commented Nov 8, 2025

This fixes a lack of reproducibility when writing the FlatVector LoudsTrie.extras_, which comes from the Vector<uint32_t> terminals, which is set to the Tail offsets during Tail::build from LoudsTrie::build_next_trie.

@pgaskin pgaskin force-pushed the zero-initialize-vector branch from 87b1510 to 4797668 Compare November 8, 2025 21:27
@pgaskin
Copy link
Author

pgaskin commented Nov 8, 2025

Building words.txt before and after this change:

diff --git a/before b/after
index 2b19266..6705ebd 100644
--- a/before
+++ b/after
@@ -82100,7 +82100,7 @@
 00140b30: 010a 0101 0101 0101 2401 6701 0100 0805  ........$.g.....
 00140b40: 014a 0101 0101 0101 0101 0100 002b 0100  .J...........+..
 00140b50: 0000 1909 094a 094a 092b 1000 4a4a 0900  .....J.J.+..JJ..
-00140b60: 01fc 0100 0000 0000 0800 0000 ff00 0000  ................
+00140b60: 0100 0000 0000 0000 0800 0000 ff00 0000  ................
 00140b70: 990b 0500 0000 0000 0000 0000 0000 0000  ................
 00140b80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00140b90: 0000 0000 0000 0000 0000 0000 0000 0000  ................

Stack trace of the write which wrote that offset:

marisa.write(i32,i32)
marisa.wasm.marisa::grimoire::io::Writer::data(void const*, unsigned long)(i32,i32,i32)
marisa.wasm.void marisa::grimoire::io::Writer::write<unsigned int>(unsigned int const*, unsigned long)(i32,i32,i32)
marisa.wasm.marisa::grimoire::vector::Vector<unsigned int>::write(marisa::grimoire::io::Writer&) const(i32,i32)
marisa.wasm.marisa::grimoire::vector::FlatVector::write(marisa::grimoire::io::Writer&) const(i32,i32)
marisa.wasm.marisa::grimoire::trie::LoudsTrie::write_(marisa::grimoire::io::Writer&) const(i32,i32)
marisa.wasm.marisa::grimoire::trie::LoudsTrie::write(marisa::grimoire::io::Writer&) const(i32,i32)
marisa.wasm.marisa::Trie::write(int) const(i32,i32)
marisa.wasm.marisa_save()

The buffer of length 330656 passed to the write function (i.e., The uint32_t* Vector<uint32_t>.const_objs_ of FlatVector.units_ of LoudsTrie.extra_):

--- snip ---
00050b90  09 2b 10 00 4a 4a 09 00  01 fc 01 00 00 00 00 00  |.+..JJ..........|

pgaskin added a commit to pgaskin/go-marisa that referenced this pull request Nov 8, 2025
See s-yata/marisa-trie#134. Without this, written tries may include
unitialized memory, which makes them non-reproducible.
@pgaskin
Copy link
Author

pgaskin commented Nov 8, 2025

Note that the output of a simple marisa-build command isn't changed by this PR, as it's uninitialized memory happens to be zero.

This is an alternative patch which would have the same effect:

diff --git a/lib/marisa/grimoire/vector/vector.h b/lib/marisa/grimoire/vector/vector.h
index afc7d9a..b25031c 100644
--- a/lib/marisa/grimoire/vector/vector.h
+++ b/lib/marisa/grimoire/vector/vector.h
@@ -256,7 +256,7 @@ class Vector {
     assert(new_capacity >= size_);
     assert(new_capacity <= max_size());
 
-    std::unique_ptr<char[]> new_buf(new char[sizeof(T) * new_capacity]);
+    std::unique_ptr<char[]> new_buf(new char[sizeof(T) * new_capacity]());
     T *new_objs = reinterpret_cast<T *>(new_buf.get());
 
     static_assert(std::is_trivially_copyable_v<T>);
@@ -273,7 +273,7 @@ class Vector {
   void copyInit(const T *src, std::size_t size, std::size_t capacity) {
     assert(size_ == 0);
 
-    std::unique_ptr<char[]> new_buf(new char[sizeof(T) * capacity]);
+    std::unique_ptr<char[]> new_buf(new char[sizeof(T) * capacity]());
     T *new_objs = reinterpret_cast<T *>(new_buf.get());
 
     static_assert(std::is_trivially_copyable_v<T>);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant