Skip to content

Conversation

@jstucke
Copy link
Contributor

@jstucke jstucke commented Jan 29, 2026

  • little endian test files were created with mkfs.minix
    • since there does not seem to be a tool to create big endian MINIX filesystems, those test files were created by byte swapping the little endian test files
  • known firmware samples with MINIX FS:
    • Trendnet IP cameras, e.g. TV-IP110W and TV-IP422WN (both MINIX FS v1 LE with 30 byte filenames)
    • Netgear switches, e.g. FSM7326P and GSM7248 (both MINIX FS v2 BE with 30 byte filenames)
    • I have not encountered MINIX v3 filesystems in firmware images, but adding it anyway was little overhead

@qkaiser qkaiser self-requested a review January 29, 2026 09:48
@qkaiser qkaiser added enhancement New feature or request format:filesystem python Pull requests that update Python code labels Jan 29, 2026
@jstucke
Copy link
Contributor Author

jstucke commented Jan 29, 2026

hmm I noticed that I should probably add a few sanity checks to calculate_chunk to avoid false positives

Copy link
Contributor

@qkaiser qkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review pass over lunch, will get back with a more detailed review.

yield self._read_zone_data(index)

def _read_directory(self, inode) -> Iterator[cstruct]:
for entry in chunked(self._read_file_data(inode), self.dir_entry_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have iterate_file in file_utils that could help you if you're doing what I think you're doing. Lots of things are already implemented in file_utils, so you probably don't need the more-itertools dependency either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed _read_file_data to stream chunks of data, but these chunks have block size and not dir_entry_size, so chunked() is still beneficial (making _read_file_data yield chunks of size dir_entry_size is a bit much overhead because of the indirect data zone structure).

I could replace the chunked with slices. That would remove the dependency, but would not really improve readability:

            for dir_data in chunked(zone_data, self.dir_entry_size):
            for index in range(0, len(zone_data), self.dir_entry_size):
                dir_data = zone_data[index : index + self.dir_entry_size]


def __init__(self):
super().__init__()
self.EXTRACTOR = MinixFSExtractor(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the attempt at generalization but I don't think this self passing is a good design. I'll try to come up with something better when I'm thru the review.

Copy link
Contributor

@qkaiser qkaiser Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should define a _MinixFSExtractor class that do not override the __init__ function of Extractor. Then you create 3 extractors:

  • MinixFSv1Extractor(_MinixFSExtractor)
  • MinixFSv2Extractor(_MinixFSExtractor)
  • MinixFSv3Extractor(_MinixFSExtractor)

None of these extractors should rely on the handler they're attached to. No handlers argument or attribute here.

You should define a _MinixFSHandler(StructHandler) that does not override the __init__ function of StructHandler. This generic handler should define:

  • HEADER_STRUCT as minix_super_block

Then, create three different handlers:

  • MinixFSv1Handler(_MinixFSHandler)
  • MinixFSv2Handler(_MinixFSHandler)
  • MinixFSv3Handler(_MinixFSHandler)

Each of these handlers must override the C_DEFINITIONS entry with their own. To do so I would split the C_DEFINITIONS in 3 blocks:

  • C_DEFINITIONS_V1 (only minix_inode, minix_super_block, minix_dir_entry)
  • C_DEFINITIONS_V2 (only minix2_inode, minix_super_block, minix_dir_entry)
  • C_DEFINITIONS_V3 (only minix_inode, minix3_super_block, minix3_dir_entry)

In each of these C definitions, use the same names. Do not use minix2_inode vs minix_inode. Use generic names without version info. That's the reason we can define HEADER_STRUCT in the handler parent class.

With this approach you can get rid of DIR_STRUCT and INODE_STRUCT since they will be generic. Just use the generic name.

You can also drop the VERSION class attribute. The VERSION is currently used for two things:

  • derive the block_size, just check if your header has an s_blocksize attribute. If it does, use it otherwise default to BLOCK_SIZE
  • derive the inode_size, this can be done through the cstruct parser that expose its fields size (e.g. self._struct_parser.cparser_le.minix_inode.size

Do not define DOC in __init__, just define it in the class with a version that's not resolved at runtime.

Each handler must set its extractor explicitly like:

  • EXTRACTOR = MinixFSv3Extractor()

This should make the code easier to follow and reduce complexity.

Copy link
Contributor Author

@jstucke jstucke Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you come up with anything? I'm still a bit puzzled how to solve this (since both handler and extractor both need many of the fields including the C_DEFINITIONS). Is overriding Handler.extract an option? 😅
edit: it seems I did not refresh the page before commenting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change everything like you suggested. I couldn't get completely rid of VERSION, though.

Comment on lines 150 to 151
zone_count = self._get_zone_count(header)
if zone_count != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since zone_count is in the header, you can write patterns that do not match on headers with a null zone count. This way you don't need to get it here or check the value since it won't be matched in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would I do that? There is the not (~) syntax in YARA but this does not seem to work. At least I get an error when using it in a pattern:

>           raise InvalidHexString(str(e)) from e
E           unblob.parser.InvalidHexString: No terminal matches '~' in the current parser context, at line 1 col 44
E           
E           ] (7f | 8f) 13 (00 | 01 | 02) 00 [2] 00 ~00 00 00
E                                                   ^
E           Expected one of: 
E           	* START_ANCHOR
E           	* SECONDNIBLE
E           	* JUMP
E           	* END_ANCHOR
E           	* LPAR
E           	* WILDCARD
E           	* RANGE_JUMP
E           	* RPAR
E           	* FIRSTNIBLE
E           	* LITERAL
E           	* ALTERNATIVE_SEPARATOR
E           
E           Previous tokens: Token('LITERAL', '00')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something we may want to fix in unblob itself. Naive approach is to accept everything but 00:

( 01 | 02 | 03 | ... | ff)

You may have an easier time writing patterns as Regex rathern than HexString for that purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried writing a Regex instead, but it seems that it does not like null bytes:

    @lru_cache
    def build_hyperscan_database(handlers: Handlers) -> StreamDatabase:
        patterns = []
        for handler_class in handlers:
            handler = handler_class()
            for pattern in handler.PATTERNS:
                try:
                    patterns.append(
                        Pattern(
                            pattern.as_regex(),
                            Flag.SOM_LEFTMOST,
                            Flag.DOTALL,
                            tag=handler,
                        )
                    )
                except InvalidHexString as e:
                    logger.error(
                        "Invalid pattern",
                        handler=handler.NAME,
                        pattern=pattern,
                        error=str(e),
                    )
                    raise
>       return StreamDatabase(*patterns)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
E       ValueError: Pattern expression contains NULL byte

Maybe it is best to do sanity checking during chunk extraction and change it as soon as it is supported?

endianness_fmt = "<" if self._handler.ENDIANNESS == Endian.LITTLE else ">"
return list(struct.unpack(f"{endianness_fmt}{count}{ptr_fmt}", data))

def _read_file_data(self, inode: cstruct) -> bytes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the cstruct typing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some better way to do this? I removed it for the time being

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use Structure (but since that riddled the code with missing attribute warnings, I added more concrete subclasses as type hints (maybe that was overkill, though))

entry_inode = self._read_inode(entry.inode)

if self._is_file(entry_inode):
fs.write_bytes(Path(entry_path), self._read_file_data(entry_inode))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using write_chunks and implement _read_file_data as a generator that yields chunks of bytes. This way we don't have to load large files in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, so that _read_file_data streams the file data instead (and renamed it to _stream_file_data)

@qkaiser
Copy link
Contributor

qkaiser commented Jan 29, 2026

@jstucke your recent contributions are highly appreciated. Keep them coming ! :)

@jstucke
Copy link
Contributor Author

jstucke commented Jan 29, 2026

@jstucke your recent contributions are highly appreciated. Keep them coming ! :)

Haha thank you for taking the time for a thorough review. I added some validity checks and tests with invalid headers. I also found the original tool from the minix project and was also able to create samples with block size > 1024 and even log_zone_size != 0 (which is not possible with the mkfs.minix included in linux), but sadly I am unable to mount the filesystems and therefore could not add any files. The validity checks seem to work, though. There is are some images on the minix download page which uses v3 with blocksize 4096 and it seems to be unpacked without problems.

I will probably fix the remaining issues tomorrow.

@jstucke jstucke force-pushed the minixfs branch 2 times, most recently from f17d888 to 57475ad Compare January 30, 2026 09:58
@jstucke jstucke force-pushed the minixfs branch 3 times, most recently from 91dc5a2 to 7dc5667 Compare January 30, 2026 14:33
* little endian test files were created with mkfs.minix
  * since there does not seem to be a tool to create big endian MINIX filesystems, those test files were created by byte swapping the little endian test files
* known firmware samples with MINIX FS:
  * Trendnet IP cameras, e.g. TV-IP110W and TV-IP422WN (both MINIX FS v1 LE with 30 byte filenames)
  * Netgear switches, e.g. FSM7326P and GSM7248 (both MINIX FS v2 BE with 30 byte filenames)
@qkaiser
Copy link
Contributor

qkaiser commented Feb 2, 2026

@jstucke I'll do some cleanups on this branch sometimes this week, but not sure when. Hang on :)

@qkaiser
Copy link
Contributor

qkaiser commented Feb 12, 2026

Cleaning up the code and doing cross-validation as we speak. Will push my changes when I'm done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request format:filesystem python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants