Skip to content

Set of multiple fixes - batch 2#6

Open
hubpav wants to merge 90 commits intomainfrom
pavel/fixes-batch-2
Open

Set of multiple fixes - batch 2#6
hubpav wants to merge 90 commits intomainfrom
pavel/fixes-batch-2

Conversation

@hubpav
Copy link
Contributor

@hubpav hubpav commented Feb 17, 2026

No description provided.

Replace `return ret` with `res = ret; goto error` in
app_machine_probe_enable_tilt_alert, disable_tilt_alert, and
get_tilt_alert. The direct return skipped COMM_EPILOGUE, leaving the
1-Wire bus locked and the mutex held, permanently deadlocking the
machine probe subsystem on any transient I2C failure.

Also remove unreachable `return 0` after COMM_EPILOGUE in
read_accelerometer and get_tilt_alert for consistency.
Only update the nonce counter after successful decryption. Previously,
a crafted NFC message with nonce_counter = UINT32_MAX and garbage
ciphertext would permanently poison the monotonic nonce, blocking all
future legitimate NFC configuration.
The else-if chain for LED status indication in the main loop spans a
preprocessor boundary: the LoRaWAN block is guarded by
#if defined(CONFIG_LORAWAN), but the alarm and default LED blocks were
chained with else-if across the #endif. When CONFIG_LORAWAN is disabled,
this produces a compile error (dangling else without preceding if).

Use a `led_handled` flag to decouple the blocks so each section is
self-contained and compiles correctly regardless of CONFIG_LORAWAN.
Cast the TMP112 register value to int16_t before right-shifting to
preserve two's complement sign extension. Without the cast,
sys_get_be16 returns uint16_t and negative temperatures (e.g. -1 C
represented as 0xFF70) are interpreted as ~255 C.
Compare against -ENODATA (negative) instead of ENODATA (positive).
Zephyr's sensor_sample_fetch returns negative error codes, so the
comparison with positive ENODATA was always true, causing the retry
loop to break on the first attempt instead of retrying.
Move LOG_DBG inside the null check for the voltage parameter. The log
statement unconditionally dereferenced *voltage, causing a hard fault
when the caller passes NULL.
Validate lorawan.region, lorawan.network, and lorawan.activation
against their respective enum ranges before assignment. Protobuf does
not enforce enum ranges for unknown values, so a crafted NFC message
could set arbitrary integers that cause undefined behavior in the
LoRaWAN stack's switch statements.
Apply the same range checks used by shell commands to NFC-ingested
configuration values:
- interval_sample: 0 or 5-3600
- interval_report: 60-86400
- temperature lo/hi: -30 to 70
- temperature hst: 0 to 5
- humidity lo/hi: 0 to 100
- humidity hst: 0 to 20
- pressure lo/hi: 500 to 1200
- pressure hst: 0 to 50
- correction temperatures: -5 to 5

Without validation, a crafted NFC message could set interval_report=1
causing uplinks every second, exhausting duty cycle and battery.
Use fixed-size stack buffers instead of k_malloc for I2C transaction
buffers in ds28e17_i2c_write_, ds28e17_i2c_read_, and
ds28e17_i2c_write_read_. The DS28E17 protocol bounds write_len to 255,
so the maximum buffer sizes are known at compile time:
- write: 3 + 255 + 2 = 260 bytes
- read: 5 bytes
- write_read: 3 + 255 + 3 = 261 bytes

This eliminates heap fragmentation risk with the 2KB system heap and
removes the ENOMEM failure path from every I2C transaction.
Move LOG_DBG inside the null check for the serial_number parameter.
The log statement unconditionally dereferenced *serial_number, causing
a hard fault when the caller passes NULL.
The sht_read_serial function auto-detects SHT43 (0x44) vs SHT30/SHT33
(0x45) but the result was not stored. All operational functions
(init/convert/read) unconditionally used the SHT30 I2C address (0x45)
and SHT30-specific commands, causing every read to fail on SHT43-based
machine probes.

Add SHT43-specific init (0x94 soft reset), convert (0xFD high-precision
measure), and read functions using the correct I2C address (0x44) and
SHT4x command set. Store the detected sensor type per-probe and branch
accordingly in app_machine_probe_read_hygrometer.

Auto-detection is performed on first hygrometer read if the type is
unknown, and cached for subsequent reads.
Cast both sides of the enum range comparisons to int to avoid
cross-enum-type comparison warnings between protobuf-generated enums
and app_config enums.
Set m_state to APP_LRW_STATE_RECONNECT before starting the backoff
timer when lorawan_start() fails during rejoin. Without this, the
state remained JOINING, causing the backoff timer handler to dispatch
link_check_work instead of join_work. The link_check_work_handler
then returned immediately (state == JOINING), and no further work was
ever scheduled — permanently deadlocking the LoRaWAN subsystem.
The Conversion Ready Flag (CRF) is bit 7 of the 16-bit Configuration
Register, which maps to read_buf[1] (LSB), not read_buf[0] (MSB).
The MSB check tested RN3 (auto-range result), which is only set for
ranges 8-11. In dim lighting (ranges 0-7, <655 lux), RN3=0 caused
a false -EIO return despite the sensor operating correctly.
Change nwkkey max_length from 16 to 32. A 16-byte LoRaWAN network key
requires 32 hex characters in string form. The 16-char limit caused
nanopb to silently truncate the key, which then failed the length
check in parse_hex_string, making NFC nwkkey provisioning always fail.

All sibling key fields (appkey, nwkskey, appskey) already used
max_length:32.
Add goto error after sensor_channel_get failure to prevent
fall-through into the temperature computation. Without it, the
uninitialized sensor_value struct (stack garbage) was used to compute
and write a garbage temperature value.
The LoRaWAN MAC callbacks (downlink_callback, link_check_callback)
execute on the system work queue but directly mutated state variables
shared with handlers on m_work_q. The system work queue (cooperative
priority) could preempt m_work_q mid-operation, causing races such as
a link check response being handled between setting m_link_check_pending
and starting the timeout timer — leading to spurious timeout failures.

Route all state mutations through m_work_q by having the MAC callbacks
submit work items instead of calling handle_link_check_success/failure
directly. This serializes all state machine transitions on a single
thread.
Guard app_lrw_send_with_link_check() to reject requests in RECONNECT
or JOINING states. The m_link_check_timer serves both as link check
response timeout (10s) and rejoin backoff (60-3600s). A shell command
invoking send_with_link_check during RECONNECT would overwrite the
long backoff timer with a 10s timeout, causing an immediate rejoin
that bypasses the exponential backoff.
Initialize orientation to 0 instead of 0xff. When no orientation data
is available, the 0xff value was OR'd into the header at line 197,
setting the lower 8 bits and corrupting unrelated header flags.
Replace single k_sleep(60s) with a loop that feeds the watchdog each
second. The previous approach could trigger a watchdog reset before
the intended reboot, preventing diagnostic logs from being flushed.
Add flash_area_close(fa) before returning when flash_area_erase fails.
Without this, the flash area handle leaks on the error path.
Refactor app_battery_measure to use goto-based cleanup so the ADC is
always suspended even when adc_read or adc_raw_to_millivolts fails.
Previously, early returns left the ADC in the active state, wasting
power until the next measurement cycle.
m_state is written from the LoRaWAN work queue thread and read from
the main thread via app_lrw_get_state/app_lrw_is_ready. Without
atomic operations the compiler may cache the value in a register,
causing the main thread to read stale state.
The static lorawan_join_config was only zero-initialized at program
start. On subsequent calls to join_work_handler, stale fields from a
previous activation mode (e.g. OTAA pointers after switching to ABP)
could remain in the struct.
Previously, alarm state variables tracked threshold transitions even
when the corresponding alarm type was disabled. When re-enabled, the
stale state could cause an immediate spurious alarm. Now each alarm
block resets its state to false when disabled, preventing false
activations on re-enable.
Previously, app_sensor_init returned immediately on any subsystem
failure, preventing all remaining sensors from being initialized and
the sampling timer from starting. Now failures are logged but
initialization continues, so a single faulty peripheral does not
block the entire sensor pipeline.
strtoul silently converts negative strings like "-1" to large
unsigned values, making the old `value < 0` check always false.
Validate the input string does not start with '-' before parsing.
Explicitly sign-extend the 24-bit raw altitude value before shifting.
The previous expression shifted a uint32_t left by 8 and assigned to
int32_t, which is implementation-defined for negative altitudes where
bit 23 is set.
The strict equality check (reg_status != 0x0e) rejected valid data
when overwrite flags (PTOW/POW/TOW) were set in the status register.
Use a bitmask to check only the data-ready bits, accepting readings
even when an overwrite occurred during the conversion delay.
Unnamed threads appear as blank in Zephyr thread analyzer and
debugger views.
Post-increment with == caused first NFC check to occur at blink 11
instead of 10. Use pre-increment with >= for correct interval.
hex2bin returns 0 on failure; returning that to the shell signals
success. Return -EINVAL instead to properly indicate the error.
cmd_print_serial_numbers, cmd_reset_sample, and cmd_print_sample
had void return and missing argc/argv parameters. Zephyr's
SHELL_CMD_ARG expects int(shell, argc, argv).
Without endptr check, non-numeric input like "abc" silently yields 0,
which disables sampling without warning.
Reads of g_app_sensor_data fields were unprotected, risking torn
reads of float values written by the sensor work queue thread.
Without the check, a play request with no APP_LED_CMD_END as the
first command would silently execute zero iterations.
Restarting m_send_timer with K_NO_WAIT cancelled the next periodic
send, so rapid events from hall/input could starve periodic reporting.
Voltages above 5.1V would wrap around due to unsigned overflow when
cast to uint8_t.
Right-shifting signed negative values is implementation-defined in C.
Cast to uint16_t to guarantee zero-fill behavior on all compilers.
Setting boot=false before APPEND_BYTE calls meant the boot flag
would be lost if the buffer was too small and the function returned
-ENOSPC. Move it to just before return 0.
ABP join completes immediately and triggers send_work, which could
call app_sensor_sample before sensors are initialized. Deferring
join until after sensor init avoids this race.
Starting periodic sampling after init failures could cause repeated
error logs from failed sensor reads.
When get_tilt_alert failed, the continue statement discarded valid
hygrometer temperature and humidity data. Store hygrometer values
first, then attempt tilt-alert separately.
Early returns on GPIO errors could leave hall pins in PULL_UP state,
causing ~50uA continuous leakage until the next successful poll cycle.
The DS28E17 echoes the configuration register after a write. Read
it back and verify it matches to detect bit-flip or bus errors.
The unified struct request union forced every app_led_blink caller to
allocate ~400 bytes for the play variant. Split into separate blink
(~24 bytes) and play queues with a shared semaphore to wake the
thread.
Hall/input counts from g_app_sensor_data could be from a different
sample cycle than the notify flags from app_hall_get_data. Use the
snapshot's own counts to keep data consistent.
A new event between get_data and clear_notify_flags would have its
flag immediately cleared without being reported. Add atomic
get_data_and_clear_notify functions that snapshot and clear under
a single mutex hold.
Use net_buf_simple for type-safe serialization instead of a hand-rolled
macro with per-byte bounds checking. This replaces manual big-endian byte
shifting with idiomatic net_buf_simple_add_be32/be16/u8 calls.
Native 32-bit type is idiomatic, eliminates overflow concerns, and
generates slightly smaller code on Cortex-M4 (no byte-masking).
Move irq_lock/unlock inside the bitbang loop so IRQs are only disabled
for a single bit frame (~100us). k_sched_lock prevents preemption for
the full sequence while allowing ISRs to fire between bits.
The cmd_reset_sample shell command only zeroed g_app_sensor_data
counters, but module-local m_hall_data/m_input_data were unchanged.
On the next sample cycle the old counts were copied back. Add
app_hall_reset_counts() and app_input_reset_counts() to zero the
authoritative module-local counters.
The SI7210 needs ~3ms to complete an ADC conversion after ONEBURST.
Without a delay the DSPSIGM/DSPSIGL registers return stale data from
the previous measurement cycle.
PIR detector and general-purpose inputs share physical GPIO pins
(GPIOA-11, GPIOB-4). When both are enabled via NFC configuration,
PIR takes priority and input initialization is skipped with a warning.
@hubpav hubpav requested a review from Hymbajs February 17, 2026 06:25
@hubpav hubpav self-assigned this Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments