mwc is a POSIX-based multithreaded word count program written in C. It is designed as a learning-focused systems project that explores:
- File I/O using
pread - Correct word counting across chunk boundaries
- Thread-level parallelism with
pthreads - Avoiding shared mutable state
- Benchmarking and performance considerations
The project is conceptually similar to wc -w, but is intentionally implemented from first principles to understand how such tools work internally.
Given a large text file, count the total number of words using multiple threads.
A word is defined as a contiguous sequence of non-whitespace characters. Any whitespace character (space, newline, tab, etc.) is treated as a delimiter.
The key challenge is:
How do you split a file across multiple threads without double-counting or missing words at chunk boundaries?
- The file is split into byte-range chunks based on the number of available CPU cores.
- Each chunk is processed by a separate worker thread.
- Threads read their assigned file regions independently using
pread. - Each thread produces a local word count.
- The main thread joins all workers and aggregates the results.
No thread shares mutable state with another thread.
Instead of trying to detect word ends, this project counts word starts.
A word is counted only when its first character is encountered.
This is implemented using the following logic:
-
A character starts a word if:
- It is not whitespace, and
- The previous character is whitespace (or the character is at file offset 0)
- Each word has exactly one starting character
- Words spanning multiple buffers or chunks are counted exactly once
- No need to peek into the next chunk
This approach completely avoids complex boundary-adjustment logic.
Each worker thread is given:
offset: starting byte offset in the filesize: number of bytes it is responsible for
To correctly handle the first word in a chunk:
- If
offset == 0, the thread may count the first word - Otherwise, the thread inspects the previous byte (
offset - 1) to determine whether the current position starts a new word or is a continuation
This guarantees:
- No double-counting
- No missing words
-
Threads are created using
pthread_create -
Each thread:
- Reads from the file using
pread - Maintains its own local buffer
- Produces a local word count
- Reads from the file using
preadreads from an explicit file offset- It does not modify the file descriptor's shared offset
- Multiple threads can safely read from the same file descriptor concurrently
No locks or synchronization primitives are required.
- File size is determined using
stat - The file is logically divided into chunks based on CPU core count
- Each thread reads its chunk incrementally using a fixed-size buffer (
READ_BUFFER_SIZE)
The program never loads the entire file into memory. It streams data in small buffers, making it suitable for very large files.
Compile with a POSIX-compliant compiler:
gcc -O2 -pthread mwc.c -o mwcOptimization (-O2) is recommended for realistic performance measurements.
long count = mwc("large_file.txt");The mwc function:
- Opens the file
- Spawns worker threads
- Waits for completion
- Returns the total word count
This project is intended to explore:
- Correct parallelization of sequential data
- Thread-safe file I/O in POSIX systems
- Boundary conditions in text processing
This project is provided for educational purposes. Use and modify freely.