Multithreaded Word Count (mwc)

mwc is a POSIX-based multithreaded word count program written in C. It is designed as a learning-focused systems project that explores:

File I/O using pread
Correct word counting across chunk boundaries
Thread-level parallelism with pthreads
Avoiding shared mutable state
Benchmarking and performance considerations

The project is conceptually similar to wc -w, but is intentionally implemented from first principles to understand how such tools work internally.

Problem Statement

Given a large text file, count the total number of words using multiple threads.

A word is defined as a contiguous sequence of non-whitespace characters. Any whitespace character (space, newline, tab, etc.) is treated as a delimiter.

The key challenge is:

How do you split a file across multiple threads without double-counting or missing words at chunk boundaries?

Design Overview

High-level approach

The file is split into byte-range chunks based on the number of available CPU cores.
Each chunk is processed by a separate worker thread.
Threads read their assigned file regions independently using pread.
Each thread produces a local word count.
The main thread joins all workers and aggregates the results.

No thread shares mutable state with another thread.

Word Counting Strategy

Instead of trying to detect word ends, this project counts word starts.

Core rule

A word is counted only when its first character is encountered.

This is implemented using the following logic:

A character starts a word if:
- It is not whitespace, and
- The previous character is whitespace (or the character is at file offset 0)

Why this works

Each word has exactly one starting character
Words spanning multiple buffers or chunks are counted exactly once
No need to peek into the next chunk

This approach completely avoids complex boundary-adjustment logic.

Handling Chunk Boundaries

Each worker thread is given:

offset: starting byte offset in the file
size: number of bytes it is responsible for

To correctly handle the first word in a chunk:

If offset == 0, the thread may count the first word
Otherwise, the thread inspects the previous byte (offset - 1) to determine whether the current position starts a new word or is a continuation

This guarantees:

No double-counting
No missing words

Concurrency Model

Threads are created using pthread_create
Each thread:
- Reads from the file using pread
- Maintains its own local buffer
- Produces a local word count

Why `pread`?

pread reads from an explicit file offset
It does not modify the file descriptor's shared offset
Multiple threads can safely read from the same file descriptor concurrently

No locks or synchronization primitives are required.

File I/O

File size is determined using stat
The file is logically divided into chunks based on CPU core count
Each thread reads its chunk incrementally using a fixed-size buffer (READ_BUFFER_SIZE)

The program never loads the entire file into memory. It streams data in small buffers, making it suitable for very large files.

Building

Compile with a POSIX-compliant compiler:

gcc -O2 -pthread mwc.c -o mwc

Optimization (-O2) is recommended for realistic performance measurements.

Usage

long count = mwc("large_file.txt");

The mwc function:

Opens the file
Spawns worker threads
Waits for completion
Returns the total word count

Learning Goals

This project is intended to explore:

Correct parallelization of sequential data
Thread-safe file I/O in POSIX systems
Boundary conditions in text processing

License

This project is provided for educational purposes. Use and modify freely.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multithreaded Word Count (mwc)

Problem Statement

Design Overview

High-level approach

Word Counting Strategy

Core rule

Why this works

Handling Chunk Boundaries

Concurrency Model

Why `pread`?

File I/O

Building

Usage

Learning Goals

License

About

Uh oh!

Releases

Packages

Languages

suyash-purwar/mwc

Folders and files

Latest commit

History

Repository files navigation

Multithreaded Word Count (mwc)

Problem Statement

Design Overview

High-level approach

Word Counting Strategy

Core rule

Why this works

Handling Chunk Boundaries

Concurrency Model

Why pread?

File I/O

Building

Usage

Learning Goals

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Why `pread`?

Packages