Matrix Multiplication | Joel Abraham

Source Code:
View the project on GitHub

Overview

This project focuses on implementing matrix multiplication with an emphasis on parallel execution to improve performance and efficiency. Matrix multiplication is a core computational kernel used extensively in graphics processing, machine learning, and scientific computing, making it an ideal workload for exploring hardware-level parallelism.

The goal of this work is to analyze how parallelization strategies can significantly reduce computation time compared to a purely sequential implementation, and to understand the trade-offs involved in hardware resource usage, throughput, and scalability.

Motivation

Matrix multiplication is inherently parallel: each output element can be computed independently as a dot product of a row and a column. This property makes it a foundational example for:

GPU-style parallel computation
SIMD/SIMT execution models
Accelerator and FPGA-based designs

This project uses matrix multiplication as a test case for studying how parallel compute units can be structured and coordinated to achieve higher performance.

Approach

The implementation explores:

Decomposing matrix multiplication into independent operations
Executing multiple multiply–accumulate operations in parallel
Structuring computation to maximize data reuse and throughput
Comparing parallel execution against a baseline sequential approach

The design emphasizes clarity and correctness while progressively introducing parallelism to demonstrate measurable performance gains.

Parallelism Strategy

Key parallelization concepts explored include:

Parallel computation of multiple output elements
Independent processing of rows and columns
Synchronization of partial results
Trade-offs between parallel unit count and resource usage

These strategies mirror the techniques used in GPU compute pipelines and hardware accelerators.

Results and Observations

Through parallel execution:

Computation time is reduced relative to a single-lane implementation
Performance scales with the number of parallel operations available
Hardware complexity increases as more parallel units are introduced

This highlights the classic performance–resource trade-off in hardware design.

Skills and Concepts Demonstrated

Parallel algorithm design
Hardware-oriented thinking for compute-intensive workloads
Understanding of matrix multiplication as a fundamental compute kernel
Performance analysis and comparison of sequential vs. parallel execution
Foundational concepts used in GPU and accelerator architectures

Future Work

Potential extensions of this project include:

Scaling to larger matrix sizes
Introducing pipelining alongside parallel execution
Mapping the design to FPGA or GPU-style hardware architectures
Exploring memory bandwidth and data locality optimizations