GPU Prototype
Ongoing project building a basic parallel GPU-style accelerator on the iCEstick iCE40 FPGA
Overview
This is an ongoing hardware project focused on building a very basic GPU-style accelerator on the iCEstick (iCE40 FPGA). The main goal is to explore how GPU-like speedups come from parallelization, even with limited FPGA resources, by running many small processing elements in parallel on the same workload.
Rather than targeting a full graphics pipeline, this project is centered around designing a minimal, educational “GPU core” that demonstrates the core ideas behind GPUs: SIMD/SIMT-style execution, parallel compute units, and high-throughput operations.
Project Goals
- Implement a small set of parallel compute lanes (processing elements) on the iCE40
- Create a simple instruction/control model (broadcast control with lane-level execution)
- Build a lightweight memory interface suitable for small test workloads
- Demonstrate measurable speedup vs. a single-lane design on FPGA
Architecture (Work in Progress)
Planned high-level blocks:
-
Controller / Dispatcher
Issues operations to multiple compute lanes in parallel (GPU-like “warp” control at a very small scale) -
Parallel Compute Lanes
Multiple identical ALU-style units that run the same operation across different data elements -
On-chip Memory / Buffers
Simple scratchpad-style buffers for storing inputs/outputs (within iCE40 constraints) -
Output / Verification Interface
A basic way to inspect results (UART and/or LED/debug signals depending on the build)
Constraints
The iCEstick iCE40 platform is intentionally resource-limited, so this project emphasizes:
- Small, efficient RTL blocks
- Careful tradeoffs between lane count vs. timing/resource usage
- Simple memory patterns and constrained bandwidth
- Step-by-step verification before scaling complexity
Testing and Verification
Current / planned verification approach:
- Unit tests for each lane (ALU ops, registers, control)
- Small vector workloads to validate parallel correctness
- Comparison against a software reference model
- Timing/resource checks after each major feature addition
Output and Demo Plan
Planned demo outputs include:
- Running small parallel kernels (e.g., vector add, dot product variants, simple image-style operations)
- Showing correctness via printed/serial output or debug traces
- Comparing runtime/throughput between:
- 1-lane “CPU-like” version
- multi-lane “GPU-like” parallel version
Status
In progress.
This page will be updated as the RTL, testing harness, and demo workloads mature.