coscup
motivation/end-game
reduce risk/cost of someone creating a startup to rival nvidia
seek large company such as Facebook, and get them to tape it our, manufacture it
create a startup oneself, using the project as evidence of domain expertise
design goals
work with PyTorch
deep learning framework
extensively used in industry and research
specialized in machine learning
eliminate all die space not necessary for ML
only BF16; no FP16, no FP32, no FP64
only fp operations needed by ml, eg exp, log, tanh
goal is to keep tape-out costs down, and yield up
fast
use primarily opensource tools, or at least, free tools, where possible
click to edit
design
work with pytorch
could modify pytorch
but lots of dev needed
could use opencl
open standard
not currently supported by pytorch => would need dev effort
slow, because has to support so many types of hardware, and reuirements from many consortium members
could use nvidia cuda
tempting: already works with pytorch
unclear how to defend against IP related C&Ds from nvidia
use AMD HIP
works with pytorch already
opensource (MIT) license
almost the same as CUDA, so relatively standard
Using RISC-V ISA for kernels
currently using zfinx extension, so same register file for floats and ints
might need to break with RISC-V somewhat when switching to use BF16 (currently using FP32, to get things working)
intend to use drop-in third-party IP for memory controller, and PCIe connection to computer
since relatively standard, anyone taping out hte GPU could simply drop these in, ideally
avoids needing to reinvent the wheel
intend to use AXI4 for connection
though this doesn't seem sufficient to fully specify the interface, so there will be some controller-specific work to do
question in my mind about whether need network-on-a-chip?
are GPUs sufficiently hierarhical that no need?
verification
creating unit tests as I go along
most run in iverilog
migrating tests to work in verilator too
challenge: 'x' values, uninitialized state
not caught by simulators by default
with iverilog, can use gate level simulation to catch many errors
but some still get through
with verilator, can use random initialization
this seems pretty effective, found bugs that iveriloatr with GLS di not
sometimes tricky to track down the bug, but at least know there is one
performance
using yosys to synthesize down to gate-level netlist
using an opensource 90nm cell library (SAED_EDK90)
use custom script to walk the netlist
calculate propagation delay, in units of 'nand gate propagation delay'
calculate die area, in units of 'nand gate die area'
measure cycle count for various test programs
CI server
CircleCI, free for opensource projects
runs verification and performacne tests at each commit
opensource tooling used
icarus verilog
4-state simulator
good points
bad points
very easy to use
can write test cases in verilog
generates/compiles code very quickly
strict GPLv2 license
potentially an issue for using VPI
limited support for system verilog
verilator
2-state simulator
good points
great reputation in industry
runs quickly
relatively unrestrictive license (LGPLv2)
detects initiailzation errors reliably, using random initizlation
easy to link with C++
system verilog underway
bad points
lots of system verilog functionality missing
cannot write standard verilog test bench unit tests
compilation/generation slow, hard to configure
yosys
awesome opensource synthesizer
good points
very reliable
didnt find a bug yet
can handle a diverse space of verilog code
bad points
limited support for system verilog
opensource tooling not used
sv2v
converts system verilog code to verilog code
which can be given to other tools
good points
supports system verilog
bad points
whenever somethign goes wrong, throws you into generated verilog cocde
I'd like to be thrown into the sv code, ideally
opensta/timer
can be used to measure propagation delay
I found it hard to use
gaps/opportunites in open source tooling
systemverilog support
opensource DDR5 or 6 controller, with a full verification suiste, as standard an interface as possible, and some kind of 'proxy' module for running simulations
same for PCIe controller
Is there some way to have an opensource way to create a bmem block?
make chip layout easy to run
I tried qflow, but found it hard to use
maybe as easy as adding noob documentation to qflow?
probably need some tests for various edge-cases though, perhaps with a CI server
click to edit
current status
basic gpu core
present
todo
basic int and fp arithmetic
unified register file using zfinx instructions
RISC-V based
good propagation delay
parallel instruction
migrate to bf16
other fp operations: exp, log, tanh
warps
todo
gpu controller
working
can use to copy memory to/from host
can use to launch kernels
single source c++
can compile and run single-source c++ programs, containing both host and kernel code
use clang to factorize code into host and device code
uses clang to transform c++ into llvm ir
uses clang to transform llvm ir into RISC-V assembler
we provide a HIP hostside runtime
handles memory management
handles launching kernels
handles communications with gpu controller
we handle rewriting the LLVM IR to call into our HIP hostside runtime
PyTorch
we can allocate gpu memory from pytorch
need to look into building pytorch from scratch, in order to rebuild all kernels as RISC-V
need to add in interface to ddr controller, and to pcie controller
might need network-on-chip (undecided currently)