coscup

motivation/end-game

reduce risk/cost of someone creating a startup to rival nvidia

seek large company such as Facebook, and get them to tape it our, manufacture it

create a startup oneself, using the project as evidence of domain expertise

design goals

work with PyTorch

deep learning framework

extensively used in industry and research

specialized in machine learning

eliminate all die space not necessary for ML

only BF16; no FP16, no FP32, no FP64

only fp operations needed by ml, eg exp, log, tanh

goal is to keep tape-out costs down, and yield up

fast

use primarily opensource tools, or at least, free tools, where possible

click to edit

design

work with pytorch

could modify pytorch

but lots of dev needed

could use opencl

open standard

not currently supported by pytorch => would need dev effort

slow, because has to support so many types of hardware, and reuirements from many consortium members

could use nvidia cuda

tempting: already works with pytorch

unclear how to defend against IP related C&Ds from nvidia

use AMD HIP

works with pytorch already

opensource (MIT) license

almost the same as CUDA, so relatively standard

Using RISC-V ISA for kernels

currently using zfinx extension, so same register file for floats and ints

might need to break with RISC-V somewhat when switching to use BF16 (currently using FP32, to get things working)

intend to use drop-in third-party IP for memory controller, and PCIe connection to computer

since relatively standard, anyone taping out hte GPU could simply drop these in, ideally

avoids needing to reinvent the wheel

intend to use AXI4 for connection

though this doesn't seem sufficient to fully specify the interface, so there will be some controller-specific work to do

question in my mind about whether need network-on-a-chip?

are GPUs sufficiently hierarhical that no need?

verification

creating unit tests as I go along

most run in iverilog

migrating tests to work in verilator too

challenge: 'x' values, uninitialized state

not caught by simulators by default

with iverilog, can use gate level simulation to catch many errors

but some still get through

with verilator, can use random initialization

this seems pretty effective, found bugs that iveriloatr with GLS di not

sometimes tricky to track down the bug, but at least know there is one

performance

using yosys to synthesize down to gate-level netlist

using an opensource 90nm cell library (SAED_EDK90)

use custom script to walk the netlist

calculate propagation delay, in units of 'nand gate propagation delay'

calculate die area, in units of 'nand gate die area'

measure cycle count for various test programs

CI server

CircleCI, free for opensource projects

runs verification and performacne tests at each commit

opensource tooling used

icarus verilog

4-state simulator

good points

bad points

very easy to use

can write test cases in verilog

generates/compiles code very quickly

strict GPLv2 license

potentially an issue for using VPI

limited support for system verilog

verilator

2-state simulator

good points

great reputation in industry

runs quickly

relatively unrestrictive license (LGPLv2)

detects initiailzation errors reliably, using random initizlation

easy to link with C++

system verilog underway

bad points

lots of system verilog functionality missing

cannot write standard verilog test bench unit tests

compilation/generation slow, hard to configure

yosys

awesome opensource synthesizer

good points

very reliable

didnt find a bug yet

can handle a diverse space of verilog code

bad points

limited support for system verilog

opensource tooling not used

sv2v

converts system verilog code to verilog code

which can be given to other tools

good points

supports system verilog

bad points

whenever somethign goes wrong, throws you into generated verilog cocde

I'd like to be thrown into the sv code, ideally

opensta/timer

can be used to measure propagation delay

I found it hard to use

gaps/opportunites in open source tooling

systemverilog support

opensource DDR5 or 6 controller, with a full verification suiste, as standard an interface as possible, and some kind of 'proxy' module for running simulations

same for PCIe controller

Is there some way to have an opensource way to create a bmem block?

make chip layout easy to run

I tried qflow, but found it hard to use

maybe as easy as adding noob documentation to qflow?

probably need some tests for various edge-cases though, perhaps with a CI server

click to edit

current status

basic gpu core

present

todo

basic int and fp arithmetic

unified register file using zfinx instructions

RISC-V based

good propagation delay

parallel instruction

migrate to bf16

other fp operations: exp, log, tanh

warps

todo

gpu controller

working

can use to copy memory to/from host

can use to launch kernels

single source c++

can compile and run single-source c++ programs, containing both host and kernel code

use clang to factorize code into host and device code

uses clang to transform c++ into llvm ir

uses clang to transform llvm ir into RISC-V assembler

we provide a HIP hostside runtime

handles memory management

handles launching kernels

handles communications with gpu controller

we handle rewriting the LLVM IR to call into our HIP hostside runtime

PyTorch

we can allocate gpu memory from pytorch

need to look into building pytorch from scratch, in order to rebuild all kernels as RISC-V

need to add in interface to ddr controller, and to pcie controller

might need network-on-chip (undecided currently)