Whenever I am trying to learn something new, I tend to spend about a day (or longer depending on scope) doing research on the best resources to use. I keep these to myself, but I was recently building a syllabus for learning CUDA and some friends recommended I formalize and share it.
Just give me the syllabus
If you don’t want to read my yapping, this is the syllabus with no added commentary:
Conversational Understanding ¶
- Programming Massively Parallel Processors, 4th Edition (PMPP): Chapters 1-6 and its associated lectures.
- First 3 MPs of the associated Coursera course (The course seems to be dead now, but someone backed up the assignments).
- As many exercises as I deem fit in those first 6 chapters.
- Referencing the CUDA C++ Programming Guide from Chapters 1-5. My friend Andre, who I would consider a CUDA expert, also recommends reading chapters 8, 16, and 19 (with emphasis on 16) to have more conversational knowledge.
Working Understanding ¶
Guided Learning: ¶
- Completing PMPP, it’s associated lectures, and exercises.
- Skimming CMU’s 15-418. Lectures and assignments.
- 15-418 is focused not just on CUDA, but also on general parallel programming in OpenMP and with MPI. The last two assignments in the class make use of OpenMP, and it would be interesting to try and implement some of the functionality in CUDA to solidify my understanding.
- Skimming PPC exercises.
- These may have significant overlap with PMPP and 15-418, but there are probably things that slip through the cracks that these exercises will help cover.
Open-ended Learning: ¶
- Implementing deep learning models from scratch.
- I suspect that implementing and benchmarking a simple CNN and a simple transformer will reinforce all the things I’ve learned.
- Integrating with other areas of knowledge:
- I’ve built a raytracer before while following along with Ray Tracing in One Weekend. Nvidia has a guide on re-implementing this functionality in CUDA. I plan to follow along with this and then extend it further to the extent of my knowledge.
- I’ve written a parallelized Black-Scholes options pricer in the past, doing this in CUDA seems like a valuable learning experience.
- Do all of these from scratch, and then also using real-world libraries like CCCL.
Expert Knowledge ¶
- Understand real-world CUDA optimizations.
- Simon Boehm has a wonderful writeup walking through the thought process behind every optimization along the way in optimizing a matmul kernel.
- Read and try to understand important CCCL, CuTe, and CUTLASS kernels.
- Even more complex projects
- llm.c looks like a wonderful resource. Following along with Karpathy’s GPT series and attempting to build some of the core components in CUDA will likely be a valuable learning experience.
- CuPy is quite functional, attempting to build some of the core features from scratch will also likely be a good learning experience.
- Paper reading and grad classes
- I have not looked into this too deeply, but I know that UIUC has a follow-up class to PMPP that features more complex parallel programming topics, as well as a list of relevant papers. MIT’s class also has some papers to look at.
- Twitter also seems to be a good resource for finding new and recent papers that are relevant to parallel programming and CUDA. Keeping up with major conferences, such as ASPLOS, MLSys, and SIGGRAPH will also help in keeping up-to-date.
Conversational Understanding ¶
I tend to split up my learning into three checkpoints, the first being conversational understanding. The goal for this section is to have enough knowledge about the topic to be in, or a little past, the “Valley of Despair” in the Dunning-Kruger curve. The goal for this checkpoint is to be able to have conversations with experts and be able to understand what they are talking about, but likely being unable to come to the same conclusions or build the same things that they do without further learning.
For getting to this point with CUDA, here are the resources I plan on going through:
- Programming Massively Parallel Processors, 4th Edition (PMPP): Chapters 1-6 and its associated lectures.
- First 3 MPs of the associated Coursera course (The course seems to be dead now, but someone backed up the assignments).
- As many exercises as I deem fit in those first 6 chapters.
- Referencing the CUDA C++ Programming Guide from Chapters 1-5. My friend Andre, who I would consider a CUDA expert, also recommends reading chapters 8, 16, and 19 (with emphasis on 16) to have more conversational knowledge.
This shouldn’t take longer than 10-20 hours (I have already completed these and it took about 10 hours) and after this point, I’d feel comfortable assessing further resources and being able to talk with friends and others that are more knowledgable about CUDA.
Working Understanding ¶
After I have enough vocabulary knowledge and general understanding of how CUDA works, the next step is getting to a point where I feel comfortable using CUDA for my own work and maybe even in a professional setting (with guidance).
Guided Learning: ¶
- Completing PMPP, it’s associated lectures, and exercises.
- Skimming CMU’s 15-418. Lectures and assignments.
- 15-418 is focused not just on CUDA, but also on general parallel programming in OpenMP and with MPI. The last two assignments in the class make use of OpenMP, and it would be interesting to try and implement some of the functionality in CUDA to solidify my understanding.
- Skimming PPC exercises.
- These may have significant overlap with PMPP and 15-418, but there are probably things that slip through the cracks that these exercises will help cover.
Open-ended Learning: ¶
- Implementing deep learning models from scratch.
- I suspect that implementing and benchmarking a simple CNN and a simple transformer will reinforce all the things I’ve learned.
- Integrating with other areas of knowledge:
- I’ve built a raytracer before while following along with Ray Tracing in One Weekend. Nvidia has a guide on re-implementing this functionality in CUDA. I plan to follow along with this and then extend it further to the extent of my knowledge.
- I’ve written a parallelized Black-Scholes options pricer in the past, doing this in CUDA seems like a valuable learning experience.
- Do all of these from scratch, and then also using real-world libraries like CCCL.
In general, the idea here is to complete the equivalent of an undergraduate course on CUDA/parallel programming, and then work on some “classic” projects, such as the CNN and transformer. The classes and resources I have here are just the ones that I know people recommend, but the wonderful thing about CS is that nearly every major university publishes their assignments and lecture slides publically, so if you wanted to take the Stanford equivalent or the MIT equivalent you can. These all tend to cover the same material though, so there is rarely a reason to look at more than one or two of these.
Once I’ve done those, I want to try and apply CUDA to things I already have knowledge of, so that the actual task is more complex, but I’ve already done a lot of the work. This will likely take between 60-100 hours, but by the end I should have enough knowledge to feel confident in working on anything in CUDA.
Expert Knowledge ¶
I probably won’t pursue learning CUDA any more than having a working understanding of it, but it is still useful to find resources to go even further. These tend to be far more open-ended and complex, and will usually revolve around understanding the work and writing of people that I would already consider experts.
- Understand real-world CUDA optimizations.
- Simon Boehm has a wonderful writeup walking through the thought process behind every optimization along the way in optimizing a matmul kernel.
- Read and try to understand important CCCL, CuTe, and CUTLASS kernels.
- Even more complex projects
- llm.c looks like a wonderful resource. Following along with Karpathy’s GPT series and attempting to build some of the core components in CUDA will likely be a valuable learning experience.
- CuPy is quite functional, attempting to build some of the core features from scratch will also likely be a good learning experience.
- Paper reading and grad classes
- I have not looked into this too deeply, but I know that UIUC has a follow-up class to PMPP that features more complex parallel programming topics, as well as a list of relevant papers. MIT’s class also has some papers to look at.
- Twitter also seems to be a good resource for finding new and recent papers that are relevant to parallel programming and CUDA. Keeping up with major conferences, such as ASPLOS, MLSys, and SIGGRAPH will also help in keeping up-to-date.
In my opinion, being an expert requires active, constant, involvement, so the majority of the time will be spent talking to others and becoming involved in CUDA and parallel programming work. There is no estimated timeline for going through these, because ideally you never stop.