[Profiling] Tracker Issue for Profiling first steps
Opened this issue · 4 comments
This issue lists out steps for profiling! (Mostly so I can organize my TODOs.) Will update as I move along.
Inspections & QoL improvements to profiler
- Running the profiler on more (big) programs
- Make a test suite for profiling
- Use Cider2 benchmarks for "real programs"
- Brainstorm ways to give actionable feedback for bigger programs
- Fix profiler tests CI
- Have runt test print out more things to properly debug
- Inspections
- Look through the waveforms for the "weird behavior"/mystery cycles
- Minimization to see different behaviors
- For any FSM-managed group, collect two pieces of info and report both to the users. The diff would display the mystery redundant cycles that Calyx is consuming.
- ground truth (
go
,done
ports) - FSM (what Calyx says is allowed to run)
- ground truth (
- QoL improvements:
- Connect invokes & pars (cond groups for whiles?) with user identifiable info (line numbers?)
- Visualizations
- Check out README: https://github.com/Auterion/embedded-debug-tools/tree/main/ext/orbetto
- Check out: https://ui.perfetto.dev/
- Check out: https://profiler.firefox.com/
- Make a first-pass visualization for cycle counts
- Find tools that display flame graphs (rather than a timeline view)
First Pass: Cycle-level performance info at the Calyx level
- Metadata generation
- Print JSON from TDCC (add another pass option to print JSON instead of the dump)
- Write JSON to file
- Instead of hacking through the enable assignment, we directly keep track of group to FSM state mappings
- Refactor this by directly building a
FSMStateInfo
when processing enables.
- Refactor this by directly building a
- Fix JSON emission to output a single JSON file at the end (when there are multiple TDCC groups, like in
language-tutorial-iterate
, the individual TDCC FSMs overwrite each other) -
Right now (for optimization purposes?) the first group is morphed with the setup. Want to differentiate for more accurate counts of the first group. - Merge
dump-fsm
anddump-fsm-json
for TDCC - Add FSM name information to JSON
- If the par arm/component does not yield a FSM, need to output corresponding information (check
go
anddone
instead!) - We want information about parentage (if a FSM is managing a par arm, we want to know what the par itself is)?
- Loading in the trace
- Figure out what tool to use?
- Kevin's Wellen library for Surfer
- Some Python libraries for a first pass:
- Make first pass script for reading vcd and outputting group lengths based on FSM values
- Remove assumption that there is only one FSM
- Remove assumption that each cycle takes 10ms (have a counter mechanism of how many cycles passed between X ms and Y ms)
- Sample signals on rising/falling clock edge (comment)
- Check out example programs with parallelism
- Produce summary: compute the total cycles that a given group was active, the number of times it was active (the number of segments), and the average running time (which is just the quotient of the previous two values).
- Multi-component programs:
- Update TDCC to write one JSON file reflecting all components
- Output cell names info using a backend instead of TDCC?
- Fix hardcoding of
"TOP.TOP.main.go"
- Find edge cases where timing info is not actionable
- Don't start counting clock cycles until
main.go
is 1
- Figure out what tool to use?
- Make flame graphs
- There is probably a library out there to generate a flame graph.
- Flame graphs resource: https://www.brendangregg.com/flamegraphs.html
- JavaScript library: https://github.com/spiermar/d3-flame-graph
- https://profiler.firefox.com/
- https://ui.perfetto.dev/
- There is probably a library out there to generate a flame graph.
- Write wrapper script around the pipeline
Thanks for opening this @ayakayorihiro! Could you add the "Tracker" label to this issue?
Thanks @rachitnigam ! Just added the tracker label, will keep in mind for next time :)
- Remove assumption that each cycle takes 10ms (have a counter mechanism of how many cycles passed between X ms and Y ms)
For synchronous designs like the ones Calyx produces I generally recommend sampling signals on a rising or falling clock edge (depending on how the testbench works). That way you stay independent of the actual timing. Here is how I find the sample point in a rust implementation: https://github.com/ekiwi/rtl-repair/blob/71e1afc0b9a2327d008b46acd415cf3f0343a938/scripts/osdd/src/main.rs#L113
Similar thing but with the vcdvcd
library in python:
https://github.com/ekiwi/rtl-repair/blob/861e244c599e682efe5dbd8e3295c3b8e3590a34/scripts/calc_osdd.py#L215
https://github.com/ekiwi/rtl-repair/blob/861e244c599e682efe5dbd8e3295c3b8e3590a34/scripts/calc_osdd.py#L195
Thanks @ekiwi ! I'll take a stab following your work with the vcdvcd
library :)