Kyrie-Curiosities-Cabinet

Kyrie's Cabinet of Curiosities

  1. Compiler for FL (computation+communication)

  2. Is it possible to adopt imitation learning or other IRL approaches to mimic DNN compiler generated kernels,,, shortening long tuning process, boosting cross-device compiling?

  3. NeRF, Q4ML, ML4Q, QCompiler... dazzled, Closing the Gap between Quantum Algorithms and Machines with Hardware-Software Co-Design

  4. Sim2Real Runtime Engine, for e.g. MinDOJO, autonomous driving?[X], nature of simulation: massive data, exploration cost, distributed traing <-> real.

  5. Carbon-aware DNN Compiler

The OctoML Platform has always provided automation for exploring multiple model acceleration techniques. Via our new TVM-ONNX Runtime integration,

  1. Duet类似的dual device inference的nn compiler+runtime,异构子图优化,根据不同设备的并行性能优化。

  2. Diffusion model Survey Diffusion models: A comprehensive survey of methods and applications

  3. Crypto | Privacy | Security + Accelerator

  • CryptGPU
  • PolyMPCNet
  • Crypten
  • Cheetah
  1. On-device AI
  1. Summarizing CPU and GPU Design Trends with Product Data

  2. prompt

  3. webGPU (https://github.com/mlc-ai/web-stable-diffusion)

  4. zeroth order optimization (blackboxML)

  5. Unified abstarctions for IoT datastream.

  6. mlir for heter-device ml-flow (dnn & non-dnn operators & flow)

  7. tile ir => mixed-tile ir? cross-domain/modality ir? Pain point: multi-layer IR redundant code opt. iree support vulkan-spirv,for mobile gpu and cpu => compiler support for data flow in e2e auto vehicle

  8. LLM running with CNN and DNN, co-running transformer and CNN.

  9. 大模型更强的表征能力赋给在线小模型。

  10. NPU running LLM, energy problem.

  11. llm token的稀疏性和input token的动态性使distributed inference不是所有通信都是必要的。

  12. machine unlearning ood

  13. 样本端不愿意给label,都从云走的话太慢,以及云负载过大。

  14. 未知任务识别放edge,未知任务识别在云上?

  15. 有多少edge数据放在边缘侧可以finetune出比较好的垂直大模型?

  16. Simulator for edge?

  17. 大模型compiler for edge,迁移性

  18. finetuning edge llm + compiler

  19. rTile, rGraph, 重新定义基本单元,不要以op为单位,see as dataflow。load-compute-store。大模型on edge as dynamic nn。

  20. Develop compiler strategies that can efficiently distribute model computations between edge and server GPUs, considering factors such as network latency, communication overhead, and load balancing.

  21. To facilitate this mapping, WELDER provides an abstracted accelerator device with hierarchical memory layers.

  22. Tensor 版本的imagebind, meta-transformer. Tuned records of same op/graphs (objects) on different hardwares (modality). Goal: unify hardware intrinsics, feature: cost model <-> svm对image到现在的encoder

  23. Explore Data Placement Algorithm for Balanced Recovery Load Distribution

  24. zpoline: a system call hook mechanism based on binary rewriting

  25. tensorir,code embedding, llm

  26. decompiling, executables to ir to another device e2e, nvidia tx2 的compiled模型,一键迁移到另一个上。