-
Compiler for FL (computation+communication)
-
Is it possible to adopt imitation learning or other IRL approaches to mimic DNN compiler generated kernels,,, shortening long tuning process, boosting cross-device compiling?
-
NeRF, Q4ML, ML4Q, QCompiler... dazzled, Closing the Gap between Quantum Algorithms and Machines with Hardware-Software Co-Design
-
Sim2Real Runtime Engine, for e.g. MinDOJO, autonomous driving?[X], nature of simulation: massive data, exploration cost, distributed traing <-> real.
-
Carbon-aware DNN Compiler
- EDEN: Enabling energy-efficient, high-performance deep neural network inference using approximate DRAM by Koppula, Skanda, et al., MICRO 2019
- Carbon Explorer: A Holistic Framework for Designing Carbon Aware Datacenters by Acun, Bilge, et al., ASPLOS 2023
- octoml
- lamppost, energy = cost, "However, AI inference at such a massive scale is very expensive."
- Zeus
The OctoML Platform has always provided automation for exploring multiple model acceleration techniques. Via our new TVM-ONNX Runtime integration,
-
Duet类似的dual device inference的nn compiler+runtime,异构子图优化,根据不同设备的并行性能优化。
-
Diffusion model Survey Diffusion models: A comprehensive survey of methods and applications
-
Crypto | Privacy | Security + Accelerator
- CryptGPU
- PolyMPCNet
- Crypten
- Cheetah
- On-device AI
- EfficientFormer: Vision Transformers at MobileNet Speed
- FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?
-
zeroth order optimization (blackboxML)
-
Unified abstarctions for IoT datastream.
-
mlir for heter-device ml-flow (dnn & non-dnn operators & flow)
-
tile ir => mixed-tile ir? cross-domain/modality ir? Pain point: multi-layer IR redundant code opt. iree support vulkan-spirv,for mobile gpu and cpu => compiler support for data flow in e2e auto vehicle
-
LLM running with CNN and DNN, co-running transformer and CNN.
-
大模型更强的表征能力赋给在线小模型。
-
NPU running LLM, energy problem.
-
llm token的稀疏性和input token的动态性使distributed inference不是所有通信都是必要的。
-
machine unlearning ood
-
样本端不愿意给label,都从云走的话太慢,以及云负载过大。
-
未知任务识别放edge,未知任务识别在云上?
-
有多少edge数据放在边缘侧可以finetune出比较好的垂直大模型?
-
Simulator for edge?
-
大模型compiler for edge,迁移性
-
finetuning edge llm + compiler
-
rTile, rGraph, 重新定义基本单元,不要以op为单位,see as dataflow。load-compute-store。大模型on edge as dynamic nn。
-
Develop compiler strategies that can efficiently distribute model computations between edge and server GPUs, considering factors such as network latency, communication overhead, and load balancing.
-
To facilitate this mapping, WELDER provides an abstracted accelerator device with hierarchical memory layers.
-
Tensor 版本的imagebind, meta-transformer. Tuned records of same op/graphs (objects) on different hardwares (modality). Goal: unify hardware intrinsics, feature: cost model <-> svm对image到现在的encoder
-
Explore Data Placement Algorithm for Balanced Recovery Load Distribution
-
zpoline: a system call hook mechanism based on binary rewriting
-
tensorir,code embedding, llm
-
decompiling, executables to ir to another device e2e, nvidia tx2 的compiled模型,一键迁移到另一个上。