Issues
- 2
same train_loader but got different loader size
#80 opened by Hyaloid - 4
Some error about communication
#43 opened by jglicat - 1
optimizer got an empty parameter list when rank=1
#79 opened by Hyaloid - 1
- 2
When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0):
#78 opened by lengien - 0
Supporting T5
#68 opened by gperrotta - 1
Running in docker will give you an error that you can't find a physical address
#76 opened by guanyonglai - 1
- 1
- 1
- 4
Translation demo: Division by zero
#49 opened by grwlf - 5
Planner for PipeDream-2BW
#57 opened by nict-wisdom - 0
Question about PipeDream's optimizer
#71 opened by lllukehuang - 6
Multi-machine distribution problem
#34 opened by ADAM-CT - 0
- 10
What's the latest version of PyTorch supported?
#52 opened by SimonZsx - 0
The arguments of self.start_helper_thread() should be more flexible instead of fixed as int64.
#69 opened by gouchangjiang - 0
- 6
Is there AllReduce in data parallelism?
#65 opened by Allen-Czyysx - 3
- 0
- 1
- 0
Resource temporarily unavailable
#61 opened by liulixinkerry - 0
The BLEU score of translation model seems abnormal. The model doesn't seem to train effectively.
#63 opened by njuyexiangyu - 0
- 0
- 0
Running a transformer module
#58 opened by oranichu - 6
Multi node training
#36 opened by ADAM-CT - 6
- 0
Hanging with [4,3,1] GPU assignment
#55 opened by BestSonny - 0
Can the profiler handle dynamic graphs?
#54 opened by rahul003 - 8
Unexpected Error
#31 opened by kanonjz - 0
- 1
- 1
- 3
- 6
Batch size and optimizer
#47 opened by nirandaperera - 3
- 3
Error occurred in profiling
#44 opened by gudiandian - 1
Is this the version used in SOSP paper?
#46 opened by nirandaperera - 1
- 1
docker pull error
#42 opened by cnzhanj - 4
How to determine replication factors
#37 opened by ADAM-CT - 2
"stage_to_depth_map" not found
#38 opened by ADAM-CT - 2
- 3
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai
#32 opened by ADAM-CT - 2
bandwidth parameter
#35 opened by ADAM-CT - 2
- 7
Gpu underutilization
#29 opened by ADAM-CT - 8