learning-at-home/hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

PythonMIT

Issues

Seeking advice on improvement reliability of communication.
#624 opened 23 days ago by samsja
0
[BUG] GradScaler does not work with torch 2.3.0
#610 opened 23 days ago by samsja
1
pydantic < 2.0.0 is starting to conflict with other dependencies
#597 opened 10 months ago by Vectorrent
1
How to get list of nearest peers connected in the DHT network?
#606 opened 7 months ago by abhi1263
0
question about how rpc works in the hivemind package
#605 opened 7 months ago by drimeF0
0
[Feature Request] Network Statistics
#520 opened 2 years ago by chavinlo
4
connecting to private petals using ec2 dht problems
#604 opened 10 months ago by brandnamewater
0
Support for windows
#596 opened a year ago by ParisNeo
1
[BUG] Unable to start hivemind server when using gradient clipping
#592 opened a year ago by SamAg19
2
Support for fully homomorphic encryption on training, finetuning, and inference
#584 opened a year ago by sirus20x6
0
does/can hivemind work with deepspeed ZeRO-3 Offload? [Feature Request]
#583 opened a year ago by sirus20x6
0
proto/runtime_pb2.py missing when installing from sources
#582 opened a year ago by poedator
1
forking before initialization of the MPFuture handler - server runtime not initialized in WSL --new_hive
#581 opened a year ago by poedator
1
[BUG] Getting '[Errno 13] Permission denied' when import hivemind
#580 opened a year ago by yuanluw
0
How well does it scale?
#575 opened a year ago by lonnietc
2
[BUG] hivemind.compression is not compatible with bitsandbytes == 0.39.1
#572 opened a year ago by borzunov
2
Local Gradient Accumulation is slower than the PyTorch implementation.
#566 opened a year ago by cirquit
0
hivemind.compression: TypedStorage is deprecated
#563 opened a year ago by borzunov
1
Failed to close hivemind.P2P
#564 opened a year ago by borzunov
1
Metaclasses for logging
#556 opened a year ago by StrangeTcy
1
AttributeError in MPFuture
#552 opened 2 years ago by borzunov
2
Failed to connect to bootstrap peers
#551 opened 2 years ago by amerfarooq
1
[Feature Request] improve bfoat16 serialization when there is no compression
#550 opened 2 years ago by justheuristic
1
[BUG][MINOR] relayFinder already running
#549 opened 2 years ago by justheuristic
0
[BUG] Unable to train a bloat16-compressed model
#545 opened 2 years ago by the-beee
1
Mismatched protobuf versions in sub-dependencies
#539 opened 2 years ago by briansemrau
3
[Feature Request] enable circuit relay v2
#536 opened 2 years ago by justheuristic
4
Read {run_id}_progress from DHT manually throws exceptions
#533 opened 2 years ago by cirquit
1
hivemind.averaging.partition.AllreduceException: Averaging step failed: could not find a group
#519 opened 2 years ago by chavinlo
3
[BUG] Cyclic references in TaskPool
#534 opened 2 years ago by justheuristic
0
[chore] deprecations for v1.2.0
#526 opened 2 years ago by justheuristic
0
Unable to decrease loss OR Unable to syncronize
#515 opened 2 years ago by chavinlo
2
[BUG] stale gradients
#514 opened 2 years ago by elricwan
0
[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error
#504 opened 2 years ago by alex-snd
2
GPU lost
#509 opened 2 years ago by elricwan
6
[BUG] Tests for compression fail on GPU servers with bitsandbytes installed
#507 opened 2 years ago by mryab
0
Would you consider to add some CV examples with hivemind?[Feature Request]
#500 opened 2 years ago by elricwan
2
[Feature Request] Supporting RWKV (a RNN that can match transformer LM & zero-shot performance at 1B+ params)
#496 opened 2 years ago by BlinkDL
3
On the fresh run with cifar10 on macos 11.5.2
#498 opened 2 years ago by stoneyang
2
[Feature Request] MoE enhancements
#478 opened 2 years ago by GreenFatGuy
1
[Feature Request] quality-of-life changes to examples/albert
#474 opened 2 years ago by justheuristic
1
[Feature Request] Create docker image for WSL2
#461 opened 2 years ago by kotenok2000
4
[BUG] Global connection not working
#472 opened 2 years ago by Lednik7
2
[Feature Request] fp16/bf16 gpu params with fp32 offloading in hivemind.Optimizer
#476 opened 2 years ago by justheuristic
0
[Feature Request] Example of training CNN with large batch size
#464 opened 2 years ago by elricwan
1
[BUG] The peer would terminated automatically when training large model
#458 opened 3 years ago by elricwan
0
Set a wait time for other peers to join
#455 opened 3 years ago by elricwan
4
[BUG][MINOR] monitor does not recover from failing to load state
#453 opened 3 years ago by justheuristic
0
[BUG] You current contribution: 0 samples
#451 opened 3 years ago by elricwan
3
[BUG] Loss did not decrease in Albert example after 125000 max step.
#447 opened 3 years ago by elricwan
5