ipfs/ipget

Cannot complete download

vmx opened this issue · 2 comments

vmx commented

I'm using ipget on some remote node in a data center. There the download of a large file halts in the middle of the file at one point. I'm using ipget v.0.7.0 (that's the latest on dist.ipfs.io). I've no idea how to debug this. I guess it's the environment, but it's hard to tell. The command I've used is:

LIBP2P_TCP_REUSEPORT=false /var/tmp/ipget-v0.6.0/ipget/ipget QmNPc75iEfcahCwNKdqnWLtxnjspUGGR4iscjiz3wP3RtS -o /var/tmp/filecoin-proof-parameters/v28-empty-sector-update-merkletree-poseidon_hasher-8-8-0-3b7f44a9362e3985369454947bc94022e118211e49fd672d52bec1cbfd599d18.params --progress --node=temp --peers /dns4/collab-cluster-am6-2.cluster.dwebops.net/tcp/4001/p2p/12D3KooWCrBiagtZMzpZePCr1tfBbrZTh4BRQf7JurRqNMRi8YHF --peers /dns4/collab-cluster-am6-3.cluster.dwebops.net/tcp/4001/p2p/12D3KooWDpp7U7W9Q8feMZPPEpPP5FKXTUakLgnVLbavfjb9mzrT --peers /dns4/collab-cluster-ams1-1.cluster.dwebops.net/tcp/4001/p2p/QmNMs4C2taBgMP716bgaN6wyyLRMTzLSVC5aqXrNjHE33Z --peers /dns4/collab-cluster-dc13-1.cluster.dwebops.net/tcp/4001/p2p/12D3KooWHVXoJnv2ifmr9K6LWwJPXxkfvzZRHzjiTZMvybeTnwPy --peers /dns4/collab-cluster-dc13-2.cluster.dwebops.net/tcp/4001/p2p/12D3KooWEDBLgMaCr6ZFwjDXr7eMXzb7s7SnHJHrYRYWYbQSxMif --peers /dns4/collab-cluster-sjc1-1.cluster.dwebops.net/tcp/4001/p2p/Qmde7irdYqkbhfFsu6xKzBgmGWJPnx8bS7TNVdAko4gswW --peers /dns4/collab-cluster-sjc1-2.cluster.dwebops.net/tcp/4001/p2p/12D3KooWKZLdYX8fEqMu5jNKpSKzyXjjNYosJGj5T9uDXKxseAsw

This is blocking me on some things, as I cannot get the Filecoin proof parameters to that machine easily.

My suspicion is that if the connection gets dropped for some reason and that data isn't properly advertised in the DHT then you'll never re-establish the connection because --peers only connects to the peers once up front.

This could be resolved by making

ipget/util.go

Line 13 in 5397b06

func connect(ctx context.Context, ipfs iface.CoreAPI, peers []string) error {
repeatedly connect to the target peers. Code can be copied from https://github.com/ipfs/go-ipfs/blob/d6de97b417def4feaf1382d0ff423e22fd2ff08b/peering/peering.go as inspiration.

I attempted to fix this in this branch https://github.com/shawnrader/ipget/tree/reconnect, however it appears that Swarm is maintaining connections to hundreds of hosts according to Swarm.Peers(), so attempting to re-establish the connection does not address the issue. My guess at this point is either 1. the hosts serving the filecoin param files are not doing so reliably and/or 2. there is an issue deeper in the IPFS causing the download to get stuck. I think we need to make sure #1 is addressed before investigating further. When the issue occurs I see the download progress bar go down to 0 bytes/sec and stay there.