FIP Discussion: Extend the fault period of cc sector from 2 weeks to 6 weeks

Question

FIP Discussion: Extend the fault period of cc sector from 2 weeks to 6 weeks

IPFSUnion opened this issue 3 years ago · 19 comments

Motivation

Any country in the world is likely to face force majeure factors such as major natural disasters or social abnormal events, causing storage providers to be unable to provide services normally for a long period of time. To this end, we must plan ahead.
In the current implementation of the protocol, the sector will be forcibly terminated if there are two consecutive weeks of faults. However, two weeks is not enough for a large storage provider to complete the data migration and restart the service. If appropriate measures are not taken, it will not only cause huge economic losses to the storage provider, but also cause large fluctuations in the storage power of the entire Filecoin network.
Therefore, it is necessary to make some adjustments to the sector fault period. Regarding the deal sector, the current data of all transactions is about 32 PiB. It is not a big problem to complete the real data migration within two weeks, but it is far from enough for the EiB level of cc sector to complete the migration within two weeks, so we propose to extend the cc sector fault period from 2 weeks to 6 weeks.

Proposal

Extend fault period of cc sector from 2 weeks to 6 weeks, and calculate the right fee structure.

Answer 1 · 2021-10-01T13:50:49.000Z

Relevant post from @zixuanzh #183 (comment)

There are a few principles around faults and penalties that we want to hold true from a CryptoEconomics point of view.

Block rewards are minted conditioned on storage being reliable and hence when it is not the case, vesting and minted block rewards are clawed back and slashed.

CryptoEcon mechanisms and parameters are set carefully with good reasons, tradeoffs, and validation. We can still adjust parameters as a community but the bar for doing so should be very high. Changes should preferably not be a temporary fix (unless it is a real emergency) but good for the network and ecosystem as a whole in the long term.

For all the improvements discussed here, we should be very cognizant of potential unintended consequences such as shocks that they might induce to the network.

As such, given the short turnaround time that we have to the Chocolate upgrade, CryptoEconLab reviewed most of the proposals above and thinks it's reasonable to extend the termination period as faults may sometimes take longer than the original 14 days to recover. This will generally make the network stronger and more resilient to faults. However, we don't have a good answer as to what the duration should be (unclear how long it takes to migrate large volumes of data), but somewhere between 4-6 weeks seems reasonable (10% of a sector's lifetime). Note that even with a longer termination window, the fault fees continue to highly incentivize highly reliable storage - strongly encouraging storage providers to invest in keeping N+1 copy or live migration wherever possible.

Answer 2 · 2021-10-01T14:40:16.000Z

Initial draft was submitted at #190

Answer 3 · 2021-10-01T19:24:44.000Z

The FIP draft also suggested lowering the wdpost penalty fee, which I think should be decoupled from the scope of this FIP( extend the fault period of cc sector from 2 weeks to 6 weeks). Lowering the wdpost penalty fee may have a large security impact on the network which needs more careful analysis on.
l think with sector cutoff termination period is extended, the storage provider should still try to bring the sector and deal back online and proving the data is being safely stored as soon as possible, and a reasonable high-enough penalty is necessary to help enforce this.

Answer 4 · 2021-10-02T09:32:41.000Z

Totally agree with @jennijuju's comment:

Lowering the wdpost penalty fee may have a large security impact on the network which needs more careful analysis on.

Hastily just lowering the penalty fee without any analysis does not seem like a good idea.

Answer 5 · 2021-10-04T21:17:23.000Z

NOTE: These are hasty "back of the envelop" calculations. Please do your own calculations and don't make important business decisions purely based on this analysis.

Let's quickly talk about costs. From my calculations, it will likely be more cost effective to simply buy new hardware and copy the data than to shut down operations and move, assuming a storage provider is able to keep the original data center running while the move is in progress. This is a case where planning ahead is important.

Assuming a conversion of 1FIL to 60USD (conservative bound).
Assuming a block reward of 0.025FIL/TiB (currently 0.027, was 0.031 a month ago).
Leaving sectors faulty costs 3.5BR per day in fines, plus 1BR per day in missed block rewards. That's 4.5BR per day or ~6.7USD/day per TiB. Let's call this F (fees).
According to my numbers (which are very rough estimates), a 24*16TB rack plus a PC to handle continuous proving (window post) shouldn't cost more than 18K USD, as far as I can tell. That's 51.54USD per TiB. Let's call this H (hardware cost).
Let's assume some upkeep cost (U) per TiB.

Given the above, it becomes rational to simply buy all new hardware and copy once the expected downtime exceeds H/(F-U) days. Assuming 10% of profits go towards upkeep (U = 0.025*60*.1 = 0.15), that's 7.87 days (unfortunately, I have no idea what these costs actually are).

Even assuming storage providers are totally unprofitable (all rewards go to upkeep), H/(F-U) = 9.9 days.

Importantly, that's assuming buying all new hardware. However, there's a third option: buy, e.g., 10% of the necessary hardware, then move in 10 steps. That will take 10 times longer, but will also be 10 times cheaper.

Additionally, these calculations are ignoring the value of the new hardware. If this hardware is eventually used for something, H goes to 0 and shutting down never makes sense, financially at least.

Finally, even if the penalty fee is halved, it's unlikely that shutting down for more than 14 days would be rational given the numbers above.

Answer 6 · 2021-10-05T22:51:42.000Z

Sharing a few thoughts here wrt to the potential meta-governance dynamics of this FIP. Unfortunately due to the rushed nature of the proposal there is not time for a full analysis. That being said, hopefully some of these ideas are useful.

From a high level it seems like there is an external event and the response is an internal political process to change network policy (modifying protocol parameters via a FIP). This is not good or bad, but it could set a precedent that in the future, when external events cause exogenous shocks to stakeholders, they can and should use the FIP process to request changes to the protocol. External events are a matter of when, not if. As such, this could affect stakeholder expectations wrt network resilience, incentives, and governance:

Resilience: stakeholders might be less incentivized to invest in and plan for their own resilience if they think they can get bailed out by the network.
Incentives: stakeholders might trust the rules of the protocol less if they think that whenever there's an external event it might result in a rushed change to the protocol rules.
Governance: if external events are an opportunity to get proposals passed with less scrutiny, stakeholders might take advantage of these opportunities to push for protocol changes that might otherwise not be accepted.

These are a few of the dynamics, but there are many more. The point I'm trying to make is that beyond the direct impacts of any decision, how a decision is made will create expectations around how future decisions will be made, which will create expectations around the credibility and stability of the network as a whole. Some of these dynamics were mentioned in #183, but more focused on the details of the FIP itself. To compliment that, thought it might also be useful to also think about these dynamics at a high level. That way it's easier to determine what class of problems are present (in this case responding to the potential shock from an external event), and if the solution at hand is right for that type of problem (an immediate policy decision) and/or if other approaches might be a better fit.

TL;DR: even if you don't have a strong opinion on this FIP, the meta-governance dynamics around it could impact future FIPs you do care about, which could inform your opinion on what to do now.

Answer 7 · 2021-10-06T03:24:36.000Z

A rough timeline of this "Issue":

Warning, i will only cite a few significant snippets of the mentioned issues - to get the full picture please dive into the threads and make up your own mind! Thanks

19th of March - #84

Introduce a "maintenance window" flag that can be set for a range of epochs. Any deadlines that end within a maintenance window will not be penalized for missed Window PoSts.

closed in favor of this discussion:

24th of May - #103

Chinese Vice Premier called for crackdown on cyrtocurrency mining in 5/24. As follow up instructions to IDCs might come up from the goverment realy soon, there could be risk that IDCs to cutoff infrastructure supplies for hosted filecoin nodes in China in a matter of days, this will be a massive hit on nodes hosted in China and will be a huge hit for filecoin network as well permanetnly. As an prepared emergency remediation for that worst case scenario, we propose an intermediate period for those nodes to migrate to oversea and their power can be allowed to come back afterward.

Define a start height and end height, during start and end height, the ongoing missing window post pernality reduced to far less than 5br, and the 14 days deadline of terminate sectors for missing post suspended, while end height pass, penaltis and sector terminations resume. This will allow miners to have some time (presumerbily much longer than 14 days considering the logistic complexity) to migrate their nodes physically to the locations that mining is allowed for long term.

FIPs then created in:

28th of May - #106

23rd of July - #106 closed, discussion delegated back to #103 - no activity in the discussion after that!

29th of September - #183

With full confidence in the future development of Filecoin, Chinese storage providers are actively providing storage power, which accounts for a large proportion of the entire Filecoin network. However, The new policy issued by Chinese government on September 24th proposes a comprehensive rectification and removal of Chinese domestic virtual currency “mining” projects. If the Filecoin project is to be removed, according to the current protocol, not only will it cause huge economic losses to Chinese storage providers, but the overall stability of the Filecoin network will also be battered.

Dormant period can be selected between 10 to 180 days. Bottom limit is longer than 10 days to prevent storage providers from using dormant period to evade windowpost penalty and a 180-day dormant period is enough for a large storage provider to move the storage device overseas and restart the service.

1st of October - #189

Any country in the world is likely to face force majeure factors such as major natural disasters or social abnormal events, causing storage providers to be unable to provide services normally for a long period of time. To this end, we must plan ahead.
In the current implementation of the protocol, the sector will be forcibly terminated if there are two consecutive weeks of faults. However, two weeks is not enough for a large storage provider to complete the data migration and restart the service. If appropriate measures are not taken, it will not only cause huge economic losses to the storage provider, but also cause large fluctuations in the storage power of the entire Filecoin network.
Therefore, it is necessary to make some adjustments to the sector fault period. Regarding the deal sector, the current data of all transactions is about 32 PiB. It is not a big problem to complete the real data migration within two weeks, but it is far from enough for the EiB level of cc sector to complete the migration within two weeks, so we propose to extend the cc sector fault period from 2 weeks to 6 weeks.

Extend fault period of cc sector from 2 weeks to 6 weeks, and calculate the right fee structure.

1st of October - #190

Extend fault period of cc sector from 2 weeks to 6 weeks, and calculate the right fee structure.

Now here we are, needing a decision on this within merely hours:

I believe this FIP and its various comments/suggestions needs to be resolved & merged by end of week and approved in the Core Devs meeting on Thursday to be included in the upcoming Chocolate v14 network upgrade. Otherwise, this improvement proposal will be deferred to the next scheduled network upgrade in ~2022.

(#190 (comment))

2 main observations:

this is not a new issue
most to all issues/proposals want a reduction in fees

About 1.: why rush an implementation? Storage providers had enough time to react to the triggering, external events - since May 24th!

External events are a matter of when, not if.

(#189 (comment))

What are we giving incentives out for here if the problem is still existent for storage providers?!?! What would a 4 week extension do for resilient storage? The problem is known for 4 month....

About 2.: why rush an implementation if the main demand, exclusion/lowering of wdPost fees is not even implemented?

There are so many open questions regarding the implementation of this FIP that i do not see the need for a termination deadline extesion outweighing these:

why treat deal sectors and CC sectors the same?
why 6 weeks? why not 4 or 10 or 25?
given a minimal sector life time of 180 days: should we allow storage providers not proving them for ~25% of their lifetime?
without reducing/eliminating the wdPost fault penalties - is this what the proponents want?
is this what the network/community of SPs want?
what is the economical impact of watering down rules in a drastic way like this intends to do?
does this cheat me out of my deal as a client?
etc. etc. etc.

Short: There is no urgency, the problem this tries to solve is old. Month old. We should not implement this in the v14 network upgrade - the consqueces are unknown, the impact possibly significant.

Answer 8 · 2021-10-06T04:07:21.000Z

I agree with basically the entirety of @f8-ptrk's post above, and am fairly opposed to this change. The one thing I'd disagree with is

the consqueces are unknown, the impact possibly significant.

I feel relatively confident that the change has no potentially significant impact.

Answer 9 · 2021-10-06T04:10:20.000Z

@arajasek wouldn't we watering down the rules to a point where we tipple the time a deal could not be proven? i see that ass significant

Answer 10 · 2021-10-06T14:46:36.000Z

@f8-ptrk Yes, 100%, but I think the penalty is high enough that there's no practical impact. It's mostly a weird theoretical point IMO (that the deal may not be stored for almost 33% of the time), but not immediately concerning to me.

(Relatedly, I'm not sure how much this will actually help the problem trying to be solved, but that's not my focus here)

Answer 11 · 2021-10-06T22:37:39.000Z

@arajasek I believe

(Relatedly, I'm not sure how much this will actually help the problem trying to be solved, but that's not my focus here)

This is actually the issue. This isn't a network related issue but a business one. A FIP is the wrong avenue for requesting help with the actual problem at hand. If SPs have found themselves up the creek without a paddle, then requesting assistance in terms of logistics or contacts or loans or technical support would be the first port of call, not changing how Filecoin works.

Answer 12 · 2021-10-07T00:32:03.000Z

I agree with @jennijuju 's comments. I am in favor of the change but without reducing the penalties. I do not see a risk with this FIP and it serves the community which is at risk of losing a lot of power if this is not adjusted.

Answer 13 · 2021-10-07T00:34:01.000Z

@arajasek I believe

(Relatedly, I'm not sure how much this will actually help the problem trying to be solved, but that's not my focus here)

This is actually the issue. This isn't a network related issue but a business one. A FIP is the wrong avenue for requesting help with the actual problem at hand. If SPs have found themselves up the creek without a paddle, then requesting assistance in terms of logistics or contacts or loans or technical support would be the first port of call, not changing how Filecoin works.

While I agree with the thinking here mostly. However, if we step away from the motivation of the original discussion, some period of extension may help the storage provider community in general. For example, we've got some MinerX fellows who have zfspool incidents in the past few weeks who can benefit from this FIP to buy more time, get more support to recover the sectors before they are permanently gone.

Answer 14 · 2021-10-07T00:51:38.000Z

i agree that we need a solution for sp's in dire situations. but i do not believe that we need it in v14!

we should properly discuss this.

for my part i think a "default - applies to all" solution is the wrong way to go since most of the sp's will never ever need it. a opt-in solution would be better. or a solution that adds a lifetime limit on not proving a sector, 4-6 weeks overall, not in a row.

there are so many ways to make this a screw to fix what is wrong instead of hammering the nail down now in a rush....

Answer 15 · 2021-10-07T01:12:03.000Z

a solution that adds a lifetime limit on not proving a sector, 4-6 weeks overall, not in a row

This is a good suggestion for further disucssion imho.

Answer 16 · 2021-10-07T03:30:08.000Z

@arajasek I believe

(Relatedly, I'm not sure how much this will actually help the problem trying to be solved, but that's not my focus here)

This is actually the issue. This isn't a network related issue but a business one. A FIP is the wrong avenue for requesting help with the actual problem at hand. If SPs have found themselves up the creek without a paddle, then requesting assistance in terms of logistics or contacts or loans or technical support would be the first port of call, not changing how Filecoin works.

While I agree with the thinking here mostly. However, if we step away from the motivation of the original discussion, some period of extension may help the storage provider community in general. For example, we've got some MinerX fellows who have zfspool incidents in the past few weeks who can benefit from this FIP to buy more time, get more support to recover the sectors before they are permanently gone.

That may be the case, but that is not what is being discussed here. Those SPs should be putting forward FIPs that target their problems. They may have the same or similar solution but those are also issues that don't require pushing through changes in an accelerated fashion.

Answer 17 · 2021-10-07T16:45:05.000Z

Reposting from the Core Dev Call

Question: Why 6 weeks?

from @IPFSUnion

Boat freight can take about 15-30 days for shipping boats to travel across an ocean. 4 - 6 weeks might be the right number as max suspended migration time limit.

Answer 18 · 2021-10-07T17:01:39.000Z

Any country in the world is likely to face force majeure factors such as major natural disasters or social abnormal events, causing storage providers to be unable to provide services normally for a long period of time.

if that is the premise then i do not see any reason why someone should ship something across an ocean! we are not talking about a relocation here but a "forced" migration i think

btw. the street dwell time (the time it takes to get the container out of the port) on west coast ports is on avg. somewhere between 6 and 8.5 days alone the last weeks. with the amount of ships waiting out there, add another 2 days. shipping stuff that is time critical to the west coast is suicide right now. i doubt it's any better globally...

https://www.freightwaves.com/news/record-shattered-61-container-ships-stuck-waiting-off-california

Answer 19 · 2021-10-12T19:26:05.000Z

Closing due to FIP acceptance!