US Census Block plugin - add better recovery after failure
jefffriesen opened this issue · 0 comments
jefffriesen commented
I'm wondering if there is a way to get better recovery after failures. I have about 35K unique lat/lon points I'm getting census blocks for. At 1 second/request that's about 9 hours. After about 7 hours I got this failure below. The script didn't recover and the data collected before was lost. I'm wondering if there is a more graceful way that the code can recover and at least keep what was done previously.
[01:53:45] [INFO] [dku.utils] - 17651 - processing: (40.0309365,-105.2930896)
[01:53:46] [INFO] [dku.utils] - 17652 - processing: (40.0309393,-105.2643413)
[01:55:43] [INFO] [dku.utils] - *************** Recipe code failed **************
[01:55:43] [INFO] [dku.utils] - Begin Python stack
[01:55:43] [INFO] [dku.utils] - Traceback (most recent call last):
[01:55:43] [INFO] [dku.utils] - File "/Users/jeffers/Library/DataScienceStudio/dss_home/jobs/BOULDERCOUNTYSOURCETRANSFORMS/Build_census_blocks_2017-09-13T00-37-50.653/compute_census_blocks_NP/custompyrecipehdMLSMuNyO49/python-exec-wrapper.py", line 3, in <module>
[01:55:43] [INFO] [dku.utils] - execfile(sys.argv[1])
[01:55:43] [INFO] [dku.utils] - File "/Users/jeffers/Library/DataScienceStudio/dss_home/jobs/BOULDERCOUNTYSOURCETRANSFORMS/Build_census_blocks_2017-09-13T00-37-50.653/compute_census_blocks_NP/custompyrecipehdMLSMuNyO49/script.py", line 68, in <module>
[01:55:43] [INFO] [dku.utils] - 'showall': 'true'
[01:55:43] [INFO] [dku.utils] - File "/Applications/DataScienceStudio.app/Contents/Resources/kit/python.packages/requests/api.py", line 70, in get
[01:55:43] [INFO] [dku.utils] - return request('get', url, params=params, **kwargs)
[01:55:43] [INFO] [dku.utils] - File "/Applications/DataScienceStudio.app/Contents/Resources/kit/python.packages/requests/api.py", line 56, in request
[01:55:43] [INFO] [dku.utils] - return session.request(method=method, url=url, **kwargs)
[01:55:43] [INFO] [dku.utils] - File "/Applications/DataScienceStudio.app/Contents/Resources/kit/python.packages/requests/sessions.py", line 488, in request
[01:55:43] [INFO] [dku.utils] - resp = self.send(prep, **send_kwargs)
[01:55:43] [INFO] [dku.utils] - File "/Applications/DataScienceStudio.app/Contents/Resources/kit/python.packages/requests/sessions.py", line 609, in send
[01:55:43] [INFO] [dku.utils] - r = adapter.send(request, **kwargs)
[01:55:43] [INFO] [dku.utils] - File "/Applications/DataScienceStudio.app/Contents/Resources/kit/python.packages/requests/adapters.py", line 499, in send
[01:55:43] [INFO] [dku.utils] - raise ReadTimeout(e, request=request)
[01:55:43] [INFO] [dku.utils] - ReadTimeout: HTTPConnectionPool(host='data.fcc.gov', port=80): Read timed out. (read timeout=None)
[01:55:43] [INFO] [dku.utils] - End Python stack
[01:55:43] [INFO] [com.dataiku.dip.recipes.customcode.CustomPythonRecipeRunner] - Error file found, trying to throw it: /Users/jeffers/Library/DataScienceStudio/dss_home/jobs/BOULDERCOUNTYSOURCETRANSFORMS/Build_census_blocks_2017-09-13T00-37-50.653/compute_census_blocks_NP/custompyrecipehdMLSMuNyO49/error.json
[01:55:43] [ERROR] [com.dataiku.dip.dataflow.streaming.DatasetWritingService] - Wait session error: null
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.server.HttpInput$3.noContent(HttpInput.java:464)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:124)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.dataiku.dip.input.stream.InputStreamLineReader.readLine(InputStreamLineReader.java:30)
at com.dataiku.dip.input.formats.csv.RFC4180CSVParser.next(RFC4180CSVParser.java:21)
at com.dataiku.dip.dataflow.streaming.DatasetWriter.appendFromCSVStream(DatasetWriter.java:139)
at com.dataiku.dip.dataflow.streaming.DatasetWritingService.pushData(DatasetWritingService.java:255)
at com.dataiku.dip.dataflow.kernel.slave.KernelSession.pushData(KernelSession.java:237)
at com.dataiku.dip.dataflow.kernel.slave.KernelServlet.service(KernelServlet.java:199)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:738)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:551)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1111)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:478)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1045)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:462)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:279)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:232)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:534)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:607)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:536)
at java.lang.Thread.run(Thread.java:745)
[01:55:43] [INFO] [dku.flow.activity] - Run thread failed for activity compute_census_blocks_NP
com.dataiku.common.server.APIError$SerializedErrorException: Error in Python process: <class 'requests.exceptions.ReadTimeout'>: HTTPConnectionPool(host='data.fcc.gov', port=80): Read timed out. (read timeout=None)
at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:304)
at com.dataiku.dip.recipes.customcode.CustomPythonRecipeRunner.run(CustomPythonRecipeRunner.java:79)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:353)
[01:55:43] [ERROR] [com.dataiku.dip.dataflow.streaming.DatasetWritingService] - Push data error during streaming:null
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.server.HttpInput$3.noContent(HttpInput.java:464)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:124)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.dataiku.dip.input.stream.InputStreamLineReader.readLine(InputStreamLineReader.java:30)
at com.dataiku.dip.input.formats.csv.RFC4180CSVParser.next(RFC4180CSVParser.java:21)
at com.dataiku.dip.dataflow.streaming.DatasetWriter.appendFromCSVStream(DatasetWriter.java:139)
at com.dataiku.dip.dataflow.streaming.DatasetWritingService.pushData(DatasetWritingService.java:255)
at com.dataiku.dip.dataflow.kernel.slave.KernelSession.pushData(KernelSession.java:237)
at com.dataiku.dip.dataflow.kernel.slave.KernelServlet.service(KernelServlet.java:199)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:738)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:551)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1111)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:478)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1045)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:462)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:279)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:232)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:534)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:607)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:536)
at java.lang.Thread.run(Thread.java:745)
[01:55:43] [DEBUG] [dku.jobs] - Command /tintercom/datasets/push-data processed in 26269806ms
[01:55:43] [DEBUG] [dku.jobs] - Command /tintercom/datasets/wait-write-session processed in 26269807ms
[01:55:43] [INFO] [dku.flow.activity] running compute_census_blocks_NP - activity is finished
[01:55:43] [ERROR] [dku.flow.activity] running compute_census_blocks_NP - Activity failed
com.dataiku.common.server.APIError$SerializedErrorException: Error in Python process: <class 'requests.exceptions.ReadTimeout'>: HTTPConnectionPool(host='data.fcc.gov', port=80): Read timed out. (read timeout=None)
at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:304)
at com.dataiku.dip.recipes.customcode.CustomPythonRecipeRunner.run(CustomPythonRecipeRunner.java:79)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:353)
[01:55:43] [INFO] [dku.flow.activity] running compute_census_blocks_NP - Executing default post-activity lifecycle hook
[01:55:43] [INFO] [dku.flow.activity] running compute_census_blocks_NP - Removing samples for BOULDERCOUNTYSOURCETRANSFORMS.census_blocks
[01:55:43] [INFO] [dku.flow.activity] running compute_census_blocks_NP - Done post-activity tasks