Problem with CSV containing "WEXPRO COMPANY","MUSSER, B W "B"","6"
idontgetoutmuch opened this issue · 3 comments
With this file
line_id,report_month,report_year,ST,api_county_code,api_seq_num,sidetrack_num,formation_code,well_status,prod_days,water_disp_code,water_vol,water_press_tbg,water_press_csg,bom_invent,oil_vol,oil_sales,adjustment,eom_invent,gravity_sales,gas_sales,flared,gas_vol,shrink,gas_prod,btu_sales,gas_press_tbg,gas_press_csg,operator_num,name,facility_name,facility_num,accepted_date,revised
0,01,1999,"05","045","06036","00","WMFK","SI",,,,,,,,,,,,,,,,,,,,88348,"TIMBERLINE ENERGY INC","LILLIAN B. SMITH, ET AL","1",2017-05-18 16:53:40.077000000,
1,01,1999,"05","081","05447","00","WSTC","SI",0,,,,,,,,,,,,,,,,,,,95960,"WEXPRO COMPANY","MUSSER, B W "B"","6",2017-03-23 17:45:54.713000000,
and this code
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes #-}
import Frames hiding ((:&))
import qualified Control.Foldl as Foldl
import Control.Lens
import Pipes.Prelude (fold)
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Core as PC
import qualified Data.Foldable as F
import Foo
import Data.Void
import qualified Data.Vinyl as V
import Data.Vinyl (Rec(..))
import Data.Text
import Data.Attoparsec.Text
import Control.Applicative
tableTypesText "ProductionData" "/Users/dom/Downloads/2017_prod_reports_8.csv"
declareColumn "oil_vol_double" ''Double
getOilVol :: ProductionData -> Record '[OilVolDouble]
getOilVol x = pure (Col ((f $ (parseOnly parseDouble) (x ^. oilVol)))) :& Nil
where
f (Left _) = 0.0 / 0.0
f (Right y) = y
readOilVol :: MonadSafe m => Producer (Record '[OilVolDouble]) m ()
readOilVol = (readTable "/Users/dom/Downloads/2017_prod_reports_8.csv") >->
(P.map getOilVol)
oilVolLength :: Foldl.Fold (Record '[OilVolDouble]) Int
oilVolLength = Foldl.length
totalOilVol :: Foldl.Fold (Record '[OilVolDouble]) Double
totalOilVol = (Foldl.handles oilVolDouble) Foldl.sum
oilVolTotalAndLength :: Foldl.Fold (Record '[OilVolDouble]) (Double, Int)
oilVolTotalAndLength = (,) <$> totalOilVol <*> oilVolLength
parseDouble :: Parser Double
parseDouble =
do d <- double
return d
<|>
do _ <- string ""
return 0.0
test = do
(t, l) <- runSafeT $
Foldl.purely fold oilVolTotalAndLength readOilVol
putStrLn $ show l ++ " records totalling " ++ show t
we get
*Main> test
1 records totalling 0.0
With Python's pandas package we get 2 records as expected
Side note: the second facility_name
in the python example is interesting.
I've fixed this on the master branch, but I'm a bit uncertain about the fix. It doesn't break any of our tests, so that's good, but I don't think I wrote the original code so I'm not sure about the thinking that went into it.
The problem is detecting sections of CSV data that are quoted strings. The logic we employed looked for an odd number of quotation marks at the start of a string to indicate the beginning of a quoted string, and an odd number of quotation marks at the end of a string to indicate the end of a quoted string. The "MUSSER, B W "B""
datum breaks that logic, but I'm not certain why the code was written the way it was.
I've added this data and a similar program as a test case. We may want to add more data extraction tests as it looks like a particularly challenging data set.
I'm going to take another look around to see if it will be easier to offload the CSV parsing parts and try to break the revised quotation logic before pushing a new release.
Thank you for the report, the data, and the sample program, by the way, it was extremely helpful!
I'm going to take another look around to see if it will be easier to offload the CSV parsing parts and try to break the revised quotation logic before pushing a new release.
That sounds a very good idea. I have used cassava. In fact I might just try it now with this example. https://hackage.haskell.org/package/cassava says
The goal is to roughly accept what the Python csv module accepts.
BTW I am immensely grateful for your continued efforts with this package.