acowley/Frames

Problem with CSV containing "WEXPRO COMPANY","MUSSER, B W "B"","6"

idontgetoutmuch opened this issue · 3 comments

With this file

line_id,report_month,report_year,ST,api_county_code,api_seq_num,sidetrack_num,formation_code,well_status,prod_days,water_disp_code,water_vol,water_press_tbg,water_press_csg,bom_invent,oil_vol,oil_sales,adjustment,eom_invent,gravity_sales,gas_sales,flared,gas_vol,shrink,gas_prod,btu_sales,gas_press_tbg,gas_press_csg,operator_num,name,facility_name,facility_num,accepted_date,revised
0,01,1999,"05","045","06036","00","WMFK","SI",,,,,,,,,,,,,,,,,,,,88348,"TIMBERLINE ENERGY INC","LILLIAN B. SMITH, ET AL","1",2017-05-18 16:53:40.077000000,
1,01,1999,"05","081","05447","00","WSTC","SI",0,,,,,,,,,,,,,,,,,,,95960,"WEXPRO COMPANY","MUSSER, B W "B"","6",2017-03-23 17:45:54.713000000,

and this code

{-# LANGUAGE TemplateHaskell   #-}
{-# LANGUAGE DataKinds         #-}
{-# LANGUAGE FlexibleContexts  #-}
{-# LANGUAGE TypeOperators     #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes        #-}

import           Frames hiding ((:&))
import qualified Control.Foldl as Foldl
import           Control.Lens
import           Pipes.Prelude (fold)
import           Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Core as PC
import qualified Data.Foldable as F

import           Foo

import           Data.Void

import qualified Data.Vinyl as V
import           Data.Vinyl (Rec(..))

import           Data.Text
import           Data.Attoparsec.Text
import           Control.Applicative



tableTypesText "ProductionData" "/Users/dom/Downloads/2017_prod_reports_8.csv"

declareColumn "oil_vol_double" ''Double

getOilVol :: ProductionData -> Record '[OilVolDouble]
getOilVol x = pure (Col ((f $ (parseOnly parseDouble) (x ^. oilVol)))) :& Nil
  where
    f (Left _)  = 0.0 / 0.0
    f (Right y) = y

readOilVol :: MonadSafe m => Producer (Record '[OilVolDouble]) m ()
readOilVol = (readTable "/Users/dom/Downloads/2017_prod_reports_8.csv") >->
             (P.map getOilVol)

oilVolLength :: Foldl.Fold (Record '[OilVolDouble]) Int
oilVolLength = Foldl.length

totalOilVol :: Foldl.Fold (Record '[OilVolDouble]) Double
totalOilVol = (Foldl.handles oilVolDouble) Foldl.sum

oilVolTotalAndLength :: Foldl.Fold (Record '[OilVolDouble]) (Double, Int)
oilVolTotalAndLength = (,) <$> totalOilVol <*> oilVolLength

parseDouble :: Parser Double
parseDouble =
  do d <- double
     return d
  <|>
  do _ <- string ""
     return 0.0

test = do
  (t, l) <- runSafeT $
            Foldl.purely fold oilVolTotalAndLength readOilVol
  putStrLn $ show l ++ " records totalling " ++ show t

we get

*Main> test
1 records totalling 0.0

With Python's pandas package we get 2 records as expected

pandasforoil

Side note: the second facility_name in the python example is interesting.

I've fixed this on the master branch, but I'm a bit uncertain about the fix. It doesn't break any of our tests, so that's good, but I don't think I wrote the original code so I'm not sure about the thinking that went into it.

The problem is detecting sections of CSV data that are quoted strings. The logic we employed looked for an odd number of quotation marks at the start of a string to indicate the beginning of a quoted string, and an odd number of quotation marks at the end of a string to indicate the end of a quoted string. The "MUSSER, B W "B"" datum breaks that logic, but I'm not certain why the code was written the way it was.

I've added this data and a similar program as a test case. We may want to add more data extraction tests as it looks like a particularly challenging data set.

I'm going to take another look around to see if it will be easier to offload the CSV parsing parts and try to break the revised quotation logic before pushing a new release.

Thank you for the report, the data, and the sample program, by the way, it was extremely helpful!

I'm going to take another look around to see if it will be easier to offload the CSV parsing parts and try to break the revised quotation logic before pushing a new release.

That sounds a very good idea. I have used cassava. In fact I might just try it now with this example. https://hackage.haskell.org/package/cassava says

The goal is to roughly accept what the Python csv module accepts.

BTW I am immensely grateful for your continued efforts with this package.