ruby/strscan

Two helpful methods: currchar and nextchar

shreeve opened this issue · 3 comments

Edited: This was the original message, which has been changed since then.

I'm working on a small but very performant CSV parser and am using strscan in several key places. The strscan library is really amazing and the code is beautiful, but I believe the addition of 3 methods would be extremely helpful. Of course, I can and have mimicked these methods in my Ruby code, but adding them to strscan directly would be a performance boost.

The methods are:

  1. scan_upto: This would work identically to scan_until, except it would not include the match. Currently, if my string is abcdef and I run s.scan_until(/de/), it will return abcde and the pointer will be positioned at f. In contrast, s.scan_upto(/de/) would return abc and the pointer would be positioned at d (ie - everything "up to, but not including" what I scanned for).

  2. next_char: Currently, we can use getch to return the "current" character and then advance the pointer. So, if I have a scanner with abcdef and s.pos is 3 (ie - on the d character), then when I call getch, it will return the d but advance the pointer to e. I then have to go back and read again to get the e if I want to know the "next" character. If I run s.peek(1) I can get the e, but peek() is not multibyte-aware, so I may only get a byte of a multi-byte character. I propose that the next_char method would work pretty much like getch does today, except it would first advance the pointer one character (in a multibyte-aware manner, just like getch does today), but then it would return the now-current character (again, multibyte aware) at the next position. In the example mentioned, if I have abcdef and am on the d, then next_char will advance to the e and return the e.

  3. curr_char: Right now, I can call s.string[s.pos] to get the character at the current position. But, for some reason, this method isn't as fast as I would like it to be. When trying to write highly performant code, this slowdown is noticeable. Instead, I would like for curr_char to simply do the same thing as getch does today (and is multibyte aware), except not advance the pointer. Right now, I've been using s.peek(1) to read the current character (it's faster than s.string[s.pos]), but s.peek(1) is not multibyte aware, so it's not safe to use, in the general sense.

These three small methods would provide a very welcome little "boost" for some code that I am working on and have had to do in Ruby code until now.

Here's the code I am working on...

#!/usr/bin/env ruby

# ==============================================================================
# censive - A quick and lightweight CSV handling library for Ruby
#
# Author: Steve Shreeve (steve.shreeve@gmail.com)
#   Date: Feb 1, 2023
#
# Thanks to https://crystal-lang.org/api/1.7.2/CSV.html (Crystal's CSV library)
# and, also https://github.com/ruby/strscan/blob/master/ext/strscan/strscan.c
# ==============================================================================
# The goals are:
#
# 1. Faster than Ruby's default CSV library
# 2. Lightweight code base with streamlined logic
# 3. Support for most non-compliant CSV variations
#
# Todo:
#
# 1. Support IO streaming
# 2. Add option to strip whitespace
# 3. Support CSV headers in first row
# 4. Confirm file encodings such as UTF-8, UTF-16, etc.
#
# NOTE: Only getch and scan_until advance strscan's position
# NOTE: getch returns peek(1) but *then* advances; it's out of sync
# TODO: add curr_char to strscan that is like peek(1), but is multibyte-aware
# TODO: add next_char to strscan that advances and *then* returns curr_char
# TODO: add scan_upto to strscan that is like scan_until but returns pre_match
# TODO: the scan_upto should leave pos at the first character of what matched
# ==============================================================================

require 'strscan'

class Censive < StringScanner

  def self.writer(obj=$stdout, **opts, &code)
    case obj
    when String then File.open(path, 'w') {|file| yield new(out: obj, **opts, &code) }
    when IO     then new(out: obj, **opts, &code)
    else abort "#{File.basename($0)}: invalid #{obj.class} object in writer"
    end
  end

  def initialize(str=nil,
    drop:  false   , # drop trailing empty fields?
    eol:   "\n"    , # line endings for exports
    excel: false   , # literals(="01") formulas(=A1 + B2); http://bit.ly/3Y7jIvc
    mode:  :compact, # export mode: compact or full
    out:   nil     , # output stream, needs to respond to <<
    quote: '"'     , # quote character
    relax: false   , # relax quote parsing so ,"Fo"o, => ,"Fo""o",
    sep:   ','     , # column separator character
    **opts           # grab bag
  )
    super(str || '')
    reset

    @drop   = drop
    @eol    = eol  .freeze #!# TODO: are the '.freeze' statements helpful?
    @excel  = excel
    @mode   = mode
    @out    = out
    @quote  = quote.freeze
    @relax  = relax
    @sep    = sep  .freeze

    @es     = ""   .freeze
    @cr     = "\r" .freeze
    @lf     = "\n" .freeze
    @eq     = "="  .freeze
    @esc    = (@quote * 2).freeze

    @tokens = [@sep,@quote,@cr,@lf,@es,nil]
  end

  def reset(str=nil)
    self.string = str if str
    super()
    @char = peek(1)
    @flag = nil

    @rows = nil
    @cols = @cells = 0
  end

  # ==[ Lexer ]==

  def next_char
    getch
    @char = peek(1) #!# FIXME: not multibyte encoding aware
  end

  def next_token
    case @flag
    when @es then @flag = nil; [@cr,@lf,@es,nil].include?(@char) and return @es
    when @cr then @flag = nil; next_char == @lf and next_char
    when @lf then @flag = nil; next_char
    else          @flag = nil
    end if @flag

    # Excel literals ="0123" and formulas =A1 + B2 (see http://bit.ly/3Y7jIvc)
    if @excel && @char == @eq
      @flag = @eq
      next_char
    end

    if @tokens.include?(@char)
      case @char
      when @quote # consume quoted cell
        match = ""
        while true
          next_char # move past the quote that got us here
          match << (scan_until(/(?=#{@quote})/o) or bomb "unclosed quote")
          case next_char
          when @sep            then @flag = @es; next_char; break
          when @quote          then match << @quote
          when @cr,@lf,@es,nil then break
          else @relax ? match << (@quote + @char) : bomb("invalid character after quote")
          end
        end
        match
      when @sep    then @flag = @es; next_char; @es
      when @cr     then @flag = @cr; nil
      when @lf     then @flag = @lf; nil
      when @es,nil then              nil
      end
    else # consume unquoted cell
      match = scan_until(/(?=#{@sep}|#{@cr}|#{@lf}|\z)/o) or bomb "unexpected character"
      match = @eq + match and @flag = nil if @flag == @eq
      @char = peek(1) #!# FIXME: not multibyte encoding aware
      @char == @sep and @flag = @es and next_char
      match
    end
  end

  def bomb(msg)
    abort "\n#{File.basename($0)}: #{msg} at character #{pos} near '#{string[pos-4,7]}'"
  end

  # ==[ Parser ]==

  def parse
    @rows = []
    while row = next_row
      @rows << row
      count = row.size
      @cols = count if count > @cols
      @cells += count
    end
    @rows
  end

  def next_row
    token = next_token or return
    row = [token]
    row << token while token = next_token
    row
  end

  # ==[ Helpers ]==

  # returns 2 (must be quoted and escaped), 1 (must be quoted), 0 (neither)
  def grok(str)
    if pos = str.index(/(#{@quote})|#{@sep}|#{@cr}|#{@lf}/o)
      $1 ? 2 : str.index(/#{@quote}/o, pos) ? 2 : 1
    else
      0
    end
  end

  # output a row
  def <<(row)
    @out or return super

    # drop trailing empty columns
    row.pop while row.last.empty? if @drop

    #!# FIXME: Excel output needs to protect 0-leading numbers

    s,q = @sep, @quote
    out = case @mode
    when :compact
      case grok(row.join)
      when 0
        row
      when 1
        row.map do |col|
          col.match?(/#{@sep}|#{@cr}|#{@lf}/o) ? "#{q}#{col}#{q}" : col
        end
      else
        row.map do |col|
          case grok(col)
          when 0 then col
          when 1 then "#{q}#{col}#{q}"
          else        "#{q}#{col.gsub(q, @esc)}#{q}"
          end
        end
      end
    when :full
      row.map {|col| "#{q}#{col.gsub(q, @esc)}#{q}" }
    end.join(s)

    @out << out + @eol
  end

  def each
    @rows ||= parse
    @rows.each {|row| yield row }
  end

  def export(...)
    out = self.class.writer(...)
    each {|row| out << row }
  end

  def stats
    wide = string.size.to_s.size
    puts "%#{wide}d rows"    % @rows.size
    puts "%#{wide}d columns" % @cols
    puts "%#{wide}d cells"   % @cells
    puts "%#{wide}d bytes"   % string.size
  end
end

if __FILE__ == $0
  raw = DATA.gets("\n\n").chomp
  csv = Censive.new(raw, excel: true)
  csv.export # (sep: "\t", excel: true)
end

__END__
Name,Age,Shoe
Alice,27,5
Bob,33,10 1/2
Charlie or "Chuck",=B2 + B3,9
"Doug E Fresh",="007",10
Subtotal,=sum(B2:B5),="01234"

The above code does not use these three new methods, but would benefit from them.

Going to close this and create a new, simpler issue.