willtrnr/pyxlsb

use numba to speed up conversion?

Closed this issue · 3 comments

I converting an XLSB file with size about 150MB. It takes more than 20 minutes to complete. It's too long for me. How do I speed up? I tried with numba but it did not work, probably due to the mixture of texts and numbers in my file? Is it known that pyxlsb works with numba during the reading of Excel rows?

What I am after is a fast way to read a XLSB file to a Pandas dataframe.

Here is my current code.

from numba import jit
from pyxlsb import open_workbook as open_xlsb
import pandas as pd

#@jit(nopython=True, parallel=True)        
def xlsb2array(xlsb, sheetnum = 2):
    csvArr = []
    with open_xlsb(xlsb) as wb:
        # Read the sheet to array 
        with wb.get_sheet(sheetnum) as sheet:
            for row in sheet.rows(sparse = True):
                vals = [item.v for item in row]
                csvArr.append(vals)
    return csvArr
df = pd.DataFrame(xlsb2array(myxlsb))

I actually didn't know about numba until now, but considering how I'm seeking in the data file to read the rows I really don't expect that paralleling with it will work properly.

I'd like to allow direct cell addressing by memory mapping the file and indexing some regions which might help with performance and multi-thread scenarios.

Also, reading to a pandas DF seems like a very obvious use case and I'd like to look into how I can support that directly in #12.

Great to hear about plan for pandas support.

There are several level of utilizing numba: compiled python function -> cpu parallelization -> GPU parallelization. I came to know about numba while I was working with CUDA. I only tried with number and matrix manipulation. So I don't know it works for other kind of data. Probably Cython is better fit.

Closing this as a duplicate of #12