A new parameter to capture skipped rows metadata as a column
cschloer opened this issue · 2 comments
Overview
Hi,
I'd like to propose a parameter to tabulator called "skipped_rows_capture" or something similar. It takes in a list of dicts, each dict containing a regular expression string with once captured group, and one string that contains a column name. The regular expression is then compared to each skipped row in the data.
For example:
skipped_rows_capture = [{ 'regex': '\*\* Latitud (.*)$', 'name': 'latitude' }]
skip_rows = ['**']
Would match the comment/skipped line:
** Latitud 10 29.99
And create a new column
latitude,
10 29.99
10 29.99
...
Please preserve this line to notify @roll (lead of this repository)
@cschloer
With tabulator
you can extract information like this using post_parse
- https://github.com/frictionlessdata/tabulator-py#post-parse
For example:
id,name
1,english
** Lat 50
1,german
import re
from tabulator import Stream
def capture(store, name, regex):
pattern = re.compile(regex)
def processor(extended_rows):
for row_number, headers, row in extended_rows:
match = pattern.match(row[0] if row else '')
if match:
store[name] = int(match.group(1))
continue
yield (row_number, headers, row)
return processor
store = {}
with Stream('tmp/issue331.csv', post_parse=[capture(store, 'lat', r'^\*\* Lat (.*)')]) as stream:
print(stream.read()) # [['id', 'name'], ['1', 'english'], ['1', 'german']]
print(store) # {'lat': 50}
Reshaping like adding a column is out of the scope of tabulator
but if you're interested in having such extractor available in DPP we can think of a dataflows
processor / load
parameter to achieve the goal. Of course, if you need it in Python you just can use the snippet above.
Please create a DPP issue if it's still needed or re-open this one if you still think it's a good addition for tabulator