Help Wanted: Bad performance using etl.dicts(self.table)
Maksymilian-Plywaczyk opened this issue · 4 comments
What happened?
Hi guys!
I need your help in my problem with performance of function etl.dicst(). While I want use that function with list() I had very slow and bad performance. Code example looks like this:
etl.dicts(self.table)
What is the expected behavior?
Can someone give me advice or hint how to convert table to list of dictionaries in effiecient way? Or maybe use another solution for that. In the end i need converted Table type to python native data structures.
Reproducible test case
No response
What version of petl are you have found the bug?
v1.7.12
Version
python 3.13+
What OS are you seeing the problem on?
Linux
What OS version are you using?
No response
What package manager you used to install?
pip
What's the current installed packages?
No response
Relevant log output
No response
Additional Notes
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
What source/destination are you reading/writing?
Do you have any code snippet for analysis?
@juarezr
Yes, I am writing a service to transform a given dataset. The dataset is a response from an external API. I am creating this service using method chaining. In the end after transforming in I want to get back with type of dataset that was given. Last method which i call is convert_table_to_dict
and here I get very bad performance. Dataset has only ~100 rows.
class DatasetTranformation:
def __init__(self, dataset: list[dict[str, Any]]):
self.dataset = dataset
self.table = self._transform_dataset_to_table()
def _transform_dataset_to_table(self) -> etl.Table:
return etl.fromdicts(self.dataset)
def drop_fields(self, fields: list[str]) -> "DatasetTransformation":
self.table = etl.cutout(self.table, *fields)
return self
def add_date_field_from_timestamp(
self, timestamp_field: str, date_field: str
) -> "DatasetTransformation":
def format_date(row) -> str:
timestamp = row[timestamp_field]
date = datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%fZ").strftime(
"%Y-%m-%d"
)
return date
self.table = etl.addfield(self.table, date_field, format_date)
return self
def add_field(
self, field_name: str, value_function: Any = None
) -> "DatasetTransformation":
self.table = etl.addfield(self.table, field_name, value_function)
return self
def convert_table_to_dict(self) -> list[dict[str, Any]]:
converted_data = list(etl.dicts(self.table))
return converted_data
def to_table(self) -> etl.Table:
return self.table
This call will make your petl
object to materialize:
converted_data = list(etl.dicts(self.table))
What will happen, is that:
- The list method will traverse the whole results that self.table should retrieve.
- All result rows/dicts will be allocated in the python/app memory.
- Each service call will allocate its list full of rows. This would raise the app's whole memory according to the number of concurrent calls.
What happens if you change to something like:
def convert_table_to_dict(self) -> Iterable[dict[str, Any]]:
data_generator = etl.dicts(self.table)
return data_generator
If you could use/consume this way:
- Each service method invocation would allocate just one object/row/dict by petl each transformation used.
- You could write the response through a foreach loop and stream as you receive the rows/results from the source/database.
I used this trick in a custom Airflow Operator and it worked well.
But maybe I'm missing something...
@juarezr thank you for your answer. I give you more context. I have function to save a data to my filesystem. And as parameter I take data which is data from convert_table_to_dict(self)
. And I checked that in the moment of convert_table_to_dict comes my app has very weak performance. I tried to return Table
object from function to_table
and in save_data
takes Table
object as parameter and try, except
then looks like this (but this also has very weak performance). I have the impression that always at the end with the last method call it has very poor performance, as if it is clogged.
try: etl.to_csv(data, "filename")
The save data function:
def save_data(self, data: list[dict[str, Any]]) -> None:
timestamp = timezone.now().strftime("%Y-%m-%d-%H-%M-%S")
filename = self.FILENAME_TEMPLATE.format(timestamp=timestamp)
filepath = os.path.join(settings.MEDIA_ROOT, self.REPORTS_PATH, filename)
try:
df = pd.json_normalize(data, max_level=0)
df.to_csv(filepath, index=False)
except Exception as e:
raise e
self.model.objects.create(filename=filename)
### WHERE THIS CLASS AND FUNCTION IS USED.
def transform_data(self, data: list[dict[str, Any]]) -> list[dict[str, Any]]:
transformation = DatasetTransformation(data, api_client=APIClient())
transformed_data = (
transformation.add_date_field_from_timestamp("edited", "date")
.get_field_from_url_field("films")
.drop_fields(
["created", "edited"]
)
.convert_table_to_dict()
)
return transformed_data
def get(self, request, *args, **kwargs):
data = self.get_data()
transformed_data = self.transform_data(data)
self.save_data(transformed_data)
return redirect("index")