darshan-hpc/darshan

PyDarshan job summary cleanup

shanedsnyder opened this issue · 4 comments

From an in-person review of the current Darshan job summary reports at Argonne, it would be nice to try to address as much of this as we can before doing a release.

First some high-level things up for discussion:

  • drop blue I/O performance estimate text box, and instead add a per-module table detailing key details of the module (including total files/datasets(H5D)/variables(PNETCDF_VAR) accessed, total bytes read/written, and I/O performance estimate)
    • this way, each module starts off with a high-level summary table that describes key details of I/O, and gets rid of awkward text object we are using now
  • disable DXT heatmap plot generation by default, print warning if DXT is available but HEATMAP isn't, and add CLI option to force DXT usage
    • for some logs, it's probably just too expensive to generate DXT heatmaps by default, so this protects against that
    • this would require us to either push a little on CLI stuff in #809 or use something simple temporarily?
  • after dust settles, potentially reorganize plots/tables in per-module sections?
    • nothing concrete to suggest just yet (Tyler -0.5 on this)

Smaller things:

  • Captions are hard to read (light colored text, also italicized)
    image
  • Add "% runtime" y-axis label to right-side of I/O cost plot
    image
  • Offset autolabeling text at 45 degrees in access size histogram plot and POSIX access pattern plot as is done in I/O op count graph
  • Don't autolabel 0 values to further help with cluttering of plots
  • If not too complicated, can we add more buffer to top of these plots to avoid autolabels overlapping with plot boundaries?
  • We should use a ylim of 0 for access histogram and op count plots to ensure no negative values are plotted.
    image
  • Erroneous "type" row in File Count Summary table
    image
  • drop scientific notation entirely (partially addressed in #812?), and rely on humanize to generate sensible human readable byte strings?
  • File access by category table font really small
    image
  • Pointed out in #907, add a forgotten space to our warning string when a log has no data

There is some duplication with i.e., gh-804 here as well.

I embarrassingly can't figure this one out:

Erroneous "type" row in File Count Summary table

Seems like it ought to be a one-liner to move the "type" index up to a column like other values, but I can't quite seem to pull it off.

Most of the other issues I've already addressed in #907

It wasn't an error--it was an intentional index label. You can remove it, but then you'll need to shim the test suite as well (which is pretty good to do even for aesthetic changes anyway). This is pretty straightforward, but it wasn't a bug (it was intentional) so I'd prefer not using this approach:

--- a/darshan-util/pydarshan/darshan/lib/accum.py
+++ b/darshan-util/pydarshan/darshan/lib/accum.py
@@ -71,7 +71,6 @@ def log_file_count_summary_table(derived_metrics,
                              "read/write files":3}
     df = pd.DataFrame.from_dict(darshan_file_category, orient="index")
     df.rename(columns={0:"index"}, inplace=True)
-    df.index.rename('type', inplace=True)
     df["number of files"] = np.zeros(4, dtype=int)
     df["avg. size"] = np.zeros(4, dtype=str)
     df["max size"] = np.zeros(4, dtype=str)
diff --git a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
index 080264d1..3d23fb76 100644
--- a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
+++ b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
@@ -203,7 +203,6 @@ def test_file_count_summary_table(log_name,
                          "read-only files",
                          "write-only files",
                          "read/write files"]
-    expected_df.index.rename('type', inplace=True)
 
     log_path = get_log_path(log_name)
     with darshan.DarshanReport(log_path, read_all=True) as report:

image

I'd prefer if you didn't mutate the data structures for the sole purpose of removing a row in the report, and instead do something like this (a small test may be warranted) so that end users get what they want from to_html but no need to inconvenience developers who like index labels:

--- a/darshan-util/pydarshan/darshan/lib/accum.py
+++ b/darshan-util/pydarshan/darshan/lib/accum.py
@@ -100,5 +100,6 @@ def log_file_count_summary_table(derived_metrics,
     df.drop(columns="index", inplace=True)
     ret = plot_common_access_table.DarshanReportTable(df,
                                                       col_space=200,
-                                                      justify="center")
+                                                      justify="center",
+                                                      index_names=False)
     return ret

Last bits of this closed via #927.