astrolabsoftware/spark-fits

subset of rows from multiple DoubleType columns are incorrectly loaded as very small/large values

jacobic opened this issue · 12 comments

Hi @JulienPeloton ,

This one is a bit of a mystery to me but it looks like spark-fits is unable to parse a subset of the values a single fits file with a header that looks like this header.bad.txt when used with default settings. The file itself is not so large and can be downloaded here.

It looks like spark-fits sets a subset of the dataframe elements to very small or very large value. To give a bit of additional context, I believe the original file was made from the concatenation of several files so I originally thought that there may be varying precision in a single column which could be causing spark-fits to parse incorrect values. Despite this spark-fits seems to correctly determine the schema.

Another thing to note is all of the offending columns that I happen to be interested in are doubles ('specz_ra', 'specz_dec', 'specz_redshift', 'specz_redshift_err'):

Screenshot 2019-06-04 at 21 00 43

These problems can easily be highlighted with a show:

df.select('specz_ra', 'specz_dec', 'specz_redshift', 'specz_redshift_err').show()

I can confirm that these values are correct.

Screenshot 2019-06-04 at 20 29 31

and summary, show:

df.select('specz_ra', 'specz_dec', 'specz_redshift', 'specz_redshift_err').summary().show()

as you can see there are some very large/small values which cause the summary statistics to be quite crazy

Screenshot 2019-06-04 at 20 30 03

In Topcat the fits file is parsed just fine with no extra large or extra small values:

Screenshot 2019-06-04 at 20 07 05

Screenshot 2019-06-04 at 20 07 19

Screenshot 2019-06-04 at 20 07 33

Screenshot 2019-06-04 at 20 07 44

Originally the fits file has 2.3 million rows, however, after loading with spark-fits then cleaning the bad (very large/small) i.e. (((ra >= 0.0) or (ra <= 360.0)), ((dec >= -90.0) or (dec <= 90.0)), ( 0 < z, z_err < 5), the dataframe contains only a small fraction of the original rows .

The only way I could resolve the problem was to re-downloaded the exact same file as a .csv to avoid the parsing issue in spark-fits. In Topcat I confirmed that the .csv and .fits file are indeed exactly the same data with no unusual values and that it can parse both with no problem but just wanted to let you know in case there was an opportunity to improve spark-fits.

Here is the .csv summary which shows the values that I would expect:

Screenshot 2019-06-04 at 21 27 51

Please let me know if you have any idea what could be causing the problem.

Cheers,
Jacob

Hi @jacobic, thanks for reporting this weird feature indeed! Could you fix the link to the FITS file, I cannot access it currently (We're sorry, but something went wrong. error) :-(

Also, from the header file you sent specz_redshift and specz_redshift_err seem to be float and not double (E type).

Hi @JulienPeloton,

Thanks for looking into this :)

Sorry about the missing file, here are the fits and csv files (in a more permanent storage location). Ooops! my mistake, well spotted, the redshift information are indeed float type.

Cheers,
Jacob

Hi @JulienPeloton,

I have found many more files that are affected by this issue (which all happen to be be downloaded from the same SQL server as the file I previously uploaded).

Please let me know if you need any more minimal examples with e.g. just 4 columns. I cannot work out a pattern to the rows that are affected as the issue seems to occur in random chunks throughout the resulting dataframe.

Cheers,
Jacob

Hi @JulienPeloton,

This might be useful information to locate the cause of the issue. I just converted all of the fits files to csv using stilts (see here for details) with the following command:

for f in $(ls); do java -jar stilts.jar tcopy ifmt=fits ofmt=csv in=$f out=$f.csv; done

Using the standard spark-csv dataframe reader with a glob pattern that matches these converted files loads the files as expected, i.e there are no unusual values present in the resulting dataframe.

Therefore the key to this issue may be to try and work out the difference between the stilts / topcat fits parser and spark-fits for these tricky files.

The relevant code can be found at the uk.ac.starlink.fits repository since stilts / topcat both import uk.ac.starlink.fits .

Cheers,
Jacob

Hi @jacobic, thanks for detailed inspection. I suspect this comes from a different definition of NaN or null values.

Crazy non-sense values do not correspond to flags

I checked the non-sense values are not correlated with the _flag values:

df.select(["specz_ra_isnull", "specz_ra"])\
  .filter(df["specz_ra_isnull"] == False)\
  .describe().show()
# +-------+--------------------+
# |summary|            specz_ra|
# +-------+--------------------+
# |  count|             2291160|
# |   mean|2.644092601122523E32|
# | stddev|3.175778285851278...|
# |    min|             33.5782|
# |    max|6.536526700553566E32|
# +-------+--------------------+

Most of the values are non-sense

If I just flag non-sense values, the distributions look quite good:

df.filter(df["specz_ra"] < 1e10).describe().show()
# +-------+--------------------+-------------------+---------+--------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+
# |summary|           object_id|              d_pos|    d_mag|          specz_name|          specz_ra|          specz_dec|    specz_redshift|specz_redshift_err|specz_original_flag|       specz_mag_i|
# +-------+--------------------+-------------------+---------+--------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+
# |  count|              162829|             162829|   162829|              162829|            162829|             162829|            162829|            162829|             162829|            162829|
# |   mean|4.142068022597776...| 0.2859233633074504|      NaN|                null|138.46369882832755|0.19190868892516239|0.5191814587443169|               NaN|  2.639890110678344|20.835611242214505|
# | stddev|8.357324012297646E15|0.17357401946699097|      NaN|                null|139.96104005162462|  13.15167630360855|0.6246790693905229|               NaN| 13.914875342474708| 4.068454938287414|
# |    min|   36424655564702203|       0.0010935713|-Infinity|  DEEP2DR4uniq_27043|           33.5782|          -6.101336|              -1.0|               0.0|                 -1|             -99.0|
# |    max|   76557903720366950|          1.0027766|      NaN|SDSS-DR12-1237679...|         353.99005|          56.899819|           7.01196|               NaN|                  8|            25.784|
# +-------+--------------------+-------------------+---------+--------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+

but the count is super small - as if most the file was made of these E32 values.

d_mag count is suspect!

I also compared side-by-side DataFrames obtained by fits and csv spark reader:

# CSV
df_csv.describe().show()
# +-------+--------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+------------------+
# |summary|         # object_id|              d_pos|             d_mag|        specz_name|          specz_ra|         specz_dec|    specz_redshift| specz_redshift_err|specz_original_flag|       specz_mag_i|
# +-------+--------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+------------------+
# |  count|             2299624|            2299624|           1923478|           2299624|           2299624|           2299624|           2299624|            2299624|            2299624|           2299624|
# |   mean|4.325592681794838...|0.20328248783487124|-1.147986607593022|              null|157.35652208546713|2.6757737337724925|0.6220282019904788|0.47723445930272873| 2.8340504323957076|18.701468728515767|
# | stddev|8.044117100715417E15|0.19084390071088472| 61.84223908366595|              null|119.00443140006479|12.329135967245323|1.1161052966416039|  22.31814510522508|  23.44710362080591| 93.63632404604363|
# |    min|   36407037608854089|      5.95199119E-4|   -0.000101089478| 3DHST_AEGIS_10002|           6.86E-4|         -7.282074|       -9.99999046|                0.0|                 -1|             -99.0|
# |    max|   76557903720366950|         1.00481975|               nan|ZCOSMOS-DR3-950074|         359.99936|         56.899819|        9.99989986|                nan|                B,1|               nan|
# +-------+--------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+------------------+

# FITS
df_fits.describe().show()
# +-------+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+------------------+
# |summary|           object_id|               d_pos|    d_mag|          specz_name|            specz_ra|           specz_dec|      specz_redshift|specz_redshift_err| specz_original_flag|       specz_mag_i|
# +-------+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+------------------+
# |  count|             2299624|             2299624|  2299624|             2299624|             2299624|             2299624|             2299624|           2299624|             2299624|           2299624|
# |   mean|4.706030777296113...|1.316641291227980...|      NaN|                null|2.644702000237748E32|4.255107080655406E31|1.543973276428019...|               NaN|   2.639890110678344|11946.621215034598|
# | stddev|1.287658037011944...|2.117191692113347...|      NaN|                null|3.175889256852315...|1.404418850026732...|2.119624192883826E37|               NaN|  13.914875342474708| 3309.132510468521|
# |    min|   36424655564702203|       -2.6364035E38|-Infinity|-100002...|             33.5782|           -6.101336|       -2.6346233E38|               0.0|,1...|             -99.0|
# |    max| 5061905922000338833|        2.6364035E38|      NaN|uniq_9999...|6.536526700553566E32|6.946725054123978E32|        2.6348404E38|               NaN|            FTF��F|         13585.582|
# +-------+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+------------------+

despite the obvious non-sense values, notice that the count is different only in one place: d_mag. The DataFrame from the CSV values seems to have less entry in this column than in the FITS one, as if some values were silently discarded.

d_mag schema is suspect!

If I look closely at the d_mag object type for each DataFrame, there are different:

df_csv.select("d_mag").printSchema()
# root
#  |-- d_mag: string (nullable = true)

df_fits.select("d_mag").printSchema()
# root
#  |-- d_mag: float (nullable = true)

Note that the same apply for specz_redshift_err and specz_mag_i. Note that the Spark CSV reader is assigning StringType for any columns containing NaN (NullType).

Could be misinterpreted NaN values giving error.

Looking at the file with astropy for instance reveals that only d_mag, specz_mag_i and specz_redshift_err contain NaN values:

from astropy.io import fits
import numpy as np
fits_data = fits.open("hsc_pdr2_dud_wide_spec.fits")[1].data

for name in fits_data.names:
  try: print(name, np.mean(fits_data[name]))
  except TypeError: pass
object_id 4.325592681796465e+16
object_id_isnull 0.0
d_pos 0.20328242
d_pos_isnull 0.0
d_mag nan
d_mag_isnull 0.1635684790209182
specz_name_isnull 0.0
specz_ra 157.3565220854671
specz_ra_isnull 0.0
specz_dec 2.675773733772397
specz_dec_isnull 0.0
specz_redshift 0.6220284
specz_redshift_isnull 0.0
specz_redshift_err nan
specz_redshift_err_isnull 0.0
specz_original_flag_isnull 0.0
specz_flag_homogeneous 0.7922008119588245
specz_flag_homogeneous_isnull 0.0
specz_mag_i nan
specz_mag_i_isnull 0.0
specz_flag_sdss_dr14 0.5167009911185481
specz_flag_sdss_dr14_isnull 0.0
specz_flag_deep3_v201701 0.05422843038688064
specz_flag_deep3_v201701_isnull 0.0
specz_flag_primus_dr1 0.21585528764702402
specz_flag_primus_dr1_isnull 0.0
specz_flag_vipers_pdr1 0.07486006408004091
specz_flag_vipers_pdr1_isnull 0.0
specz_flag_vvds 0.051264467582526535
specz_flag_vvds_isnull 0.0
specz_flag_gama_dr2 0.07452914041599844
specz_flag_gama_dr2_isnull 0.0
specz_flag_wigglez_dr1 0.0286285931961051
specz_flag_wigglez_dr1_isnull 0.0
specz_flag_zcosmos_dr3 0.029318271160850642
specz_flag_zcosmos_dr3_isnull 0.0
specz_flag_udsz_mar2014 0.002188183807439825
specz_flag_udsz_mar2014_isnull 0.0
specz_flag_3dhst_v4_1_5 0.010012071538651536
specz_flag_3dhst_v4_1_5_isnull 0.0
specz_flag_fmos_cosmos_v1_0 0.0005631355386793667
specz_flag_fmos_cosmos_v1_0_isnull 0.0

FITS recommends the use of the use of IEEE NaN to represent null values (and topcat must use it) but spark-fits is not probably handling those correctly. So the difference might come from this. I will give a try later in the week, and keep you posted.

After digging a bit more, the problem is not coming from NaN misinterpretation. E.g. if I set all columns but one (without NaN) to zero, the resulting column gives wrong values.

Looking closely at the non-sense values, there seems to be geographically grouped:

from pyspark.sql.functions import monotonically_increasing_id

# Add ID
df = df.withColumn("id", monotonically_increasing_id())

# Keep only good values
df.filter("specz_ra < 1e10").select("id").describe().show()
# +-------+-----------------+
# |summary|               id|
# +-------+-----------------+
# |  count|           162829|
# |   mean|          81414.0|
# | stddev|47004.82782828448|
# |    min|                0|
# |    max|           162828|
# +-------+-----------------+

Only the first 162829 values are good (min to max), and then non-sense values appear until the end of the file.

Actually the 1e32 values remind me the values I get when I messed up the pointer value while reading a FITS file. This sudden wrong pointer could be explained if the file you sent me is the concatenation of several files, whose data type changed in between two files (float to double, or int to long). Is this the case? How this FITS file has been created?

Hi Julien,

Thanks for looking into this. The catalog originates from the HSC PDR2 PostgreSQL sever. The example file is a select * from pdr2_wide.specz the schema information can be found here and a screenshot of the query plan and schema are shown below:

Screenshot 2019-06-11 at 14 30 21

Screenshot 2019-06-11 at 14 11 35

The user is able to specify the format of the data they want, csv, csv.gz, fits etc. And then when the query is completed the user can download the data.

Screenshot 2019-06-11 at 14 32 06

The information in this table is definitely made up of multiple different surveys/files/data sources but I doubt an explicit concatenation of fits files ever occurs as the table already exists in the database (although I suspect they might of merged files in order to create the table in the first place). As the data originates from a sql server I would expect the data types to be casted to the correct types and then the schema of the fits file is probably directly inferred from the schema of the table.

In my other files (of which I have many), I query another table and the same problem occurs although the origin of table is different. What all these columns have in common is that they all are matched with photometric data which comes from the HSC image processing pipeline. This data originally comes in the form of fits files (which could theoretically have slightly different schema regarding their column types) before it is ingested into the SQL database. Despite this, if all this data is able to live in a SQL server, doesn't that mean the schema should be consistent? and therefore the original origin of the files shouldn't make a difference?

Please let me know what you think.

Cheers,
Jacob

Oh and just for completeness, I don't think it is relevant but I think I also joined the pdr2_wide.specz table with this table pdr2_dud.specz in the query for the original data files that I sent you.

Cheers,
Jacob

Thanks @jacobic, it helps a lot to understand.

Actually, I dig deep inside spark-fits. When the file is read on my computer, basically it makes chunks of 33,554,432 bytes. Guess what? 33,554,432 is very close to 162829 [row] * 206 [bytes/row]... So it seems there is a mismatch in the starting index when reading a new chunk of this file. Looking deeper, I found the bug... The bug was introduced by an ugly fix for Image HDU:

// Here is an attempt to fix a bug when reading images:
//
// I noticed that when an image is split across several HDFS blocks,
// the transition is not done correctly, and there is one line typically
// missing. After some manual inspection, I found that introducing a shift
// in the starting index helps removing the bug. The shift is function of
// the HDU index, and depends whether the primary HDU is empty or not.
// By far I'm not convinced about this fix in general, but it works for
// the few examples that I tried. If you face a similar problem,
// or find a general solution, let me know!
var shift = if (primaryfits.empty_hdu) {
FITSBLOCK_SIZE_BYTES * (conf.get("hdu").toInt - 3)
} else {
FITSBLOCK_SIZE_BYTES * (conf.get("hdu").toInt - 1)
}

I'm still not 100% sure this is the final word, but fixing it seems to give correct answer:

# +-------+--------------------+-------------------+---------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+
# |summary|           object_id|              d_pos|    d_mag|        specz_name|          specz_ra|         specz_dec|    specz_redshift|specz_redshift_err|specz_original_flag|specz_mag_i|
# +-------+--------------------+-------------------+---------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+
# |  count|             2299625|            2299625|  2299625|           2299625|           2299625|           2299625|           2299625|           2299625|            2299625|    2299625|
# |   mean|4.325592616529654...|0.20328246733677513|      NaN|              null|157.35647009225943|2.6757728602627218|0.6220279768352934|               NaN| 2.8340504323962143|        NaN|
# | stddev|8.044115412593749E15|0.19084386174782547|      NaN|              null|119.00443164434095|12.329133357718852|1.1161051065508547|               NaN| 23.447103620806015|        NaN|
# |    min|   36407037608854089|        5.951991E-4|-Infinity| 3DHST_AEGIS_10002|           6.86E-4|         -7.282074|          -9.99999|               0.0|                 -1|  -9999.166|
# |    max|   76557903720366950|          1.0048198|      NaN|ZCOSMOS-DR3-950074|         359.99936|         56.899819|            9.9999|               NaN|                B,1|        NaN|
# +-------+--------------------+-------------------+---------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------+

I will make a fix tomorrow.

Great stuff, Thanks so much @JulienPeloton!
I'm very happy to test things out as usual.
Cheers,
Jacob

Fixed in #86