pandas-dev/pandas

to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer)

Closed this issue ยท 14 comments

Code Sample, a copy-pastable example if possible

import pandas
frame = pandas.DataFrame({'A':['h'*250,'hi','hola']})
frame.to_excel("text.xlsx", index=False)
frame.to_stata("test.dta")

Problem description

Raises the following error:

ValueError:
Fixed width strings in Stata .dta files are limited to 244 (or fewer)
characters. Column 'A' does not satisfy this restriction.

However this restriction seems to not exists in STATA, as the Excel file can be imported correctly
Open STATA, import the Excel file

import excel "C:\data\tesi\software\text.xlsx", sheet("Sheet1") firstrow clear

Now we can get the type of data in column 'A', and as you can see, it's str250. So STATA can store string longer than 244 characters

. describe A

A str250 %250s A

Expected Output

File gets exported with the correct format and without problems

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.5.5
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.5.3
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1

cc @bashtage does this sound right to you?

@raffamaiden I'm not very familar with stata's file format. Is it possible that the file-format is limited to 244 characters, while Stata itself can handle datasets with names > 244 characters in memory?

That is correct. There is no support for the writing the variable length strings that was added in recent Stata versions, and so writing stata is limited to 244 ascii-like (latin-1, 8 byte fixed) characters. Longer strings were added in Stata 13 and there were further changes to written strings in Stata 14 (unicode only).

Reading longer strings works fine thanks to @kshedden .

Little reason to support these IMO since reading is more important than writing.

Thanks.

@raffamaiden do you have a need for this? If you're willing to implement the code to support writing longer strings, I'm sure it'd be accepted, but as @bashtage says it'll be pretty low priority on our end.

Oh, yeah didn't think about that.

I'm going to close this for now then. We can revisit it in the future once compat with older versions of Stata is less important.

Well I can save the file in Excel format and have STATA convert it to dta by running a STATA script in batch mode

I still think it would be a nice feature to have

You may add an argument to the to_stata() function asking the file format version you want to save into.

Maybe have the lowest possible file format version set by default, and change the error message to something like "This file version doesn't support strings longer than 244 characters. Save in a more recent file format version, thus breaking compatibility with older versions of STATA"

As to implement it by myself, i'm quite new with pandas and have no idea of the internal file structure of stata file.

Maybe I can give it a try in the summer if I can ask support to pandas developers

Ok. Most of that would start here

Feel free to post here if / when you want to take it up.

Giving this item a bump. Stata 15 has been out a while now.
Also since Stata 13 there has be a supported "strL" data type.
This data type can accommodate 2,000,000,000 characters of text or binary data.

More information here : https://www.youtube.com/watch?time_continue=89&v=y6KZvm1oXAk

Stata data type specs here: https://www.stata.com/help.cgi?data+types

I'm a fan of the ideas from @raffam

Would be a helpful feature for those that use both pandas and Stata.
Count me in as a collaborator when it is time to make this improvement.

Thanks for bumping. Let's re-open it.

Are there numbers on which versions of Stata are actually used? Should we care at all about anything older than Stata 15?

You may add an argument to the to_stata() function asking the file format version you want to save into.

This also seems sensible. If it isn't too much additional effort to implement and maintain, then that's best. Otherwise, it's best to just make a clean break.

My thought is that we should look at being compatible with Stata 14 and onward. Many still use 14. I believe there are a handful of folks using 13.

Though it might not be necessary to forgo reverse compatibility. Without having given this a deep dive I would suspect that since the "strL" data type with 2billion characters has been around since Stata 13, it might be a possible to adjust the crosswalk at about line 1781 here without losing Stata 13 reverse compatibility.

Also, perhaps the solution would be to modify def _cast_to_stata_types(data): line 491 here so that it will explicitly cast string data types?

Reading support for these is already in. As the saying goes, PRs are welcome for writing.

The solution is more complicated than a cast. strL are stored in a different place in a lookup table. This is efficient but makes encoding them more complex than just changing a column type. Essentially one has to build up a strL dictionary and then write it at the end of the dta file in a particular (and idiosyncratic) format.

Writing Stata is probably lower priority than reading Stata since CSV is an easy method to move data to Stata. Reading is useful since there are many published dta files on the web, and without a reader these are inaccessible.

I took a further look at this and encoding strl's isn't that hard. The harder part of implementing this is that the current version supported in 114/115. The minimum for strl is 117.

117 makes a lot of changes the file format. The biggest change is the addition of tags surrounding regions. For example the header in 117 looks like

         <header>
                <release>117</release>
                <byteorder>MSF</byteorder>
                <K>0002</K>
                <N>00000001</N>
                <label>00</label>
                <timestamp>1110 Jul 2013 14:23</timestamp>
            </header>

while the header in 114 is a 109 byte blob described as:

        Contents            Length    Format    Comments
        -----------------------------------------------------------------------
        ds_format                1    byte      contains 114 = 0x72
        byteorder                1    byte      0x01 -> HILO, 0x02 -> LOHI
        filetype                 1    byte      0x01
        unused                   1    byte      0x00
        nvar (number of vars)    2    int       encoded per byteorder
        nobs (number of obs)     4    int       encoded per byteorder
        data_label              81    char      dataset label, \0 terminated
        time_stamp              18    char      date/time saved, \0 terminated
        -----------------------------------------------------------------------
        Total                  109

Adding StrL's requires a substantial rewrite of StataWriter, especially

_write_header
_write_descriptors
_write_variable_labels
_write_value_labels

See:

https://www.stata.com/help.cgi?dta_114
https://www.stata.com/help.cgi?dta_117

A mostly working implementation is here:

master...bashtage:strl-support

I noticed one larger potential advantage of this -- when writing largish data files with strings > 8 characters StrLs can reduce file size significantly if there are many repeated values. They can also reduce files size when writing sparse strings again as long as the maximum string length is > 8 characters (this happens since blank strings are replaced with an 8 bute uinteger).