python – Weird behaviour initializing a numpy array of string data

python – Weird behaviour initializing a numpy array of string data

Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str, it sets this maximum length to 1 by default. You can see if you do my_array.dtype; it will show |S1, meaning one-character string. Subsequent assignments into the array are truncated to fit this structure.

You can pass an explicit datatype with your maximum length by doing, e.g.:

my_array = numpy.empty([1, 2], dtype=S10)

The S10 will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.

I got a codec error when I tried to use a non-ascii character with dtype=S10

You also get an array with binary strings, which confused me.

I think it is better to use:

my_array = numpy.empty([1, 2], dtype=<U10)

Here U10 translates to Unicode string of length 10; little endian format

python – Weird behaviour initializing a numpy array of string data

The numpy string array is limited by its fixed length (length 1 by default). If youre unsure what length youll need for your strings in advance, you can use dtype=object and get arbitrary length strings for your data elements:

my_array = numpy.empty([1, 2], dtype=object)

I understand there may be efficiency drawbacks to this approach, but I dont have a good reference to support that.

Leave a Reply

Your email address will not be published.