python – Create numpy array with random elements from list

python – Create numpy array with random elements from list

Theres a couple of ways of doing this, each has their pros/cons, the following four where just
from the top of my head …

  • pythons own random.sample, is simple and built in, though it may not be the fastest…
  • numpy.random.permutation again simple but it creates a copy of which we have to slice, ouch!
  • numpy.random.shuffle is faster since it shuffles in place, but we still have to slice.
  • numpy.random.sample is the fastest but it only works on the interval 0 to 1 so we have
    to normalize it, and convert it to ints to get the random indices, at the end we
    still have to slice, note normalizing to the size we want does not generate a uniform random distribution.

Here are some benchmarks.

import timeit
from matplotlib import pyplot as plt

setup = 

import numpy
import random

number_of_members = 20
values = range(50)


number_of_repetitions = 20
array_sizes = (10, 200)

python_random_times = [timeit.timeit(stmt = [random.sample(values, number_of_members) for index in xrange({0})].format(array_size),
                                     setup = setup,                      
                                     number = number_of_repetitions)
                                        for array_size in xrange(*array_sizes)]

numpy_permutation_times = [timeit.timeit(stmt = [numpy.random.permutation(values)[:number_of_members] for index in xrange({0})].format(array_size),
                               setup = setup,
                               number = number_of_repetitions)
                                    for array_size in xrange(*array_sizes)]

numpy_shuffle_times = [timeit.timeit(stmt = 
                                
                                random_arrays = []
                                for index in xrange({0}):
                                    numpy.random.shuffle(values)
                                    random_arrays.append(values[:number_of_members])
                                .format(array_size),
                                setup = setup,
                                number = number_of_repetitions)
                                     for array_size in xrange(*array_sizes)]                                                                    

numpy_sample_times = [timeit.timeit(stmt = 
                                    
                                    values = numpy.asarray(values)
                                    random_arrays = [values[indices][:number_of_members] 
                                                for indices in (numpy.random.sample(({0}, len(values))) * len(values)).astype(int)]
                                    .format(array_size),
                                    setup = setup,
                                    number = number_of_repetitions)
                                         for array_size in xrange(*array_sizes)]                                                                                                                                            

line_0 = plt.plot(xrange(*array_sizes),
                             python_random_times,
                             color = black,
                             label = random.sample)

line_1 = plt.plot(xrange(*array_sizes),
         numpy_permutation_times,
         color = red,
         label = numpy.random.permutations
         )

line_2 = plt.plot(xrange(*array_sizes),
                    numpy_shuffle_times,
                    color = yellow,
                    label = numpy.shuffle)

line_3 = plt.plot(xrange(*array_sizes),
                    numpy_sample_times,
                    color = green,
                    label = numpy.random.sample)

plt.xlabel(Number of Arrays)
plt.ylabel(Time in (s) for %i rep % number_of_repetitions)
plt.title(Different ways to sample.)
plt.legend()

plt.show()

and the result:

enter

So it looks like numpy.random.permutation is the worst, not surprising, pythons own random.sample is holding it own, so it looks like its a close race between numpy.random.shuffle and numpy.random.sample with numpy.random.sample edging out, so either should suffice, even though numpy.random.sample has a higher memory footprint I still prefer it since I really dont need to build the arrays I just need the random indices …

$ uname -a
Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386

$ python --version
Python 2.6.1

$ python -c import numpy; print numpy.__version__
1.6.1

UPDATE

Unfortunately numpy.random.sample doesnt draw unique elements from a population so youll get repitation, so just stick with shuffle is just as fast.

UPDATE 2

If you want to remain within numpy to leverage some of its built in functionality just convert the values into numpy arrays.

import numpy as np
values = [cat, popcorn, mescaline]
number_of_members = 2
N = 1000000
random_arrays = np.asarray([values] * N)
_ = [np.random.shuffle(array) for array in random_arrays]
subset = random_arrays[:, :number_of_members]

Note that N here is quite large as such you are going to get repeated number of permutations, by permutations I mean order of values not repeated values within a permutation, since fundamentally theres a finite number of permutations on any giving finite set, if just calculating the whole set then its n!, if only selecting k elements its n!/(n – k)! and even if this wasnt the case, meaning our set was much larger, we might still get repetitions depending on the random functions implementation, since shuffle/permutation/… and so on only work with the current set and have no idea of the population, this may or may not be acceptable, depends on what you are trying to achieve, if you want a set of unique permutations, then you are going to generate that set and subsample it.

Heres a way to do it using numpys np.random.randint:

In [68]: l = np.array([cat, mescaline, popcorn])

In [69]: l[np.random.randint(len(l), size=(3,2))]
Out[69]: 
array([[cat, popcorn],
       [popcorn, popcorn],
       [mescaline, cat]], 
      dtype=|S9)

EDIT: after the additional details that each element should appear at most once in each row

this is not very space efficient, do you need something better?

In [29]: l = np.array([cat, mescaline, popcorn])

In [30]: array([np.random.choice(l, 3, replace=False) for i in xrange(5)])
Out[30]: 
array([[mescaline, popcorn, cat],
       [mescaline, popcorn, cat],
       [popcorn, mescaline, cat],
       [mescaline, cat, popcorn],
       [mescaline, cat, popcorn]], 
      dtype=|S9)

python – Create numpy array with random elements from list

>>> import numpy
>>> l = numpy.array([cat, mescaline, popcorn])
>>> l[numpy.random.randint(0, len(l), (3, 2))]
array([[popcorn, mescaline],
       [mescaline, popcorn],
       [cat, cat]], 
      dtype=|S9)

Leave a Reply

Your email address will not be published.