probability

limited_shuffle(iterable, buffer_size, rand_generator=None)[source]

Some algorithms require the order of data to be randomized. An obvious solution is to put it all in a list and shuffle, but if you don’t want to load it all into memory that’s not an option. This method iterates over the data, keeping a buffer and choosing at random from the buffer what to put next. It’s less shuffled than the simpler solution, but limits the amount of memory used at any one time to the buffer size.

limited_shuffle_numpy(iterable, buffer_size, randint_buffer_size=1000)[source]

Identical behaviour to limited_shuffle(), but uses Numpy’s random sampling routines to generate a large number of random integers at once. This can make execution a bit bursty, but overall tends to speed things up, as we get the random sampling over in one big call to Numpy.

batched_randint(low, high=None, batch_size=1000)[source]

Infinite iterable that produces random numbers in the given range by calling Numpy now and then to generate lots of random numbers at once and then yielding them one by one. Faster than sampling one at a time.

Parameters:
  • a – lowest number in range
  • b – highest number in range
  • batch_size – number of ints to generate in one go
sequential_document_sample(corpus, start=None, shuffle=None, sample_rate=None)[source]

Wrapper around a pimlico.datatypes.tar.TarredCorpus to draw infinite samples of documents from the corpus, by iterating over the corpus (looping infinitely), yielding documents at random. If sample_rate is given, it should be a float between 0 and 1, specifying the rough proportion of documents to sample. A lower value spreads out the documents more on average.

Optionally, the samples are shuffled within a limited scope. Set shuffle to the size of this scope (higher will shuffle more, but need to buffer more samples in memory). Otherwise (shuffle=0), they will appear in the order they were in the original corpus.

If start is given, that number of documents will be skipped before drawing any samples. Set start=0 to start at the beginning of the corpus. By default (start=None) a random point in the corpus will be skipped to before beginning.

sequential_sample(iterable, start=0, shuffle=None, sample_rate=None)[source]

Draw infinite samples from an iterable, by iterating over it (looping infinitely), yielding items at random. If sample_rate is given, it should be a float between 0 and 1, specifying the rough proportion of documents to sample. A lower value spreads out the documents more on average.

Optionally, the samples are shuffled within a limited scope. Set shuffle to the size of this scope (higher will shuffle more, but need to buffer more samples in memory). Otherwise (shuffle=0), they will appear in the order they were in the original corpus.

If start is given, that number of documents will be skipped before drawing any samples. Set start=0 to start at the beginning of the corpus. Note that setting this to a high number can result in a slow start-up, if iterating over the items is slow.

Note

If you’re sampling documents from a TarredCorpus, it’s better to use sequential_document_sample(), since it makes use of TarredCorpus’s built-in features to do the skipping and sampling more efficiently.

subsample(iterable, sample_rate)[source]

Subsample the given iterable at a given rate, between 0 and 1.