pimlico.utils.probability module

pimlico.utils.probability.limited_shuffle(iterable, buffer_size)[source]

Some algorithms require the order of data to be randomized. An obvious solution is to put it all in a list and shuffle, but if you don’t want to load it all into memory that’s not an option. This method iterates over the data, keeping a buffer and choosing at random from the buffer what to put next. It’s less shuffled than the simpler solution, but limits the amount of memory used at any one time to the buffer size.

pimlico.utils.probability.sequential_document_sample(corpus, start=None, shuffle=None, sample_rate=None)[source]

Wrapper around a :cls:`pimlico.datatypes.tar.TarredCorpus` to draw infinite samples of documents from the corpus, by iterating over the corpus (looping infinitely), yielding documents at random. If sample_rate is given, it should be a float between 0 and 1, specifying the rough proportion of documents to sample. A lower value spreads out the documents more on average.

Optionally, the samples are shuffled within a limited scope. Set shuffle to the size of this scope (higher will shuffle more, but need to buffer more samples in memory). Otherwise (shuffle=0), they will appear in the order they were in the original corpus.

If start is given, that number of documents will be skipped before drawing any samples. Set start=0 to start at the beginning of the corpus. By default (start=None) a random point in the corpus will be skipped to before beginning.

pimlico.utils.probability.sequential_sample(iterable, start=0, shuffle=None, sample_rate=None)[source]

Draw infinite samples from an iterable, by iterating over it (looping infinitely), yielding items at random. If sample_rate is given, it should be a float between 0 and 1, specifying the rough proportion of documents to sample. A lower value spreads out the documents more on average.

Optionally, the samples are shuffled within a limited scope. Set shuffle to the size of this scope (higher will shuffle more, but need to buffer more samples in memory). Otherwise (shuffle=0), they will appear in the order they were in the original corpus.

If start is given, that number of documents will be skipped before drawing any samples. Set start=0 to start at the beginning of the corpus. Note that setting this to a high number can result in a slow start-up, if iterating over the items is slow.

Note

If you’re sampling documents from a TarredCorpus, it’s better to use sequential_document_sample(), since it makes use of TarredCorpus‘s built-in features to do the skipping and sampling more efficiently.

pimlico.utils.probability.subsample(iterable, sample_rate)[source]

Subsample the given iterable at a given rate, between 0 and 1.