python_hll.hll module

class python_hll.hll.HLL(log2m, regwidth, expthresh=-1, sparseon=True, type=1)[source]

Bases: object

A probabilistic set of hashed long elements. Useful for computing the approximate cardinality of a stream of data in very small storage.

A modified version of the ‘HyperLogLog’ data structure and algorithm is used, which combines both probabilistic and non-probabilistic techniques to improve the accuracy and storage requirements of the original algorithm.

More specifically, initializing and storing a new HLL will allocate a sentinel value symbolizing the empty set (HLLType.EMPTY). After adding the first few values, a sorted list of unique integers is stored in a HLLType.EXPLICIT hash set. When configured, accuracy can be sacrificed for memory footprint: the values in the sorted list are “promoted” to a “HLLType.SPARSE” map-based HyperLogLog structure. Finally, when enough registers are set, the map-based HLL will be converted to a bit-packed “HLLType.FULL” HyperLogLog structure.

This data structure is interoperable with the implementations found at:

when properly serialized.

MAXIMUM_EXPLICIT_THRESHOLD = 131072
MAXIMUM_EXPTHRESH_PARAM = 18
MAXIMUM_LOG2M_PARAM = 30
MAXIMUM_REGWIDTH_PARAM = 8
MINIMUM_EXPTHRESH_PARAM = -1
MINIMUM_LOG2M_PARAM = 4
MINIMUM_REGWIDTH_PARAM = 1
add_raw(raw_value)[source]

Adds rawValue directly to the HLL.

Parameters:raw_value (long) – the value to be added. It is very important that this value already be hashed with a strong (but not necessarily cryptographic) hash function. For instance, the MurmurHash3 implementation is an excellent hash function for this purpose.
Return type:void
cardinality()[source]

Computes the cardinality of the HLL.

Returns:the cardinality of HLL. This will never be negative.
Return type:long
clear()[source]

Clears the HLL. The HLL will have cardinality zero and will act as if no elements have been added.

NOTE: Unlike addRaw(long), clear does NOT handle transitions between HLLType’s - a probabilistic type will remain probabilistic after being cleared.

Return type:void
classmethod create_for_testing(log2m, regwidth, explicit_threshold, sparse_threshold, type)[source]

Convenience constructor for testing. Assumes that both HLLType.EXPLICIT and HLLType.SPARSE representations should be enabled.

Parameters:
  • log2m (int) – log-base-2 of the number of registers used in the HyperLogLog algorithm. Must be at least 4 and at most 30.
  • regwidth (int) – number of bits used per register in the HyperLogLog algorithm. Must be at least 1 and at most 8.
  • explicit_threshold (int) – cardinality threshold at which the HLLType.EXPLICIT representation should be promoted to HLLType.SPARSE. This must be greater than zero and less than or equal to MAXIMUM_EXPLICIT_THRESHOLD.
  • sparse_threshold (int) – register count threshold at which the HLLType.SPARSE representation should be promoted to HLLType.FULL. This must be greater than zero.
  • type (HLLType) – the type in the promotion hierarchy which this instance should start at. This cannot be None.
Return type:

HLL

classmethod from_bytes(bytes)[source]

Deserializes the HLL (in toBytes() format) serialized into bytes.

Parameters:bytes (list) – the serialized bytes of new HLL
Returns:the deserialized HLL. This will never be None.
Return type:HLL
get_type()[source]

Returns the type in the promotion hierarchy of this instance. This will never be None.

Return type:HLLType
to_bytes(schema_version=<python_hll.serialization.SchemaVersionOne object>)[source]

Serializes the HLL to an array of bytes in correspondence with the format of the default schema version, SerializationUtil.DEFAULT_SCHEMA_VERSION.

Parameters:schema_version (SchemaVersion) – the schema version dictating the serialization format
Returns:the array of bytes representing the HLL. This will never be None or empty.
Return type:list
union(other)[source]

Computes the union of HLLs and stores the result in this instance.

Parameters:other (HLL) – the other HLL instance to union into this one. This cannot be None.
Return type:void