python_hll.hll module¶
-
class
python_hll.hll.
HLL
(log2m, regwidth, expthresh=-1, sparseon=True, type=1)[source]¶ Bases:
object
A probabilistic set of hashed
long
elements. Useful for computing the approximate cardinality of a stream of data in very small storage.A modified version of the ‘HyperLogLog’ data structure and algorithm is used, which combines both probabilistic and non-probabilistic techniques to improve the accuracy and storage requirements of the original algorithm.
More specifically, initializing and storing a new HLL will allocate a sentinel value symbolizing the empty set (HLLType.EMPTY). After adding the first few values, a sorted list of unique integers is stored in a HLLType.EXPLICIT hash set. When configured, accuracy can be sacrificed for memory footprint: the values in the sorted list are “promoted” to a “HLLType.SPARSE” map-based HyperLogLog structure. Finally, when enough registers are set, the map-based HLL will be converted to a bit-packed “HLLType.FULL” HyperLogLog structure.
This data structure is interoperable with the implementations found at:
when properly serialized.
-
MAXIMUM_EXPLICIT_THRESHOLD
= 131072¶
-
MAXIMUM_EXPTHRESH_PARAM
= 18¶
-
MAXIMUM_LOG2M_PARAM
= 30¶
-
MAXIMUM_REGWIDTH_PARAM
= 8¶
-
MINIMUM_EXPTHRESH_PARAM
= -1¶
-
MINIMUM_LOG2M_PARAM
= 4¶
-
MINIMUM_REGWIDTH_PARAM
= 1¶
-
add_raw
(raw_value)[source]¶ Adds
rawValue
directly to the HLL.Parameters: raw_value (long) – the value to be added. It is very important that this value already be hashed with a strong (but not necessarily cryptographic) hash function. For instance, the MurmurHash3 implementation is an excellent hash function for this purpose. Return type: void
-
cardinality
()[source]¶ Computes the cardinality of the HLL.
Returns: the cardinality of HLL. This will never be negative. Return type: long
-
clear
()[source]¶ Clears the HLL. The HLL will have cardinality zero and will act as if no elements have been added.
NOTE: Unlike
addRaw(long)
,clear
does NOT handle transitions betweenHLLType
’s - a probabilistic type will remain probabilistic after being cleared.Return type: void
-
classmethod
create_for_testing
(log2m, regwidth, explicit_threshold, sparse_threshold, type)[source]¶ Convenience constructor for testing. Assumes that both
HLLType.EXPLICIT
andHLLType.SPARSE
representations should be enabled.Parameters: - log2m (int) – log-base-2 of the number of registers used in the HyperLogLog algorithm. Must be at least 4 and at most 30.
- regwidth (int) – number of bits used per register in the HyperLogLog algorithm. Must be at least 1 and at most 8.
- explicit_threshold (int) – cardinality threshold at which the
HLLType.EXPLICIT
representation should be promoted toHLLType.SPARSE
. This must be greater than zero and less than or equal toMAXIMUM_EXPLICIT_THRESHOLD
. - sparse_threshold (int) – register count threshold at which the
HLLType.SPARSE
representation should be promoted toHLLType.FULL
. This must be greater than zero. - type (HLLType) – the type in the promotion hierarchy which this instance should
start at. This cannot be
None
.
Return type:
-
classmethod
from_bytes
(bytes)[source]¶ Deserializes the HLL (in
toBytes()
format) serialized intobytes
.Parameters: bytes (list) – the serialized bytes of new HLL Returns: the deserialized HLL. This will never be None
.Return type: HLL
-
get_type
()[source]¶ Returns the type in the promotion hierarchy of this instance. This will never be
None
.Return type: HLLType
-
to_bytes
(schema_version=<python_hll.serialization.SchemaVersionOne object>)[source]¶ Serializes the HLL to an array of bytes in correspondence with the format of the default schema version,
SerializationUtil.DEFAULT_SCHEMA_VERSION
.Parameters: schema_version (SchemaVersion) – the schema version dictating the serialization format Returns: the array of bytes representing the HLL. This will never be None
or empty.Return type: list
-