python_hll.serialization module

class python_hll.serialization.BigEndianAscendingWordDeserializer(word_length, byte_padding, bytes)[source]

Bases: object

A corresponding deserializer for BigEndianAscendingWordSerializer.

BITS_PER_BYTE = 8
BYTE_MASK = 255
read_word()[source]

Return the next word in the sequence. Should not be called more than total_word_count times.

Return type:long
total_word_count()[source]

Returns the number of words that could be encoded in the sequence.

NOTE: the sequence that was encoded may be shorter than the value this
method returns due to padding issues within bytes. This guarantees only an upper bound on the number of times readWord() can be called.
Returns:the maximum number of words that could be read from the sequence.
Return type:int
class python_hll.serialization.BigEndianAscendingWordSerializer(word_length, word_count, byte_padding)[source]

Bases: object

A serializer that writes a sequence of fixed bit-width ‘words’ to a byte array. Bitwise OR is used to write words into bytes, so a low bit in a word is also a low bit in a byte. However, a high byte in a word is written at a lower index in the array than a low byte in a word. The first word is written at the lowest array index. Each serializer is one time use and returns its backing byte array.

This encoding was chosen so that when reading bytes as octets in the typical first-octet-is-the-high-nibble fashion, an octet-to-binary conversion would yield a high-to-low, left-to-right view of the “short words”.

Example:

Say short words are 5 bits wide. Our word sequence is the values [31, 1, 5]. In big-endian binary format, the values are [0b11111, 0b00001, 0b00101]. We use 15 of 16 bits in two bytes and pad the last (lowest) bit of the last byte with a zero:

[0b11111000, 0b01001010] = [0xF8, 0x4A]
BITS_PER_BYTE = 8
get_bytes()[source]

Returns the backing array of byte’s that contain the serialized words.

Returns:the serialized words as a list of bytes.
Return type:list
write_word(word)[source]

Writes the word to the backing array.

Parameters:word (long) – the word to write.
Return type:void
class python_hll.serialization.HLLMetadata(schema_version, type, register_count_log2, register_width, log2_explicit_cutoff, explicit_off, explicit_auto, sparse_enabled)[source]

Bases: object

The metadata and parameters associated with a HLL.

explicit_auto()[source]
Returns:True if the HLLType.EXPLICIT representation cutoff cardinality is set to be automatically chosen, False otherwise.
Return type:boolean
explicit_off()[source]
Returns:True if the HLLType.EXPLICIT representation has been disabled. False< otherwise.
Return type:boolean
hll_type()[source]
Returns:the type of the HLL. This will never be None.
Return type:HLLType
log2_explicit_cutoff()[source]
Returns:the log-base-2 of the explicit cutoff cardinality. This will always be greater than or equal to zero and less than 31, per the specification.
Return type:int
register_count_log2()[source]
Returns:the log-base-2 of the register count parameter of the HLL. This will always be greater than or equal to 4 and less than or equal to 31.
Return type:int
register_width()[source]
Returns:the register width parameter of the HLL. This will always be greater than or equal to 1 and less than or equal to 8.
Return type:int
schema_version()[source]
Returns:the schema version of the HLL. This will never be None.
Return type:int
sparse_enabled()[source]
Returns:True if the HLLType.SPARSE representation is enabled.
Return type:boolean
class python_hll.serialization.SchemaVersionOne[source]

Bases: object

A serialization schema for HLLs. Reads and writes HLL metadata to and from byte representations.

EXPLICIT_AUTO = 63
EXPLICIT_OFF = 0
HEADER_BYTE_COUNT = 3
SCHEMA_VERSION = 1
TYPE_ORDINALS = [5, 1, 2, 3, 4]
get_deserializer(type, word_length, bytes)[source]

Builds an HLL deserializer that matches this schema version.

Parameters:
  • type (HLLType) – the HLL type that will be deserialized. This cannot be None.
  • word_length (int) – the length of the ‘words’ that comprise the data of the serialized HLL. Words must be at least 5 bits and at most 64 bits long.
  • bytes (list) – the serialized HLL to deserialize. This cannot be None.
Returns:

a byte array deserializer used to deserialize a HLL serialized according to this schema version’s specification.

Return type:

BigEndianAscendingWordDeserializer

get_serializer(type, word_length, word_count)[source]

Builds an HLL serializer that matches this schema version.

Parameters:
  • type (HLLType) – the HLL type that will be serialized. This cannot be None.
  • word_length (int) – the length of the ‘words’ that comprise the data of the HLL. Words must be at least 5 bits and at most 64 bits long.
  • word_count (int) – the number of ‘words’ in the HLL’s data.
Returns:

a byte array serializer used to serialize a HLL according to this schema version’s specification.

Return type:

BigEndianAscendingWordSerializer

padding_bytes(type)[source]

The number of metadata bytes required for a serialized HLL of the specified type.

Parameters:type (HLLType) – the type of the serialized HLL
Returns:the number of padding bytes needed in order to fully accommodate the needed metadata.
Return type:int
read_metadata(bytes)[source]

Reads the metadata bytes of the serialized HLL.

Parameters:bytes (list) – the serialized HLL
Returns:the HLL metadata
Return type:HLLMetadata
schema_version_number()[source]
Returns:the schema version number
Return type:int
write_metadata(bytes, metadata)[source]

Writes metadata bytes to serialized HLL.

Parameters:
  • bytes (list) – the padded data bytes of the HLL
  • metadata (HLLMetadata) – the metadata to write to the padding bytes
Return type:

void

class python_hll.serialization.SerializationUtil[source]

Bases: object

A collection of constants and utilities for serializing and deserializing HLLs.

DEFAULT_SCHEMA_VERSION = <python_hll.serialization.SchemaVersionOne object>
EXPLICIT_CUTOFF_BITS = 6
EXPLICIT_CUTOFF_MASK = 63
LOG2_REGISTER_COUNT_BITS = 5
LOG2_REGISTER_COUNT_MASK = 31
NIBBLE_BITS = 4
NIBBLE_MASK = 15
REGISTERED_SCHEMA_VERSIONS = [None, <python_hll.serialization.SchemaVersionOne object>]
REGISTER_WIDTH_BITS = 3
REGISTER_WIDTH_MASK = 7
VERSION_ONE = <python_hll.serialization.SchemaVersionOne object>
classmethod explicit_cutoff(cutoff_byte)[source]

Extracts the explicit cutoff value from the cutoff byte of a serialized HLL.

Parameters:cutoff_byte (byte) – the cutoff byte of the serialized HLL
Returns:the explicit cutoff value
Return type:int
classmethod get_schema_version(bytes)[source]

Get the appropriate SchemaVersion for the specified serialized HLL.

Parameters:bytes (list) – the serialized HLL whose schema version is desired.

:returns the schema version for the specified HLL. This will never be None. :rtype: SchemaVersion

classmethod get_schema_version_from_number(schema_version_number)[source]
Parameters:schema_version_number (int) – the version number of the SchemaVersion desired. This must be a registered schema version number.
Returns:The SchemaVersion for the given number. This will never be None.
Return type:SchemaVersion
classmethod pack_cutoff_byte(explicit_cutoff, sparse_enabled)[source]

Generates a byte that encodes the log-base-2 of the explicit cutoff or sentinel values for ‘explicit-disabled’ or ‘auto’, as well as the boolean indicating whether to use HLLType.SPARSE in the promotion hierarchy.

The top bit is always padding, the second highest bit indicates the ‘sparse-enabled’ boolean, and the lowest six bits encode the explicit cutoff value.

Parameters:
  • explicit_cutoff (int) – the explicit cutoff value to encode. * If ‘explicit-disabled’ is chosen, this value should be 0. * If a cutoff of 2:sup:n is desired, for``0 <= n < 31``, this value should be n + 1.
  • sparse_enabled (boolean) – whether HLLType.SPARSE should be used in the promotion hierarchy to improve HLL storage.
Return type:

byte

classmethod pack_parameters_byte(register_width, register_count_log2)[source]

Generates a byte that encodes the parameters of a HLLType.FULL or HLLType.SPARSE HLL.

The top 3 bits are used to encode registerWidth - 1 (range of registerWidth is thus 1-9) and the bottom 5 bits are used to encode registerCountLog2 (range of registerCountLog2 is thus 0-31).

Parameters:
  • register_width (int) – the register width (must be at least 1 and at most 9)
  • register_count_log2 (int) – the log-base-2 of the register count (must be at least 0 and at most 31)
Returns:

the packed parameters byte

Return type:

byte

classmethod pack_version_byte(schema_version, type_ordinal)[source]

Generates a byte that encodes the schema version and the type ordinal of the HLL.

The top nibble is the schema version and the bottom nibble is the type ordinal.

Parameters:
  • schema_version (int) – the schema version to encode.
  • type_ordinal (int) – the type ordinal of the HLL to encode.
Returns:

the packed version byte

Return type:

byte

classmethod register_count_log2(parameters_byte)[source]

Extracts the log2(register_count) from the parameters byte of a serialized HLLType.FULL HLL.

Parameters:parameters_byte (byte) – the parameters byte of the serialized HLL
Returns:log2(registerCount) of the serialized HLL
Return type:int
classmethod register_width(parameters_byte)[source]

Extracts the register width from the parameters byte of a serialized HLLType.FULL HLL.

Parameters:parameters_byte (byte) – the parameters byte of the serialized HLL
Returns:the register width of the serialized HLL
Return type:int
classmethod schema_version(version_byte)[source]

Extracts the schema version from the version byte of a serialized HLL.

Parameters:version_byte (byte) – the version byte of the serialized HLL
Returns:the schema version of the serialized HLL
Return type:int
classmethod sparse_enabled(cutoff_byte)[source]

Extracts the ‘sparse-enabled’ boolean from the cutoff byte of a serialized HLL.

Parameters:cutoff_byte (byte) – the cutoff byte of the serialized HLL
Returns:the ‘sparse-enabled’ boolean
Return type:boolean
classmethod type_ordinal(version_byte)[source]

Extracts the type ordinal from the version byte of a serialized HLL.

Parameters:version_byte (byte) – the version byte of the serialized HLL
Returns:the type ordinal of the serialized HLL
Return type:int