Deterministic Split
Functions for deterministic data splitting with configurable hash methods.
HashMethod
Bases: Enum
Available hash methods for deterministic splitting.
Methods:
| Name | Description |
|---|---|
DEFAULT : str |
PySpark's default hash function |
XXHASH64 : str |
xxHash algorithm, generally faster than cryptographic hashes |
MD5 : str |
MD5 cryptographic hash function |
SHA2 : str |
SHA-256 cryptographic hash function |
Source code in heiwhy/data_split/deterministic_split.py
from_string(method_name)
classmethod
Convert string to HashMethod enum.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method_name
|
str
|
Name of the hash method to use. Case-insensitive. |
required |
Returns:
| Type | Description |
|---|---|
HashMethod
|
The corresponding HashMethod enum value |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the method name is not recognized |
Source code in heiwhy/data_split/deterministic_split.py
deterministic_balanced_split(dataframe, id_column, number_of_splits, output_column='group', group_names=None, hash_method=None)
Assign records to groups based on a hash of the ID column for A/B/n testing.
This function performs deterministic group assignment for A/B/n testing by hashing ID values. The assignment process guarantees several key properties:
- Deterministic: The same ID will always be assigned to the same group
- Consistent: The assignment remains stable regardless of dataset size
- Balanced: Groups are as evenly sized as possible given the hash distribution
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataframe
|
DataFrame
|
Input DataFrame containing the data to be split |
required |
id_column
|
str
|
Name of the column containing unique identifiers |
required |
number_of_splits
|
int
|
Number of groups to split the data into |
required |
output_column
|
str
|
Name of the output column containing group assignments, by default "group" |
'group'
|
group_names
|
list[str] | None
|
Custom names for the groups. Must match number_of_splits if provided. If None, groups will be named "group_1", "group_2", etc. |
None
|
hash_method
|
str | HashMethod | None
|
Hash method to use. Can be specified as a string or HashMethod enum. Valid string values are: "default", "xxhash64", "md5", "sha2" If None, will automatically find the most balanced method. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with an additional column containing group assignments |
Examples:
>>> # Example 1: Automatic hash method selection
>>> df = spark.createDataFrame(
... data=[(1,), (2,), (3,)],
... schema=["user_id"]
... )
>>> result = deterministic_balanced_split(
... dataframe=df,
... id_column="user_id",
... number_of_splits=2
... )
>>> # Example 2: Specify hash method as string with custom group names
>>> result = deterministic_balanced_split(
... dataframe=df,
... id_column="user_id",
... number_of_splits=2,
... group_names=["control", "treatment"],
... hash_method="xxhash64",
... output_column="experiment_group"
... )
Source code in heiwhy/data_split/deterministic_split.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 | |