Batch data processing
Install Begin’s Library
First, install the Begin AI pip library by calling:
pip install beginai
This will initiate the installation process.
If using virtual environments, don’t forget to add beginai to the requirements.txt file as well.
Add Your Account Credentials
Next, open your Python editor and import the library; then initialize it with the app_id and license_key that you can find under your settings menu in your account.
import Begin as bg applier = bg.AlgorithmsApplier(app_id=APP_ID, license_key=LICENSE_KEY)
Preparing the CSV files
The SDK is not case-sensitive, so don’t worry if your columns in the CSV are on a different case than on the Schema. However, you still need to make sure there are no typos/differences between your CSV and Schema
HEADS UP: In Python, the date format is required to follow
dd-mm-yyyy
. For example, July 10th, 1985 will need to be expressed as 10-07-1985
. In order to use the Batch SDK, your CSVs must contain the correct columns and structure. The next sections will walk you through the criteria
CSV structure for Users and Objects
When using either
load_user_data
or load_object_data
your CSV must contain the following structure:id_column
: a column representing the id of the User/Object- Note: The column doesn’t need to be named
id_column
- Columns representing the properties defined on the Schema (the names must match with the Schema)
As an example, let’s assume you have an object named
Mission
in your Schema with the following attributes:number_of_enemies
loot_available
difficulty
Your CSV should have the following structure:
mission_id,loot_available,difficulty,number_of_enemies
The order of the columns does not need to match the Schema order to which the properties were defined
CSV structure for Session
When using the method
load_session_data
, the CSV must contain the following structure:id_column
: a column representing the id of the User- Note: The column doesn’t need to be named
id_column
session_date
: a column representing the session date (in the formatdd-mm-yyyy
)- Note: The column doesn’t need to be named
session_date
duration
: a column representing the session duration in minutes- Note: The column doesn’t need to be named
duration
Your CSV should have the following structure:
id_column,session_date,duration
CSV structure for Interactions
When using the method
load_interactions
, the CSV must contain the following structure:id_column
: a column representing the id of the User- Note: The column doesn’t need to be named
id_column
target_id_column
: a column representing the id of the Object- Note: The column doesn’t need to be named
target_id_column
action
: a column representing the name of the action done- Note: The column doesn’t need to be named
action
- Columns representing the properties defined on the Schema for the interaction (the names must match with the Schema)
As an example, let’s assume you have an object named
Hero
in your Schema with an interaction named has
and an interaction named played
:The
has
interaction has the following properties:acquired_date
is_winner
competed_before
While the
played
action doesn’t have any properties.Your CSV should have the following structure:
id_column,target_id_column,action,acquired_date,is_winner,competed_before
You can have in the same CSV both actions (
played
and has
), you just need to make sure all the properties for both actions are definedLoad Your Data
By default, the SDK assumes that your CSV will use
,
as your column separator.
In case your CSV use a different character just add file_separator='{your_value}'
in the methods load_user_data
, load_object_data
, load_session_data
& load_interactions
Now you’re good to load users’ data from a CSV.
applier.load_user_data('users.csv', 'user_id_column_name', label_column='label_column', created_at_column='created_at_column', file_separator='your_csv_file_separator') applier.learn_from_data()
The two lines above will load your CSV in memory. From there, it locally applies Begin’s platform-generated instructions, anonymizing your users’ data by converting them to mathematical signatures. These signatures are then submitted to Begin’s platform.
Similarly, you can apply Begin’s learning algorithms on the remaining objects and on users’ interactions.
applier.load_object_data('objects.csv', 'object_name_as_defined_in_schema', 'object_id_column_name', label_column='label_column', created_at_column='created_at_column', file_separator='your_csv_file_separator') applier.learn_from_data() applier.load_session_data('session.csv', 'user_id_column', 'session_date_column', 'session_duration_column', file_separator='your_csv_file_separator') applier.learn_from_data()
And interactions between the user and the object
applier.load_interactions( 'interactions.csv', 'user_id_column', 'object_name_as_defined_in_schema', 'object_id_column', 'interaction_column', created_at_column='created_at_column', file_separator='your_csv_file_separator' ) applier.learn_from_data()
HEADS UP: Make sure to use the exact name of the object/interaction as defined in the schema.
Processing Large Datasets
We recommend splitting your CSV into multiple smaller CSVs if you’re processing a large amount of data. Every time you make the call to learn from data, the memory of the library refreshes. You can load as many CSVs as you like (the example with GitHub loads about 200 million interactions, split over multiple CSVs).
How large is “too large”? Our recommendation is: if your laptop can’t handle it, split it. An average laptop can process about 300k records in 30 minutes on a dataset with 30 features.