Dataset and features

Dataset format

When you do a GET request on the dataUrls for the url under a specific dataset key such as initial, you'll get the full dataset with all features returned as JSON.

{
"data": [
[
"eb987a78daea43a8af908f7bad5fcdbb",
"2020-04-20T14:32:00.000Z",
"2020-05-14T18:12:44.184Z",
"true",
"2020-05-14T18:12:44.184Z",
0.2594,
"initial",
"2020-04-20T14:32:00.000Z",
"2020-04-20T14:32:00.000Z",
"false",
"false",
"US",
"facebook",
"iPhone10,1",
],
[
"fo98f3a78daea43a8aff9e77bad5fcdef",
"2020-04-20T14:32:00.000Z",
"2020-05-14T18:12:44.184Z",
"false",
null,
0.1394,
"initial",
"2020-04-20T14:32:00.000Z",
"2020-04-20T14:32:00.000Z",
"false",
"false",
"NL",
"youtube",
"Android9"
],
...
],
"metadata": {
"columns": [
{
"name": "user_id"
},
{
"name": "user_created"
},
{
"name": "data_now"
},
{
"name": "y_value"
},
{
"name": "y_timestamp"
},
{
"name": "random"
},
{
"name": "moment_key"
},
{
"name": "moment_timestamp"
},
{
"name": "user_moment_base_timestamp"
},
{
"name": "feature_acaa2cf5d"
},
{
"name": "feature_3c11322c9"
},
{
"name": "feature_7f88703f1"
},
{
"name": "feature_2fe4e4093"
},
{
"name": "feature_440572d3d"
}
]
}
}

Under data we an array of rows can be found, where each row represent an unique user with its features and properties. Under metadata.columns you will find the column names, where each index number corresponds to the column index for each user row in data.

Here are the column names that are always available for each dataset. All timestamps are in ISO 8601 format.

  • user_id: The unique ID of the user. In case the user has been identified by a custom ID, this will be that id.
  • user_created: The timestamp when the user was first created/tracked.
  • data_now: Timestamp when the dataset was created
  • y_value: Will be "true" if the user converted and "false" if not. Note this is a string type, not boolean!
  • y_timestamp: If the user converted, this is the earliest timestamp when the conversion happened. If the conversion goal is that a user did play_song, and the user did that event 10 times, this is the timestamp of the first time that event happened.
  • random: A random value between 0 and 1. Can be used to query for a smaller sample of all users in the dataset, see next section.
  • moment_key: This contains the same value as the key of the dataset this is for, such as initial, train_more, etc.
  • user_moment_base_timestamp: The user base timestamp defaults to user 'created at', so in that case it will be the same as the user_created column from above. But we can also choose any other user property such as 'identified at' or any other timestamp that you're tracking. See the Insight API intelligence plugin section later on how to specify a different user base timestamp.
  • moment_timestamp: A dataset is always based on a snapshot from the number of seconds since their creation date, or whatever user_moment_base_timestamp was specified. For the initial dataset, where the moment is 0 seconds the moment_timestamp equals user_created. But for 60 seconds, it will be user_moment_base_timestamp + 60 seconds. If the dataset moment is latest, it will be equal to data_now. The moment timestamp is useful for a specific type of analysis, such as predicting what behavior leads to conversion — we don't want to analyze users where their snapshot state is beyond the conversion y_timestamp, as our analysis would be biased then. So in this specific case we want to filter out all users where moment_timestamp < y_timestamp.

We can have also have one or more features. Each column name that starts with feature_ is a feature, and we can find the corresponding feature details in the JSON manifest discussed in the previous section, here is the features part again:

...
"features": {
"feature_acaa2cf5d": {
"name": "Did event Play song",
"type": "categorical",
"nativeType": "boolean",
"moment": "dynamic",
"details": {
"propertyType": "event",
"value": "play_song"
}
},
"feature_3c11322c9": {
"name": "Did event Sign up",
"type": "categorical",
"nativeType": "boolean",
"moment": "dynamic",
"details": {
"propertyType": "event",
"value": "sign_up"
}
},
"feature_7f88703f1": {
"name": "Country",
"type": "categorical",
"nativeType": "string",
"moment": "static",
"details": {
"propertyType": "userProperty",
"value": {
"type": "text",
"name": "initial_country",
"value": "u_col_initial_country"
}
}
},
"feature_2fe4e4093": {
"name": "Campaign source",
"type": "categorical",
"nativeType": "string",
"moment": "static",
"details": {
"propertyType": "userProperty",
"value": {
"type": "text",
"name": "initial_campaign_source",
"value": "u_col_initial_campaign_source"
}
}
},
"feature_440572d3d": {
"name": "Device model",
"type": "categorical",
"nativeType": "string",
"moment": "static",
"details": {
"propertyType": "userProperty",
"value": {
"type": "text",
"name": "initial_device_model",
"value": "u_col_initial_device_model"
}
}
}
}

Each feature has a user friendly name.

The moment of a feature can be either static, which is for user properties such as country or device — those are always known at the initial moment — 0 seconds since user creation — and don't change if the moment of the dataset changes from say 0 seconds since user creation to 3600 seconds since user creation. On the other hand, dynamic moments features do change as we calculate them at to the moment of the dataset snapshot time, for example 300 second since user creation, so that users has 5 minutes to do some actions.

Each feature can have a type:

  • "integer": Discrete numeric value (1, 2, 9, 10)
  • "numeric": Continous numeric value (3.330, 8.4846)
  • "categorical": String based categorical value ("true", "false", "red", "green"). Boolean values are encoded as categorical too.
  • "text": Content based text value (user review or comment for example), which could be used with NLP or other text based analysis.
  • "string": A string value that is mostly unusable for most cases. For example a user id, some random hash, etc.

nativeType can be one of string, integer, float, boolean, timestamp. For example country would be encoded as string, so that type would be categorical and nativeType would be string.

Finally the details can be looked up of a feature, but this is used only in rare cases.

Querying datasets

When you do a GET on any of the the dataUrls, it will return the whole dataset. While you can then trim your dataset down by filter locally, this can get inefficient. A better way would be to append those data urls with the query parameter, so you can run any SQL query directly onto the dataset table. All columns such as y_value, y_timestamp and feature_abcd1234 are available as-is, and can be used in the query. The table is called DATA_TABLE.

Here is an example that returns 10% of the dataset where time since conversion > 1 hour:

...?query=SELECT * FROM DATA_TABLE WHERE
date_diff('second', from_iso8601_timestamp(y_timestamp), now())
> 3600 AND random < 0.10

Don't forget to escape the query parameter when used in a GET request.

The random column is in the range of 0.0 and 1.0, so we can use that get a sub-selection of the dataset. There is a shortcut by using the range_start_gt_or_eq and range_end_lt query parameters:

translates to random >= 0.50:
...?range_start_gt_or_eq=0.50
translates to random < 0.33:
...?range_end_lt=0.33
translates to random >= 0.10 AND random < 0.75:
...?range_start_gt_or_eq=0.10&range_end_lt=0.75

Events input data

Events input data is a list of events with their timestamp, id, name and the requested event properties. Because each row in the complete dataset represents an unique users, the events belonging to that user are encoded in a column as a JSON encoded nested array. Each element in that array represents an event, such as an item in a webshop interacted with.

A dataset can contain many event input data features, so each one will be represented in a separate column, encoded as JSON. Each event array always contains three elements:

  1. The first one is the unique ID of the event. Because we can have multiple event input data features, such as price and color, each will be JSON encoded in their own column in the dataset. To then tie those two properties price and color to the same event in your code, do so through the unique ID of the event. In the JSON example below we can see how one event's three properties — SKU id, color and price — are represented in three different arrays.
  2. The second element contains the ISO 8601 timestamp of the event. Sometimes you want to take only certain events into consideration, such as the ones that happened before the conversion timestamp — available under y_timestamp in the dataset. To do so, you need to filter events in your own code (Python, etc) where this array's second element event timestamp < y_timestamp.
  3. The third element always contains the requested event property, such as color, price or SKU id.

Because events input data is based on events, it collects only events up to the dataset moment. So for 0 seconds since user creation, there will never we any events input data. For dataset moment latest it will have the full history for that event, so sure to always collect events input data from the right dataset. In addition, the amount of events for each input data JSON encoded array is limited to a maximum of 200.

Below is an example of how the array of events will roughly look like, where each one comes from a different column in the dataset:

[
["event-id-1", "2020-04-25T09:57:13.503Z", "item_sku_id_1"],
["event-id-99", "2020-04-26T11:23:44.149Z", "item_sku_id_40"]
]
[
["event-id-1", "2020-04-25T09:57:13.503Z", "yellow"],
["event-id-99", "2020-04-26T11:23:44.149Z", "purple"]
]
[
["event-id-1", "2020-04-25T09:57:13.503Z", 179.0],
["event-id-99", "2020-04-26T11:23:44.149Z", 29.0]
]

The event input data above maps to the JSON manifest inputData as shown below. The important part of that column, which indicates what column of the dataset this event input data can be found under.

...,
"inputData": {
"recommend_item_sku_id": {
"name": "Product SKU id of View Item",
"type": "categorical",
"nativeType": "string",
"column": "data_7d89806a",
"details": {
"event": "view_item",
"property": "e_col_447b7147e84be512208dcc0995d67ebc"
}
},
"recommender_item_text1": {
"name": "Color of View item",
"type": "categorical",
"nativeType": "string",
"column": "data_31691826",
"details": {
"event": "view_item",
"property": "e_col_70dda5dfb8053dc6d1c492574bce9bfd"
}
},
"recommender_item_number1": {
"name": "Price of View item",
"type": "numeric",
"nativeType": "integer",
"column": "data_15eb4d73",
"details": {
"event": "view_item",
"property": "e_col_78a5eb43deef9a7b5b9ce157b9d52ac4"
}
}
},
...

Note that for inputData, the type is as-is; so text doesn't mean it NLP-able; it's just a text type. For numeric it can be integer or float.