When you do a
GET request on the
dataUrls for the url under a specific dataset key such as
initial, you'll get the full dataset with all features returned as JSON.
data we an array of rows can be found, where each row represent an unique user with its features and properties. Under
metadata.columns you will find the column names, where each index number corresponds to the column index for each user row in
Here are the column names that are always available for each dataset. All timestamps are in ISO 8601 format.
- user_id: The unique ID of the user. In case the user has been identified by a custom ID, this will be that id.
- user_created: The timestamp when the user was first created/tracked.
- data_now: Timestamp when the dataset was created
- y_value: Will be
"true"if the user converted and
"false"if not. Note this is a string type, not boolean!
- y_timestamp: If the user converted, this is the earliest timestamp when the conversion happened. If the conversion goal is that a user did
play_song, and the user did that event 10 times, this is the timestamp of the first time that event happened.
- random: A random value between 0 and 1. Can be used to query for a smaller sample of all users in the dataset, see next section.
- moment_key: This contains the same value as the key of the dataset this is for, such as
- user_moment_base_timestamp: The user base timestamp defaults to user 'created at', so in that case it will be the same as the
user_createdcolumn from above. But we can also choose any other user property such as 'identified at' or any other timestamp that you're tracking. See the Insight API intelligence plugin section later on how to specify a different user base timestamp.
- moment_timestamp: A dataset is always based on a snapshot from the number of seconds since their creation date, or whatever
user_moment_base_timestampwas specified. For the
initialdataset, where the moment is 0 seconds the
user_created. But for 60 seconds, it will be
user_moment_base_timestamp + 60 seconds. If the dataset moment is
latest, it will be equal to
data_now. The moment timestamp is useful for a specific type of analysis, such as predicting what behavior leads to conversion — we don't want to analyze users where their snapshot state is beyond the conversion y_timestamp, as our analysis would be biased then. So in this specific case we want to filter out all users where
moment_timestamp < y_timestamp.
We can have also have one or more features. Each column
name that starts with
feature_ is a feature, and we can find the corresponding feature details in the JSON manifest discussed in the previous section, here is the features part again:
Each feature has a user friendly
moment of a feature can be either
static, which is for user properties such as country or device — those are always known at the initial moment — 0 seconds since user creation — and don't change if the moment of the dataset changes from say 0 seconds since user creation to 3600 seconds since user creation. On the other hand,
dynamic moments features do change as we calculate them at to the moment of the dataset snapshot time, for example 300 second since user creation, so that users has 5 minutes to do some actions.
Each feature can have a
- "integer": Discrete numeric value
(1, 2, 9, 10)
- "numeric": Continous numeric value
- "categorical": String based categorical value
("true", "false", "red", "green"). Boolean values are encoded as categorical too.
- "text": Content based text value (user review or comment for example), which could be used with NLP or other text based analysis.
- "string": A string value that is mostly unusable for most cases. For example a user id, some random hash, etc.
nativeType can be one of
timestamp. For example country would be encoded as string, so that
type would be
nativeType would be
details can be looked up of a feature, but this is used only in rare cases.
When you do a
GET on any of the the
dataUrls, it will return the whole dataset. While you can then trim your dataset down by filter locally, this can get inefficient. A better way would be to append those data urls with the
query parameter, so you can run any SQL query directly onto the dataset table. All columns such as
feature_abcd1234 are available as-is, and can be used in the query.
The table is called
Here is an example that returns 10% of the dataset where time since conversion > 1 hour:
Don't forget to escape the
query parameter when used in a
random column is in the range of 0.0 and 1.0, so we can use that get a sub-selection of the dataset. There is a shortcut by using the
range_end_lt query parameters:
translates to random >= 0.50:
translates to random < 0.33:
translates to random >= 0.10 AND random < 0.75:
Events input data
Events input data is a list of events with their timestamp, id, name and the requested event properties. Because each row in the complete dataset represents an unique users, the events belonging to that user are encoded in a column as a JSON encoded nested array. Each element in that array represents an event, such as an item in a webshop interacted with.
A dataset can contain many event input data features, so each one will be represented in a separate column, encoded as JSON. Each event array always contains three elements:
- The first one is the unique ID of the event. Because we can have multiple event input data features, such as
color, each will be JSON encoded in their own column in the dataset. To then tie those two properties
colorto the same event in your code, do so through the unique ID of the event. In the JSON example below we can see how one event's three properties — SKU id, color and price — are represented in three different arrays.
- The second element contains the ISO 8601 timestamp of the event. Sometimes you want to take only certain events into consideration, such as the ones that happened before the conversion timestamp — available under
y_timestampin the dataset. To do so, you need to filter events in your own code (Python, etc) where this array's second element
event timestamp < y_timestamp.
- The third element always contains the requested event property, such as color, price or SKU id.
Because events input data is based on events, it collects only events up to the dataset moment. So for 0 seconds since user creation, there will never we any events input data. For dataset moment
latest it will have the full history for that event, so sure to always collect events input data from the right dataset. In addition, the amount of events for each input data JSON encoded array is limited to a maximum of 200.
Below is an example of how the array of events will roughly look like, where each one comes from a different column in the dataset:
The event input data above maps to the JSON manifest
inputData as shown below. The important part of that
column, which indicates what column of the dataset this event input data can be found under.
Note that for
inputData, the type is as-is; so
text doesn't mean it NLP-able; it's just a text type. For numeric it can be