Review: What is a transformation?

Trellis helps you turn your documents into data. A Trellis “transformation” is essentially a formula that describes what data you want to extract, and how you extract it. Once you define a transformation, you can start to upload documents to Trellis. The extracted data will fit the schema you define in the transformation.

How do you define a transformation?

Defining a transformation includes specifying three key parameters:

  1. model - which Trellis LLM engine do you want the transformation to use?

  2. mode - will you upload freeform document-type assets, or structured table-type assets?

  3. operations - what data do you want to extract, and how?

Throughout the Trellis API, the transform_params object is the single source of truth for all three of these parameters. Defining and redefining transform_params is equivalent to setting up and updating your transformation.

The transform_params object

The transform_params object contains configuration settings for performing data transformations using our AI models. Below are the detailed parameters included in transform_params:

model

  • Type: string

  • Description: Specifies the type of LLMs engine to use for the transformation. Options include trellis-vertixand trellis-scale, each indicating a different level of speed and accuracy. We recommend trellis-vertix.

mode

  • Type: string

  • Description: Method of processing the data. The default here should be document. You can also use table if you’re parsing tables.

operations

  • Type: list of operation

  • Description: Each operation is a data field to extract from your data. Each operation is detailed by information about the target column, the data type of that column, the type of transformation to apply, and a description of the task.

Defining operations

Each operation object contains the following information:

  1. Name of the field you wish to populate

  2. Data type

  3. Transformation type (extraction, classification, generation, etc.)

  4. Task description (guidelines for the Trellis LLM to populate your data)

Important: Every transform_params object requires a minimum of one operation with the following parameters:

  1. column_type = 'assets'

  2. transform_type = 'parse'

Here are the specific fields required for each operation:

column_name

  • Type: string

  • Description: Names the column in the dataset on which the operation will be executed. It identifies the specific data point that will undergo transformation or extraction. Must be in snake case.

column_type

  • Type: string

  • Description: Indicates the data type of the target column, adhering to all PostgreSQL data types as documented in the PostgreSQL documentation (https://www.postgresql.org/docs/current/datatype.html). Valid types include, but are not limited to, assets for file data,text for string data, text[] for arrays of text, numeric for numerical data, and date for date values.

Transformations: column_type

Reference our guide on column types for more information!

transform_type

  • Type: string

  • Description: Describes the transformation or extraction method to be applied to the data in the target column. The term "parse" refers to the transformation of assets-type columns into parsed text data, ready for other columns to reference. The term "extraction" suggests that the operation aims to retrieve specific pieces of data from the column. Other types includes classification and generation.

Transformations: tranform_type

Reference our guide on transform types for more information!

task_description

  • Type: string

  • Description: Provides a clear, human-readable explanation of what the operation seeks to achieve. Uses double curly braces to reference other columns at transformation runtime (e.g. Extract the invoice amount from {{Invoice}}). Use cases for references include culling data from parsed assets, classifying extracted data and more.

Important: Excluding operations with transform_type in ['parse', 'manual'], all operations’ task description must reference at least one other operation. Reference is done in the format {{column_name}}

Task descriptions for operations with transform_type in ['parse', 'manual'] should be populated with “N/A”.

Example transformation_params object

Json
{
  "model": "trellis-vertix",
  "mode": "document",
  "operations": [
    {
      "column_name": "invoice",
      "column_type": "assets",
      "transform_type": "parse",
      "task_description": "N/A"
    },
    {
      "column_name": "invoice_rows",
      "column_type": "list",
      "transform_type": "extraction",
      "task_description": "Extract the invoice rows from {{invoice}} ",
      "operations": [
        {
          "column_name": "invoice_row",
          "column_type": "object",
          "transform_type": "extraction",
          "task_description": "The invoice row",
          "operations": [
            {
              "column_name": "service_name",
              "column_type": "text",
              "transform_type": "extraction",
              "task_description": "Name of the charging service."
            },
            {
              "column_name": "date",
              "column_type": "date",
              "transform_type": "extraction",
              "task_description": "Date of the charge"
            },
            {
              "column_name": "invoice_amount",
              "column_type": "number",
              "transform_type": "extraction",
              "task_description": "The invoice amount in dollars"
            }
          ]
        }
      ]
    }
  ]
}

When you’re done with defining the transformation you can go to create transforms to kick-off the transformation run.