Hume AI Streaming API
0.3.0

The streaming API provides access to Hume models through secure websocket connections. You can build real-time applications by running continuous inference on streams of audio, video, and text data.

This is the documentation for version 0.3.0 of the API. Last update on Mar 16, 2023.

Protocol: wss
wss://api.hume.ai/v0/stream

Error Codes

About

The streaming API maintains a list of error codes that you might encounter when using the API. Many of these can be resolved by reconfiguring the streaming payload. If there are issues that cannot be resolved with the documentation here, please reach out to support@hume.ai for further assistance.

Configuration Errors

Configuration errors indicate that something about the API call was not configured correctly.

Error Code Description
E0100 Websockets can accept data in many different formats. The Hume streaming API required payloads that can be parsed as JSON. Ensure that your input payload is valid JSON.
E0101 This generic error indicates that the format of your payload is invalid. The details of what exactly is invalid can be found in the API response error message.
E0102 Some model types are not compatible with every file type. For example, no facial expressions will be detected in a text file.
E0200 The streaming API supports a wide range of media types including audio, video, image, text. However, the data provided could not be parsed into a known file format.
E0201 Media must be encoded as base64 encoded bytes. This error means that the service could not properly decode the data field and there may have been an encoding issue.
E0202 This error indicates that the media provided could not be parsed into a valid audio file, despite audio models being configured.
E0203 The streaming API is intended for near real-time processing of data streams. For larger files considering using the Hume batch API, which is intended for use with larger files.
E0300 Go to the Hume platform website in order to purchase more credits.

Warnings

Warnings indicate that the streaming payload was configured correctly, but no results could be returned.

Error Code Description
W0101 No vocal bursts could be detected in the streamed media.
W0102 No facemeshes could be detected in the streamed media.
W0103 No faces could be detected in the streamed media.
W0104 No emotional language could be detected in the streamed media.
W0105 No speech could be detected in the streamed media.

Service Errors

Service errors indicate that there was an unexpected failure in a Hume service. There may be an outage or you may have come across a bug. If you receive any of these error codes you can reach out to support@hume.ai.

Error Code
I0100
I0101
I0102
I0103

SUBSCRIBE /burst

SUB /burst

Analyze a continuous stream of audio for vocal bursts. Vocal bursts, also called non-verbal exclamations, are any sounds you make that express emotion and aren't words.

Accepts one of the following messages.

SUBSCRIBE /burst

Model Predictions

Payload

Payload example
{
  "predictions": [
    {
      "name": "string",
      "score": 42.0
    }
  ]
}

Prediction warning

Payload

Payload example
{
  "warning": "string"
}

PUBLISH /burst

PUB /burst

Analyze a continuous stream of audio for vocal bursts. Vocal bursts, also called non-verbal exclamations, are any sounds you make that express emotion and aren't words.

PUBLISH /burst

Raw audio payload

Payload

  • audio string

    Raw bytes of audio file. Should be base64 encoded and in a standard audio file format like wav or mp3.

  • reset boolean

    Setting reset to true will mark the new audio sample as part of a distinct audio stream. False means that the new audio sample is a continuation of previously sent samples. Default is false.

Payload example
{
  "audio": "string",
  "reset": true
}

SUBSCRIBE /face

SUB /face

Analyze a continuous stream of images for facial expressions.

SUBSCRIBE /face

Model Predictions

Payload

  • predictions array[object]
    • misc object

      Additional information about the frame. (Not yet implemented)

    • bbox object

      A bounding box around a face.

      • x number

        x-coordinate of bounding box top left corner.

        Minimum value is 0.

      • y number

        y-coordinate of bounding box top left corner.

        Minimum value is 0.

      • w number

        Bounding box width.

        Minimum value is 0.

      • h number

        Bounding box height.

        Minimum value is 0.

    • prob number

      Probability that the detected face is really a face.

      Minimum value is 0, maximum value is 1.

    • face_id string

      Identifier for tracking face identity across frames.

    • emotions array[object]

      A high-dimensional embedding in emotion space.

      • name string

        Name of the emotion being expressed.

      • score number

        Embedding value for the emotion being expressed.

Payload example
{
  "predictions": [
    {
      "misc": {},
      "bbox": {
        "x": 42.0,
        "y": 42.0,
        "w": 42.0,
        "h": 42.0
      },
      "prob": 42.0,
      "face_id": "string",
      "emotions": [
        {
          "name": "string",
          "score": 42.0
        }
      ]
    }
  ]
}

PUBLISH /face

PUB /face

Analyze a continuous stream of images for facial expressions.

PUBLISH /face

Raw image payload

Payload

Payload example
{
  "image": "string"
}

SUBSCRIBE /language

SUB /language

Analyze a continuous stream of text for emotion classification.

SUBSCRIBE /language

Model Predictions

Payload

Payload example
{
  "predictions": [
    {
      "name": "string",
      "score": 42.0
    }
  ]
}

PUBLISH /language

PUB /language

Analyze a continuous stream of text for emotion classification.

PUBLISH /language
Payload example
{
  "text": "string"
}

SUBSCRIBE /models

SUB /models

Analyze a continuous media stream with Hume models.

Supports the following Hume models:

  • burst
  • face
  • facemesh
  • language
  • prosody

To learn more about what these models do and the science behind them please check out the Hume platform help pages.

Note that this endpoint has a timeout of 10 minutes. To maintain a connection longer than 10 minutes you will need to implement custom reconnect logic in your application.

Accepts one of the following messages.

SUBSCRIBE /models

Model predictions

Payload

  • burst object

    Response for the vocal burst emotion model.

    • predictions array[object]
      • time object

        A time range with a beginning and end, measured in seconds.

        • begin number

          Beginning of time range in seconds.

          Minimum value is 0.

        • end number

          End of time range in seconds.

          Minimum value is 0.

      • emotions array[object]

        A high-dimensional embedding in emotion space.

        • name string

          Name of the emotion being expressed.

        • score number

          Embedding value for the emotion being expressed.

  • face object

    Response for the facial expression emotion model.

    • predictions array[object]
      • frame number

        Frame number

      • time number

        Time in seconds when face detection occurred.

      • bbox object

        A bounding box around a face.

        • x number

          x-coordinate of bounding box top left corner.

          Minimum value is 0.

        • y number

          y-coordinate of bounding box top left corner.

          Minimum value is 0.

        • w number

          Bounding box width.

          Minimum value is 0.

        • h number

          Bounding box height.

          Minimum value is 0.

      • prob number

        The predicted probability that a detected face was actually a face.

      • face_id string

        Identifier for a face. Not that this defaults to unknown unless face identification is enabled in the face model configuration.

      • emotions array[object]

        A high-dimensional embedding in emotion space.

        • name string

          Name of the emotion being expressed.

        • score number

          Embedding value for the emotion being expressed.

      • facs array[object]

        A high-dimensional embedding in emotion space.

        • name string

          Name of the emotion being expressed.

        • score number

          Embedding value for the emotion being expressed.

      • descriptions array[object]

        A high-dimensional embedding in emotion space.

        • name string

          Name of the emotion being expressed.

        • score number

          Embedding value for the emotion being expressed.

  • facemesh object

    Response for the facemesh emotion model.

    • predictions array[object]
      • emotions array[object]

        A high-dimensional embedding in emotion space.

        • name string

          Name of the emotion being expressed.

        • score number

          Embedding value for the emotion being expressed.

  • language object

    Response for the language emotion model.

    • predictions array[object]
      • text string

        A segment of text (like a word or a sentence).

      • position object

        Position of a segment of text within a larger document, measured in characters. Uses zero-based indexing. The beginning index is inclusive and the end index is exclusive.

        • begin number

          The index of the first character in the text segment, inclusive.

          Minimum value is 0.

        • end number

          The index of the last character in the text segment, exclusive.

          Minimum value is 0.

      • emotions array[object]

        A high-dimensional embedding in emotion space.

        • name string

          Name of the emotion being expressed.

        • score number

          Embedding value for the emotion being expressed.

      • sentiment array[object]

        Sentiment predictions returned as a distribution. This model predicts the probability that a given text could be interpreted as having each sentiment level from 1 (negative) to 9 (positive).

        Compared to returning one estimate of sentiment, this enables a more nuanced analysis of a text's meaning. For example, a text with very neutral sentiment would have an average rating of 5. But also a text that could be interpreted as having very positive sentiment or very negative sentiment would also have an average rating of 5. The average sentiment is less informative than the distribution over sentiment, so this API returns a value for each sentiment level.

        • name string

          Level of sentiment, ranging from 1 (negative) to 9 (positive)

        • score number

          Prediction for this level of sentiment

      • toxicity array[object]

        Toxicity predictions returned as probabilities that the text can be classified into the following categories: toxic, severe_toxic, obscene, threat, insult, and identity_hate.

        • name string

          Category of toxicity.

        • score number

          Prediction for this category of toxicity

  • prosody object

    Response for the speech prosody emotion model.

    • predictions array[object]
      • time object

        A time range with a beginning and end, measured in seconds.

        • begin number

          Beginning of time range in seconds.

          Minimum value is 0.

        • end number

          End of time range in seconds.

          Minimum value is 0.

      • emotions array[object]

        A high-dimensional embedding in emotion space.

        • name string

          Name of the emotion being expressed.

        • score number

          Embedding value for the emotion being expressed.

Payload example
{
  "burst": {
    "predictions": [
      {
        "time": {
          "begin": 42.0,
          "end": 42.0
        },
        "emotions": [
          {
            "name": "string",
            "score": 42.0
          }
        ]
      }
    ]
  },
  "face": {
    "predictions": [
      {
        "frame": 42.0,
        "time": 42.0,
        "bbox": {
          "x": 42.0,
          "y": 42.0,
          "w": 42.0,
          "h": 42.0
        },
        "prob": 42.0,
        "face_id": "string",
        "emotions": [
          {
            "name": "string",
            "score": 42.0
          }
        ],
        "facs": [
          {
            "name": "string",
            "score": 42.0
          }
        ],
        "descriptions": [
          {
            "name": "string",
            "score": 42.0
          }
        ]
      }
    ]
  },
  "facemesh": {
    "predictions": [
      {
        "emotions": [
          {
            "name": "string",
            "score": 42.0
          }
        ]
      }
    ]
  },
  "language": {
    "predictions": [
      {
        "text": "string",
        "position": {
          "begin": 42.0,
          "end": 42.0
        },
        "emotions": [
          {
            "name": "string",
            "score": 42.0
          }
        ],
        "sentiment": [
          {
            "name": "string",
            "score": 42.0
          }
        ],
        "toxicity": [
          {
            "name": "string",
            "score": 42.0
          }
        ]
      }
    ]
  },
  "prosody": {
    "predictions": [
      {
        "time": {
          "begin": 42.0,
          "end": 42.0
        },
        "emotions": [
          {
            "name": "string",
            "score": 42.0
          }
        ]
      }
    ]
  }
}

Error message

Payload

  • error string

    Error message text.

  • code string

    Unique identifier for the error.

Payload example
{
  "error": "string",
  "code": "string"
}

Warning message

Payload

  • warning string

    Warning message text.

  • code string

    Unique identifier for the error.

Payload example
{
  "warning": "string",
  "code": "string"
}

PUBLISH /models

PUB /models

Analyze a continuous media stream with Hume models.

Supports the following Hume models:

  • burst
  • face
  • facemesh
  • language
  • prosody

To learn more about what these models do and the science behind them please check out the Hume platform help pages.

Note that this endpoint has a timeout of 10 minutes. To maintain a connection longer than 10 minutes you will need to implement custom reconnect logic in your application.

PUBLISH /models

Models endpoint payload

Payload

  • data string

    Raw bytes of media file. Should be base64 encoded and in a standard media format.

    Recommended filetypes:

    • audio: wav, mp3, mp4
    • image: png, jpeg
    • text: txt
    • facemesh: json

    Note: Streaming video files is not recommended. For better latency you should stream videos by sending a single frame image in each payload.

  • models object

    Configuration used to specify which models should be used and with what settings.

    • burst object

      Configuration for the vocal burst emotion model.

      Note: Model configuration is not currently available in streaming.

      Please use the default configuration by passing an empty object {}.

    • face object

      Configuration for the facial expression emotion model.

      Note: Using the reset_stream parameter does not have any effect on face identification. A single face identifier cache is maintained over a full session whether reset_stream is used or not.

      • facs object

        Configuration for FACS predictions. If missing or null, no FACS predictions will be generated.

      • Configuration for Descriptions predictions. If missing or null, no Descriptions predictions will be generated.

      • Whether to return identifiers for faces across frames. If true, unique identifiers will be assigned to face bounding boxes to differentiate different faces. If false, all faces will be tagged with an "unknown" ID.

        Default value is false.

    • facemesh object

      Configuration for the facemesh emotion model.

      Note: Model configuration is not currently available in streaming.

      Please use the default configuration by passing an empty object {}.

    • language object

      Configuration for the language emotion model.

      • Configuration for sentiment predictions. If missing or null, no sentiment predictions will be generated.

      • toxicity object

        Configuration for toxicity predictions. If missing or null, no toxicity predictions will be generated.

    • prosody object

      Configuration for the speech prosody emotion model.

      Note: Model configuration is not currently available in streaming.

      Please use the default configuration by passing an empty object {}.

  • Length in milliseconds of streaming sliding window.

    Extending the length of this window will prepend media context from past payloads into the current payload.

    For example, if on the first payload you send 500ms of data and on the second payload you send an additional 500ms of data, a window of at least 1000ms will allow the model to process all 1000ms of stream data.

    A window of 600ms would append the full 500ms of the second payload to the last 100ms of the first payload.

    Note: This feature is currently only supported for audio data and audio models. For other file types and models this parameter will be ignored.

    Minimum value is 500, maximum value is 10000. Default value is 5000.

  • Whether to reset the streaming sliding window before processing the current payload.

    If this parameter is set to true then past context will be deleted before processing the current payload.

    Use reset_stream when one audio file is done being processed and you do not want context to leak across files.

    Default value is false.

  • Set to true to get details about the job.

    This parameter can be set in the same payload as data or it can be set without data and models configuration to get the job details between payloads.

    This parameter is useful to get the unique job ID.

    Default value is false.

  • raw_text boolean

    Set to true to enable the data parameter to be parsed as raw text rather than base64 encoded bytes.
    This parameter is useful if you want to send text to be processed by the language model, but it cannot be used with other file types like audio, image, or video.

    Default value is false.

Payload example
{
  "data": "string",
  "models": {
    "burst": {},
    "face": {
      "facs": {},
      "descriptions": {},
      "identify_faces": false
    },
    "facemesh": {},
    "language": {
      "sentiment": {},
      "toxicity": {}
    },
    "prosody": {}
  },
  "stream_window_ms": 5000,
  "reset_stream": false,
  "job_details": false,
  "raw_text": false
}

SUBSCRIBE /multi

SUB /multi

Send data to multiple models at once. Data must be compatible with all models specified. For example passing a stream of audio data and requesting face predictions will result in an error.

SUBSCRIBE /multi

PUBLISH /multi

PUB /multi

Send data to multiple models at once. Data must be compatible with all models specified. For example passing a stream of audio data and requesting face predictions will result in an error.

PUBLISH /multi

Input for streaming multiple models

Payload

  • data string

    Raw base64 encoded bytes of a media file.

  • models array[string]

    List of model types to apply to the data. (e.g. 'burst', 'face', 'facemesh', 'language', 'prosody')

    Values are burst, face, facemesh, language, or prosody.

Payload example
{
  "data": "string",
  "models": [
    "burst"
  ]
}

SUBSCRIBE /prosody

SUB /prosody

Analyze a continuous stream of audio for speech prosody. Speech prosody includes the intonation, stress, and rhythm of spoken word.

Accepts one of the following messages.

SUBSCRIBE /prosody

Model Predictions

Payload

Payload example
{
  "predictions": [
    {
      "name": "string",
      "score": 42.0
    }
  ]
}

Prediction warning

Payload

Payload example
{
  "warning": "string"
}

PUBLISH /prosody

PUB /prosody

Analyze a continuous stream of audio for speech prosody. Speech prosody includes the intonation, stress, and rhythm of spoken word.

PUBLISH /prosody

Raw audio payload

Payload

  • audio string

    Raw bytes of audio file. Should be base64 encoded and in a standard audio file format like wav or mp3.

  • reset boolean

    Setting reset to true will mark the new audio sample as part of a distinct audio stream. False means that the new audio sample is a continuation of previously sent samples. Default is false.

Payload example
{
  "audio": "string",
  "reset": true
}