Language Models

DataCrunch Inference LLM endpoints documentation

Our inference service provides Language Model endpoints compatible with TGI schema.

The LLM endpoints provided are both streaming and non-streaming. Both endpoints have one required parameter, inputs , corresponding to the prompt. The optional parameters are specified in the parameters object.

Curl examples

Non-streaming example, /generate endpoint:

curl -X POST https://<ENDPOINT_URL>/generate \
  -H "Content-Type: application/json" \
  -H <AUTH_HEADERS> \
  -d \
'{
    "model": "<MODEL_NAME>",
    "inputs": "My name is Olivier and I",
    "parameters": {
      "best_of": 1,
      "decoder_input_details": true,
      "details": true,
      "do_sample": false,
      "max_new_tokens": 20,
      "repetition_penalty": 1.03,
      "return_full_text": false,
      "seed": null,
      "stop": [
        "photographer"
      ],
      "temperature": 0.5,
      "top_k": 10,
      "top_p": 0.95,
      "truncate": null,
      "typical_p": 0.95,
      "watermark": true
   }
 }'

Streaming example, /generate_stream endpoint:

curl -N -X POST https://<ENDPOINT_URL>/generate_stream \
  -H "Content-Type: application/json" \
  -H <AUTH_HEADERS> \
  -d \
'{
    "model": "<MODEL_NAME>",
    "inputs": "My name is Olivier and I",
    "parameters": {
      "best_of": 1,
      "decoder_input_details": false,
      "details": true,
      "do_sample": false,
      "max_new_tokens": 20,
      "repetition_penalty": 1.03,
      "return_full_text": false,
      "seed": null,
      "stop": [
        "photographer"
      ],
      "temperature": 0.5,
      "top_k": 10,
      "top_p": 0.95,
      "truncate": null,
      "typical_p": 0.95,
      "watermark": true
   }
 }'

Note: the decoder_input_details parameter must be set to false for the streaming endpoint.

Parameters

List of optional parameters for TGI-based endpoints:

  • do_sample (bool, optional): Activate logits sampling. Defaults to False.

  • max_new_tokens (int, optional): Maximum number of generated tokens. Defaults to 20.

  • repetition_penalty (float, optional): The parameter for repetition penalty. A value of 1.0 means no penalty. See this paper for more details. Defaults to None.

  • return_full_text (bool, optional): Whether to prepend the prompt to the generated text. Defaults to False.

  • stop (List[str], optional): Stop generating tokens if a member of stop_sequences is generated. Defaults to an empty list.

  • seed (int, optional): Random sampling seed. Defaults to None.

  • temperature (float, optional): The value used to modulate the logits distribution. Defaults to None.

  • top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering. Defaults to None.

  • top_p (float, optional): If set to a value less than 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Defaults to None.

  • truncate (int, optional): Truncate input tokens to the given size. Defaults to None.

  • typical_p (float, optional): Typical Decoding mass. See Typical Decoding for Natural Language Generation for more information. Defaults to None.

  • best_of (int, optional): Generate best_of sequences and return the one with the highest token logprobs. Defaults to None.

  • watermark (bool, optional): Watermarking with A Watermark for Large Language Models. Defaults to False.

  • details (bool, optional): Get generation details. Defaults to False.

  • decoder_input_details (bool, optional): Get decoder input token logprobs and ids. Defaults to False.

Last updated