-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
root@jdev-3090:/mnt/runtime/triton_trtllm# bash run.sh 3 3
Starting Triton server
I0130 06:34:46.490943 7929 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fbb44000000' with size 268435456"
I0130 06:34:46.493070 7929 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0130 06:34:46.497172 7929 model_lifecycle.cc:473] "loading: audio_tokenizer:1"
I0130 06:34:46.497192 7929 model_lifecycle.cc:473] "loading: cosyvoice2:1"
I0130 06:34:46.497201 7929 model_lifecycle.cc:473] "loading: speaker_embedding:1"
I0130 06:34:46.497221 7929 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I0130 06:34:46.497230 7929 model_lifecycle.cc:473] "loading: token2wav:1"
I0130 06:34:47.470954 7957 pb_stub.cc:320] Failed to initialize Python stub for auto-complete: ModuleNotFoundError: No module named 'cosyvoice'
At:
/mnt/runtime/triton_trtllm/model_repo_cosyvoice2/speaker_embedding/1/model.py(35):
(488): _call_with_frames_removed
(995): exec_module
(950): _load_unlocked
(1334): _find_and_load_unlocked
(1360): _find_and_load
E0130 06:34:47.479864 7929 model_lifecycle.cc:654] "failed to load 'speaker_embedding' version 1: Internal: ModuleNotFoundError: No module named 'cosyvoice'\n\nAt:\n /mnt/runtime/triton_trtllm/model_repo_cosyvoice2/speaker_embedding/1/model.py(35): \n (488): _call_with_frames_removed\n (995): exec_module\n (950): _load_unlocked\n (1334): _find_and_load_unlocked\n (1360): _find_and_load\n"
I0130 06:34:47.479899 7929 model_lifecycle.cc:789] "failed to load 'speaker_embedding'"
I0130 06:34:47.614020 7929 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0130 06:34:47.614039 7929 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0130 06:34:47.614042 7929 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0130 06:34:47.614044 7929 libtensorrtllm.cc:86] "backend configuration:\n{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I0130 06:34:47.620198 7929 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][INFO] num_nodes is not specified, will be set to 1
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2560) * 24
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 32768
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 32767 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 1215 MiB
[TensorRT-LLM][INFO] Engine load time 589 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1024.03 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
I0130 06:34:48.374294 7954 pb_stub.cc:320] Failed to initialize Python stub for auto-complete: ModuleNotFoundError: No module named 'matcha'
At:
/mnt/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(42):
(488): _call_with_frames_removed
(995): exec_module
(950): _load_unlocked
(1334): _find_and_load_unlocked
(1360): _find_and_load
E0130 06:34:48.385283 7929 model_lifecycle.cc:654] "failed to load 'cosyvoice2' version 1: Internal: ModuleNotFoundError: No module named 'matcha'\n\nAt:\n /mnt/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(42): \n (488): _call_with_frames_removed\n (995): exec_module\n (950): _load_unlocked\n (1334): _find_and_load_unlocked\n (1360): _find_and_load\n"
I0130 06:34:48.385305 7929 model_lifecycle.cc:789] "failed to load 'cosyvoice2'"
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1209 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 13.87 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 46.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.57 GiB, available: 20.74 GiB, extraCostMemory: 0.00 GiB
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 80
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 1024 [window size=2560]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.03 GiB for max tokens in paged KV cache (2560).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I0130 06:34:48.517434 7929 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I0130 06:34:48.517569 7929 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm'"
I0130 06:34:48.553925 8103 pb_stub.cc:320] Failed to initialize Python stub for auto-complete: ModuleNotFoundError: No module named 'cosyvoice'
At:
/mnt/runtime/triton_trtllm/model_repo_cosyvoice2/token2wav/1/model.py(39):
(488): _call_with_frames_removed
(995): exec_module
(950): _load_unlocked
(1334): _find_and_load_unlocked
(1360): _find_and_load
E0130 06:34:48.562428 7929 model_lifecycle.cc:654] "failed to load 'token2wav' version 1: Internal: ModuleNotFoundError: No module named 'cosyvoice'\n\nAt:\n /mnt/runtime/triton_trtllm/model_repo_cosyvoice2/token2wav/1/model.py(39): \n (488): _call_with_frames_removed\n (995): exec_module\n (950): _load_unlocked\n (1334): _find_and_load_unlocked\n (1360): _find_and_load\n"
I0130 06:34:48.562449 7929 model_lifecycle.cc:789] "failed to load 'token2wav'"
I0130 06:34:48.872798 7929 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: audio_tokenizer_0_0 (CPU device 0)"
I0130 06:34:50.673777 7929 model_lifecycle.cc:849] "successfully loaded 'audio_tokenizer'"
I0130 06:34:50.673925 7929 server.cc:611]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0130 06:34:50.673944 7929 server.cc:638]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0130 06:34:50.673974 7929 server.cc:681]
+-------------------+---------+-----------------------------------------------------------------------------------------------+
| Model | Version | Status |
+-------------------+---------+-----------------------------------------------------------------------------------------------+
| audio_tokenizer | 1 | READY |
| cosyvoice2 | 1 | UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'matcha' |
| | | |
| | | At: |
| | | /mnt/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(42): |
| | | (488): _call_with_frames_removed |
| | | (995): exec_module |
| | | (950): _load_unlocked |
| | | (1334): _find_and_load_unlocked |
| | | (1360): _find_and_load |
| speaker_embedding | 1 | UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'cosyvoice' |
| | | |
| | | At: |
| | | /mnt/runtime/triton_trtllm/model_repo_cosyvoice2/speaker_embedding/1/model.py(35): |
| | | (488): _call_with_frames_removed |
| | | (995): exec_module |
| | | (950): _load_unlocked |
| | | (1334): _find_and_load_unlocked |
| | | (1360): _find_and_load |
| tensorrt_llm | 1 | READY |
| token2wav | 1 | UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'cosyvoice' |
| | | |
| | | At: |
| | | /mnt/runtime/triton_trtllm/model_repo_cosyvoice2/token2wav/1/model.py(39): |
| | | (488): _call_with_frames_removed |
| | | (995): exec_module |
| | | (950): _load_unlocked |
| | | (1334): _find_and_load_unlocked |
| | | (1360): _find_and_load |
+-------------------+---------+-----------------------------------------------------------------------------------------------+
I0130 06:34:50.730851 7929 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090"
I0130 06:34:50.733660 7929 metrics.cc:783] "Collecting CPU metrics"
I0130 06:34:50.733709 7929 tritonserver.cc:2598]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.59.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | ./model_repo_cosyvoice2 |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| model_config_name | |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0130 06:34:50.733754 7929 server.cc:312] "Waiting for in-flight requests to complete."
I0130 06:34:50.733758 7929 server.cc:328] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0130 06:34:50.733919 7929 server.cc:343] "All models are stopped, unloading models"
I0130 06:34:50.733922 7929 server.cc:352] "Timeout 30: Found 2 live models and 0 in-flight non-inference requests"
[TensorRT-LLM][INFO] Refreshed the MPI local session
I0130 06:34:50.867384 7929 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm' version 1"
I0130 06:34:51.734022 7929 server.cc:352] "Timeout 29: Found 1 live models and 0 in-flight non-inference requests"
I0130 06:34:52.104783 7929 model_lifecycle.cc:636] "successfully unloaded 'audio_tokenizer' version 1"
I0130 06:34:52.734213 7929 server.cc:352] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models