5月27日 23:58

TensorFlow中如何实现自定义损失函数和自定义指标？

TensorFlow 2.x 内置了 MSE、CrossEntropy 等常见损失函数和 Accuracy 等指标，但实际项目中经常遇到类别极度不平衡、需要业务特定评估逻辑、或者要在损失中融合多个优化目标的情况，这时就得自己写损失函数和指标。下面分别讲解实现方式、关键细节和容易踩的坑。

自定义损失函数的两种写法

函数式写法：简单直接

如果损失逻辑不依赖额外参数，直接写一个签名为 (y_true, y_pred) -> scalar 的函数即可：

python
import tensorflow as tf

def huber_loss(y_true, y_pred, delta=1.0):
    """Huber Loss：对异常值比 MSE 更鲁棒"""
    error = y_true - y_pred
    abs_error = tf.abs(error)
    quadratic = tf.minimum(abs_error, delta)
    linear = abs_error - quadratic
    return tf.reduce_mean(0.5 * quadratic ** 2 + delta * linear)

model.compile(optimizer="adam", loss=huber_loss)

函数式写法的好处是简洁，但无法持有可配置的状态（比如 delta 是写死在函数签名里的，model.compile 时不能动态传参）。

类继承写法：支持参数化和序列化

继承 tf.keras.losses.Loss 是更推荐的方式，它支持 get_config 序列化，也能在 compile 时传入超参：

python
class WeightedMSE(tf.keras.losses.Loss):
    def __init__(self, pos_weight=2.0, name="weighted_mse", **kwargs):
        super().__init__(name=name, **kwargs)
        self.pos_weight = pos_weight

    def call(self, y_true, y_pred):
        error = tf.square(y_true - y_pred)
        # 正样本权重更高，缓解类别不平衡
        weights = tf.where(y_true > 0, self.pos_weight, 1.0)
        return tf.reduce_mean(weights * error)

    def get_config(self):
        config = super().get_config()
        config.update({"pos_weight": self.pos_weight})
        return config

model.compile(
    optimizer="adam",
    loss=WeightedMSE(pos_weight=3.0)  # 可动态调整
)

关键点：

call 方法的返回值必须是标量（scalar），不能是张量，否则梯度计算会报错。
损失函数必须是可微的，如果用了 tf.argmax、tf.floor 等不可微操作，反向传播会直接失败。
get_config 不要漏写，否则模型保存/加载时无法恢复参数。

用 add_loss 在模型层内部添加损失

有些损失依赖模型中间层的输出（如正则化项、对比学习的对比损失），此时 call(y_true, y_pred) 的签名不够用，需要在层或模型内部用 self.add_loss() 注册：

python
class RegularizedDense(tf.keras.layers.Layer):
    def __init__(self, units, l2_coef=0.01, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.l2_coef = l2_coef

    def build(self, input_shape):
        self.kernel = self.add_weight(
            name="kernel", shape=[input_shape[-1], self.units]
        )
        # 将 L2 正则化项注册为额外损失
        self.add_loss(self.l2_coef * tf.reduce_sum(tf.square(self.kernel)))
        super().build(input_shape)

    def call(self, inputs):
        return tf.matmul(inputs, self.kernel)

add_loss 注册的损失会自动累加到 model.losses 列表中，训练时被一并优化，无需在 compile 中指定。

自定义指标的实现

指标和损失的核心区别：损失参与反向传播优化权重，指标只做评估不参与梯度计算。所以指标要确保计算过程不引入梯度依赖。

继承 Metric 类：完整实现 F1-Score

自定义指标继承 tf.keras.metrics.Metric，需要实现四个方法：

python
class F1Score(tf.keras.metrics.Metric):
    def __init__(self, name="f1_score", **kwargs):
        super().__init__(name=name, **kwargs)
        self.true_positives = self.add_weight(name="tp", initializer="zeros")
        self.false_positives = self.add_weight(name="fp", initializer="zeros")
        self.false_negatives = self.add_weight(name="fn", initializer="zeros")

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_true = tf.cast(y_true, tf.float32)
        y_pred = tf.cast(tf.round(y_pred), tf.float32)

        tp = tf.reduce_sum(y_true * y_pred)
        fp = tf.reduce_sum((1 - y_true) * y_pred)
        fn = tf.reduce_sum(y_true * (1 - y_pred))

        if sample_weight is not None:
            sample_weight = tf.cast(sample_weight, tf.float32)
            tp = tf.reduce_sum(tp * sample_weight)
            fp = tf.reduce_sum(fp * sample_weight)
            fn = tf.reduce_sum(fn * sample_weight)

        self.true_positives.assign_add(tp)
        self.false_positives.assign_add(fp)
        self.false_negatives.assign_add(fn)

    def result(self):
        precision = self.true_positives / (
            self.true_positives + self.false_positives + tf.keras.backend.epsilon()
        )
        recall = self.true_positives / (
            self.true_positives + self.false_negatives + tf.keras.backend.epsilon()
        )
        return 2 * precision * recall / (
            precision + recall + tf.keras.backend.epsilon()
        )

    def reset_state(self):
        self.true_positives.assign(0.0)
        self.false_positives.assign(0.0)
        self.false_negatives.assign(0.0)

model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=[F1Score()]
)

实现要点：

用 self.add_weight 创建状态变量，不要用 tf.Variable，前者能正确支持分布式训练和模型保存。
update_state 支持 sample_weight 参数，这是 Keras 回调框架的约定，不实现会导致 fit 中传权重时报错。
reset_state（TF 2.x 早期叫 reset_states）在每个 epoch 开始时被框架自动调用，漏写会导致指标值跨 epoch 累积。
分母加 epsilon() 防除零，这是标配。

函数式指标：轻量但不累积

python
def rmse(y_true, y_pred):
    return tf.sqrt(tf.reduce_mean(tf.square(y_true - y_pred)))

model.compile(optimizer="adam", loss="mse", metrics=[rmse])

函数式指标每个 batch 独立计算，不跨 batch 累积。如果指标需要全局统计（如 F1、AUC），必须用类继承写法。

自定义训练步：损失+指标的进阶用法

当 model.compile + model.fit 的标准流程不够灵活时（比如 GAN 的生成器/判别器交替训练、多任务权重动态调整），可以重写 train_step：

python
class CustomModel(tf.keras.Model):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.discriminator_loss_tracker = tf.keras.metrics.Mean(name="d_loss")
        self.generator_loss_tracker = tf.keras.metrics.Mean(name="g_loss")

    def train_step(self, data):
        real_images, _ = data
        batch_size = tf.shape(real_images)[0]

        # 训练判别器
        with tf.GradientTape() as tape:
            fake_images = self.generator(
                tf.random.normal([batch_size, latent_dim]), training=True
            )
            real_output = self.discriminator(real_images, training=True)
            fake_output = self.discriminator(fake_images, training=True)
            d_loss = discriminator_loss(real_output, fake_output)
        grads = tape.gradient(d_loss, self.discriminator.trainable_variables)
        self.d_optimizer.apply_gradients(
            zip(grads, self.discriminator.trainable_variables)
        )

        # 训练生成器
        with tf.GradientTape() as tape:
            fake_images = self.generator(
                tf.random.normal([batch_size, latent_dim]), training=True
            )
            fake_output = self.discriminator(fake_images, training=True)
            g_loss = generator_loss(fake_output)
        grads = tape.gradient(g_loss, self.generator.trainable_variables)
        self.g_optimizer.apply_gradients(
            zip(grads, self.generator.trainable_variables)
        )

        # 更新指标
        self.discriminator_loss_tracker.update_state(d_loss)
        self.generator_loss_tracker.update_state(g_loss)
        return {
            "d_loss": self.discriminator_loss_tracker.result(),
            "g_loss": self.generator_loss_tracker.result(),
        }

    @property
    def metrics(self):
        return [self.discriminator_loss_tracker, self.generator_loss_tracker]

重写 train_step 后仍可用 model.fit 训练，但内部逻辑完全自定义。注意 metrics 属性必须返回所有追踪器，这样框架才能在每个 epoch 开始时自动调用 reset_state。

常见坑和排查方法

问题	原因	解决
`No gradients provided for any variable`	损失函数中使用了不可微操作（如 `tf.argmax`）	换用 `tf.nn.softmax` + 连续近似，或用 `tf.stop_gradient` 隔离
指标值不更新	`update_state` 的参数类型与数据不匹配	用 `tf.cast` 显式转换类型
指标跨 epoch 累积	漏写 `reset_state`	用 `self.add_weight` 而非 `tf.Variable`，确保 `metrics` 属性返回所有追踪器
`add_loss` 的损失为 None	在 `build` 之前调用了 `add_loss`	在 `build` 或 `call` 中调用
保存模型报错	自定义类缺少 `get_config`	补写 `get_config` 并调用 `super().get_config()`
分布式训练指标不准	用 `tf.Variable` 而非 `add_weight`	`add_weight` 会自动做跨 replica 聚合

调试建议：在训练前用小批量数据手动跑一次前向传播 + 梯度计算，确认损失为标量、梯度不为 None、指标能正常更新和重置。

python
# 快速验证脚本
x = tf.random.normal([4, 10])
y = tf.random.uniform([4, 1], 0, 2, dtype=tf.int32)
y_float = tf.cast(y, tf.float32)

loss_fn = WeightedMSE(pos_weight=2.0)
metric_fn = F1Score()

with tf.GradientTape() as tape:
    pred = model(x, training=False)
    loss = loss_fn(y_float, pred)

grads = tape.gradient(loss, model.trainable_variables)
assert loss.shape == (), f"Loss must be scalar, got {loss.shape}"
assert all(g is not None for g in grads), "Some gradients are None"

metric_fn.update_state(y_float, pred)
assert metric_fn.result().numpy() >= 0, "Metric should be non-negative"
metric_fn.reset_state()
assert metric_fn.result().numpy() == 0, "Reset failed"
print("All checks passed!")

标签：Tensorflow