[OpenCV实战]5 基于深度学习的文本检测
目录
在这篇文章中,我们将逐字逐句地尝试找到图片中的单词!基于最近的一篇论文进行文字检测。
EAST: An Efficient and Accurate Scene Text Detector.
https://arxiv.org/abs/1704.03155v2
https://github.com/argman/EAST
应该注意,文本检测不同于文本识别。在文本检测中,我们只检测文本周围的边界框。但是,在文本识别中,我们实际上找到了框中所写的内容。例如,在下面给出的图像中,文本检测将为您提供单词周围的边界框,文本识别将告诉您该框包含单词STOP。本文只进行文本检测。
本文基于tensorflow模型,基于OpenCV调用tensorflow模型。我们将逐步讨论算法是如何工作的。您将需要OpenCV3.4.3以上版本来运行代码。其他opencv DNN模型读取也类似这样步骤。
涉及的步骤如下:
- 下载EAST模型
- 将模型加载到内存中
- 准备输入图像
- 正向传递blob通过网络
- 处理输出
1 网络加载
我们将使用cv :: dnn :: readnet(C++版本)或cv2.dnn.ReadNet(python版本)函数将网络加载到内存中。它会根据指定的文件名自动检测配置和框架。在我们的例子中,它是一个pb文件,因此,它将假定要加载Tensorflow网络。和加载图像不大一样,没有模型结构描述文件。
C++
Net net = readNet(model);
Python
net = cv.dnn.readNet(model)
2 读取图像
我们需要创建一个4-D输入blob,用于将图像输送到网络。这是使用blobFromImage函数完成的。
C++
blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);
Python
blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)
我们需要为此函数指定一些参数。它们如下:
- 第一个参数是图像本身。
- 第二个参数指定每个像素值的缩放。在这种情况下,它不是必需的。因此我们将其保持为1。
- 第三个参数是设定网络的默认输入为320×320。因此,我们需要在创建blob时指定它。最好和网络输入一致。
- 第四个参数是训练时候设定的模型均值。需要减去模型均值。
- 第五个参数是我们是否要交换R和B通道。这是必需的,因为OpenCV使用BGR格式,Tensorflow使用RGB格式,caffe模型使用BGR格式。
- 最后一个参数是我们是否要裁剪图像并采取中心裁剪。在这种情况下我们指定False。
3 前向传播
现在我们已准备好输入,我们将通过网络传递它。网络有两个输出。一个指定文本框的位置,另一个指定检测到的框的置信度分数。两个输出层如下:
feature_fusion/concat_3
feature_fusion/Conv_7/Sigmoid
这两个输出可以直接用netron这个软件打开pb模型,看到最后输出结果。Netron是一个模型结构可视化神器,支持tf, caffe, keras,mxnet等多种框架。Netron下载地址:
https://github.com/lutzroeder/Netron
c++读取输出代码如下:
std::vector<String> outputLayers(2);
outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";
outputLayers[1] = "feature_fusion/concat_3";
python读取输出代码如下:
outputLayers = []
outputLayers.append("feature_fusion/Conv_7/Sigmoid")
outputLayers.append("feature_fusion/concat_3")
接下来,我们通过将输入图像传递到网络来获得输出。如前所述,输出由两部分组成:置信度和位置。
C++
std::vector<Mat> output;
net.setInput(blob);
net.forward(output, outputLayers);
Mat scores = output[0];
Mat geometry = output[1];
python:
net.setInput(blob)
output = net.forward(outputLayers)
scores = output[0]
geometry = output[1]
4 处理输出
如前所述,我们将使用两个层的输出并解码文本框的位置及其方向。我们可能会得到许多文本框。因此,我们需要从该批次中筛选出看起来最好的文本框。这是使用非极大值抑制算法完成的。
非极大值抑制算法在目标检测中应用很广泛,具体可以参考
http://www.it610.com/article/5215825.htm
https://blog.csdn.net/qq_14845119/article/details/52064928
1 解码
C++:
std::vector<RotatedRect> boxes;
std::vector<float> confidences;
decode(scores, geometry, confThreshold, boxes, confidences);
python:
[boxes, confidences] = decode(scores, geometry, confThreshold)
2 非极大值抑制
我们使用OpenCV函数NMSBoxes(C ++)或NMSBoxesRotated(Python)来过滤掉误报并获得最终预测。
C++:
std::vector<int> indices;
NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);
Python:
indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)
3结果和代码
3.1结果
在VS2017下运行了C++代码,其中OpenCV版本至少要3.4.5以上。不然模型读取会有问题。模型文件太大,见下载链接:
如果没有积分(系统自动设定资源分数)看看参考链接。我搬运过来的,大修改没有。
或者梯子直接下载模型:
https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1
结果如下,效果还不错,速度也还好。
3.2 代码
C++代码有所更改,python没有。对文本检测不熟悉,注释较少。
C++代码:
// text_detection.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。
//
#include "pch.h"
#include <iostream>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;
using namespace cv::dnn;
//解码
void decode(const Mat &scores, const Mat &geometry, float scoreThresh,
std::vector<RotatedRect> &detections, std::vector<float> &confidences);
/**
* @brief
*
* @param srcImg 检测图像
* @param inpWidth 深度学习图像输入宽
* @param inpHeight 深度学习图像输入高
* @param confThreshold 置信度
* @param nmsThreshold 非极大值抑制算法阈值
* @param net
* @return Mat
*/
Mat text_detect(Mat srcImg, int inpWidth, int inpHeight, float confThreshold, float nmsThreshold, Net net)
{
//输出
std::vector<Mat> output;
std::vector<String> outputLayers(2);
outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";
outputLayers[1] = "feature_fusion/concat_3";
//检测图像
Mat frame, blob;
frame = srcImg.clone();
//获取深度学习模型的输入
blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);
net.setInput(blob);
//输出结果
net.forward(output, outputLayers);
//置信度
Mat scores = output[0];
//位置参数
Mat geometry = output[1];
// Decode predicted bounding boxes, 对检测框进行解码,获取文本框位置方向
//文本框位置参数
std::vector<RotatedRect> boxes;
//文本框置信度
std::vector<float> confidences;
decode(scores, geometry, confThreshold, boxes, confidences);
// Apply non-maximum suppression procedure, 应用非极大性抑制算法
//符合要求的文本框
std::vector<int> indices;
NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);
// Render detections. 输出预测
//缩放比例
Point2f ratio((float)frame.cols / inpWidth, (float)frame.rows / inpHeight);
for (size_t i = 0; i < indices.size(); ++i)
{
RotatedRect &box = boxes[indices[i]];
Point2f vertices[4];
box.points(vertices);
//还原坐标点
for (int j = 0; j < 4; ++j)
{
vertices[j].x *= ratio.x;
vertices[j].y *= ratio.y;
}
//画框
for (int j = 0; j < 4; ++j)
{
line(frame, vertices[j], vertices[(j + 1) % 4], Scalar(0, 255, 0), 2, LINE_AA);
}
}
// Put efficiency information. 时间
std::vector<double> layersTimes;
double freq = getTickFrequency() / 1000;
double t = net.getPerfProfile(layersTimes) / freq;
std::string label = format("Inference time: %.2f ms", t);
putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));
return frame;
}
//模型地址
auto model = "./model/frozen_east_text_detection.pb";
//检测图像
auto detect_image = "./image/patient.jpg";
//输入框尺寸
auto inpWidth = 320;
auto inpHeight = 320;
//置信度阈值
auto confThreshold = 0.5;
//非极大值抑制算法阈值
auto nmsThreshold = 0.4;
int main()
{
//读取模型
Net net = readNet(model);
//读取检测图像
Mat srcImg = imread(detect_image);
if (!srcImg.empty())
{
cout << "read image success!" << endl;
}
Mat resultImg = text_detect(srcImg, inpWidth, inpHeight, confThreshold, nmsThreshold, net);
imshow("result", resultImg);
waitKey();
return 0;
}
/**
* @brief 输出检测到的文本框相关信息
*
* @param scores 置信度
* @param geometry 位置信息
* @param scoreThresh 置信度阈值
* @param detections 位置
* @param confidences 分类概率
*/
void decode(const Mat &scores, const Mat &geometry, float scoreThresh,
std::vector<RotatedRect> &detections, std::vector<float> &confidences)
{
detections.clear();
//判断是不是符合提取要求
CV_Assert(scores.dims == 4);
CV_Assert(geometry.dims == 4);
CV_Assert(scores.size[0] == 1);
CV_Assert(geometry.size[0] == 1);
CV_Assert(scores.size[1] == 1);
CV_Assert(geometry.size[1] == 5);
CV_Assert(scores.size[2] == geometry.size[2]);
CV_Assert(scores.size[3] == geometry.size[3]);
const int height = scores.size[2];
const int width = scores.size[3];
for (int y = 0; y < height; ++y)
{
//识别概率
const float *scoresData = scores.ptr<float>(0, 0, y);
//文本框坐标
const float *x0_data = geometry.ptr<float>(0, 0, y);
const float *x1_data = geometry.ptr<float>(0, 1, y);
const float *x2_data = geometry.ptr<float>(0, 2, y);
const float *x3_data = geometry.ptr<float>(0, 3, y);
//文本框角度
const float *anglesData = geometry.ptr<float>(0, 4, y);
//遍历所有检测到的检测框
for (int x = 0; x < width; ++x)
{
float score = scoresData[x];
//低于阈值忽略该检测框
if (score < scoreThresh)
{
continue;
}
// Decode a prediction.
// Multiple by 4 because feature maps are 4 time less than input image.
float offsetX = x * 4.0f, offsetY = y * 4.0f;
//角度及相关正余弦计算
float angle = anglesData[x];
float cosA = std::cos(angle);
float sinA = std::sin(angle);
float h = x0_data[x] + x2_data[x];
float w = x1_data[x] + x3_data[x];
Point2f offset(offsetX + cosA * x1_data[x] + sinA * x2_data[x],
offsetY - sinA * x1_data[x] + cosA * x2_data[x]);
Point2f p1 = Point2f(-sinA * h, -cosA * h) + offset;
Point2f p3 = Point2f(-cosA * w, sinA * w) + offset;
//旋转矩形,分别输入中心点坐标,图像宽高,角度
RotatedRect r(0.5f * (p1 + p3), Size2f(w, h), -angle * 180.0f / (float)CV_PI);
//保存检测框
detections.push_back(r);
//保存检测框的置信度
confidences.push_back(score);
}
}
}
Python代码:
# Import required modules
import cv2 as cv
import math
import argparse
parser = argparse.ArgumentParser(description='Use this script to run text detection deep learning networks using OpenCV.')
# Input argument
parser.add_argument('--input', help='Path to input image or video file. Skip this argument to capture frames from a camera.')
# Model argument
parser.add_argument('--model', default="./model/frozen_east_text_detection.pb",
help='Path to a binary .pb file of model contains trained weights.'
)
# Width argument
parser.add_argument('--width', type=int, default=320,
help='Preprocess input image by resizing to a specific width. It should be multiple by 32.'
)
# Height argument
parser.add_argument('--height',type=int, default=320,
help='Preprocess input image by resizing to a specific height. It should be multiple by 32.'
)
# Confidence threshold
parser.add_argument('--thr',type=float, default=0.5,
help='Confidence threshold.'
)
# Non-maximum suppression threshold
parser.add_argument('--nms',type=float, default=0.4,
help='Non-maximum suppression threshold.'
)
args = parser.parse_args()
############ Utility functions ############
def decode(scores, geometry, scoreThresh):
detections = []
confidences = []
############ CHECK DIMENSIONS AND SHAPES OF geometry AND scores ############
assert len(scores.shape) == 4, "Incorrect dimensions of scores"
assert len(geometry.shape) == 4, "Incorrect dimensions of geometry"
assert scores.shape[0] == 1, "Invalid dimensions of scores"
assert geometry.shape[0] == 1, "Invalid dimensions of geometry"
assert scores.shape[1] == 1, "Invalid dimensions of scores"
assert geometry.shape[1] == 5, "Invalid dimensions of geometry"
assert scores.shape[2] == geometry.shape[2], "Invalid dimensions of scores and geometry"
assert scores.shape[3] == geometry.shape[3], "Invalid dimensions of scores and geometry"
height = scores.shape[2]
width = scores.shape[3]
for y in range(0, height):
# Extract data from scores
scoresData = scores[0][0][y]
x0_data = geometry[0][0][y]
x1_data = geometry[0][1][y]
x2_data = geometry[0][2][y]
x3_data = geometry[0][3][y]
anglesData = geometry[0][4][y]
for x in range(0, width):
score = scoresData[x]
# If score is lower than threshold score, move to next x
if(score < scoreThresh):
continue
# Calculate offset
offsetX = x * 4.0
offsetY = y * 4.0
angle = anglesData[x]
# Calculate cos and sin of angle
cosA = math.cos(angle)
sinA = math.sin(angle)
h = x0_data[x] + x2_data[x]
w = x1_data[x] + x3_data[x]
# Calculate offset
offset = ([offsetX + cosA * x1_data[x] + sinA * x2_data[x], offsetY - sinA * x1_data[x] + cosA * x2_data[x]])
# Find points for rectangle
p1 = (-sinA * h + offset[0], -cosA * h + offset[1])
p3 = (-cosA * w + offset[0], sinA * w + offset[1])
center = (0.5*(p1[0]+p3[0]), 0.5*(p1[1]+p3[1]))
detections.append((center, (w,h), -1*angle * 180.0 / math.pi))
confidences.append(float(score))
# Return detections and confidences
return [detections, confidences]
if __name__ == "__main__":
# Read and store arguments
confThreshold = args.thr
nmsThreshold = args.nms
inpWidth = args.width
inpHeight = args.height
model = args.model
# Load network
net = cv.dnn.readNet(model)
# Create a new named window
kWinName = "EAST: An Efficient and Accurate Scene Text Detector"
outputLayers = []
outputLayers.append("feature_fusion/Conv_7/Sigmoid")
outputLayers.append("feature_fusion/concat_3")
# Read frame
frame = cv.imread("./image/stop1.jpg")
# Get frame height and width
height_ = frame.shape[0]
width_ = frame.shape[1]
rW = width_ / float(inpWidth)
rH = height_ / float(inpHeight)
# Create a 4D blob from frame.
blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)
# Run the model
net.setInput(blob)
output = net.forward(outputLayers)
t, _ = net.getPerfProfile()
label = 'Inference time: %.2f ms' % (t * 1000.0 / cv.getTickFrequency())
# Get scores and geometry
scores = output[0]
geometry = output[1]
[boxes, confidences] = decode(scores, geometry, confThreshold)
# Apply NMS
indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold,nmsThreshold)
for i in indices:
# get 4 corners of the rotated rect
vertices = cv.boxPoints(boxes[i[0]])
# scale the bounding box coordinates based on the respective ratios
for j in range(4):
vertices[j][0] *= rW
vertices[j][1] *= rH
for j in range(4):
p1 = (vertices[j][0], vertices[j][1])
p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1])
cv.line(frame, p1, p2, (0, 255, 0), 2, cv.LINE_AA);
# cv.putText(frame, "{:.3f}".format(confidences[i[0]]), (vertices[0][0], vertices[0][1]), cv.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1, cv.LINE_AA)
# Put efficiency information
cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))
# Display the frame
cv.imshow("result",frame)
cv.waitKey(0)
参考
https://www.learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/