0. 摘要
参加了天池的一个pdf简历信息提取的比赛,这里进行回顾、整理和分享
赛题要求从pdf简历中提取出信息,比如说名字,籍贯等。这里搭建了一个BiLSTM-CRF模型,能够从PDF简历中提取出所需的信息。
模型的线上得分是0.727,排名 21/1200+
1. 赛题相关
模型目标:pdf简历 --> 类别信息
2. 思路

使用python库pdfminer,将pdf简历中的文本提取出来。利用json标注文件,对提取出来的文本进行匹配和BIO标注,每一个字对应一个标注。最后,将标注后的文本送到BiLSM-CRF模型中进行训练。
3. BiLSTM-CRF 模型

将文本中的每个字进行one-hot编码,经过Embedding层后,每一个字对应一个字向量,所以文本可以用一个矩阵表示。将文本矩阵输入BiLSTM层,输出中每一个字会对应一个类别概率向量,此类别概率向量表示了该字属于各个类别的概率。所以所有字属于各个类别的概率可以用一个类别概率矩阵表示。将此类别概率矩阵输入CRF层,即可得到得分最高的文本标注序列。
此处留一个pytorch官方的BiLSTM-CRF教程链接: https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html#
4. 代码地址
https://github.com/Agwave/PDF-Resume-Information-Extraction
文章评论
Well I definitely enjoyed studying it. This information procured by you is very helpful for good planning.
I adore reading and I believe this website got some really utilitarian stuff on it! .
After study a few of the blog posts on your website now, and I truly like your way of blogging. I bookmarked it to my bookmark website list and will be checking back soon. Pls check out my web site as well and let me know what you think.
Thanks, I have recently been looking for info approximately this topic for a long time and yours is the best I've found out so far. However, what concerning the bottom line? Are you positive in regards to the supply?
Some genuinely fantastic information, Sword lily I found this.
I besides conceive therefore, perfectly indited post! .
I'm curious to find out what blog platform you happen to be using? I'm experiencing some minor security issues with my latest website and I'd like to find something more secure. Do you have any recommendations?