一定一定要掌握python,其中的一些库 nltk,spacy,尤其是spacy他的速度要远好于我之前用的所有工具。包括迈入deep learning之后的pytorch等等库,都是依赖python的,所以学习python是必不可少的。
推荐《数学之美》,这个书写得特别科普且生动形象,我相信你不会觉得枯燥。这个我极力推荐,我相信科研的真正原因是因为兴趣,而不是因为功利的一些东西。
接下来说,《统计自然语言处理基础》这本书,这书实在是太老了,但是也很经典,看不看随意了。
现在自然语言处理都要靠统计学知识,所以我十分十分推荐《统计学习方法》,李航的。李航老师用自己课余时间7年写的,而且有博士生Review的。自然语言处理和机器学习不同,机器学习依靠的更多是严谨的数学知识以及推倒,去创造一个又一个机器学习算法。而自然语言处理是把那些机器学习大牛们创造出来的东西当Tool使用。所以入门也只是需要涉猎而已,把每个模型原理看看,不一定细致到推倒。
<a data-draft-node="block" data-draft-type="mcn-link-card" data-mcn-id="1269262373313028096">宗成庆老师 的统计自然语言处理第二版非常好~《中文信息处理丛书:统计自然语言处理(第2版)》 蓝色皮的~~~
然后就是Stanford公开课了,Stanford公开课要求一定的英语水平。| Coursera 我觉得讲的比大量的中国老师好~
举例:
http://www.ark.cs.cmu.edu/LS2/in...
或者
http://www.stanford.edu/class/cs...
如果做工程前先搜索有没有已经做好的工具,不要自己从头来。做学术前也要好好的Survey!
开始推荐工具包:
中文的显然是哈工大开源的那个工具包 LTP (Language Technology Platform) developed by HIT-SCIR(哈尔滨工业大学社会计算与信息检索研究中心).
英文的(python):
- pattern - simpler to get started than NLTK
- chardet - character encoding detection
- pyenchant - easy access to dictionaries
- scikit-learn - has support for text classification
- unidecode - because ascii is much easier to deal with
希望可以掌握以下的几个tool:
CRF++
GIZA
Word2Vec
还记得小时候看过的数码宝贝,每个萌萌哒的数码宝贝都会因为主人身上发生的一些事情而获得进化能力,其实在自然语言处理领域我觉得一切也是这样~ 我简单的按照自己的见解总结了每个阶段的特征,以及提高的解决方案
1.幼年体——自然语言处理好屌,我什么都不会但是好想提高
建议。。。去看公开课~去做Kaggle的那个情感分析题。
2.成长期——觉得简单模型太Naive,高大上的才是最好的
这个阶段需要自己动手实现一些高级算法,或者说常用算法,比如LDA,比如SVM,比如逻辑斯蒂回归。并且拥抱Kaggle,知道trick在这个领域的重要性。在预训练模型和Transformer模型有了以后,一定要精通这两个模型,精通到什么程度呢,Bert Base的参数量是怎么得到的要能脱口而出。
3.成熟期——高大上的都不work,通过特征工程加规则才work
大部分人应该都在这个级别吧,包括我自己,我总是想进化,但积累还是不够。觉得高大上的模型都是一些人为了paper写的,真正的土方法才是重剑无锋,大巧不工。在这个阶段,应该就是不断读论文,不断看各种模型变种吧,什么句子相似度计算word2vec cosine已经不再适合你了。
4.完全体——在公开数据集上,把某个高大上的模型做work了~
这类应该只有少数博士可以做到吧,我已经不知道到了这个水平再怎么提高了~是不是只能说不忘初心,方得始终。
5.究极体——参见Micheal Jordan Andrew Ng.
好好锻炼身体,保持更长久的究极体形态
希望可以理解自然语言处理的基本架构~:分词=>词性标注=>Parser
Quora上推荐的NLP的论文(摘自Quora 我过一阵会翻译括号里面的解释):
Parsing(句法结构分析~语言学知识多,会比较枯燥)
- Klein & Manning: &#34;Accurate Unlexicalized Parsing&#34; ( )
- Klein & Manning: &#34;Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency&#34; (革命性的用非监督学习的方法做了parser)
- Nivre &#34;Deterministic Dependency Parsing of English Text&#34; (shows that deterministic parsing actually works quite well)
- McDonald et al. &#34;Non-Projective Dependency Parsing using Spanning-Tree Algorithms&#34; (the other main method of dependency parsing, MST parsing)
Machine Translation(机器翻译,如果不做机器翻译就可以跳过了,不过翻译模型在其他领域也有应用)
- Knight &#34;A statistical MT tutorial workbook&#34; (easy to understand, use instead of the original Brown paper)
- Och &#34;The Alignment-Template Approach to Statistical Machine Translation&#34; (foundations of phrase based systems)
- Wu &#34;Inversion Transduction Grammars and the Bilingual Parsing of Parallel Corpora&#34; (arguably the first realistic method for biparsing, which is used in many systems)
- Chiang &#34;Hierarchical Phrase-Based Translation&#34; (significantly improves accuracy by allowing for gappy phrases)
Language Modeling (语言模型)
- Goodman &#34;A bit of progress in language modeling&#34; (describes just about everything related to n-gram language models 这是一个survey,这个survey写了几乎所有和n-gram有关的东西,包括平滑 聚类)
- Teh &#34;A Bayesian interpretation of Interpolated Kneser-Ney&#34; (shows how to get state-of-the art accuracy in a Bayesian framework, opening the path for other applications)
Machine Learning for NLP
- Sutton & McCallum &#34;An introduction to conditional random fields for relational learning&#34; (CRF实在是在NLP中太好用了!!!!!而且我们大家都知道有很多现成的tool实现这个,而这个就是一个很简单的论文讲述CRF的,不过其实还是蛮数学= =。。。)
- Knight &#34;Bayesian Inference with Tears&#34; (explains the general idea of bayesian techniques quite well)
- Berg-Kirkpatrick et al. &#34;Painless Unsupervised Learning with Features&#34; (this is from this year and thus a bit of a gamble, but this has the potential to bring the power of discriminative methods to unsupervised learning)
Information Extraction
- Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992. (The very first paper for all the bootstrapping methods for NLP. It is a hypothetical work in a sense that it doesn&#39;t give experimental results, but it influenced it&#39;s followers a lot.)
- Collins and Singer. Unsupervised Models for Named Entity Classification. EMNLP 1999. (It applies several variants of co-training like IE methods to NER task and gives the motivation why they did so. Students can learn the logic from this work for writing a good research paper in NLP.)
Computational Semantics
- Gildea and Jurafsky. Automatic Labeling of Semantic Roles. Computational Linguistics 2002. (It opened up the trends in NLP for semantic role labeling, followed by several CoNLL shared tasks dedicated for SRL. It shows how linguistics and engineering can collaborate with each other. It has a shorter version in ACL 2000.)
- Pantel and Lin. Discovering Word Senses from Text. KDD 2002. (Supervised WSD has been explored a lot in the early 00&#39;s thanks to the senseval workshop, but a few system actually benefits from WSD because manually crafted sense mappings are hard to obtain. These days we see a lot of evidence that unsupervised clustering improves NLP tasks such as NER, parsing, SRL, etc,
其实我相信,大家更感兴趣的是上层的一些应用~而不是如何实现分词,如何实现命名实体识别等等。而且应该大家更对信息检索感兴趣。不过自然语言处理和信息检索还是有所区别的,So~~~我就不在这边写啦
<hr/> |
|