What’s the Deep Learning Revolution?

Deep learning is receiving enormous attention after some recent breakthrough results. But what exactly happened to get us here?

The revolution (i.e., a real step-change) happening at the moment is an effect of a number of developments:

1. Improvements in hardware acceleration and commoditization of GPU access: electrical engineers have increased computation speed by providing GPUs like NVIDIA K40 and the associated CUDA parallel programming paradigm. Furthermore, clusters of GPUs are now available for rent on demand in the cloud for affordable prices (e.g. Amazon AWS EC2).

2. Increasing RAM availability: electrical engineers have increased integration density, so that computers with many GBs or even several TB of RAM are available.

3. Big data availability: enormous amounts of data (text, images, audio, videos) are now available for training machine learning systems. Digitization of analog assets and connecting machines over the Internet amplify this effect.

4. Progress in machine learning models, representation and learning algorithms:

  • Deep Belief Networks (DBN) (Hinton et al., 2006) and greedy pre-training: a disruptive change occurred when Hinton and co-workers discovered more effective and efficient training procedures for multi-layer neural networks based on pre-training
    individual layers in a greedy and unsupervised (Restricted Boltzman Machines (RBM) learning) fashion (Bengio 2009: 6); Auto-Encoders a.k.a. Auto-Associators a.k.a. Diabolo networks and Stacked
    Auto-encoders (Bengio 2009: 45-47), are another family of models, discovered shortly DBNs, that likewise exploit unsupervised training applied locally to intermediate layers of representation;
  • Word embeddings (Weston et al, 2008; Collobert and Weston 2008) are a way to encode/represent word meaning and context as word co-text vectors, which can achieve competitive results in natural language processing tasks;
  • the Neurocognitron, Feed-Forward Neural Networks a.k.a. Convolutional Neural Networks (CNN) (Bengio 2009: 43-45) are families of models that capable of sequence modeling (corresponding to HMMs or CRF-based sequence taggers in
    traditional generative modeling);
  • Neural Network Language Models (NNLM) (Mnih and Hinton 2009) can replace traditional statistical word n-gram language models; and
  • Recursive Neural Networks (RNN) (Goller and Küchler 1996; Socher, Manning,  and Ng 2010) “operate on any hierarchical structure, combining child representations into parent representations,” (Wikipedia). They are a generalization of the longer-known Recurrent Neural Networks: recursive neural networks with a simpler structure, namely that of a (time-)linear chain are Recurrent NNs (which combine the previous time step and a hidden representation into the representation for the current time step in a feedback loop). Socher applied RNNs to parsing, learning of relations between image and text and sentiment analysis.In a previous revolution, when statistical methods led to a paradigm shift in natural language processing in the 1990s, this revolution was triggered by success of statistics in speech recognition, which was pursued by electrical engineers and computer scientists. This time, the origin of the breakthrough comes from theoretical machine learning researchers from within computer science.


Bengio, Y. (2009) “Learning Deep Architectures for AI” Foundations and Trends in Machine Learning 2(1)

G. E. Hinton, S. Osindero, and Y. Teh (2006) “A fast learning algorithm for deep belief nets,” IEEE Neural Computation 18, pp. 1527–1554

R. Collobert and J. Weston (2008)  “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), (W. W. Cohen, A. McCallum, and S. T. Roweis, eds.), pp. 160–167, ACM

C. Goller and A. Küchler (1996) “Learning task-dependent distributed representations by backpropagation through structure” IEEE Neural Networks

A. Mnih and G. E. Hinton (2009) “A scalable hierarchical distributed language model,” in Advances in Neural Information Processing Systems 21 (NIPS’08), (D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds.), pp. 1081–1088

R. Socher, C. D. Manning, A. Y. Ng (2010) “Learning continuous phrase representations and syntactic parsing with recursive neural networks.” Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop, 1-9.

J. Weston, F. Ratle, and R. Collobert (2008) “Deep learning via semi-supervised embedding,” in Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), (W. W. Cohen, A. McCallum, and S. T. Roweis, eds.), pp. 1168–1175, New York, NY, USA: ACM

Germany and ‘Elite’ Universities?

Germany’s government is trying to move away of what I have long considered the main strength of its academic system, namely that wherever you go, you get a very decent education there, by fostering a smaller number of “excellence” universities instead (in computer science, these are LMU Munich, KIT (Karlsruhe), RTWH Aachen and a few others).

A lot of money was pumped into that effort, and the results are now beginning to be assessed by the government and reported by a recent Nature article.

Now I’m not saying excellence isn’t a good thing, but it’s even better if a country can excel on top of retaining the very strong broad baseline the country has been able to establish in the past,
especially given that tertiary education in Germany is still free of charge or pretty close to it (depending on the Bundesland).

Interestingly, many universities that didn’t get the additional “excellence” money also became more excellent, so the German federal government’s self-assessment needs to be careful about crafting a message consistent with the actual numbers (a sanity check that every experienced reviewer carries out before
pressing the ‘submit’ button).

According to Nature (elsewhere), Johanna Wanka, Germany’s federal research and education minister, is thinking about re-visiting one of the main reasons why many German academics have fled to other countries, notably the US to seek “academic asylum” at places like MIT: a  law that limits to 12 years (6 before the PhD, and another 6 after) the time that scientists can be employed before they must leave their institution.
This rule, originally intended to create competitiveness and to avoid nepotism, basically means that researchers in German universities, unlike counter staff at a petrol station, say, have no permanent employment ever, other than the select few that are able to obtain professorial posts – which has driven top talent out of the country and has contributed to ‘brain drain’ more than the limited funding, in my personal opinion. (This is a slightly simplified version; in reality, curiously, people have been creatively circumvented the official rule above in certain cases by retaining former role titles like Akademischer Rat, just like the Habilitation was never fully replaced by the Juniorprofessur, so the official rules and the de-facto rules differ).

So I hope Ms. Wanka get enough support so that this law can become history. And it may well be that one smart rule change can have as big an effect than a billion of additional research funding.

Eugene Garfield and the Journal Impact Factor: A Man and His Metric

What is Scientometrics?

Scientometrics is the scholarly study of analysing and quantifying scientific, technological and innovation output.
Research questios include ‘How to measure the impact of a publication?’, ‘How to impactful is a journals?’, ‘How impactful is an institution?’ Scientometrics involved gathering and processing scientific citations, mapping scientific fields and the production of indicators (metrics) for use in scientific policy and managing research groups and institutions (founding institutions, staffing for emerging fields). Scientometrics is also a learned journal covering the field.

The Man

Eugene “Gene” Garfield was born in New York on 1925-09-16 and grew up in a Italian-Lithuanian family with orthodox-Jewish religious influence. Garfield obtained a PhD in Structural Linguistics from the University of Pennsylvania and founded ISI (now part of Thomson Reuters) in Philadelphia, Pennsylvania in 1955.

He is one of the founders of scientometrics, and he is responsible for many bibliographic products, including Current Contents, the Science Citation Index (SCI), and other citation databases, the Journal Citation Reports, and Index Chemicus. He is the founding editor and publisher of The Scientist, a news magazine for life scientists.

In 1955 he endeavoured the development of a computerized citation index showing the spread of scientific thinking via citation graphs (see below).

Garfield’s work pre-dates the development of several IR ranking methods such as HITS or PageRank that are informed by “citation” between Web pages connected by hyperlinks.

Watch this series of videos interviewing Eugene Garfield at the Web of Stories (requires Flash).

The Metric

Dr. Garfield founded the Information Science Institute ISI, and its product, the ISI database – now called the Thomson Reuters Web of Science®, constitutes the basis for the Journal Impact Factor (JIF), a metric to assess the quality of learned publications based on citation indexing – as laid out in Garfield (1955).

The journal impact factor was originally developed to help select journals for the Science Citation Index (SCI): having an objective metric that can be calculated by machine, so it was reckoned, could make the selection less prone to subjective bias.

The JIF for a journal is defined as the ratio of the number of citations in the current year to any items published in a journal in the previous two years, and the number of substantive articles (source items) published in that same two-year window.

Nowadays, the JIF is highly valued by publisher to advertise their learned publications to customers such as university librarians. However, it has also been abused by governments and research institutions to assess individuals; for arguably, it is not the fault of a formula to be used for unintended purposes, and as you can see in the video, Dr. Garfield himself explicitly cautions against this use.

Using the citation graph to compute impact is a clever idea. If you want to see a very retro explanation, consider the YouTube video below.


Garfield, Eugene (1999). “Journal impact factor: a brief review”. Canadian Medical Association Journal 161: 979–980

Garfield, Eugene (1955). “Citation indexes for science…” Science 122 (3159): 108–111

My Tools

Operating Systems: Atari TOS/GEM, MS DOS/PC DOS, Microsoft Windows 3.11/NT/XP/7, Apple MacOS 9/X, (Slackware/SuSE/Fedora/Red Hat/Ubuntu) Linux (and X11), SunOS/Solaris, HP-UX 9.03/10/11, Sequent DYNIX, Apple iOS, Google Android
Programming languages: BASIC/VisualBasic/VBA/GFA Basic, MOS6510 assembler (also mk68000, i860, ARM), PASCAL, Forth, Modula-2, LOGO, FORTRAN IV/95, LISP/Scheme, SAP ABAP/4, JavaScript, Python, Perl, Java, C, C++, C#, PROLOG, R, MATLAB/Octave, Scala
Typesetting/word processing: LaTeX/beamer, Word, PowerPoint, Keynote
Drawing tools: xfig, gnuplot, Paint, OmniGraffle
E-Mail tools: Outlook, Mutt, VM, Elm, Entourage, PINE, Pegasus, Zimbra
Communication tools: talk, Newsbeuter, IRSSI/IRC, Skype, WebEx, Jabber, USENET/gnus
Text editors: XEmacs, Sublime, Atom, Notepad++, vi
Databases: PostgreSQL, MySQL, Oracle 8, SQL Server, MS Access, SQLite, BerkeleyDB, MongoDB, CouchDB
Search engines/search libraries: MG, Lemur, Terrier, Lucene, Solr, ElasticSearch
Integrated development environments: Netbeans, Eclipse, Microsoft Visual IDE, XCode, IntelliJ, XEmacs
Documentation tools: Mediawiki, Confluence, Docuwiki, WordPress, JIVE, Doxygen, JavaDoc
Requirements capture/agile tools: JIRA, Balsamiq, WireframeSketcher
Revision control tools: Perforce, Subversion, CVS, RCS, Mercurial/BitBucket, Git/GitHub
Web browsers: Chrome/Chromium, Firefox, Netscape Communicator, Mozilla, lynx, Safari, Internet Explorer
Debugging/testing tools: Purify, JUnit
Build tools: Make, Ant, Maven, sbt, CMake, configure, Hudson/Jenkins
Software architecture/analysis/design tools: ArgoUML, Dia, Borland TogetherJ, OmniGraffle, PowerPoint
Project tools: Microsoft Project, OmniPlan, Merlin
Big data processing tools: Hadoop, Spark, Sun Grid Engine
Special purpose software: SAP R/3, ESRI ArcGIS

List of Natural Language Processing, Machine Learning & Search Blogs

Natural Language Processing Blogs

Information Retrieval Blogs

Machine Learning Blogs

Mixed Content Blogs (NLP, IR, ML & co.)

(Please share other should-read blogs from the areas above that might be missing from this list.)

Contextual Bandits

It is well known that statistics owes a lot to gambling. One-armed bandits as they are commonly encountered in Las Vegas have inspired a statistical model (H. Robbins, 1952) that is commonly applied to ranking search results, online ads and to making recommendations.

In a -armed bandit, we assume a decision maker that receives a reward at time , which is a function of the state as well as a random element. We want to maximize the expectation following a policy   given a discount factor :


Some older books like Berry and Fristedt’s Bandit Problems: Sequential Allocation of Experiments (1985) provide readable introductions (pp. 1-7, it also has a 50+ page annotated bibliography). White’s recent book Bandit Algorithms for Website Optimization gives a more programmer-friendly introduction for the Python-inclined. Check out this NIPS paper by Hofmann, Whiteson and de Rijke (2011) for IR applications.

Sergey Feldman has a post about applying Contextual Bandits to personalization.

When Will Devices Be Cable-Free?

Computers, digital cameras, and mobile phones are getting faster, cheaper and more powerful.

Yet despite the apparent progress, the cable is alive and kicking. Each generation of devices usually brings its own plugs, and comes with incompatible charger, much to the annoyance of frequently traveling business people (and scientists).  Continue reading When Will Devices Be Cable-Free?

Thoughts From A Curious Traveler in Space-Time