Classic papers will be tweeted...
You have probably found my "ex-libris of a data scientist" series. The six part series covered mostly books (part I for introduction and data and databases, part II: model, part III: technology, part IV: code, part V:Visualization, part VI: Communications). Here and there, I also had mentions or links to a few classic papers, especially in the database section.
I've also started posting some classic papers on twitter, as an experiment (3-4 at most per week). The papers cover a gamut of subjects, from how short term memory works to what is a Perceptron. If you are not following me on twitter, or you are looking for a condensed document instead of a thread, here's the activity on that thread for January 2019 (pretty much the text from twitter as is, formatted for LinkedIn):
I'll be sharing some classic papers on twitter this year that are very much data science related.
Week 1
Let's start the year with some paper on cognition and our ability to process our visual field. This first one is from 1956: "The Magical Number Seven, Plus or Minus Two Some Limits on Our Capacity for Processing Information" by George Miller:
The above paper covered the short term memory capacity.
The next one covers the duration. "Short term retention of individual verbal items" by Peterson and Peterson from 1959.
These should inform our decisions when it comes to communicating information. Table, visualization, semi-graphic display, video or even audio data. We can only chunk so much information at once and more than likely wont retain these for very long without emphasis and repetition
Week 2
A new week, some new (to some of you) classic papers. Today we take on the subject of correlation and causation.
"The environment and disease: association or causation?" from 1965, by Sir Austin Bradford Hill:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1898525/ …
The paper talks about strength, consistency, specificity, temporality and 5 more factors.
If the full implications of the above are not obvious, consider this much more recent paper: "The missed lessons of Sir Austin Bradford Hill" by Phillips and Goodman:
The other classic paper on the subject I want to introduce is that of Sewall Wright: "Correlation and Causation" written in 1921 (took me a while to find a publicly available copy online)
A century old, but coming back in style.
Week 3
Another week, more classic data science related papers. Today I'll cover database papers.
The first classic #database paper I want to introduce is a paper Edgar Codd published in 1972. It starts with the following sentence:
"In the near future, we can expect a great variety of languages to be proposed for interrogating and updating databases".
Might seem obvious now, but wasn't at the time. In this paper Codd talks about the Alpha relational language.
It is found here ("Relational Completeness of Data Base Sublanguages"): https://pdfs.semanticscholar.org/6a04/8dc38250ffce49c5e6a5040b4c91ca05e83d.pdf …
What's interesting is that he worked on Alpha while at IBM, but they were not too keen on the idea, as this would chip away at their existing products. Still, it was adapted into a product and the language called SEQUEL. The rest is history...
The other classic #database paper I want to introduce is about Postgres. Now, the first paper on Postgres is from 1986, and originates from INGRES, with the first paper from 1976. But the one that is the classic reading is the one from 1990. #databases #paper
It is "The Implementation of Postgres" (1990), IEEE Transactions Knowledge and Data Engineering, M. Stonebraker. And it is available online at:
http://db.cs.berkeley.edu/papers/ERL-M90-34.pdf …
#datascience #databases #paper
If you are interested in more classic papers on databases (and books), you might be interested in my article on linkedin:
Papers are found toward the end. #datascience #books #papers #databases #data
Week 4
Another week, more data science papers. I've held off long enough on neural networks. But please, use responsibly. Neural Networks and Deep Learning is for when the simpler stuff fails. Not as a plug in replacement for everything... #datascience
So. We have to start with the classic paper by McCulloch & Pitts. This is where it really starts.
http://www.cse.chalmers.se/~coquand/AUTOMATA/mcp.pdf …
Next, before I go to the next paper, I would suggest also reading the book "The Organization of Behavior" by Donald Hebb (1949)
The next one is fundamental. It is one of the most important paper you'll ever read (along with Shannon's work)
"The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain" by Frank Rosenblatt (1959)
You might not know, but Rosenblatt pretty much laid out the concept of back propagation. And it would be rediscovered several times over the years.
So, typically, the next paper people go to a paper from the Connectionist Summer School from 1988. Instead, let me recommend Mikel Olazaran's "A sociological History of the Neural Network Controversy" . P.335 of illustrated book, from 1993.
And you don't have to buy the book or hunt it down in a library, since it is also available online:
One of the better survey you'll read on neural networks and the early work. Also includes 6 pages worth of references!
#datascience #ai #NeuralNetworks
Week 5
Another week, and we are continuing this thread with some more classic papers related to data science. This week, let's talk about classic papers on software.
In a previous episode I provided a link to my ex-libris article on technology (https://www.linkedin.com/pulse/ex-libris-data-scientist-part-iii-technology-francois-dion/ …) which covered some essential papers by Claude Shannon and Alan Turing. A fascinating look at logic circuits from relays and tape machines. But I digress. What about software?
Software as an entity, separate from hardware, wasn't really a thing in the 30s & 40s. For that, the popularization of the "software" concept, we find this clear thought in John W Tukey 1958 paper "The Teaching of concrete mathematics" https://www.maa.org/sites/default/files/pdf/CUPM/first_40years/1958-65Tukey.pdf …
This is a particularly visionary paper by Tukey and clearly shows his polymath background. Especially his view on teaching programming to as many people as possible, even if they don't understand how it all works behind the scene. That was a revolutionary thought at the time.
And you thought this would be a waterfall vs agile list of papers...
Continuing on classic data science related papers, we jump from 1958 to 1963. "On the computational complexity of algorithms" by Hartmanis and Stearns for which they received the 1993 ACM Turing Award
Their paper, the foundation to computational complexity theory can be found at:
Third classic paper related to #datascience I'll share this week came out in 1971 and it brings up an important concept in software design. David L. Parnas published "on the criteria to be used in decomposing systems into modules". Not 2011, not 2001, 1971.
In it, he suggests "a mechanism for improving the flexibility and comprehensibility of a system while allowing the shortening of its development time".
You can read it here:
And the last classic paper I'll share this week is from John Backus, from 1977. The title? "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs". You can read it here:
#datascience related classic papers
Francois Dion
Machine Learning
6yIt’s a shame Wake Forest couldn’t host a data science page like Georgia tech physics that is accessible not just off of LinkedIn.