Classic papers will be tweeted...

Francois Dion

Entrepreneur | Python Expert | Chief Data Scientist

Published Feb 4, 2019

You have probably found my "ex-libris of a data scientist" series. The six part series covered mostly books (part I for introduction and data and databases, part II: model, part III: technology, part IV: code, part V:Visualization, part VI: Communications). Here and there, I also had mentions or links to a few classic papers, especially in the database section.

I've also started posting some classic papers on twitter, as an experiment (3-4 at most per week). The papers cover a gamut of subjects, from how short term memory works to what is a Perceptron. If you are not following me on twitter, or you are looking for a condensed document instead of a thread, here's the activity on that thread for January 2019 (pretty much the text from twitter as is, formatted for LinkedIn):

I'll be sharing some classic papers on twitter this year that are very much data science related.

Week 1

Let's start the year with some paper on cognition and our ability to process our visual field. This first one is from 1956: "The Magical Number Seven, Plus or Minus Two Some Limits on Our Capacity for Processing Information" by George Miller:

http://www.psych.utoronto.ca/users/peterson/psy430s2001/Miller%20GA%20Magical%20Seven%20Psych%20Review%201955.pdf …

The above paper covered the short term memory capacity.

The next one covers the duration. "Short term retention of individual verbal items" by Peterson and Peterson from 1959.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.227.1807&rep=rep1&type=pdf …

These should inform our decisions when it comes to communicating information. Table, visualization, semi-graphic display, video or even audio data. We can only chunk so much information at once and more than likely wont retain these for very long without emphasis and repetition

Week 2

A new week, some new (to some of you) classic papers. Today we take on the subject of correlation and causation.

"The environment and disease: association or causation?" from 1965, by Sir Austin Bradford Hill:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1898525/ …

The paper talks about strength, consistency, specificity, temporality and 5 more factors.

If the full implications of the above are not obvious, consider this much more recent paper: "The missed lessons of Sir Austin Bradford Hill" by Phillips and Goodman:

https://www.researchgate.net/publication/8210360_The_Missed_Lessons_of_Sir_Austin_Bradford_Hill

The other classic paper on the subject I want to introduce is that of Sewall Wright: "Correlation and Causation" written in 1921 (took me a while to find a publicly available copy online)

https://www.ssc.wisc.edu/soc/class/soc952/Wright/Wright_Correlation%20and%20Causation.pdf …

A century old, but coming back in style.

Week 3

Another week, more classic data science related papers. Today I'll cover database papers.

The first classic #database paper I want to introduce is a paper Edgar Codd published in 1972. It starts with the following sentence:

"In the near future, we can expect a great variety of languages to be proposed for interrogating and updating databases".

Might seem obvious now, but wasn't at the time. In this paper Codd talks about the Alpha relational language.

It is found here ("Relational Completeness of Data Base Sublanguages"): https://pdfs.semanticscholar.org/6a04/8dc38250ffce49c5e6a5040b4c91ca05e83d.pdf …

What's interesting is that he worked on Alpha while at IBM, but they were not too keen on the idea, as this would chip away at their existing products. Still, it was adapted into a product and the language called SEQUEL. The rest is history...

The other classic #database paper I want to introduce is about Postgres. Now, the first paper on Postgres is from 1986, and originates from INGRES, with the first paper from 1976. But the one that is the classic reading is the one from 1990. #databases #paper

It is "The Implementation of Postgres" (1990), IEEE Transactions Knowledge and Data Engineering, M. Stonebraker. And it is available online at:

http://db.cs.berkeley.edu/papers/ERL-M90-34.pdf …

#datascience #databases #paper

If you are interested in more classic papers on databases (and books), you might be interested in my article on linkedin:

https://www.linkedin.com/pulse/ex-libris-data-scientist-part-i-francois-dion/ …

Papers are found toward the end. #datascience #books #papers #databases #data

Week 4

Another week, more data science papers. I've held off long enough on neural networks. But please, use responsibly. Neural Networks and Deep Learning is for when the simpler stuff fails. Not as a plug in replacement for everything... #datascience

So. We have to start with the classic paper by McCulloch & Pitts. This is where it really starts.

http://www.cse.chalmers.se/~coquand/AUTOMATA/mcp.pdf …

Next, before I go to the next paper, I would suggest also reading the book "The Organization of Behavior" by Donald Hebb (1949)

#datascience

The next one is fundamental. It is one of the most important paper you'll ever read (along with Shannon's work)

"The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain" by Frank Rosenblatt (1959)

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.335.3398&rep=rep1&type=pdf …

You might not know, but Rosenblatt pretty much laid out the concept of back propagation. And it would be rediscovered several times over the years.

So, typically, the next paper people go to a paper from the Connectionist Summer School from 1988. Instead, let me recommend Mikel Olazaran's "A sociological History of the Neural Network Controversy" . P.335 of illustrated book, from 1993.

And you don't have to buy the book or hunt it down in a library, since it is also available online:

https://pdfs.semanticscholar.org/f3b6/e5ef511b471ff508959f660c94036b434277.pdf …

One of the better survey you'll read on neural networks and the early work. Also includes 6 pages worth of references!

#datascience #ai #NeuralNetworks

Week 5

Another week, and we are continuing this thread with some more classic papers related to data science. This week, let's talk about classic papers on software.

In a previous episode I provided a link to my ex-libris article on technology (https://www.linkedin.com/pulse/ex-libris-data-scientist-part-iii-technology-francois-dion/ …) which covered some essential papers by Claude Shannon and Alan Turing. A fascinating look at logic circuits from relays and tape machines. But I digress. What about software?

Software as an entity, separate from hardware, wasn't really a thing in the 30s & 40s. For that, the popularization of the "software" concept, we find this clear thought in John W Tukey 1958 paper "The Teaching of concrete mathematics" https://www.maa.org/sites/default/files/pdf/CUPM/first_40years/1958-65Tukey.pdf …

This is a particularly visionary paper by Tukey and clearly shows his polymath background. Especially his view on teaching programming to as many people as possible, even if they don't understand how it all works behind the scene. That was a revolutionary thought at the time.

And you thought this would be a waterfall vs agile list of papers...

Continuing on classic data science related papers, we jump from 1958 to 1963. "On the computational complexity of algorithms" by Hartmanis and Stearns for which they received the 1993 ACM Turing Award

Their paper, the foundation to computational complexity theory can be found at:

https://fi.ort.edu.uy/innovaportal/file/20124/1/60-hartmanis_stearns_complexity_of_algorithms.pdf …

Third classic paper related to #datascience I'll share this week came out in 1971 and it brings up an important concept in software design. David L. Parnas published "on the criteria to be used in decomposing systems into modules". Not 2011, not 2001, 1971.

In it, he suggests "a mechanism for improving the flexibility and comprehensibility of a system while allowing the shortening of its development time".

You can read it here:

https://www.win.tue.nl/~wstomv/edu/2ip30/references/criteria_for_modularization.pdf …

And the last classic paper I'll share this week is from John Backus, from 1977. The title? "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs". You can read it here:

https://www.thocp.net/biographies/papers/backus_turingaward_lecture.pdf …

#datascience related classic papers

Francois Dion

@f_dion

Classic papers will be tweeted...

Francois Dion

Entrepreneur | Python Expert | Chief Data Scientist

Week 1

Week 2

Week 3

Week 4

Week 5

More articles by this author

Others also viewed

REGISTER NOW: July Mavens of Data shows are here!

Chapter 3: Death by Data Dictionary

New Curated Lists for Focused Practice 🧠

DECISION TREES AND TITANIC DATASET

Once You Have Data

Quantico: Plotting

Looking forward to Data Science Week 2020, Virtual Edition!

Wrapping up: insights on metadata matching and introducing our new Data Science team

My Fellow Citizens... Applying analytics to our Presidents' first national address

Day 4: Advanced Cypher Techniques and Query Optimization

Explore topics

Week 1

Week 2

Week 3

Week 4

Week 5

Analog Computing, back to the future?

Aug 31, 2022

Of Poets and Visualizations

Oct 24, 2019

A typo is a cybersecurity nightmare

Jul 19, 2019

What grade of data are you using?

Dec 10, 2018

Are you smarter than a fifth grader?

Mar 8, 2018

"ex-libris" of a Data Scientist, part VI: Communication

Jan 17, 2018

There are humans behind those stats

Sep 21, 2017

"ex-libris" of a Data Scientist, part V: Visualization

Aug 8, 2017

"ex-libris" of a Data Scientist, part IV: Code

Jun 13, 2017

"ex-libris" of a Data Scientist, part III: Technology

May 31, 2017

Others also viewed

REGISTER NOW: July Mavens of Data shows are here!

Chapter 3: Death by Data Dictionary

New Curated Lists for Focused Practice 🧠

DECISION TREES AND TITANIC DATASET

Once You Have Data

Quantico: Plotting

Looking forward to Data Science Week 2020, Virtual Edition!

Wrapping up: insights on metadata matching and introducing our new Data Science team

My Fellow Citizens... Applying analytics to our Presidents' first national address

Day 4: Advanced Cypher Techniques and Query Optimization

Explore topics