Data Science ist immer noch ein sehr schwammiger Begriff, einige behaupten es ist ein schöner Name für Statistik, andere sagen es handelt sich um das neue Business Intelligence aber für Big Data und wiederum andere glauben, dass Data Science ein komplett neues Thema ist. Denn es gibt weder Konsensus noch eine offizielle Definition. Allerdings handelt es sich um ein sehr sexy Thema zur Zeit und fast jeder Verlag hat mittlerweile ein Buch diesbezüglich im Angebot; O'reilly macht es nicht anders.
Doing Data Science: Straight Talk from the Frontline versucht, eine Einführung zum Thema zu sein, ohne eine große Mathematik-Theorie dahinter. Vielmehr will von dem Alltag von Data Scientists erzählen und ein Basisverständnis für Data Science schaffen.
Die Autorinnen sind bekannt in der Szene: Cathy O'Neill, ist eine bekannte Bloggerin (mathbabe) mit einem sehr starken mathematischen Hintergrund, und Rachel Schutt lehrt an der Columbia University in New York. All dies sind die richtigen Bedingungen für ein gutes Buch über das Thema: Erfahrene Expertinnen, die sich sehr gut schriftlich ausdrücken können und im Data Science tätig sind.
Allerdings liegt hier genau das Problem, dass Buch ist weder ein Fachbuch noch ein Roman. Es fühlt sich genau wie eine Sammlung von Blog-Einträgen oder einen längeren Magazinartikel. Denn einige Kapitel sind Gastbeiträge von anderen Experten oder Studenten des erwähnten Kurses. Noch dazu ist das Buch an sich eine Ansammlung von Präsentationen und Vorträgen der Data Science Vorlesung an der Columbia University. Somit ist das Stil in jedem Kapitel etwas anders. Außerdem sind viele Sachen im Buch leider entweder extrem oberflächig erklärt oder sogar falsch.
Es ist sehr schade, da das Buch extrem viel Potential hat. Trotzdem ist Doing Data Science: Straight Talk from the Frontline generell ein positiver Beitrag. Jeder kann die Themen verstehen, es ist unterhaltsam und die Literaturempfehlungen sind sehr umfassend. Daher kann jeder interessierte im Data Science das Buch schnell lesen und einen guten Überblick bekommen. Anschließend kann man die Theorie hinter Data Science durch ein umfassenderes Fachbuch in der Empfehlungsliste oder durch „An Introduction to Statistical Learning“ lernen.
First off - this book is bloated with "historical context" of Data Science and the personal experiences of the authors. Not only contain these sections no relevant information, they also bog down the reading process. There ist no "straight" talk from the frontline in this book. Many keywords and phrases specific to Data Science are also not explained, further hindering the process of understanding this topic. The first chapters would be expected to explain what Data Science is - and fail. This is exemplary for the entire book. It talks about methods, but not how these methods are used to to Data Science. Statistical Inference, Exploratory Data Analysis and so on are touched upon (not thoroughly explained), but not how a Data Scientist would use them, how she would interpret the results, what part of the results would be particularly interesting and what decisions she would make based upon these results. A clear example how a Data Scientist would tackle a particular problem, describing in Details what steps she would take and most importantly: WHY; would have been necessary to make heads and tails of the information presented in this book.
A particular shining example of all these problems are the "exercises" in this book: For example, the second chapter contains an exercise about a real estate house buying company and asks the reader to formulate a "data strategy" for this company, based on its website data and to analyze this data for anything unusual... If you have no prior knowledge about real-estate house buying, you wil have trouble even understanding what this company is actually doing, what seperates it from its competitors and why it earns money this way. It is really that badly explained. And analyzing the company's website? How? What? Why? How many clicks the users need to get to useful information? How pleasing it is to the eye? What information would we log from this website about our users - based on... what? What is the bloody purpose of this analysis? How the authors expect the reader to solve a problem without defining this problem clearly, is beyond me. That no exemplary solutions are given to these work assignments doesn't help either...
And on the next page the authors write about how important it is to ask questions when some "domain-expert" uses incomprehensible, domain specific jargon. Yeah, that's rich...
They at least attempted to give some help here, i have to give them credit for that, for what it's worth. They provided R-code for a similar DataSet to those given for the exercises. But that doesn't help much. Of course we plot Histograms and Box-Plots for those variables but what are we looking for in these graphs? What would that tell a DataScientist and what would she do about it?
To summarize: this book completely fails to meet its set goal.
“Data Science” has become one of the most trendy research fields in recent years, as well as a catchall rubric for various job descriptions and work functions. The cynics and skeptics, and there are many of those, contend that “Data Science” is nothing more than repackaged Statistics, with a bit of coding and hacking thrown in. Its proponents, however, point out that most practicing data scientists use a variety of skills and techniques in their daily work, and come from a vast spectrum of career paths and backgrounds. I tend to side with the latter group, but I too am an outsider to this field and am still trying to get a better understanding of what it really entails.
“Doing Data Science: Straight Talk from the Frontline” is a compendium of chapters that deal with data science as it is practiced in the real world. Each chapter is written by a different author, all of who have significant practical experience and are acknowledged authorities on data science. Most of the contributors work in industry, but data science is still so fresh and new that there is a lot of crossing over between academia and the corporate world.
A few of the chapters include exercises, but these tend to be too advanced and assume too much background material for an introductory book. The exercises still give you a good idea of what kinds of problems data scientists tend to grapple with. However, this book is definitely not a textbook and cannot be effectively used as such. The book doesn’t provide any background on R, statistics, data scrubbing, machine learning, and various other techniques used by data scientist. It is highly unlikely that any single textbook would be able to do justice to all of that material anyways, but a book of that sort could still have a lot of potential use.
There are two groups of people who would benefit from this book. The first are people who have absolutely no background in data science or any of its related fields, but would like to get a flavor of what data science is all about and are interested in exploring it for career purposes. The second group are people with significant technical background in one of the fields related to data science (programming, statistics, machine learning, etc.) who are interested in broadening their skills and would like to see how would their particular strengths fit within the broader data science field.
I have read by now about 8 books on data science, machine learning, MapReduce, Spark and general stats - but this book has helped me tremendously with getting my hands dirty in the real work of data sceince, how to think about it and do it.
If you enjoy being a wizard who can fortell the future :) The book is a superb start to data science. Has it all and gives you clues as how to continue building up your knowlege. Gives you some basic R code samples that are priceless for a quick and practical learning. Beware, however, that this book just gets you started and then one has to follow the threads. If you know nothing about statistics, perhaps you should take an introductory book or course on statistics and probability. I did study math 20 years ago and never used stats or probability for work until recently. This book got me going, so you need nothing but rudimentary information.