- Taschenbuch: 416 Seiten
- Verlag: Manning; Auflage: Pap/Psc (10. April 2014)
- Sprache: Englisch
- ISBN-10: 1617291560
- ISBN-13: 978-1617291562
- Größe und/oder Gewicht: 18,5 x 2,5 x 23,1 cm
- Durchschnittliche Kundenbewertung: 1 Kundenrezension
- Amazon Bestseller-Rang: Nr. 45.006 in Fremdsprachige Bücher (Siehe Top 100 in Fremdsprachige Bücher)
Practical Data Science with R (Englisch) Taschenbuch – 10. April 2014
Wird oft zusammen gekauft
Kunden, die diesen Artikel gekauft haben, kauften auch
Es wird kein Kindle Gerät benötigt. Laden Sie eine der kostenlosen Kindle Apps herunter und beginnen Sie, Kindle-Bücher auf Ihrem Smartphone, Tablet und Computer zu lesen.
Geben Sie Ihre Mobiltelefonnummer ein, um die kostenfreie App zu beziehen.
Über den Autor und weitere Mitwirkende
Nina Zumel co-founded Win-Vector, a data science consulting firm in San Francisco. She holds a PH.D. in robotics from Carnegie Mellon and was a content developer for EMC's Data Science and Big Data Analytics Training Course. Nina also contributes to the Win-Vector Blog, which covers topics in statistics, probability, computer science, mathematics and optimization.
John Mount co-founded Win-Vector, a data science consulting firm in San Francisco. He has a Ph.D. in computer science from Carnegie Mellon and over 15 years of applied experience in biotech research, online advertising, price optimization and finance. He contributes to the Win-Vector Blog, which covers topics in statistics, probability, computer science, mathematics and optimization.
Welche anderen Artikel kaufen Kunden, nachdem sie diesen Artikel angesehen haben?
Die hilfreichsten Kundenrezensionen auf Amazon.com (beta)
UPD. With the benefit of a little more life experience, I would say: don't spend your time on *any* R book. Python is the way to go.
Was excited to see this book coming to publication. I'm a fan of practical, non-academic approaches to subjects and prefer working from concrete examples to abstract principles (rather than the other way around). I think this is both the most difficult and most needed type of resources that can be put into print. This book handles the task ok; it falls a bit short on practical, concrete, use cases as it alternates between working with hands on datasets and shotgun coverage of principles and techniques at a higher level. I'd have much preferred sticking with single data-sets for longer (say, a couple chapters per data set), but didn't feel cheated out of hands on work.
- Easy access to the datasets via Github; good documentation on where to find others
- Key Takeaways provided at end of chapter are good summaries of overall information provided.
- A good focus on not just data analysis, but the process as a whole; very Agile like, practical, and non-dogmatic.
- Battle tested advice: You can tell some of the advice comes from hard-fought battles - ex: Why not use the sample() function instead of manually creating a sample column? Because with a sample column, you can repeatably sample the same data (e.g. all columns < 2) for repeatable output and for regression testing (avoiding introducing bugs).
- Builds your analyst vocabulary, increasing your all-important google-fu skills. Not knowing what to Google is, imho, the single hardest problem when learning a new set of problems / api's.
- Good use of Appendices for introducing R syntax / installation, rather then stuffing it into one of the early chapters.
- Doesn't stick with data sets long enough. I went to the trouble of setting up a true database to use the first dataset (chapter 2); only to move on to a different data set in the very next chapter (book did eventually return to the data set).
- Feels a bit back and forth at times on whether it wants to be a truly pragmatic, focused work or a principles driven, broadly scoped book (thinking of chapters 5-7 here). Not necessarily a knock depending on what your looking for.
I've ready a few books on getting started in data analysis, R, statistics, etc. This book is solid enough that were I to choose among them, I'd recommend it first. I think if the book focused down on using data-sets for longer stretches, allowing you to learn the data well and apply multiple types of analyses on top of it (especially earlier on), it would be a bit more engaging.
Lastly, its has good coverage of R principles but (per its scope) doesn't get into the nitty gritty. I'd recommend "The Art of R Programming" for that, which would be a good companion to this book (e.g. covers R but not Data Analysis). I've heard R in Action is good as well, though haven't read it. Caveat emptor.
Disclaimer: I received a e-copy of the book from Manning for review.
Ch1 describes the job of the data scientist, the workflow, and the characters you run into on a project.
Ch2 outlines some of the tools used to get at the data, including the authors tool, "SQL Screwdriver." I'd have liked some genuflections at the unix tools used to clean data before it is put anywhere important; sed, awk, tr, sort and cut here, but I'm not sure if there is a graceful way of doing this. Or perhaps I'm the only weirdo who uses these in the ETL process.
Ch3 exploring data; using the various plot utilities in ggplot2 (the graphics library everyone should be using); bar charts, histograms, summary statistics and scatter plots.
Ch4 managing data: what they call "cleaning data" -I call reshaping data (and I use reshape, sometimes anyway; I would have mentioned this, though I got on well without it for years)
Ch5 gets into specifying the problem; is it a classification problem? scoring? recommendation engine? How do I quantify success? This chapter is very helpful in doing this. Of course, problems evolve over time, and customers change their minds, but there are very helpful mappings here which will point you in the right direction There are a few new techniques which should probably be included in future editions of this chapter, depending on how they pan out: I'm impressed with using drop out techniques to prevent overfitting, for example (this is bleeding edge stuff, generally in context of deep learning).
Ch6 Memorization techniques covers Naive Bayes, KNN and decision trees. It would have been nice to have more information on the various kinds of variable selection techniques (particularly important for NB and KNN), but mentioning this will allow the practitioner to go find their own information.
Ch7 Logistic and Linear regression: most would have done these first, but these are actually more complex than memorization techniques, and there are more things to know to keep the practitioner out of trouble. In my opinion, this chapter really shines: everyone who is going to do this for a living has had some exposure to regression models: this chapter makes it practical.
Ch8 Unsupervised methods; covers clustering; heirarchical clustering (one of the most useful tricks you will use in data science), kmeans (it has to be done, though I never found it to be useful) and association rules.
Ch9 Advanced methods: GAMs, SVM, bagging and random forests (the importance measure trick: if you don't know it, pay attention: this is a very good trick). These are the "industrial strength" tools used in industry. I, personally would have stuck GAMs in their own chapter, and mentioned boosting here, but everyone is a little different in their tastes.
Ch10 Documentation and deployment: they use Knitr; I just use vanilla Sweave (I've tried brew, but never took to it). They introduce git here: something I would have done in chapter 1 or 2, but it is a fairly natural place to mention it. They use the Rook tool to deploy HTTP services; I've never used it, though I have used Shiny, which I can recommend. They mention PMML briefly (I've never used it).
The appendix on R is helpful, though it doesn't include the most valuable advice of all for using R in production: you need to maintain a distribution of R and all used packages, as well as a dependency toolchain if the code will be deployed on multiple servers.
Data science in general terms requires a confusing mix of talents. This book really highlights that by the amount of material that was covered.
To be in this field, you need to have an understanding of; Stats and Linear algebra, Programming, SQL, source control and general computer savvy.
Most importantly you need to have desire, it’s a lot of material.
I found myself fairly critical of its early content, mainly the business and methodology; getting through the first couple of chapters was absolute torture.
I have a computer programming background and have been through a couple of generations of "methodologies". i.e. waterfall, Agile etc. For me this exercise would have been better spent learning R. This brought me to my next challenge, R syntax. I basically took a sabbatical from PDSR and read a book on R(R in Action).
I moved on to the guts of the book after learning a bit of R. Here I found a bunch of great concrete examples with REAL data. The example data sets are fantastic.
The only complaint I have here is; it would have been nice to carry through some of those data examples a little further as opposed to having more examples.
I could see that the authors had really invested themselves with the examples, it was worth it. For me they carried the book. All of the examples are published on Github.
Some of my major criticisms; because the book and reader audience is so diverse introducing obscure technologies like H2 database and "SQLScrewdriver" utilities sort of throws readers into unnecessary tangents. H2 and SQLscrewdriver are edge tools, as the writers pointed out there are more mainstream databases and data loading tools available either open source of free "express versions" all of which have more than adequate data loading tools. Those tools have plenty of “googability” and would require less effort and more results on the readers part (and author for that matter).
Overall I enjoyed the book. It was a very hands on book, not overly academic. I am sure readers with different backgrounds will be critical of the sections of the book where they are more experienced. I think that's ok simply because the book needs to cover so much material, some of it is bound to be review for one audience or another.
Moving forward, I'll use the book as a reference, especially the examples.
As you'd expect from the authors, both experienced practicing data scientists with PhDs from Carnegie Mellon, Part 2 of the book presenting individual modeling techniques is comprehensive and useful. But Parts 1 and 3 that complement the algorithmic detail are also terrific: typical roles in a data science project, and practical guidance on data exploration and visualization in Part I, and on documentation, delivery, and presentation in Part 3; that content is rarely available, illustrated with examples and runnable code, in a single book as it is here.
I used early versions of some of the chapters in a graduate class I taught on Managing Analytics Projects at CMU last fall, and was very happy with the results; I would not hesitate to recommend this to other practitioners or faculty looking for a data science textbook for their classes.
Ähnliche Artikel finden
- Fremdsprachige Bücher > Computer & Internet > Datenbanken
- Fremdsprachige Bücher > Computer & Internet > Informatik > Künstliche Intelligenz
- Fremdsprachige Bücher > Computer & Internet > Informatik > Modellierung & Simulation
- Fremdsprachige Bücher > Computer & Internet > Informatik > Systemanalyse & Design
- Fremdsprachige Bücher > Computer & Internet > Projektmanagement
- Fremdsprachige Bücher > Computer & Internet > Software > Mathematik & Statistik