Bad Data Handbook und über 1,5 Millionen weitere Bücher verfügbar für Amazon Kindle. Erfahren Sie mehr

Neu kaufen

oder
Loggen Sie sich ein, um 1-Click® einzuschalten.
oder
Mit kostenloser Probeteilnahme bei Amazon Prime. Melden Sie sich während des Bestellvorgangs an.
Gebraucht kaufen
Gebraucht - Gut Informationen anzeigen
Preis: EUR 18,83

oder
 
   
Jetzt eintauschen
und EUR 4,28 Gutschein erhalten
Eintausch
Alle Angebote
Möchten Sie verkaufen? Hier verkaufen
Der Artikel ist in folgender Variante leider nicht verfügbar
Keine Abbildung vorhanden für
Farbe:
Keine Abbildung vorhanden

 
Beginnen Sie mit dem Lesen von Bad Data Handbook auf Ihrem Kindle in weniger als einer Minute.

Sie haben keinen Kindle? Hier kaufen oder eine gratis Kindle Lese-App herunterladen.

Bad Data Handbook [Englisch] [Taschenbuch]

Q. Ethan McCallum

Preis: EUR 24,95 kostenlose Lieferung. Siehe Details.
  Alle Preisangaben inkl. MwSt.
o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o
Nur noch 2 auf Lager (mehr ist unterwegs).
Verkauf und Versand durch Amazon. Geschenkverpackung verfügbar.
Lieferung bis Mittwoch, 30. Juli: Wählen Sie an der Kasse Morning-Express. Siehe Details.

Weitere Ausgaben

Amazon-Preis Neu ab Gebraucht ab
Kindle Edition EUR 17,30  
Taschenbuch EUR 24,95  

Kurzbeschreibung

20. November 2012
What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they've recovered from nasty data problems. From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it. Among the many topics covered, you'll discover how to:* Test drive your data to see if it's ready for analysis* Work spreadsheet data into a usable form* Handle encoding problems that lurk in text data* Develop a successful web-scraping effort* Use NLP tools to reveal the real sentiment of online reviews* Address cloud computing issues that can impact your analysis effort* Avoid policies that create data analysis roadblocks* Take a systematic approach to data quality analysis

Hinweise und Aktionen

  • Amazon Trade-In: Tauschen Sie Ihre gebrauchten Bücher gegen einen Amazon.de Gutschein ein - wir übernehmen die Versandkosten. Jetzt eintauschen


Wird oft zusammen gekauft

Bad Data Handbook + Doing Data Science: Straight Talk from the Frontline + Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Preis für alle drei: EUR 77,85

Die ausgewählten Artikel zusammen kaufen

Kunden, die diesen Artikel gekauft haben, kauften auch


Produktinformation


Mehr über den Autor

Entdecken Sie Bücher, lesen Sie über Autoren und mehr

Produktbeschreibungen

Über den Autor und weitere Mitwirkende

Q Ethan McCallum is a consultant, writer, and technology enthusiast, though perhaps not in that order. His work has appeared online on The O'Reilly Network and Java.net, and also in print publications such as C/C++ Users Journal, Doctor Dobb's Journal, and Linux Magazine. In his professional roles, he helps companies to make smart decisions about data and technology.

Welche anderen Artikel kaufen Kunden, nachdem sie diesen Artikel angesehen haben?


Kundenrezensionen

Es gibt noch keine Kundenrezensionen auf Amazon.de
5 Sterne
4 Sterne
3 Sterne
2 Sterne
1 Sterne
Die hilfreichsten Kundenrezensionen auf Amazon.com (beta)
Amazon.com: 3.5 von 5 Sternen  13 Rezensionen
7 von 7 Kunden fanden die folgende Rezension hilfreich
4.0 von 5 Sternen Well worth reading. 17. Dezember 2012
Von William E. J. Doane - Veröffentlicht auf Amazon.com
Format:Taschenbuch
Bad data is a fact of life. Coping with bad data is a valuable, learned skill. Bad Data Handbook offers insights from over 20 authors based on their years of personal experience managing ill-defined, often chaotic and incomplete data. We begin with a exploration of what is meant by *bad data* and what checks we can preform to help us understand data quality as a prerequisite to data analysis.

Kevin Fink offers suggestions on approaching data critically in order to ensure that we understand what we're working with before we begin to try to manipulate it. Fink offers useful scripts in shell and Perl that can be used to inspect data and perform basic sanity checks. Paul Murrell tackles the problem of scraping data from sources formatted for human consumption into a format more amenable for algorithmic analysis using R. And on and on.

Each chapter addresses a critical concern in the data life-cycle: identifying, annotating, capturing, archiving, versioning, manipulating, analyzing, and deriving actionable information from imperfect or incomplete data. The advice offered is both powerful and immediately useful to data scientists and newcomers to the field alike and for me has spurred several ideas for how to approach teaching statistics.

Given the number of authors who contributed to this volume, it should come as no surprise that the tone, writing styles, and tools used vary greatly among the chapters, sometimes wandering into technical minutia, but only infrequently. The book holds together remarkably well, regardless, and was a pleasure to read.

Disclosure: I received a complimentary ebook copy of this book to review
10 von 12 Kunden fanden die folgende Rezension hilfreich
4.0 von 5 Sternen Taming Bad DAta 29. November 2012
Von Shawn Day - Veröffentlicht auf Amazon.com
Format:Kindle Edition
A great concept for a book. In this day and age as we seem to be increasingly engaging with things we call datasets, engaging in challenges to make sense of big data and engaging with one another around stuff we call data - here are a series of lessons to deal with data ... Taking a very case-oriented approach, the collection of articles in this edited volume look at the problems we run into - either overtly or unawarely when working with data. How many have run into the character encoding challenge, received data in a semi-structured form and needed to transform it quickly and efficiently into something more usable, or had to determine a means to identify the potential bias or results from collection errors? Well, that's what the Bad Data Handbook is all about.

Editor, Q. Ethan McCallum has assembled an impressive array of contributors who present articles on determining data quality and detecting potential flaws, fixing data errors to make it usable for your specific usage, and using the most up to date techniques and methods available today to tame data and effectively interrogate it for analytical purposes. The precept of this book is data not fit for purpose ... or at least the purpose you might have in mind for it and in that respect, we will call it bad data. The various chapters look at doing 'sniff' tests' on the data to see whether it is sound for the purposes you might consider putting it to. How do we find outliers? Can we spot gaps? through the use of some handy automated routines. The second chapter looks to techniques useful for transforming data that was formatted for human consumption and provides means to transform it to useful for machine readability. Subsequently the authors explore ways to consider the data models that have been used to define the collection and processing procedures that may or may not render data unfair for purpose.

The collection of articles in this book are deadly valuable and the solutions proposed are code based. The routines for dealing with the data ultimately involve application of routines to make data suit your needs. The routines are python-based so about as approachable as possible by users who may be less familiar or accustomed to using code to deal with data problems.

I was particularly impressed by the inclusion of a section on working with various text encoding formats and apply techniques to remedy situations which render the data 'bad'. The inclusion of a series of quick exercises in this section are particularly apt.

The general presentation of the book is to identify a specific problem, explain its significance and then to provide hands-on examples of how a user can approach a solution.

The transition to applied techniques to look at data from a more broad basis, such as using sentiment analysis and Natural Language Processing to sniff out whether online reviews are genuine or not addresses real world problems with online information - more than data itself.

This is an intriguing book. It looks at the down and dirty manipulation and mungingg of data, then takes higher level looks at how we might mistake information for solid data. In all cases it applies good techniques, suggests how one can use sound statistical reasoning, interrogate the data model or delve into code based manipulation in the pursuit of more truthful data. Due to the broad coverage of this book it is harder to determine who it is directly aimed towards. I believe that selective reading of it could inform general practitioners in the digital humanities and in emerging areas of study increasingly engaging with data in new ways. It brings to light many lessons of experience that are simply invaluable and would normally be developed only through hands-on tinkering and discovery often well into larger projects.It has broader appeal to data scientists more broadly who benefit for similar reasons, but also for the wealth of hands-on techniques provided that refine and empower standard practice.

In any case I do feel that as a collection of it articles it can a very helpful reference source and individual sections consulted as needed - by no means does is this a linear designed volume. It is however, a very valuable contribution to a field that is gaining mass popular engagement.
7 von 8 Kunden fanden die folgende Rezension hilfreich
5.0 von 5 Sternen Innovative reference on bad data 18. Dezember 2012
Von Ben Rothke - Veröffentlicht auf Amazon.com
Format:Taschenbuch
In the movie The Sixth Sense, Cole Sear said "I see dead people". For author Q. Ethan McCallum, whose excellent book Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work just came out, he likely sees bad data just about everywhere.

So just what is this monster called bad data? McCallum writes on page 1 that it is difficult to explicitly define what bad data is. He writes that some people consider it a purely hands-on, technical phenomenon, namely missing values, malformed records, and incompatible file formats. But also notes that it is much more than that.

Chapter 1 notes that bad data includes data that eats up your time, causes you to stay late at the office, drives you to tear out your hair in frustration and more. It's data that you can't access, data that you had and then lost, data that's not the same today as it was yesterday. Ultimately, bad data is data that gets in the way. And there are so many ways to get there, from bad storage, to poor representation, to misguided policy.

In the book, McCallum gathered numerous authors to detail how bad data issues have affected them and what they have done to deal with it, and remediate it.

Most books that have close to 20 authors suffer from poor organization, repetitive material and overall lack of structure. This title suffers from none of that, and provides the reader with an excellent guidebook to use to ensure that they don't run into the garbage in, garbage out scenario when dealing with data. This is particularly important given that we are living in a data driven society.

While ostensibly a dry topic, the authors expertise is such that they are able to make the text most interesting. This is particularly true in chapters 2 Is It Just Me, or Does This Data Smell Funny?, 8 - How Chemists Make Up Numbers and 16 - How to Feed and Care for Your Machine-Learning Experts.

Another interesting chapter is 14 on the Myths of Cloud Computing. Steve Francia debunks 4 pervasive cloud myths including the notion that the cloud is a great solution for all infrastructure components and the cloud will always save you money.

The beginning of the book has a lot of code that may turn off some non-programmers, After chapter 7, the coding examples are limited, and the message the authors give is definitely worth reading.

While hardware is cheap and bandwidth even cheaper; the book shows that bad data is extremely expensive. Bad data has significant and always negative consequences.

The book takes a highly systematic approach to data quality analysis, which is a most important task. Given the importance of the topic, Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work is a title that has relevance for nearly everyone in IT, and should be read by anyone who is concerned with the integrity of their organizations data.x
6 von 7 Kunden fanden die folgende Rezension hilfreich
2.0 von 5 Sternen All over the place, a different place 21. Oktober 2013
Von Dimitri Shvorob - Veröffentlicht auf Amazon.com
Format:Taschenbuch
"Bad data" follows the low-cost formula of O'Reilly's earlier "Beautiful data": 15-20 practitioners each contribute a short essay, then O'Reilly puts the manuscripts together and adds a sexy, vague title. Anything goes (in) - "bad" as not available in a form ready for analysis? here's an essay on web scraping in Python! - and while this increases the chances of you finding something interesting, or even useful, it also increases the percentage of content that you will find irrelevant. Speaking of my own expectations, "Bad data" is not a book dedicated to data quality - if this is what you would like, I recommend "Data quality assessment" by Arkady Maydanchik.
1 von 1 Kunden fanden die folgende Rezension hilfreich
4.0 von 5 Sternen Real-world anecdotes and lessons-learned 3. November 2013
Von Peter Clark - Veröffentlicht auf Amazon.com
Format:Taschenbuch
TL;DR summary of the review - awesome book. If you work with real-world datasets, or you work with people who do, you owe it to yourself to read this book. I wish it had been around 8 years earlier when I started working with large-scale social sciences census data. All of the fun, and all of the pain, of dealing with government data and social sciences data is particularly true for census information.

Much of the book could be summed up as noting that less-than-perfect data is still very useful, but you need to understand how the data is bad - is it random? What kinds of bias are introduced, if any? What impact will that have on your conclusions? Go get your hands dirty with the data itself - go look at a few hundred records in a text editor to see what you've got. You'll want to test the data all through your analysis, to ensure that you can identify both where you're hitting issues and where you're introducing issues yourself, and you'll be happier if you can automate these tests so that you can run them often without creating a burden for yourself. Prefer simple tools and portable file formats - in particular, Excel is not your friend. The book discusses a number of different case studies and anecdotes for dealing with data that has problems of one flavor or another. The authors have been there before and you can learn from their experience.

Discussions of social sciences survey data and its inherent imperfections and messy metadata definitely rang true with my experiences dealing with census data, as did the chapter on the lowly, undervalued flat file as a data structure.

I'll summarize three takeaway messages that resonated for my own experience:
1. It's generally easy to do some basic analysis of your data to look for problems, gaps, inconsistencies, unusual distributions; and doing so will give you insight into what you're dealing with. Going through your actual data file, rather than trusting the metadata and documentation, is the only way to really know what sort of issues are lying in wait.

2. There's lots of interesting data that's structured for human consumption rather than machine-driven analysis. Restructuring it to be in a format that's more amenable for machine analysis can be tedious, but it's also automatable. Rather than converting a huge list of documents by hand, write some code to restructure it. This notion is explored in chapter 2, where the code is in the R stats language. R is is a good fit for two-dimensional data such as tables, rather than the base unix tools (perl, sed, awk) which tend to be line-oriented. However, there's nothing here that can't be done in awk too. Don't shy away from writing code to transform data into something useful, and expect that to be an iterative process.

3. Oftentimes, "plain text" files are anything but. You can find "plan text" files that are ASCII, or UTF-8, or ISO-8859, or CP-1252, all of which will look the same until you start to run into non-English characters. I've seen this in dealing with internationally-sourced data, or even US data that includes Puerto Rico. The authors provides some guidance about how to deal with this in chapter 4, but more importantly, they discuss the fact that it's a surprisingly and frustratingly complex problem that you need to be aware of. Another issue is that when looking at data generated from a web app, you may find text that's been encoded or escaped to avoid SQL injection or cross-site scripting attacks. These are web app best practices, and it's generally easy to get it back to plain text once you know what you're looking at. The author gives code samples in python, which has strong library support for text transformation, but the main point is to see how to identify these kinds of problems with your input data.

My only negatives are that, as a collection of individual essays, the writing style and tone tends to be all over the map. All in all, this is a book that I enjoyed reading, and have recommended to other software developers starting to work with data scientists.
Waren diese Rezensionen hilfreich?   Wir wollen von Ihnen hören.

Kunden diskutieren

Das Forum zu diesem Produkt
Diskussion Antworten Jüngster Beitrag
Noch keine Diskussionen

Fragen stellen, Meinungen austauschen, Einblicke gewinnen
Neue Diskussion starten
Thema:
Erster Beitrag:
Eingabe des Log-ins
 

Kundendiskussionen durchsuchen
Alle Amazon-Diskussionen durchsuchen
   


Ähnliche Artikel finden


Ihr Kommentar