- Taschenbuch: 262 Seiten
- Verlag: O'Reilly and Associates; Auflage: 1 (20. November 2012)
- Sprache: Englisch
- ISBN-10: 1449321887
- ISBN-13: 978-1449321888
- Größe und/oder Gewicht: 17,8 x 1,4 x 23,3 cm
- Durchschnittliche Kundenbewertung: Schreiben Sie die erste Bewertung
- Amazon Bestseller-Rang: Nr. 333.120 in Fremdsprachige Bücher (Siehe Top 100 in Fremdsprachige Bücher)
Bad Data Handbook (Englisch) Taschenbuch – 20. November 2012
|Neu ab||Gebraucht ab|
Kunden, die diesen Artikel gekauft haben, kauften auch
Es wird kein Kindle Gerät benötigt. Laden Sie eine der kostenlosen Kindle Apps herunter und beginnen Sie, Kindle-Bücher auf Ihrem Smartphone, Tablet und Computer zu lesen.
Geben Sie Ihre Mobiltelefonnummer ein, um die kostenfreie App zu beziehen.
Wenn Sie dieses Produkt verkaufen, möchten Sie über Seller Support Updates vorschlagen?
Über den Autor und weitere Mitwirkende
Q Ethan McCallum is a consultant, writer, and technology enthusiast, though perhaps not in that order. His work has appeared online on The OReilly Network and Java.net, and also in print publications such as C/C++ Users Journal, Doctor Dobbs Journal, and Linux Magazine. In his professional roles, he helps companies to make smart decisions about data and technology.
|5 Sterne (0%)|
|4 Sterne (0%)|
|3 Sterne (0%)|
|2 Sterne (0%)|
|1 Stern (0%)|
Die hilfreichsten Kundenrezensionen auf Amazon.com
Much of the book could be summed up as noting that less-than-perfect data is still very useful, but you need to understand how the data is bad - is it random? What kinds of bias are introduced, if any? What impact will that have on your conclusions? Go get your hands dirty with the data itself - go look at a few hundred records in a text editor to see what you've got. You'll want to test the data all through your analysis, to ensure that you can identify both where you're hitting issues and where you're introducing issues yourself, and you'll be happier if you can automate these tests so that you can run them often without creating a burden for yourself. Prefer simple tools and portable file formats - in particular, Excel is not your friend. The book discusses a number of different case studies and anecdotes for dealing with data that has problems of one flavor or another. The authors have been there before and you can learn from their experience.
Discussions of social sciences survey data and its inherent imperfections and messy metadata definitely rang true with my experiences dealing with census data, as did the chapter on the lowly, undervalued flat file as a data structure.
I'll summarize three takeaway messages that resonated for my own experience:
1. It's generally easy to do some basic analysis of your data to look for problems, gaps, inconsistencies, unusual distributions; and doing so will give you insight into what you're dealing with. Going through your actual data file, rather than trusting the metadata and documentation, is the only way to really know what sort of issues are lying in wait.
2. There's lots of interesting data that's structured for human consumption rather than machine-driven analysis. Restructuring it to be in a format that's more amenable for machine analysis can be tedious, but it's also automatable. Rather than converting a huge list of documents by hand, write some code to restructure it. This notion is explored in chapter 2, where the code is in the R stats language. R is is a good fit for two-dimensional data such as tables, rather than the base unix tools (perl, sed, awk) which tend to be line-oriented. However, there's nothing here that can't be done in awk too. Don't shy away from writing code to transform data into something useful, and expect that to be an iterative process.
3. Oftentimes, "plain text" files are anything but. You can find "plan text" files that are ASCII, or UTF-8, or ISO-8859, or CP-1252, all of which will look the same until you start to run into non-English characters. I've seen this in dealing with internationally-sourced data, or even US data that includes Puerto Rico. The authors provides some guidance about how to deal with this in chapter 4, but more importantly, they discuss the fact that it's a surprisingly and frustratingly complex problem that you need to be aware of. Another issue is that when looking at data generated from a web app, you may find text that's been encoded or escaped to avoid SQL injection or cross-site scripting attacks. These are web app best practices, and it's generally easy to get it back to plain text once you know what you're looking at. The author gives code samples in python, which has strong library support for text transformation, but the main point is to see how to identify these kinds of problems with your input data.
My only negatives are that, as a collection of individual essays, the writing style and tone tends to be all over the map. All in all, this is a book that I enjoyed reading, and have recommended to other software developers starting to work with data scientists.
Kevin Fink provides an interesting peek (with code) at processing web log data. Paul Murrel offers advice on getting data out of ‘awkward’ formats like Excel (use XLConnect) and processing it with ‘R’.
We enjoyed Joch Levy’s chapter on ‘bad data in plain text’ with an authoritative account of character encodings and text processing in Python. Adam Laciano’s chapter on scraping data from web pages does a good job of showing what an ugly task this can be. For one website using Flash, this meant running Matlab scripts to extract text from screen grabs! Jacob Perkins’ ‘detecting liars on the web’ describes how Python’s NLTK library for natural language processing is used to classify movie reviews. Interesting but again, somewhat off topic!
A problem with BDH is that the subject means different things to different people. Phil Janert’s chapter covers defect reduction in manufacturing, analyzing call center data and making the most of data with statistics-based hypothesis testing. BDH is very much in the modern world of NoSQL, file databases and the web. The topics of database integrity and naming conventions are not covered—even though these are key routes to clean data.
Ethan McCallum makes a brave attempt to tie all this together but his is less of an editor’s role, more on of an applier of lipstick to the pig. Again, the problem with BD is the subject and the fact that the book is mostly about making sense of data as it is found on the web. The issue of how to avoid creating bad data in the first place is not covered. Which is a shame as this is arguably more important.
This review originally appeared in Oil IT Journal (oilit.com)