Data wrangling, sometimes referred to as data munging, is the process of transforming and data mapping from one "raw data" data form into another content format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.
The process of data wrangling may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data wrangling typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use. It is closely aligned with the ETL process.
One of the first mentions of data wrangling in a scientific context was by Donald Cline during the NASA/NOAA Cold Lands Processes Experiment. Cline stated the data wranglers "coordinate the acquisition of the entire collection of the experiment data." Cline also specifies duties typically handled by a storage administrator for working with large amounts of data. This can occur in areas like major research projects and the making of with a large amount of complex computer-generated imagery. In research, this involves both data transfer from research instrument to storage grid or storage facility as well as data manipulation for re-analysis via high-performance computing instruments or access via cyberinfrastructure-based digital library.
With the upcoming of artificial intelligence in data science it has become increasingly important for automation of data wrangling to have very strict checks and balances, which is why the munging process of data has not been automated by machine learning. Data munging requires more than just an automated solution, it requires knowledge of what information should be removed and artificial intelligence is not to the point of understanding such things.
An example of data mining that is closely related to data wrangling is ignoring data from a set that is not connected to the goal: say there is a data set related to the state of Texas and the goal is to get statistics on the residents of Houston, the data in the set related to the residents of Dallas is not useful to the overall set and can be removed before processing to improve the efficiency of the data mining process.
Note: Benefits depend on documenting steps and using version-controlled workflows; otherwise, wrangling can be time-consuming and error-prone.
These steps are an iterative process that should yield a clean and usable data set that can then be used for analysis. This process is tedious but rewarding as it allows analysts to get the information they need out of a large set of data that would otherwise be unreadable.
| +Starting data !Name !Phone !Birth date !State | |||
| John, Smith | 445-881-4478 | August 12, 1989 | Maine |
| Jennifer Tal | +1-189-456-4513 | 11/12/1965 | Tx |
| Gates, Bill | (876)546-8165 | June 15, 72 | Kansas |
| Alan Fitch | 5493156648 | 2-6-1985 | Oh |
| Jacob Alan | 156-4896 | January 3 | Alabama |
| +Result !Name !Phone !Birth date !State | |||
| John Smith | 445-881-4478 | 1989-08-12 | Maine |
| Jennifer Tal | 189-456-4513 | 1965-11-12 | Texas |
| Bill Gates | 876-546-8165 | 1972-06-15 | Kansas |
| Alan Fitch | 549-315-6648 | 1985-02-06 | Ohio |
The result of using the data wrangling process on this small data set shows a significantly easier data set to read. All names are now formatted the same way, {first name last name}, phone numbers are also formatted the same way {area code-XXX-XXXX}, dates are formatted numerically {YYYY-mm-dd}, and states are no longer abbreviated. The entry for Jacob Alan did not have fully formed data (the area code on the phone number is missing and the birth date had no year), so it was discarded from the data set. Now that the resulting data set is cleaned and readable, it is ready to be either deployed or evaluated.
The recipients could be individuals, such as or data science who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such as , , or downstream applications.
Visual data wrangling systems were developed to make data wrangling accessible for non-programmers, and simpler for programmers. Some of these also include embedded AI recommenders and programming by example facilities to provide user assistance, and program synthesis techniques to autogenerate scalable dataflow code. Early prototypes of visual data wrangling tools include OpenRefine and the Stanford/Berkeley Wrangler research system;
Other terms for these processes have included data franchising, What is Data Franchising? (2003 and 2017 IRI) data preparation, and data munging.
Start by determining the structure of the outcome, what is important to understand the disease diagnosis.
Once a final structure is determined, clean the data by removing any data points that are not helpful or are malformed, this could include patients that have not been diagnosed with any disease.
After cleaning look at the data again, is there anything that can be added to the data set that is already known that would benefit it? An example could be most common diseases in the area, America and India are very different when it comes to most common diseases.
Now comes the validation step, determine validation rules for which data points need to be checked for validity, this could include date of birth or checking for specific diseases.
After the validation step the data should now be organized and prepared for either deployment or evaluation. This process can be beneficial for determining correlations for disease diagnosis as it will reduce the vast amount of data into something that can be easily analyzed for an accurate result.
|
|