{"id":7478,"date":"2025-02-04T15:00:00","date_gmt":"2025-02-04T15:00:00","guid":{"rendered":"https:\/\/kocerroxy.com\/?p=7478"},"modified":"2025-10-22T12:07:00","modified_gmt":"2025-10-22T12:07:00","slug":"the-right-way-of-collecting-data-for-machine-learning","status":"publish","type":"post","link":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/","title":{"rendered":"The Right Way of Collecting Data for Machine Learning"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Imagine spending months building a machine learning model, fine-tuning every hyperparameter, and using the latest deep learning techniques only to realize your model performs terribly in the real world. It\u2019s not because of the algorithm. Might not be because of your code\u2014although you should always double-check your code. It\u2019s because you didn\u2019t follow the right approach when collecting data for machine learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is why <strong>data collection is one of the most important steps<\/strong> in any machine learning project. It\u2019s like cooking: no matter how skilled the chef is, a bad ingredient will ruin the dish. In ML, <strong>bad data leads to bad models<\/strong>\u2014no exceptions.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Why_High-Quality_Data_Matters_in_Machine_Learning\"><\/span>Why High-Quality Data Matters in Machine Learning<span class=\"ez-toc-section-end\"><\/span><\/h2><div id=\"ez-toc-container\" class=\"ez-toc-v2_0_76 ez-toc-wrap-left counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #ffffff;color:#ffffff\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #ffffff;color:#ffffff\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 eztoc-toggle-hide-by-default' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Why_High-Quality_Data_Matters_in_Machine_Learning\" >Why High-Quality Data Matters in Machine Learning<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Case_Study_A_Healthcare_ML_Model_That_Almost_Failed_Due_to_Poor_Data\" >Case Study: A Healthcare ML Model That Almost Failed Due to Poor Data<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Key_Characteristics_of_High-Quality_Datasets\" >Key Characteristics of High-Quality Datasets<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#1_Representativeness_Your_Data_Must_Reflect_the_Real_World\" >1. Representativeness: Your Data Must Reflect the Real World<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#2_Correct_Labels_Human_Validation_Matters\" >2. Correct Labels: Human Validation Matters<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#3_Data_Splitting_Done_Right_Avoiding_Leaks_and_Ensuring_Balance\" >3. Data Splitting Done Right: Avoiding Leaks and Ensuring Balance<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Common_Data_Sources_for_Machine_Learning\" >Common Data Sources for Machine Learning<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Videos_Natural_Language\" >Videos &amp; Natural Language<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#APIs_vs_Web_Scraping\" >APIs vs. Web Scraping<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Manually_Collected_Data\" >Manually Collected Data<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Handling_Real-Time_Data_Collection\" >Handling Real-Time Data Collection<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#JSON_for_Dynamic_Data\" >JSON for Dynamic Data<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Storage_Choices\" >Storage Choices<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Cleaning_and_Preparing_Data_for_ML\" >Cleaning and Preparing Data for ML<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Ensuring_Data_is_Ready_for_Use\" >Ensuring Data is Ready for Use<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Handling_Unstructured_Data\" >Handling Unstructured Data<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Mitigating_Bias\" >Mitigating Bias<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Tools_Technologies_for_Data_Collection\" >Tools &amp; Technologies for Data Collection<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#1_Web_Scraping_Scrapy\" >1. Web Scraping: Scrapy<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-20\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#2_Data_Processing_Pandas\" >2. Data Processing: Pandas<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-21\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#3_Checking_Data_Integrity_Custom_Python_Scripts\" >3. Checking Data Integrity: Custom Python Scripts<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-22\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#A_Real-Life_Case_Study_Collecting_Medical_Data\" >A Real-Life Case Study: Collecting Medical Data<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-23\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#The_Challenge_Collecting_Medical_Video_Data_While_Protecting_Privacy\" >The Challenge: Collecting Medical Video Data While Protecting Privacy<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-24\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#How_They_Solved_It\" >How They Solved It<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-25\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#1_Data_Anonymization_Replacing_Patient_Names_with_Auto-Generated_Strings\" >1. Data Anonymization: Replacing Patient Names with Auto-Generated Strings<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-26\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#2_Video_Blurring_Hiding_Identifiable_Features\" >2. Video Blurring: Hiding Identifiable Features<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-27\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Lesson_Learned_Privacy_Should_Be_Handled_from_the_Start\" >Lesson Learned: Privacy Should Be Handled from the Start<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-28\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Getting_Started_with_Machine_Learning\" >Getting Started with Machine Learning<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-29\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Read_a_Great_Beginner-Friendly_ML_Book\" >Read a Great Beginner-Friendly ML Book<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-30\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Start_with_Kaggle\" >Start with Kaggle<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-31\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Build_Your_First_Data_Collection_Project\" >Build Your First Data Collection Project<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-32\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#Conclusions_on_Collecting_Data_for_Machine_Learning\" >Conclusions on Collecting Data for Machine Learning<\/a><\/li><\/ul><\/nav><\/div>\n\n\n\n\n<p class=\"wp-block-paragraph\">The machine learning engineer and data scientist <a href=\"https:\/\/www.mislavjuric.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Mislav Juri\u0107<\/strong><\/a> shared with me a great saying: <strong>&#8220;Garbage in, garbage out&#8221;<\/strong> (GIGO). This perfectly applies to machine learning. If you train your model on incomplete, biased, or low-quality data, your results will be unreliable, no matter how powerful your algorithm is.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Case_Study_A_Healthcare_ML_Model_That_Almost_Failed_Due_to_Poor_Data\"><\/span><strong>Case Study: A Healthcare ML Model That Almost Failed Due to Poor Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A team of researchers was building an ML model to classify patients based on medical symptoms. They had collected around <strong>300 data samples<\/strong>, but they quickly noticed a major problem\u2014the class distribution was highly uneven:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Class 1:<\/strong> 20 samples<\/li>\n\n\n\n<li><strong>Class 2:<\/strong> 130 samples<\/li>\n\n\n\n<li><strong>Class 3:<\/strong> 130 samples<\/li>\n\n\n\n<li><strong>Class 4:<\/strong> 20 samples<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">If they had split the dataset randomly, most of the <strong>Class 1 and Class 4<\/strong> samples could have ended up entirely in the training or test set, making the real-world data and the test set have different class distributions. This would lead to non-representative test set results, as some classes could be underrepresented or left out completely.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Realizing the issue, they decided <strong>not to use random sampling<\/strong> and instead applied <strong>stratified sampling<\/strong>. This ensured that each class was proportionally represented in <strong>training, validation, and test sets<\/strong>, making sure the model learned from all classes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Python code demonstrating <strong>how to apply stratified sampling<\/strong> to ensure proper class distribution<\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-layout-flex wp-container-core-group-is-layout-2c90304e wp-block-group-is-layout-flex\">\n<pre class=\"wp-block-code\"><code>from itertools import chain<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>import numpy as np<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>from sklearn.model_selection import train_test_split<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># Example dataset with class labels<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>X = &#91;]<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>for i in range(0, 50): # creates 50 samples (just strings in this sample)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>sample_name = \"sample_\" + str(i)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>X.append(sample_name)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>y = list(chain(&#91;\"0\"] * 5, &#91;\"1\"] * 20, &#91;\"2\"] * 20, &#91;\"3\"] * 5)) # creates 5 class 0 labels, 20 class 1 labels, 20 class 2 labels and 5 class 3 labels<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>print(\"Training labels distribution:\", dict(zip(*np.unique(y_train, return_counts=True)))) # class 0: 4 data samples, class 1: 16 data samples, class 2: 16 data samples, class 3: 4 data samples<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>print(\"Testing labels distribution:\", dict(zip(*np.unique(y_test, return_counts=True)))) # class 0: 1 data sample, class 1: 4 data samples, class 2: 4 data samples, class 3: 1 data sample<\/code><\/pre>\n<\/div>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/how-to-prepare-effective-llm-training-data\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>How to Prepare Effective LLM Training Data<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Key_Characteristics_of_High-Quality_Datasets\"><\/span><strong>Key Characteristics of High-Quality Datasets<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When building a machine learning model, <strong>the quality of your dataset is everything<\/strong>. Even the most advanced model won\u2019t perform well if it&#8217;s trained on <strong>biased, mislabeled, or unbalanced data<\/strong>. So, what makes a dataset high-quality? Let\u2019s break it down.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"1_Representativeness_Your_Data_Must_Reflect_the_Real_World\"><\/span><strong>1. Representativeness: Your Data Must Reflect the Real World<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine training a healthcare model to classify diseases, but most of your training data comes from patients in big-city hospitals. If you try using that model in <strong>rural areas where patient demographics, lifestyles, and common diseases are different<\/strong>, it might fail completely.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This happens when your dataset <strong>isn\u2019t representative of the real-world population<\/strong>. If your model is supposed to <strong>mimic reality, your data should match that reality as closely as possible<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Always verify dataset representativeness by <strong>consulting domain experts<\/strong> before training a model. If you don\u2019t have access to one, compare your dataset\u2019s distribution with known real-world statistics.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"2_Correct_Labels_Human_Validation_Matters\"><\/span><strong>2. Correct Labels: Human Validation Matters<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Your model can only be as good as the labels it learns from. If the data is mislabeled, your model will learn the wrong patterns, making it unreliable in real-world scenarios.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A <strong>common mistake<\/strong> is trusting auto-labeled data or assuming that a non-expert can label data accurately. But <strong>bad labels = bad models.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Have <strong>humans in the loop<\/strong> for validation. Ideally, experts in the field should verify labels, especially in specialized areas like <strong>medicine, finance, and law<\/strong>. If manual labeling is too expensive, consider using <strong>active learning<\/strong>\u2014where the model flags uncertain cases for human review.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"3_Data_Splitting_Done_Right_Avoiding_Leaks_and_Ensuring_Balance\"><\/span><strong>3. Data Splitting Done Right: Avoiding Leaks and Ensuring Balance<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A common approach when splitting data is <strong>randomly assigning samples<\/strong> to training, validation, and test sets. If your dataset is imbalanced\u2014some categories have far fewer samples than others\u2014this can cause serious issues.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"853\" src=\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2025\/02\/Key-Characteristics-of-High-Quality-Datasets-1024x853.webp\" alt=\"Train\/Test Distribution for Random and Stratified Splits\" class=\"wp-image-7480\" srcset=\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Key-Characteristics-of-High-Quality-Datasets-1024x853.webp 1024w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Key-Characteristics-of-High-Quality-Datasets-300x250.webp 300w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Key-Characteristics-of-High-Quality-Datasets-768x640.webp 768w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Key-Characteristics-of-High-Quality-Datasets-1536x1280.webp 1536w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Key-Characteristics-of-High-Quality-Datasets-2048x1707.webp 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Train\/Test Distribution for Random and Stratified Splits<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For example, if a rare class ends up mostly in the training set and barely in the test set, your test set results could be misleading.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use <strong>stratified sampling<\/strong> to make sure each class is <strong>proportionally represented<\/strong> in the training, validation, and test sets. This prevents the model from ignoring rare classes and makes evaluation <strong>much more reliable<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Python code for <strong>checking class distribution before splitting<\/strong>:<\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-layout-flex wp-container-core-group-is-layout-2c90304e wp-block-group-is-layout-flex\">\n<pre class=\"wp-block-code\"><code>import pandas as pd<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>import seaborn as sns<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>import matplotlib.pyplot as plt<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># Example dataset<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>df = pd.DataFrame({'Class': &#91;'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'C', 'C']})<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># Count class distribution<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>sns.countplot(x=df&#91;'Class'])<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>plt.title('Class Distribution Before Balancing')<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>plt.show()<\/code><\/pre>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to do hyperparameter tuning, you will also need to further split the training data into training and validation sets while keeping the class distribution in mind.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/data-parsing-with-proxies\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Data Parsing with Proxies<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Common_Data_Sources_for_Machine_Learning\"><\/span><strong>Common Data Sources for Machine Learning<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before you start training a machine learning model, you need data. But not all data sources are created equal. The source you choose depends on the <strong>problem you\u2019re trying to solve<\/strong> and how structured or messy the data is. Let\u2019s go through some of the most common data sources, along with their <strong>challenges and best practices<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Videos_Natural_Language\"><\/span><strong>Videos &amp; Natural Language<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Working with <strong>videos<\/strong> or <strong>text<\/strong>? You\u2019ll need to preprocess it before you can feed it into an ML model. Depending on your use case, the extent of the preprocessing varies, but some preprocessing is almost always required.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In one project, a team collected <strong>videos of patients<\/strong> to train a healthcare ML model. But instead of feeding raw video files into the model, they <strong>extracted key data<\/strong> from the videos and converted it into <strong>structured JSON files<\/strong>. This made it easier to analyze and train on.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you\u2019re dealing with videos, consider <strong>extracting metadata (timestamps, object tracking, facial landmarks, etc.) into structured formats like JSON<\/strong> before moving forward.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If working with natural language, tokenizing the text\u2014and depending on the machine learning model used, cleaning the text\u2014would be required before training the model.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"APIs_vs_Web_Scraping\"><\/span><strong>APIs vs. Web Scraping<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you need large amounts of data from the web, you have two main options:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>APIs (Application Programming Interfaces)<\/strong><\/li>\n\n\n\n<li><strong>Web Scraping<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">APIs are usually the <strong>better choice<\/strong> because they provide <strong>structured, reliable<\/strong> data without violating terms of service. Many platforms (Twitter, Google, OpenWeather, etc.) offer APIs for developers to access <strong>clean, formatted<\/strong> data.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" width=\"947\" height=\"1024\" src=\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2025\/02\/Untitled-20-947x1024.webp\" alt=\"Weather API\" class=\"wp-image-7482\" srcset=\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Untitled-20-947x1024.webp 947w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Untitled-20-277x300.webp 277w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Untitled-20-768x831.webp 768w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Untitled-20-1420x1536.webp 1420w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Untitled-20.webp 1665w\" sizes=\"(max-width: 947px) 100vw, 947px\" \/><figcaption class=\"wp-element-caption\">Weather API<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Example of <strong>a simple API call using Python<\/strong>:<\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-layout-flex wp-container-core-group-is-layout-2c90304e wp-block-group-is-layout-flex\">\n<pre class=\"wp-block-code\"><code>import requests<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>response = requests.get(\"https:\/\/api.openweathermap.org\/data\/3.0\/onecall?lat=LATITUDE&amp;lon=LONGITUDE&amp;appid=YOUR_API_KEY\")<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>data = response.json()<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>print(data)<\/code><\/pre>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">On the other hand, <strong>web scraping<\/strong> is often a last resort when APIs <strong>aren\u2019t available<\/strong> or are <strong>too restrictive<\/strong>. However, it comes with challenges like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Websites changing layouts, which can break your scrapers.<\/li>\n\n\n\n<li>Legal issues\u2014many sites <strong>prohibit<\/strong> scraping in their terms of service.<\/li>\n\n\n\n<li>Data inconsistency\u2014unstructured HTML can be messy to parse.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Manually_Collected_Data\"><\/span><strong>Manually Collected Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Sometimes, the best way to get high-quality data is <strong>to collect it manually<\/strong>. This might seem tedious, but in many cases, manually labeled or curated datasets <strong>perform better than automatically gathered ones<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If your use case requires <strong>domain-specific knowledge<\/strong>, manual data collection and curation can be <strong>worth the effort<\/strong>. Consider a <strong>hybrid approach<\/strong>\u2014automated data collection, followed by <strong>human validation<\/strong> of key samples.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/the-importance-of-web-scraping\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>The Importance of Web Scraping<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Handling_Real-Time_Data_Collection\"><\/span><strong>Handling Real-Time Data Collection<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When collecting data for machine learning, things get even more complicated when dealing with <strong>real-time data<\/strong>. Unlike static datasets that you collect once and clean up later, <strong>real-time data is constantly changing<\/strong>\u2014which means you need a system that can <strong>adapt on the fly<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So how do you handle it properly? Let\u2019s go over the key strategies.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"JSON_for_Dynamic_Data\"><\/span><strong>JSON for Dynamic Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When dealing with real-time user-generated data, one of the best formats for storing it is <strong>JSON<\/strong>. Why? Because JSON is <strong>flexible<\/strong>\u2014it can handle new data fields without breaking everything.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" width=\"687\" height=\"997\" src=\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2025\/02\/Handling-Real-Time-Data-Collection.webp\" alt=\"A basic JSON file structure\" class=\"wp-image-7484\" srcset=\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Handling-Real-Time-Data-Collection.webp 687w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Handling-Real-Time-Data-Collection-207x300.webp 207w\" sizes=\"(max-width: 687px) 100vw, 687px\" \/><figcaption class=\"wp-element-caption\">A basic JSON file structure<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">But even <a href=\"https:\/\/kocerroxy.com\/blog\/json-vs-csv-which-is-better\">JSON-based data can evolve<\/a> over time. For example, new data points might be added, or the structure might change slightly. When that happens, <strong>your data processing scripts need to be updated to handle the changes smoothly<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s a sample Python code for handling missing fields within a JSON by replacing them with a provided default value:<\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-layout-flex wp-container-core-group-is-layout-2c90304e wp-block-group-is-layout-flex\">\n<pre class=\"wp-block-code\"><code>import json<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># Example JSON with missing\/extra fields<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>json_data = '{\"name\": \"John\", \"age\": 30}'<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>parsed_data = json.loads(json_data)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># Handling missing keys with default values<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>name = parsed_data.get(\"name\", \"Unknown\")<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>city = parsed_data.get(\"city\", \"No City Provided\")<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>print(f\"Name: {name}, City: {city}\")<\/code><\/pre>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Use JSON for flexibility, but plan for changes. <\/strong>If the JSON structure evolves, make sure <strong>older scripts don\u2019t break<\/strong> by handling missing or new fields properly.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Storage_Choices\"><\/span><strong>Storage Choices<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When collecting real-time data, you need to store it somewhere. The two most common options are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Databases<\/strong>. Ideal for structured data that needs frequent querying.<\/li>\n\n\n\n<li><strong>Cloud Storage (e.g., AWS S3, Google Cloud Storage)<\/strong>. Better for large-scale, less frequently accessed raw data.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/json-vs-csv-which-is-better\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>JSON vs. CSV: Which Is Better?<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Cleaning_and_Preparing_Data_for_ML\"><\/span><strong>Cleaning and Preparing Data for ML<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Once you\u2019ve collected your data, the next step is <strong>cleaning and preparing it<\/strong> before training your model. This is where a lot of people <strong>cut corners<\/strong>\u2014but skipping this step leads to <strong>garbage results<\/strong> no matter how good your algorithm is.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s how to do it <strong>the right way<\/strong> to ensure your model learns from clean, structured, and unbiased data.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Ensuring_Data_is_Ready_for_Use\"><\/span><strong>Ensuring Data is Ready for Use<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The first thing you need to ask yourself is: <strong>&#8220;Can my data be trusted?&#8221;<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Raw data is <strong>never perfect<\/strong>\u2014it often contains errors, inconsistencies, and missing values. If you don\u2019t catch these issues early, your model will learn from <strong>bad data<\/strong>, leading to poor performance in real-world scenarios.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"863\" height=\"1024\" src=\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2025\/02\/Cleaning-and-Preparing-Data-for-ML-863x1024.webp\" alt=\"Dropping rows with null values\" class=\"wp-image-7486\" srcset=\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Cleaning-and-Preparing-Data-for-ML-863x1024.webp 863w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Cleaning-and-Preparing-Data-for-ML-253x300.webp 253w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Cleaning-and-Preparing-Data-for-ML-768x911.webp 768w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Cleaning-and-Preparing-Data-for-ML.webp 903w\" sizes=\"(max-width: 863px) 100vw, 863px\" \/><figcaption class=\"wp-element-caption\">Dropping rows with null values<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Python script for removing duplicate rows and imputing missing values using pandas:<\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-layout-flex wp-container-core-group-is-layout-2c90304e wp-block-group-is-layout-flex\">\n<pre class=\"wp-block-code\"><code>import pandas as pd<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># Sample dataset with missing values<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>df = pd.DataFrame({'Name': &#91;'Alice', 'Bob', None, 'Charlie', 'Bob'],<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'Age': &#91;25, 30, 35, None, 30]})<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># Drop duplicates and fill missing values<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>df = df.drop_duplicates().fillna({'Age': df&#91;'Age'].median()})<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>print(df)<\/code><\/pre>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Always <strong>have humans review the data<\/strong>, especially if labeling errors could significantly impact your results. Use <strong>double-checking methods<\/strong>\u2014such as <strong>multiple reviewers<\/strong> for critical datasets.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Handling_Unstructured_Data\"><\/span><strong>Handling Unstructured Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Not all data comes neatly packaged in tables. If you\u2019re dealing with <strong>unstructured data<\/strong> like text, images, or videos, <strong>there\u2019s no one-size-fits-all approach<\/strong>\u2014you need a plan based on your specific use case.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Understand your data<\/strong> before deciding on the best way to structure it. Use <strong>Python scripts (pandas, OpenCV, or NLP libraries)<\/strong> to <strong>convert, clean, and format<\/strong> unstructured data.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Mitigating_Bias\"><\/span><strong>Mitigating Bias<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">One of the biggest mistakes in machine learning is assuming your dataset is <strong>neutral and unbiased<\/strong>. <strong>It almost never is.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Bias sneaks into datasets in different ways\u2014imbalanced class distributions, unverified data sources, or human labeling errors. If you don\u2019t <strong>test for bias<\/strong>, you might end up with a model that <strong>reinforces unfair patterns<\/strong> rather than making fair and accurate predictions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Instead of assuming your dataset is unbiased, <strong>train your model <\/strong>and analyze its behavior. Check for <strong>disproportionate error rates<\/strong> across different classes\u2014this often indicates bias. If you find bias, <strong>adjust the dataset<\/strong> (e.g., by adding more samples from underrepresented classes) rather than just tweaking the model.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/alternative-data-for-startups\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Alternative Data for Startups<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Tools_Technologies_for_Data_Collection\"><\/span><strong>Tools &amp; Technologies for Data Collection<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Having the right tools can make <strong>collecting data for machine learning<\/strong> much easier and more efficient. Whether you\u2019re scraping data from the web or cleaning raw datasets, the right technology can <strong>save you time and prevent errors<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here are some essential tools that can help you streamline your data collection process.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"1_Web_Scraping_Scrapy\"><\/span><strong>1. Web Scraping: Scrapy<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you need data from the web and an <strong>API isn\u2019t available<\/strong>, web scraping can be a useful approach. One of the best tools for this is <strong>Scrapy<\/strong>\u2014a powerful Python framework that makes web scraping easier and more efficient.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"541\" src=\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2025\/02\/A-Scrapy-project-in-an-Anaconda-Prompt-terminal-window-1024x541.png\" alt=\"A Scrapy project in an Anaconda Prompt terminal window\" class=\"wp-image-7488\" srcset=\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/A-Scrapy-project-in-an-Anaconda-Prompt-terminal-window-1024x541.png 1024w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/A-Scrapy-project-in-an-Anaconda-Prompt-terminal-window-300x159.png 300w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/A-Scrapy-project-in-an-Anaconda-Prompt-terminal-window-768x406.png 768w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/A-Scrapy-project-in-an-Anaconda-Prompt-terminal-window-1536x812.png 1536w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/A-Scrapy-project-in-an-Anaconda-Prompt-terminal-window.png 1723w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">A Scrapy project in an Anaconda Prompt terminal window<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A simple <strong>Scrapy script to extract website data<\/strong>:<\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-layout-flex wp-container-core-group-is-layout-2c90304e wp-block-group-is-layout-flex\">\n<pre class=\"wp-block-code\"><code>import scrapy<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>from scrapy.crawler import CrawlerProcess<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>class QuotesSpider(scrapy.Spider):<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>name = \"quotes\"<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>start_urls = &#91;'http:\/\/quotes.toscrape.com\/']<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>def parse(self, response):<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>for quote in response.css('div.quote'):<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>yield {'text': quote.css('span.text::text').get(),<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>'author': quote.css('small.author::text').get()}<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>process = CrawlerProcess(<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>settings={<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>\"FEEDS\": {<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>\"items.json\": {\"format\": \"json\"},<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>},<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>}<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>process.crawl(QuotesSpider)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>process.start() # the script will block here until the crawling is finished<\/code><\/pre>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Use Scrapy<\/strong> when you need structured data from websites but no API exists. Be aware of <strong>legal and ethical concerns<\/strong>\u2014always check a website\u2019s <strong>terms of service<\/strong> before scraping.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"2_Data_Processing_Pandas\"><\/span><strong>2. Data Processing: Pandas<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Once you\u2019ve collected data, the next step is <strong>cleaning and transforming it<\/strong>. That\u2019s where <strong>pandas<\/strong> comes in. Pandas is a Python library designed for <strong>data manipulation, cleaning, and preprocessing<\/strong>. It\u2019s one of the most commonly used tools in machine learning workflows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use <strong>pandas<\/strong> for <strong>data cleaning, transformation, and analysis<\/strong> before feeding data into your model. When working with JSON, CSV, or databases, pandas helps <strong>structure the data properly<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"3_Checking_Data_Integrity_Custom_Python_Scripts\"><\/span><strong>3. Checking Data Integrity: Custom Python Scripts<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Even if your data <strong>looks<\/strong> clean, small mistakes can lead to <strong>huge problems<\/strong> when training your model. Sometimes, you need <strong>custom scripts<\/strong> to verify that your data follows the correct structure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A team was using <strong>a large language model (LLM) like ChatGPT<\/strong> to generate structured responses for a dataset. However, over time, they noticed that some responses <strong>contained gibberish or incorrect formatting<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To solve this, they wrote a <strong>Python script that checked if the generated responses matched the expected structure<\/strong>. While it couldn\u2019t verify content accuracy, it helped them <strong>filter out bad responses automatically<\/strong>, saving hours of manual work.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/the-future-of-ad-verification-ais-impact-on-brand-safety\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>The Future of Ad Verification: AI\u2019s Impact on Brand Safety<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"A_Real-Life_Case_Study_Collecting_Medical_Data\"><\/span><strong>A Real-Life Case Study: Collecting Medical Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Collecting data for machine learning is already tricky\u2014but it becomes even more challenging when <strong>privacy and legal compliance<\/strong> are involved. In fields like <strong>healthcare<\/strong>, where sensitive patient data is collected, mistakes can lead to <strong>serious ethical and legal issues<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s a real-world case where a team had to <strong>collect and process medical video data while ensuring patient privacy<\/strong>\u2014and the smart solutions they used to solve this challenge.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_Challenge_Collecting_Medical_Video_Data_While_Protecting_Privacy\"><\/span><strong>The Challenge: Collecting Medical Video Data While Protecting Privacy<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A healthcare research team was working on a <strong>machine learning model<\/strong> that analyzed <strong>medical videos<\/strong> to help with diagnostics. However, there was a major issue: <strong>the videos contained identifiable patient information<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">They couldn\u2019t just store and process the videos as they were because:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Patient names were visible in medical records attached to the video files.<\/li>\n\n\n\n<li>The patients\u2019 <strong>faces and body features<\/strong> were clearly visible in the footage.<\/li>\n\n\n\n<li>They had to comply with <strong>strict privacy laws<\/strong> like GDPR.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">If they didn\u2019t handle this correctly, the entire project <strong>couldn\u2019t move forward<\/strong> due to ethical and legal risks.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"How_They_Solved_It\"><\/span><strong>How They Solved It<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The team found <strong>two effective solutions<\/strong>:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"1_Data_Anonymization_Replacing_Patient_Names_with_Auto-Generated_Strings\"><\/span><strong>1. Data Anonymization: Replacing Patient Names with Auto-Generated Strings<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">To protect patient identities, they <strong>automated the anonymization process<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instead of storing real patient names, they <strong>generated random strings<\/strong> (e.g., Patient_001, Patient_002) and assigned them to each individual.<\/li>\n\n\n\n<li>The mapping between real names and assigned strings was stored <strong>in a separate, highly restricted database<\/strong>, accessible only to authorized personnel.<\/li>\n\n\n\n<li>This ensured that even if the dataset was leaked, it <strong>wouldn\u2019t expose patient identities<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"2_Video_Blurring_Hiding_Identifiable_Features\"><\/span><strong>2. Video Blurring: Hiding Identifiable Features<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">To prevent visual identification, they used <strong>computer vision techniques<\/strong> to blur patient faces:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They applied <strong>key point detection<\/strong> to identify facial landmarks like the <strong>eyes, nose, and mouth<\/strong>.<\/li>\n\n\n\n<li>Once key points were detected, they <strong>automatically blurred<\/strong> the surrounding region.<\/li>\n\n\n\n<li>The same approach was used to blur any <strong>tattoos, scars, or other identifiable body parts<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">They used a machine learning model for detecting key points on the body and OpenCV for blurring the areas around the detected key points.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Did it work?<\/strong> Yes! Even when a patient <strong>turned their head to the side or wore hats<\/strong>, the key point detection still functioned well, ensuring their identity remained protected.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Lesson_Learned_Privacy_Should_Be_Handled_from_the_Start\"><\/span><strong>Lesson Learned: Privacy Should Be Handled from the Start<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">One of the biggest takeaways from this project? <strong>Never treat privacy as an afterthought.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If they had waited until the later stages of development to address privacy concerns, they would potentially have had to <strong>redo large portions of their work<\/strong>\u2014wasting time and resources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Think about privacy from the very beginning\u2014especially when working with sensitive data. Automate anonymization and security measures so privacy protection is built into the data collection process. Consult legal experts early to avoid compliance headaches later. Regularly assess and update your privacy protocols to adapt to evolving regulations and technological advancements. By prioritizing transparency with users about how their data will be used, you can build trust and mitigate potential <a href=\"https:\/\/kocerroxy.com\/blog\/deepseek-ais-privacy-violations-in-data-collection\/\">privacy issues in AI technologies<\/a>. Ethical data practices safeguard individuals and enhance the credibility of your organization.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/well-paid-web-scraping-projects\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Well Paid Web Scraping Projects<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Getting_Started_with_Machine_Learning\"><\/span>Getting Started with Machine Learning<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you&#8217;re new to machine learning, it can feel overwhelming at first. Where do you start? What tools should you use? How do you go from raw data to training a real model?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Don\u2019t worry\u2014I\u2019ve got you covered. Here\u2019s a beginner-friendly roadmap to help you <strong>learn, practice, and build your first machine learning project.<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Read_a_Great_Beginner-Friendly_ML_Book\"><\/span><strong>Read a Great Beginner-Friendly ML Book<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before diving deep into machine learning, it helps to understand <strong>the basics of machine learning<\/strong>. A great book to start with is <a href=\"https:\/\/www.amazon.com\/Hands-Machine-Learning-Scikit-Learn-TensorFlow\/dp\/1492032646\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Hands-On Machine Learning<\/strong><\/a><strong> by Aur\u00e9lien G\u00e9ron<\/strong>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It covers <strong>real-world machine learning workflows<\/strong>, including data preprocessing.<\/li>\n\n\n\n<li>The explanations are <strong>clear and beginner-friendly<\/strong>.<\/li>\n\n\n\n<li>It includes <strong>Python code examples<\/strong> so you can follow along and practice.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Read the book <strong>before jumping into complex projects<\/strong>\u2014it will help you understand <strong>many of the machine learning models used today.<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Start_with_Kaggle\"><\/span><strong>Start with Kaggle<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">One of the best places to <strong>learn and practice machine learning<\/strong> is <a href=\"https:\/\/www.kaggle.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Kaggle<\/strong><\/a>. Kaggle is a website where you can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Find <strong>real-world datasets<\/strong> for free.<\/li>\n\n\n\n<li>Solve <strong>machine learning challenges<\/strong> (some with prize money!).<\/li>\n\n\n\n<li>See how experienced data scientists <strong>approach problems<\/strong>\u2014you can study their solutions and improve your skills.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Instead of struggling to find data, Kaggle provides <strong>ready-to-use datasets<\/strong> in various domains\u2014healthcare, finance, sports, and even fun topics like movie ratings and music preferences. It\u2019s a <strong>hands-on way to learn<\/strong> without getting stuck in the data collection phase.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"934\" height=\"1024\" src=\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2025\/02\/Kaggle-Dataset-Example-934x1024.png\" alt=\"Kaggle Dataset Example\" class=\"wp-image-7490\" srcset=\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Kaggle-Dataset-Example-934x1024.png 934w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Kaggle-Dataset-Example-274x300.png 274w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Kaggle-Dataset-Example-768x842.png 768w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Kaggle-Dataset-Example-1401x1536.png 1401w, https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/Kaggle-Dataset-Example.png 1453w\" sizes=\"(max-width: 934px) 100vw, 934px\" \/><figcaption class=\"wp-element-caption\">Kaggle Dataset Example<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Pick a <strong>simple dataset<\/strong> on Kaggle and try to clean, analyze, and visualize it using Python. Pandas is a great tool for this!<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Build_Your_First_Data_Collection_Project\"><\/span><strong>Build Your First Data Collection Project<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Once you\u2019ve read a bit about ML and practiced on Kaggle, the next step is to <strong>collect your own dataset and train a machine learning model on that dataset.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Example Project:<\/strong> Collect <strong>movie scripts<\/strong> and train a text-generation model!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s how <strong>Mislav Juri\u0107<\/strong> did in his project: <a href=\"https:\/\/www.mislavjuric.com\/movie-script-generator-based-on-gpt-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Movie Script Generator Using GPT-2<\/a>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Found a <strong>movie scripts database<\/strong> online.<\/li>\n\n\n\n<li>Used <strong>web scraping (Scrapy)<\/strong> to collect the movie scripts.<\/li>\n\n\n\n<li>Trained a <strong>GPT-2 model<\/strong> on the data to generate new movie scripts.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Start small\u2014don\u2019t try to <a href=\"https:\/\/kocerroxy.com\/blog\/alternative-data-for-startups\">collect massive datasets right away.<\/a> Choose a <strong>fun dataset<\/strong> so you stay motivated while learning.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">Also read: <a href=\"https:\/\/kocerroxy.com\/blog\/web-scraping-with-proxies\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Web Scraping With Proxies<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Conclusions_on_Collecting_Data_for_Machine_Learning\"><\/span>Conclusions on Collecting Data for Machine Learning<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If there\u2019s one thing you should take away from this guide, it\u2019s this: <strong>data quality is king.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can have the most advanced machine learning model, fine-tuned with the latest techniques, but if it\u2019s trained on <strong>bad data<\/strong>, it\u2019s going to fail. <strong>No exceptions.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What\u2019s the fix? <strong>Collecting data the right way from the start. <\/strong>Don\u2019t just grab whatever data is available. Think about what\u2019s truly representative of your problem. <strong>Involve domain experts<\/strong> who can spot issues in your dataset that an algorithm never will.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A mediocre model trained on a <strong>high-quality, well-balanced dataset<\/strong> will always <strong>outperform<\/strong> a powerful model trained on <strong>garbage data<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, take the time to <strong>collect, clean, and validate your data properly<\/strong>. Your future ML model and your sanity will thank you.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Struggling with collecting data for machine learning? Learn how to gather, clean, and prepare high-quality datasets for AI success.<\/p>\n","protected":false},"author":3,"featured_media":7496,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[139],"tags":[176,184,24],"class_list":["post-7478","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-web-scraping","tag-data-processing","tag-programming","tag-web-scraping"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The Right Way of Collecting Data for Machine Learning - KocerRoxy<\/title>\n<meta name=\"description\" content=\"Struggling with collecting data for machine learning? Learn how to gather, clean, and prepare high-quality datasets for AI success.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The Right Way of Collecting Data for Machine Learning - KocerRoxy\" \/>\n<meta property=\"og:description\" content=\"Struggling with collecting data for machine learning? Learn how to gather, clean, and prepare high-quality datasets for AI success.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"KocerRoxy\" \/>\n<meta property=\"article:author\" content=\"https:\/\/www.facebook.com\/TheHelenBold\" \/>\n<meta property=\"article:published_time\" content=\"2025-02-04T15:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-10-22T12:07:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1792\" \/>\n\t<meta property=\"og:image:height\" content=\"1024\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"Helen Bold\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TheHelenBold\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Helen Bold\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/\"},\"author\":{\"name\":\"Helen Bold\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/#\/schema\/person\/c9c9120b90dac4268b7012486a55074c\"},\"headline\":\"The Right Way of Collecting Data for Machine Learning\",\"datePublished\":\"2025-02-04T15:00:00+00:00\",\"dateModified\":\"2025-10-22T12:07:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/\"},\"wordCount\":3129,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp\",\"keywords\":[\"data processing\",\"programming\",\"web scraping\"],\"articleSection\":[\"Web Scraping\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/\",\"url\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/\",\"name\":\"The Right Way of Collecting Data for Machine Learning - KocerRoxy\",\"isPartOf\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp\",\"datePublished\":\"2025-02-04T15:00:00+00:00\",\"dateModified\":\"2025-10-22T12:07:00+00:00\",\"description\":\"Struggling with collecting data for machine learning? Learn how to gather, clean, and prepare high-quality datasets for AI success.\",\"breadcrumb\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage\",\"url\":\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp\",\"contentUrl\":\"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp\",\"width\":1792,\"height\":1024,\"caption\":\"Collecting Data for Machine Learning\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/kocerroxy.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"The Right Way of Collecting Data for Machine Learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/#website\",\"url\":\"https:\/\/kocerroxy.com\/blog\/\",\"name\":\"Kocerroxy\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/kocerroxy.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/#organization\",\"name\":\"Kocerroxy\",\"url\":\"https:\/\/kocerroxy.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2023\/07\/Favicon.png\",\"contentUrl\":\"https:\/\/kocerroxy.com\/wp-content\/uploads\/2023\/07\/Favicon.png\",\"width\":512,\"height\":512,\"caption\":\"Kocerroxy\"},\"image\":{\"@id\":\"https:\/\/kocerroxy.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/#\/schema\/person\/c9c9120b90dac4268b7012486a55074c\",\"name\":\"Helen Bold\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/kocerroxy.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/7624887d3556e306a0883ab27fba8ad89c7f315532399aacf4e5cd49014bc658?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/7624887d3556e306a0883ab27fba8ad89c7f315532399aacf4e5cd49014bc658?s=96&d=mm&r=g\",\"caption\":\"Helen Bold\"},\"description\":\"Helen Bold has been writing about proxies since 2020. Helen specializes in gathering details, checking facts, and bringing value to our readers. In addition to writing articles, Helen does in-depth research and analyzes proxy industry trends. In her free time, she also writes amazing novels. You can read more about her personal work here: helenbold.com\",\"sameAs\":[\"http:\/\/helenbold.com\",\"https:\/\/www.facebook.com\/TheHelenBold\",\"https:\/\/www.instagram.com\/helenboldwriter\/\",\"https:\/\/x.com\/TheHelenBold\"],\"url\":\"https:\/\/kocerroxy.com\/blog\/author\/helen-b\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The Right Way of Collecting Data for Machine Learning - KocerRoxy","description":"Struggling with collecting data for machine learning? Learn how to gather, clean, and prepare high-quality datasets for AI success.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/","og_locale":"en_US","og_type":"article","og_title":"The Right Way of Collecting Data for Machine Learning - KocerRoxy","og_description":"Struggling with collecting data for machine learning? Learn how to gather, clean, and prepare high-quality datasets for AI success.","og_url":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/","og_site_name":"KocerRoxy","article_author":"https:\/\/www.facebook.com\/TheHelenBold","article_published_time":"2025-02-04T15:00:00+00:00","article_modified_time":"2025-10-22T12:07:00+00:00","og_image":[{"width":1792,"height":1024,"url":"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp","type":"image\/webp"}],"author":"Helen Bold","twitter_card":"summary_large_image","twitter_creator":"@TheHelenBold","twitter_misc":{"Written by":"Helen Bold","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#article","isPartOf":{"@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/"},"author":{"name":"Helen Bold","@id":"https:\/\/kocerroxy.com\/blog\/#\/schema\/person\/c9c9120b90dac4268b7012486a55074c"},"headline":"The Right Way of Collecting Data for Machine Learning","datePublished":"2025-02-04T15:00:00+00:00","dateModified":"2025-10-22T12:07:00+00:00","mainEntityOfPage":{"@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/"},"wordCount":3129,"commentCount":0,"publisher":{"@id":"https:\/\/kocerroxy.com\/blog\/#organization"},"image":{"@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp","keywords":["data processing","programming","web scraping"],"articleSection":["Web Scraping"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/","url":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/","name":"The Right Way of Collecting Data for Machine Learning - KocerRoxy","isPartOf":{"@id":"https:\/\/kocerroxy.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage"},"image":{"@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp","datePublished":"2025-02-04T15:00:00+00:00","dateModified":"2025-10-22T12:07:00+00:00","description":"Struggling with collecting data for machine learning? Learn how to gather, clean, and prepare high-quality datasets for AI success.","breadcrumb":{"@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#primaryimage","url":"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp","contentUrl":"https:\/\/kocerroxy.com\/blog\/wp-content\/uploads\/2025\/02\/DALL\u00b7E-2025-02-04-16.40.28-A-futuristic-AI-themed-illustration-showcasing-the-concept-of-collecting-data-for-machine-learning.-The-image-features-a-glowing-digital-brain-in-the-1.webp","width":1792,"height":1024,"caption":"Collecting Data for Machine Learning"},{"@type":"BreadcrumbList","@id":"https:\/\/kocerroxy.com\/blog\/the-right-way-of-collecting-data-for-machine-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/kocerroxy.com\/blog\/"},{"@type":"ListItem","position":2,"name":"The Right Way of Collecting Data for Machine Learning"}]},{"@type":"WebSite","@id":"https:\/\/kocerroxy.com\/blog\/#website","url":"https:\/\/kocerroxy.com\/blog\/","name":"Kocerroxy","description":"","publisher":{"@id":"https:\/\/kocerroxy.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/kocerroxy.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/kocerroxy.com\/blog\/#organization","name":"Kocerroxy","url":"https:\/\/kocerroxy.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/kocerroxy.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/kocerroxy.com\/wp-content\/uploads\/2023\/07\/Favicon.png","contentUrl":"https:\/\/kocerroxy.com\/wp-content\/uploads\/2023\/07\/Favicon.png","width":512,"height":512,"caption":"Kocerroxy"},"image":{"@id":"https:\/\/kocerroxy.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/kocerroxy.com\/blog\/#\/schema\/person\/c9c9120b90dac4268b7012486a55074c","name":"Helen Bold","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/kocerroxy.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/7624887d3556e306a0883ab27fba8ad89c7f315532399aacf4e5cd49014bc658?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7624887d3556e306a0883ab27fba8ad89c7f315532399aacf4e5cd49014bc658?s=96&d=mm&r=g","caption":"Helen Bold"},"description":"Helen Bold has been writing about proxies since 2020. Helen specializes in gathering details, checking facts, and bringing value to our readers. In addition to writing articles, Helen does in-depth research and analyzes proxy industry trends. In her free time, she also writes amazing novels. You can read more about her personal work here: helenbold.com","sameAs":["http:\/\/helenbold.com","https:\/\/www.facebook.com\/TheHelenBold","https:\/\/www.instagram.com\/helenboldwriter\/","https:\/\/x.com\/TheHelenBold"],"url":"https:\/\/kocerroxy.com\/blog\/author\/helen-b\/"}]}},"_links":{"self":[{"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/posts\/7478","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/comments?post=7478"}],"version-history":[{"count":2,"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/posts\/7478\/revisions"}],"predecessor-version":[{"id":7498,"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/posts\/7478\/revisions\/7498"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/media\/7496"}],"wp:attachment":[{"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/media?parent=7478"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/categories?post=7478"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kocerroxy.com\/blog\/wp-json\/wp\/v2\/tags?post=7478"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}