Indian Workers Paid $2.40 an Hour to Film Everyday Actions for AI Training

How the footage is collected

then text. Header: "How the footage is collected" -> "วิธีการรวบรวมฟุตเทจ" Text: "Workers wear a smartphone mounted on a headband or a harness, recording first-person video of ordinary tasks. The pay — roughly $2.40 per hour — is low by global standards but competitive in parts of India where wages for similar gig work range from 150 to 300 rupees an hour. The workers are not named in the reports, but they are part of a growing labor force that supplies the raw material for AI training." Translation: "คนงานสวมสมาร์ทโฟนที่ติดตั้งบนแถบคาดศีรษะหรือสายรัด บันทึกวิดีโอมุมมองบุคคลที่หนึ่งของงานทั่วไป ค่าจ้าง — ประมาณ $2.40 ต่อชั่วโมง — ถือว่าต่ำตามมาตรฐานโลก แต่มีการแข่งขันในบางพื้นที่ของอินเดียที่ค่าจ้างสำหรับงานกิ๊กที่คล้ายกันอยู่ในช่วง 150 ถึง 300 รูปีต่อชั่วโมง คนงานเหล่านี้ไม่ถูกเปิดเผยชื่อในรายงาน แต่พวกเขาเป็นส่วนหนึ่งของแรงงานที่เพิ่มขึ้นซึ่งจัดหาวัตถุดิบสำหรับการฝึกอบรม AI" Note: "first-person video" -> "วิดีโอมุมมองบุคคลที่หนึ่ง" is clear. "gig work" -> "งานกิ๊ก" (common term). "raw material" -> "วัตถุดิบ". Next: "The head-mounted approach captures what is called egocentric footage: video from the wearer's point of view. This is different from third-person video, which shows the whole body. Egocentric data helps AI models learn how a robot's camera would see the world if it were mounted on a humanoid torso or head." Translation: "วิธีการสวมศีรษะนี้จับภาพสิ่งที่เรียกว่าฟุตเทจแบบอีโกเซนทริก (egocentric): วิดีโอจากมุมมองของผู้สวมใส่ ซึ่งแตกต่างจากวิดีโอมุมมองบุคคลที่สามที่แสดงทั้งร่างกาย ข้อมูลแบบอีโกเซนทริกช่วยให้โมเดล AI เรียนรู้ว่ากล้องของหุ่นยนต์จะมองโลกอย่างไรหากติดตั้งบนลำตัวหรือศีรษะของหุ่นยนต์ฮิวแมนนอยด์" Keep "egocentric" as "อีโกเซนทริก" with English in parentheses for clarity. "third-person video" -> "วิดีโอมุมมองบุคคลที่สาม". Next section:

Companies processing the data

-> "บริษัทที่ประมวลผลข้อมูล" Text: "Two companies are known to be involved in turning this raw video into usable training data. Objectways, based in the U.S. but with operations in India, specializes in data annotation for AI. Humyn Lab, based in Bangalore, focuses on human-centric data collection and labeling. Both firms take the egocentric footage and add metadata — bounding boxes, action labels, object tags — that machine learning models need to learn from." Translation: "มีสองบริษัทที่ทราบว่าเกี่ยวข้องกับการเปลี่ยนฟุตเทจดิบนี้ให้เป็นข้อมูลฝึกอบรมที่ใช้งานได้ Objectways ซึ่งมีฐานในสหรัฐอเมริกาแต่มีปฏิบัติการในอินเดีย เชี่ยวชาญด้านการติดป้ายข้อมูลสำหรับ AI Humyn Lab ซึ่งมีฐานในบังกาลอร์ มุ่งเน้นการรวบรวมและติดป้ายข้อมูลที่เน้นมนุษย์เป็นศูนย์กลาง ทั้งสองบริษัทนำฟุตเทจแบบอีโกเซนทริกมาเพิ่มข้อมูลเมตา — กรอบขอบเขต (bounding boxes) ป้ายกำกับการกระทำ (action labels) แท็กวัตถุ (object tags) — ที่โมเดลการเรียนรู้ของเครื่องต้องการเพื่อเรียนรู้" Keep company names as is. "data annotation" -> "การติดป้ายข้อมูล". "bounding boxes" -> "กรอบขอบเขต" or "ไบน์ดิ้งบ็อกซ์" but I'll use Thai explanation with English in parentheses. "action labels" -> "ป้ายกำกับการกระทำ". "object tags" -> "แท็กวัตถุ". "machine learning models" -> "โมเดลการเรียนรู้ของเครื่อง". Next: "The work is painstaking. Each second of video may require several minutes of human annotation to identify what is happening: a hand reaching for a cup, a foot stepping over a threshold, a person opening a door. The resulting datasets are then sold or licensed to AI developers building robots and virtual assistants." Translation: "งานนี้ต้องใช้ความพยายามอย่างมาก วิดีโอแต่ละวินาทีอาจต้องใช้เวลาหลายนาทีในการติดป้ายโดยมนุษย์เพื่อระบุสิ่งที่เกิดขึ้น: มือเอื้อมไปหยิบถ้วย เท้าก้าวข้ามธรณีประตู คนเปิดประตู ชุดข้อมูลที่ได้จะถูกขายหรือให้สิทธิ์แก่ผู้พัฒนา AI ที่สร้างหุ่นยนต์และผู้ช่วยเสมือน" "painstaking" -> "ต้องใช้ความพยายามอย่างมาก". "virtual assistants" -> "ผู้ช่วยเสมือน". Next: "Investor assessments project the humanoid robot market will reach $38 billion by 2035. That growth depends on robots that can navigate human environments — homes, offices, factories — without bumping into furniture or misreading a gesture. Training those robots requires vast amounts of first-person video showing how people actually behave, not just staged actions in a lab." Translation: "การประเมินของนักลงทุนคาดการณ์ว่าตลาดหุ่นยนต์ฮิวแมนนอยด์จะถึง 38 พันล้านดอลลาร์ภายในปี 2035 การเติบโตนั้นขึ้นอยู่กับหุ่นยนต์ที่สามารถนำทางในสภาพแวดล้อมของมนุษย์ — บ้าน สำนักงาน โรงงาน — โดยไม่ชนเฟอร์นิเจอร์หรือตีความท่าทางผิด การฝึกหุ่นยนต์เหล่านั้นต้องใช้วิดีโอมุมมองบุคคลที่หนึ่งจำนวนมากที่แสดงให้เห็นว่าผู้คนมีพฤติกรรมอย่างไร ไม่ใช่แค่การกระทำที่จัดฉากในห้องทดลอง" "misreading a gesture" -> "ตีความท่าทางผิด". "staged actions" -> "การกระทำที่จัดฉาก". Next: "The Indian workers' footage fills a gap. Most existing egocentric datasets come from researchers or volunteers in wealthy countries. The Indian data adds variety: different homes, different objects, different cultural routines. That diversity helps AI systems generalize better, though it also raises questions about labor conditions and consent." Translation: "ฟุตเทจของคนงานอินเดียช่วยเติมเต็มช่องว่าง ชุดข้อมูลอีโกเซนท

How the footage is collected

Companies processing the data

Related Articles