Data Engineer (NLP-Focused) [Remote]
Київстар
- Remote job
We are looking for a Data Engineer (NLP-Focused) to build and optimize the data pipelines that fuel our Ukrainian LLM and Kyivstar’s NLP initiatives. In this role, you will design robust ETL/ELT processes to collect, process, and manage large-scale text and metadata, enabling our data scientists and ML engineers to develop cutting-edge language models. You will work at the intersection of data engineering and machine learning, ensuring that our datasets and infrastructure are reliable, scalable, and tailored to the needs of training and evaluating NLP models in a Ukrainian language context. This is a unique opportunity to shape the data foundation of a pioneering AI project in Ukraine, working alongside NLP experts and leveraging modern big data technologies.
About us
Kyivstar.Tech is a Ukrainian hybrid IT company and a resident of Diia.City. We are a subsidiary of Kyivstar, one of Ukraine's largest telecom operators.
Our mission is to change lives in Ukraine and around the world by creating technological solutions and products that unleash the potential of businesses and meet users' needs.
Over 600+ KS.Tech specialists work daily in various areas: mobile and web solutions, as well as design, development, support, and technical maintenance of high-performance systems and services.
We believe in innovations that truly bring quality changes and constantly challenge conventional approaches and solutions. Each of us is an adherent of entrepreneurial culture, which allows us never to stop, to evolve, and to create something new.
Responsibilities:
• Design, develop, and maintain ETL/ELT pipelines for gathering, transforming, and storing large volumes of text data and related information. Ensure pipelines are efficient and can handle data from diverse sources (e.g., web crawls, public datasets, internal databases) while maintaining data integrity.
• Implement web scraping and data collection services to automate the ingestion of text and linguistic data from the web and other external sources. This includes writing crawlers or using APIs to continuously collect data relevant to our language modeling efforts.
• Implementation of NLP/LLM specific data processing: cleaning and normalization of text, like filtering of toxic content, de-duplication, de-noising), detection and deletion of personal data.
• Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
• Set up and manage cloud-based data infrastructure for the project. Configure and maintain data storage solutions (data lakes, warehouses) and processing frameworks (e.g., distributed compute on AWS/GCP/Azure) that can scale with growing data needs.
• Automate data processing workflows and ensure their scalability and reliability. Use workflow orchestration tools like Apache Airflow to schedule and monitor data pipelines, enabling continuous and repeatable model training and evaluation cycles.
• Maintain and optimize analytical databases and data access layers for both ad-hoc analysis and model training needs. Work with relational databases (e.g., PostgreSQL) and other storage systems to ensure fast query performance and well-structured data schemas.
• Collaborate with Data Scientists and NLP Engineers to build data features and datasets for machine learning models. Provide data subsets, aggregations, or preprocessing as needed for tasks such as language model training, embedding generation, and evaluation.
• Implement data quality checks, monitoring, and alerting. Develop scripts or use tools to validate data completeness and correctness (e.g., ensuring no critical data gaps or anomalies in the text corpora), and promptly address any pipeline failures or data issues. Implement data version control.
• Manage data security, access, and compliance. Control permissions to datasets and ensure adherence to data privacy policies and security standards, especially when dealing with user data or proprietary text sources.
Required Qualifications:
Education & Experience: 3+ years of experience as a Data Engineer or in a similar role, building data-intensive pipelines or platforms. A Bachelor’s or Master’s degree in Computer Science, Engineering, or related field is preferred. Experience supporting machine learning or analytics teams with data pipelines is a strong advantage.
NLP Domain Experience: Prior experience handling linguistic data or supporting NLP projects (e.g., text normalization, handling different encodings, tokenization strategies). Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given our project’s focus. Understanding of FineWeb2 or similar processing pipelines approach.
Data Pipeline Expertise: Hands-on experience designing ETL/ELT processes, including extracting data from various sources, using transformation tools, and loading into storage systems. Proficiency with orchestration frameworks like Apache Airflow for scheduling workflows. Familiarity with building pipelines for unstructured data (text, logs) as well as structured data.
Programming & Scripting: Strong programming skills in Python for data manipulation and pipeline development. Experience with NLP packages (spaCy, NLTK, langdetect, fasttext, etc.). Experience with SQL for querying and transforming data in relational databases. Knowledge of Bash or other scripting for automation tasks. Writing clean, maintainable code and using version control (Git) for collaborative development.
Databases & Storage: Experience working with relational databases (e.g., PostgreSQL, MySQL) including schema design and query optimization. Familiarity with NoSQL or document stores (e.g., MongoDB) and big data technologies (HDFS, Hive, Spark) for large-scale data is a plus. Understanding of or experience with vector databases (e.g., Pinecone, FAISS) is beneficial, as our NLP applications may require embedding storage and fast similarity search.
Cloud Infrastructure: Practical experience with cloud platforms (AWS, GCP, or Azure) for data storage and processing. Ability to set up services such as S3/Cloud Storage, data warehouses (e.g., BigQuery, Redshift), and use cloud-based ETL tools or serverless functions. Understanding of infrastructure-as-code (Terraform, CloudFormation) to manage resources is a plus.
Data Quality & Monitoring: Knowledge of data quality assurance practices. Experience implementing monitoring for data pipelines (logs, alerts) and using CI/CD tools to automate pipeline deployment and testing. An analytical mindset to troubleshoot data discrepancies and optimize performance bottlenecks.
Collaboration & Domain Knowledge: Ability to work closely with data scientists and understand the requirements of machine learning projects. Basic understanding of NLP concepts and the data needs for training language models, so you can anticipate and accommodate the specific forms of text data and preprocessing they require. Good communication skills to document data workflows and to coordinate with team members across different functions.
Preferred Qualifications:
Advanced Tools & Frameworks: Experience with distributed data processing frameworks (such as Apache Spark or Databricks) for large-scale data transformation, and with message streaming systems (Kafka, Pub/Sub) for real-time data pipelines. Familiarity with data serialization formats (JSON, Parquet) and handling of large text corpora.
Web Scraping Expertise: Deep experience in web scraping, using tools like Scrapy, Selenium, or Beautiful Soup, and handling anti-scraping challenges (rotating proxies, rate limiting). Ability to parse and clean raw text data from HTML, PDFs, or scanned documents.
CI/CD & DevOps: Knowledge of setting up CI/CD pipelines for data engineering (using GitHub Actions, Jenkins, or GitLab CI) to test and deploy changes to data workflows. Experience with containerization (Docker) to package data jobs and with Kubernetes for scaling them is a plus.
Big Data & Analytics: Experience with analytics platforms and BI tools (e.g., Tableau, Looker) used to examine the data prepared by the pipelines. Understanding of how to create and manage data warehouses or data marts for analytical consumption.
Problem-Solving: Demonstrated ability to work independently in solving complex data engineering problems, optimising existing pipelines, and implementing new ones under time constraints. A proactive attitude to explore new data tools or techniques that could improve our workflows.
What we offer:
• Office or remote – it’s up to you. You can work from anywhere, and we will arrange your workplace.
• Remote onboarding.
• Performance bonuses for everyone (annual or quarterly — depends on the role).
• We train employees: with the opportunity to learn through the company’s library, internal resources, and programs from partners.
• Health and life insurance.
• Wellbeing program and corporate psychologist.
• Reimbursement of expenses for Kyivstar mobile communication.
- Capture2Proposal is seeking an experienced Data Engineer/Programmer to join their team in San Diego. The role focuses on building data pipelines and utilizing NLP for data extraction in a fast-paced SaaS environment. The ideal candidate will have a strong background in...SuggestedRemote job
- Job Title Data Scientist / Machine Learning Engineer (Generative AI Focus) Contract Length 12+ Months Schedule Hybrid schedule 3 days per week onsite / 2 remote Location Charlotte, NC / Irving, TX / Boston, MA Reference Number 246769 Overview We are seeking a highly...SuggestedContract workRemote work3 days per week
- ...markets to deliver the best in class software engineering and exceptional client services. We have... ..., and AI. Position overview As a Senior Data Engineer / Analytic Engineer (Microsoft... ...pipelines and ETL processes. You will focus on leveraging Microsoft Fabric, Azure Blob...SuggestedRemote jobStart working todayWork at officeWork from homeFlexible hours
- AI & NLP Fellowship: Data Engineering for Social Impact 2 days ago Be among the first 25 applicants Institute for Development Impact (I4DI) | DECipher Project About the Project DECipher is an AI-powered platform developed by the Institute for Development Impact (I4DI)...SuggestedSummer workInternshipRemote workWorldwideFlexible hours
- Cardinal Health is seeking a Senior Data/AI Engineer to lead the design and delivery of state-of-the-art data solutions within a remote environment... ...pivotal role involves driving AI initiatives, particularly focusing on healthcare-related data. The ideal candidate will possess...SuggestedRemote job
- Cardinal Health is seeking a Senior Data/AI Engineer to join their team remotely. This role involves leading the design and delivery of data and AI solutions, focusing on healthcare data platforms. Candidates will be responsible for owning projects end-to-end and using...Remote job
- A financial technology startup seeks a Software Engineer with a focus on data for a full-stack Python role. You'll work on building a unique AI-native platform that combines public and private investments. This position is ideal for early-career professionals eager to learn...Remote job
$118.1k - $144.2k
...LLC., a CVS Health company, is hiring in New York, NY for a Data Engineer. The role focuses on building large‑scale data structures, pipelines, and... ...learning, statistical analysis, and predictive modeling; NLP tools (Scikit‑Learn, SpaCity, PyTorch, or Spark NLP); and...Temporary workInternshipRemote work- ...Data Science / Machine Learning Engineer (Remote, Continental United States) 3 weeks ago Be among the first 25... ...of hands‑on experience specifically focused on large language models (LLMs) and... ...techniques (Regression, Deep Learning, NLP, Time Series Analysis, etc.). ~ Familiarity...Local areaRemote workFlexible hours
$130k - $196.5k
...Job Overview LiveRamp is a data collaboration platform that helps... ...and design documents. Champion engineering best practices (code reviews,... ...engineering or data engineering focused on large‑scale data processing... ...Experience in using AI tools, NLP, and agent creation. Exposure...Work from homeFlexible hours$130k - $196.5k
...build, and optimize scalable data processing pipelines using Spark... ...design documents. Champion engineering best practices (code reviews,... ...engineering or data engineering, with a focus on large‑scale data processing... ...Experience in using AI tools, NLP and agent creation. Exposure...Work from homeFlexible hoursNight shift- ...cutting-edge technology firm is seeking a skilled Data Scientist with 3-7 years of experience. The role focuses on both structured and unstructured data, leveraging... ...learning. Strong expertise in Databricks, PySpark, and NLP frameworks is essential, along with the ability to...Work from homeFlexible hours
$155k - $235k
...expertise. About the Team The Data Platform team sits within Scribd... ..., and ensuring data works for engineers, analysts, and business users... ..., with at least 1 year focused on AI/ML infrastructure or LLM... ...Experience building or integrating NLP/LLM‑based systems — RAG pipelines...Home officeFlexible hours$80k - $105k
...Come join our growing team of data practitioners and be on the leading... ...for data, software, and engineering. At Burlington you will have the... ...Marts, and Data Stores with focus on AI/ML techniques and LLMs.... ...Natural Language Processing (NLP) frameworks. Experience with...Full timeRemote workFlexible hours- ...Lead specializing in Machine Learning and Data Engineering, you will lead the technical direction... ...recommendations, ranking, forecasting, or NLP Experience with experimentation and... ...through Who This Role Is Not For Want to focus only on research prototypes or only on...Local area
- An innovative consulting firm in Arlington, VA, seeks a Data Science / Machine Learning Engineer skilled in applying machine learning techniques to solve complex challenges. You will develop and implement cutting-edge ML models, analyze large datasets, and collaborate...Remote job
- Inn-Flow Corp. in Raleigh is seeking a Snowflake Data Engineer to design, build, and operate robust data platforms. You’ll leverage your SQL expertise and knowledge of data warehousing to enhance analytics for the hospitality industry, collaborating with teams to deliver...Remote jobFlexible hours
$130k - $140k
...your recruiter to learn more. Base pay range $130,000.00/yr - $140,000.00/yr Direct message the job poster from Harnham Data Analytics, Engineering, & AI Recruiter at Harnham Remote in Arizona Occasional Travel $130,000 - $140,000 Salary + Bonus Are you a skilled Data...Full timeRemote work$120k - $160k
Senior Data Engineer - Enterprise Data Frameworks Overview The Enterprise Data Frameworks team is seeking a Senior Data Engineer with deep expertise in building scalable, high-performance data processing systems using Java, Apache Spark, and modern data engineering tools...Work at officeLocal areaRemote workMonday to FridayFlexible hours- ...talented Applied Machine Learning Engineer to join our team. This... ...analytical insights. Utilize NLP and machine learning techniques... ...both structured and unstructured data, including text and images. Design... ...to push boundaries in mission‑focused initiatives. #J-18808-Ljbffr...Remote work
$65.5k - $134k
...world. Senior Machine Learning Engineer EY is the only professional... ...dynamic FSO team! The opportunity Data has yet to be utilized to its... ...growth opportunities, with a focus on continuous learning and... ...learning systems, including advanced NLP and Generative AI (LLM)...Summer holidayFlexible hours- ...manages all applications and next steps. Our partner is looking for a Data Engineer III based in the United States. This role sits within a fast-growing healthcare technology environment focused on building data-driven workforce solutions that optimize clinician staffing...Remote jobFull timeFlexible hours
- ...Company. Cox Automotive - USA - Job Family Group. Engineering / Product Development. Job Profile. Lead Data Engineer. Management Level. Manager - Non People Leader... ...-sized solutions for complex business problems focused on building patterns to support data pipelines, data...Remote workFlexible hoursShift work
- ...About the Role We’re hiring a Data Engineer to help own and scale our core data platform. This role is hands‑on and production‑focused: you’ll design, build, and maintain data pipelines that power analytics, machine learning, and operational systems. We’re especially...Contract workWork from home
- ...Focused on improving data pipelines and migrating legacy systems, the remote contract Data Engineering Analyst will enhance data accuracy and system performance while ensuring seamless data transfer and integration. Key responsibilities: Improve data pipelines through...Contract workRemote work
$107k - $135k
...Overview The Lead Data Engineer is responsible for the design, architecture and support of systems, services and applications required for... ...strong analytical skills Results oriented with a strong customer focus Ability to work in a team environment Strong technical...Contract workWork at officeRemote workWork visaRelocation package3 days per week$215k - $265k
...Join to apply for the Lead Data Engineer role at Genies Genies is an avatar technology company powering the next era of interactive digital... ...as data warehouses, lakes, or transactional stores with a focus on scalability, reliability, security, and cost-effectiveness....Full timeWork at officeWork from homeFlexible hours$142.3k - $195.7k
...integrated solutions that leverage high-quality data, data-driven insights, and technology to... ...within Risk Adjustment. The Lead Data Engineer handles work assignments involving... ...passionate about contributing to an organization focused on continuously improving consumer...Temporary workApprenticeshipRemote work$220k
...Specialist Recruiter | Databricks Data Engineer Recruitment | Connecting Top Talent with Leading US Opportunities Job Title: Lead Data Engineer... ...Databricks and AWS . This is a hands-on engineering role focused on architecture, implementation, and optimization of robust data...Full timeRemote workFlexible hours- ...cleaning, and maintaining incomplete parcel data. By bridging property and geography at... ...that allows it to scale.As our Lead Data Engineer, you will serve as both technical architect... ...one to work here, but fair warning our focus is singular, and our mission is to bring...Remote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Data Engineer (NLP-Focused) [Remote]. Be the first to apply!


