Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Data Engineer (NLP-Focused) [Remote]

Київстар

Remote
  • Remote job

We are looking for a Data Engineer (NLP-Focused) to build and optimize the data pipelines that fuel our Ukrainian LLM and Kyivstar’s NLP initiatives. In this role, you will design robust ETL/ELT processes to collect, process, and manage large-scale text and metadata, enabling our data scientists and ML engineers to develop cutting-edge language models. You will work at the intersection of data engineering and machine learning, ensuring that our datasets and infrastructure are reliable, scalable, and tailored to the needs of training and evaluating NLP models in a Ukrainian language context. This is a unique opportunity to shape the data foundation of a pioneering AI project in Ukraine, working alongside NLP experts and leveraging modern big data technologies.

About us 

Kyivstar.Tech is a Ukrainian hybrid IT company and a resident of Diia.City. We are a subsidiary of Kyivstar, one of Ukraine's largest telecom operators. 

Our mission is to change lives in Ukraine and around the world by creating technological solutions and products that unleash the potential of businesses and meet users' needs. 

Over 600+ KS.Tech specialists work daily in various areas: mobile and web solutions, as well as design, development, support, and technical maintenance of high-performance systems and services. 

We believe in innovations that truly bring quality changes and constantly challenge conventional approaches and solutions. Each of us is an adherent of entrepreneurial culture, which allows us never to stop, to evolve, and to create something new. 

Responsibilities:

• Design, develop, and maintain ETL/ELT pipelines for gathering, transforming, and storing large volumes of text data and related information. Ensure pipelines are efficient and can handle data from diverse sources (e.g., web crawls, public datasets, internal databases) while maintaining data integrity.

• Implement web scraping and data collection services to automate the ingestion of text and linguistic data from the web and other external sources. This includes writing crawlers or using APIs to continuously collect data relevant to our language modeling efforts.

• Implementation of NLP/LLM specific data processing: cleaning and normalization of text, like filtering of toxic content, de-duplication, de-noising), detection and deletion of personal data.

• Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.

• Set up and manage cloud-based data infrastructure for the project. Configure and maintain data storage solutions (data lakes, warehouses) and processing frameworks (e.g., distributed compute on AWS/GCP/Azure) that can scale with growing data needs.

• Automate data processing workflows and ensure their scalability and reliability. Use workflow orchestration tools like Apache Airflow to schedule and monitor data pipelines, enabling continuous and repeatable model training and evaluation cycles.

• Maintain and optimize analytical databases and data access layers for both ad-hoc analysis and model training needs. Work with relational databases (e.g., PostgreSQL) and other storage systems to ensure fast query performance and well-structured data schemas.

• Collaborate with Data Scientists and NLP Engineers to build data features and datasets for machine learning models. Provide data subsets, aggregations, or preprocessing as needed for tasks such as language model training, embedding generation, and evaluation.

• Implement data quality checks, monitoring, and alerting. Develop scripts or use tools to validate data completeness and correctness (e.g., ensuring no critical data gaps or anomalies in the text corpora), and promptly address any pipeline failures or data issues. Implement data version control.

• Manage data security, access, and compliance. Control permissions to datasets and ensure adherence to data privacy policies and security standards, especially when dealing with user data or proprietary text sources.

Required Qualifications:

Education & Experience: 3+ years of experience as a Data Engineer or in a similar role, building data-intensive pipelines or platforms. A Bachelor’s or Master’s degree in Computer Science, Engineering, or related field is preferred. Experience supporting machine learning or analytics teams with data pipelines is a strong advantage.

NLP Domain Experience: Prior experience handling linguistic data or supporting NLP projects (e.g., text normalization, handling different encodings, tokenization strategies). Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given our project’s focus. Understanding of FineWeb2 or similar processing pipelines approach.

Data Pipeline Expertise: Hands-on experience designing ETL/ELT processes, including extracting data from various sources, using transformation tools, and loading into storage systems. Proficiency with orchestration frameworks like Apache Airflow for scheduling workflows. Familiarity with building pipelines for unstructured data (text, logs) as well as structured data.

Programming & Scripting: Strong programming skills in Python for data manipulation and pipeline development. Experience with NLP packages (spaCy, NLTK, langdetect, fasttext, etc.). Experience with SQL for querying and transforming data in relational databases. Knowledge of Bash or other scripting for automation tasks. Writing clean, maintainable code and using version control (Git) for collaborative development.

Databases & Storage: Experience working with relational databases (e.g., PostgreSQL, MySQL) including schema design and query optimization. Familiarity with NoSQL or document stores (e.g., MongoDB) and big data technologies (HDFS, Hive, Spark) for large-scale data is a plus. Understanding of or experience with vector databases (e.g., Pinecone, FAISS) is beneficial, as our NLP applications may require embedding storage and fast similarity search.

Cloud Infrastructure: Practical experience with cloud platforms (AWS, GCP, or Azure) for data storage and processing. Ability to set up services such as S3/Cloud Storage, data warehouses (e.g., BigQuery, Redshift), and use cloud-based ETL tools or serverless functions. Understanding of infrastructure-as-code (Terraform, CloudFormation) to manage resources is a plus.

Data Quality & Monitoring: Knowledge of data quality assurance practices. Experience implementing monitoring for data pipelines (logs, alerts) and using CI/CD tools to automate pipeline deployment and testing. An analytical mindset to troubleshoot data discrepancies and optimize performance bottlenecks.

Collaboration & Domain Knowledge: Ability to work closely with data scientists and understand the requirements of machine learning projects. Basic understanding of NLP concepts and the data needs for training language models, so you can anticipate and accommodate the specific forms of text data and preprocessing they require. Good communication skills to document data workflows and to coordinate with team members across different functions.

Preferred Qualifications:

Advanced Tools & Frameworks: Experience with distributed data processing frameworks (such as Apache Spark or Databricks) for large-scale data transformation, and with message streaming systems (Kafka, Pub/Sub) for real-time data pipelines. Familiarity with data serialization formats (JSON, Parquet) and handling of large text corpora.

Web Scraping Expertise: Deep experience in web scraping, using tools like Scrapy, Selenium, or Beautiful Soup, and handling anti-scraping challenges (rotating proxies, rate limiting). Ability to parse and clean raw text data from HTML, PDFs, or scanned documents.

CI/CD & DevOps: Knowledge of setting up CI/CD pipelines for data engineering (using GitHub Actions, Jenkins, or GitLab CI) to test and deploy changes to data workflows. Experience with containerization (Docker) to package data jobs and with Kubernetes for scaling them is a plus.

Big Data & Analytics: Experience with analytics platforms and BI tools (e.g., Tableau, Looker) used to examine the data prepared by the pipelines. Understanding of how to create and manage data warehouses or data marts for analytical consumption.

Problem-Solving: Demonstrated ability to work independently in solving complex data engineering problems, optimising existing pipelines, and implementing new ones under time constraints. A proactive attitude to explore new data tools or techniques that could improve our workflows.

What we offer:

• Office or remote – it’s up to you. You can work from anywhere, and we will arrange your workplace.

• Remote onboarding.

• Performance bonuses for everyone (annual or quarterly — depends on the role).  

• We train employees: with the opportunity to learn through the company’s library, internal resources, and programs from partners.   

• Health and life insurance.  

• Wellbeing program and corporate psychologist.  

• Reimbursement of expenses for Kyivstar mobile communication.  

Vacancy posted more than 2 months ago
Similar jobs that could be interesting for youBased on the Data Engineer (NLP-Focused) [Remote] in Remote vacancy
  •  ...AI/NLP/Large Data Engineer Agent IQ is a recently funded, rapidly growing mid-stage fintech company. Agent IQ is making intelligent, frictionless...  ...Develop customer specific and financial institutions focused machine intelligence Contribute to deploying/monitoring/... 
    Suggested
    Remote work
    Work from home

    Agent IQ

    United States
    3 days ago
  • $5,000 per month

     ...in cybersecurity, software development, data science, and cloud migration. Enjoy a close...  ...The Role As a Senior/Lead Data Flow Engineer, you will lead efforts in strategic data...  ...acquisition, analysis, and sharing. This role focuses on building and managing software... 
    Suggested
    For contractors
    Remote work

    stanleyreid.com

    Columbia, MD
    2 days ago
  • $70k - $120k

     ...technology stack built on NetSuite ERP with 200+ data integrations, our platform enables brands...  ..., and brick-and-mortar retail. We focus on acquiring digital-first brands within...  ...win. We're looking for a Senior Data Engineer with deep expertise in GCP solution... 
    Suggested
    Remote work

    unybrands

    United States
    1 day ago
  • $55 - $64 per hour

     ...Data Engineer (API & Database Focus) - Fully Remote (W2 Contract) Location: Fully Remote (U.S.) Duration: 6-Month W2 Contract Pay Rate: $55-$64/hour (depending on experience) Company: Russell Tobin (supporting a leading global technology client) Overview... 
    Suggested
    Contract work
    Remote work

    Pride Global

    United States
    3 days ago
  •  ...Introduction We are seeking an experienced and highly skilled Contract Data Engineer. This role is focused on building robust data pipelines, primarily involving the integration of data from various APIs and managing critical data infrastructure. The successful candidate... 
    Suggested
    Contract work
    Work experience placement
    Immediate start
    Remote work

    Artech

    Sunnyvale, CA
    14 hours ago
  • $105k - $158k

     ...Senior Java Data Engineer The Enterprise Data Frameworks team is seeking a Senior Java focused software engineer who builds and maintains large scale data processing systems using Java, Apache Spark, and Kafka. This role is intended for experienced backend engineers... 
    Work at office
    Local area
    Remote work
    Monday to Friday
    Flexible hours

    Citizens Financial Group, Inc.

    Iselin, NJ
    2 days ago
  •  ...AI & NLP Fellowship: Data Engineering for Social Impact 2 days ago Be among the first 25 applicants Institute for Development Impact (I4DI) | DECipher Project About the Project DECipher is an AI-powered platform developed by the Institute for Development Impact... 
    Summer work
    Internship
    Remote work
    Worldwide
    Flexible hours

    Institute for Development Impact - I4DI

    Washington DC
    9 hours ago
  •  ...markets to deliver the best in class software engineering and exceptional client services. We have...  ..., and AI. Position overview As a Senior Data Engineer / Analytic Engineer (Microsoft...  ...pipelines and ETL processes. You will focus on leveraging Microsoft Fabric, Azure Blob... 
    Remote job
    Start working today
    Work at office
    Work from home
    Flexible hours

    Velvetech

    Florida, NY
    5 hours ago
  • A financial technology startup seeks a Software Engineer with a focus on data for a full-stack Python role. You'll work on building a unique AI-native platform that combines public and private investments. This position is ideal for early-career professionals eager to learn... 
    Remote work

    Peppercorn Solutions Inc.

    Boston, MA
    1 day ago
  •  ...Data Engineer Lead, (Snowflake-Focused) Location: Remote in US Duration: Long term Position Overview The Lead Data Engineer designs and delivers robust data pipelines, data models, and platform capabilities that support analytics, reporting,... 
    Remote work

    Inficare

    United States
    14 hours ago
  •  ...Job Title: I Engineer - NLP/LLM Data Specialist Location: Houston, Texas - Remote Duration: 6 months bout the...  ...Technology Selection: Evaluate and recommend AI technologies, focusing on OCR, NLP, LLM and machine learning. Ensure seamless... 
    Remote work

    Saviance

    Houston, TX
    1 day ago
  •  ...be part of a global practice focused on understanding and learning...  ...background in CX/VoC, expertise in data analysis and storytelling, and...  ...natural language processing (NLP) to analyze large-scale...  ...Mathematics, Computer/Data Science, Engineering, or a related field. ~6+... 
    Remote work

    Mindlance

    United States
    4 days ago
  •  ...Sr Data Quality Engineer - Azure Focus Chicago, IL (loop/downtown) - 2 days work from home (WFH), 3 days onsite Summary: This position focuses on building automated tesing frameworks to ensure data quality and reliability across data pipelines in our Azure... 
    Work from home

    1872 Consulting

    Chicago, IL
    4 days ago
  •  ...Data Engineer As a Data Engineer, you will play a key role in our Data Expansion Squad, which...  ...datasets and source formats. Your work will focus on understanding source structures,...  ...familiarity with developing and deploying NLP models is a bonus Language: English proficiency... 
    Remote work
    Home office
    Flexible hours

    Noxtua

    United States
    3 days ago
  • $100k

     ...at SynergisticIT We just don't focus on getting you a tech Job we...  ...vegas) -2023/2022 and at Gartner Data Analytics Summit (Florida)-202...  ...Scientists, Machine Learning engineers for full time positions with...  ...skills Preferred skills: NLP, Text mining, Tableau, PowerBI... 
    Full time
    H1b
    Remote work

    SynergisticIT

    United States
    1 day ago
  • $100k

     ...at SynergisticIT We just don't focus on getting you a tech Job we...  ...vegas) -2023/2022 and at Gartner Data Analytics Summit (Florida)-202...  ...Scientists, Machine Learning engineers for full time positions with...  ...skills Preferred skills: NLP, Text mining, Tableau, PowerBI... 
    Full time
    H1b
    Remote work

    SynergisticIT

    United States
    14 hours ago
  •  ...Here at SynergisticIT we just don't focus on getting you a job we make...  ...developers, Python/Java developers, data analysts/data scientists, machine learning engineers for full time positions with...  ...communication skills Preferred skills: NLP, Text mining, Tableau, PowerBI,... 
    Full time
    H1b
    Immediate start
    Remote work

    SynergisticIT

    United States
    2 days ago
  •  ...Here at SynergisticIT We just don't focus on getting you a Job we make...  ...developers, Python/Java developers, Data analysts/ Data Scientists, Machine Learning engineers for full time positions with...  ...skills Preferred skills: NLP, Text mining, Tableau, PowerBI,... 
    Full time
    H1b
    Remote work

    SynergisticIT

    United States
    14 hours ago
  •  ...Here at SynergisticIT we just don't focus on getting you a job we make...  ...developers, Python/Java developers, data analysts/data scientists, machine learning engineers for full time positions with...  ...communication skills Preferred skills: NLP, text mining, Tableau, PowerBI,... 
    Full time
    H1b
    Remote work

    SynergisticIT

    Wichita, KS
    4 days ago
  •  ...seeking a highly motivated and innovative Data Engineer to support enterprise-level analytics...  ..., mission-driven environment. This role focuses on leveraging cutting-edge data science...  ...including natural language processing (NLP), deep learning, and predictive modeling... 
    Local area
    Remote work
    Flexible hours
    2 days per week
    3 days per week

    Small Business Consulting Inc

    Randolph Air Force Base, TX
    9 days ago
  • $175k - $225k

     ...Alliance is seeking a forward-thinking Data Engineer in San Diego, CAto provide client support...  ...Experience in natural language parsing (NLP) and using regular expressions, RegEx...  ...Apache Airflow, AWS Step Functions) ~ Focused experience implementing and monitoring data... 
    Contract work
    Work at office

    The Marlin Alliance

    San Diego, CA
    3 days ago
  • $100k

     ...at SynergisticIT We just don't focus on getting you a tech Job we...  ...vegas) -2023/2022 and at Gartner Data Analytics Summit (Florida)-202...  ...Scientists, Machine Learning engineers for full time positions with...  ...Preferred skills: NLP, Text mining, Tableau, PowerBI... 
    Full time
    H1b
    Remote work

    SynergisticIT

    United States
    14 hours ago
  •  ...Job Description: Data Scientist nlp remote Data Scientist to help revolutionize...  ...amount of data. Youll join a team focused on deep medical document understanding...  ...Work with other data scientists and engineers to optimize machine learning models... 
    Remote work

    ESR Healthcare

    United States
    14 hours ago
  •  ...cutting-edge technology firm is seeking a skilled Data Scientist with 3-7 years of experience. The role focuses on both structured and unstructured data, leveraging...  ...learning. Strong expertise in Databricks, PySpark, and NLP frameworks is essential, along with the ability to... 
    Work from home
    Flexible hours

    Vytwo Technologies Inc.

    Prosper, TX
    2 days ago
  • $100k

     ...Developers, Python/Java Developers, Data Analysts/Data Scientists, and Machine Learning Engineers Since 2010 Synergisticit has...  ...at SynergisticIT we just don't focus on getting you a tech job we make...  ...communication skills Preferred skills: NLP, text mining, Tableau, PowerBI,... 
    Full time
    H1b
    Remote work

    SynergisticIT

    Detroit, MI
    1 day ago
  •  ...at SynergisticIT, we just don't focus on getting you a job; we make careers...  ..., Python/Java developers, data analysts/data scientists, machine learning engineers for full-time positions with clients...  ...communication skills Preferred skills: NLP, Text mining, Tableau, PowerBI,... 
    Full time
    H1b
    Immediate start
    Remote work

    SynergisticIT

    Tampa, FL
    1 day ago
  • $100k

     ...at SynergisticIT we just don't focus on getting you a tech job we...  ...Vegas) - 2023/2022 and at Gartner Data Analytics Summit (Florida) - 2...  ...scientists, machine learning engineers for full time positions with...  ...communication skills. Preferred skills: NLP, text mining, Tableau, PowerBI... 
    Full time
    H1b
    Remote work

    SynergisticIT

    San Diego, CA
    14 hours ago
  • $80k - $105k

     ...Come join our growing team of data practitioners and be on the leading...  ...for data, software, and engineering. At Burlington you will have the...  ...Marts and Data Stores with focus on AI/ML techniques and LLMs...  ...Natural Language Processing (NLP) frameworks Experience with... 
    Full time
    Remote work
    Flexible hours

    Burlington

    Beverly, NJ
    1 day ago
  •  ...Job Title: Senior Palantir Foundry Data Engineer & Architect Locations: Arlington, VA | McLean...  ...data integration solutions. This role focuses on leveraging Palantir Foundry and AIP...  ...Experience building products leveraging NLP, image-based analytics, or document processing... 
    Full time
    Local area
    Remote work

    System One

    Arlington, VA
    2 days ago
  •  ...Data Science / Machine Learning Engineer (Remote, Continental United States) 3 weeks ago Be among the first 25 applicants...  ...of hands‑on experience specifically focused on large language models (LLMs) and...  ...(Regression, Deep Learning, NLP, Time Series Analysis, etc.). Familiarity... 
    Local area
    Remote work
    Flexible hours

    ICA Corporation

    Arlington, VA
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Data Engineer (NLP-Focused) [Remote]. Be the first to apply!