🧾 Extract Structured Transition Triplets from DOCX Files

250
ETH, DAI, USDT
+53
0 days (till Jun 13th, 2025)

We are hiring a developer to build a Streamlit application that extracts structured examples of transition phrases from .docx documents containing regional French news articles. These transitions are short, context-appropriate phrases that connect ideas between paragraphs. The goal is to automatically extract them along with their surrounding context and format them as clean, structured datasets for downstream analysis and AI fine-tuning.

Each article follows a consistent and predictable structure:

A header line (number and date)

Title and a short blurb

A marker line: "À savoir également dans votre département"

One long narrative paragraph containing embedded transitions

A final list of 2–3 transitions used within that paragraph

Your task is to focus only on the long narrative paragraph (step 4) and extract 2–3 clean triplets formatted as:

paragraph_a (text before the transition)

transition (the phrase as it appears in the list)

paragraph_b (text following the transition)

This structure will be used to build:

fewshot_examples.json (with paragraph_a, transition, paragraph_b objects, capped at 3 uses per transition)

transitions_only.txt (unique transitions only)

fewshot_examples.jsonl (fine-tuning format with role:user, role:assistant style entries)

and additional .txt files that list duplicates beyond the usage cap

The app must allow:

Upload of a .docx file

Selection of which outputs to generate (not all at once)

Local saving of the output files

Display of how many valid examples were extracted

To guide your implementation, you are provided with two reference files:

word10.docx: a sample input with ~10 articles and 24 expected examples

ccm_raw_paragraphs_dump.txt: a structured version of the same document, where every paragraph is numbered. This helps clarify how the transitions in paragraphs 7–10 relate to the narrative block in paragraph 6, then 17–20 to 16, and so on. This pattern remains consistent across all documents in the project.

Application Requirements:
To be considered for this job, please submit the following:

A working Streamlit demo (link or .py file)

At least 20 extracted examples from word10.docx showing clean paragraph-transition-paragraph triplets

A clear UI that allows output selection

Confirmation that outputs are saved locally as described

This project is the foundation for a much larger AI-based system we are developing for editorial automation. A successful result here could lead to multiple follow-on projects.

A full job description with more detailed formatting rules and two example outputs (from paragraph 6 and 16) is included in the attached PDF: this document outlines exactly what is expected in terms of logic, layout, and transition matching.

We look forward to reviewing your application.

250
ETH, DAI, USDT
+53
0 days (till Jun 13th, 2025)

More Jobs from this customer

Python Developer – Transition Repetition Analysis Module

Milestone 1 – Linguistic QA Validator (French Transition Rules) 🎯 Objective Develop a Python module to validate batches of AI-generated French transition phrases. This module ensures: 1. No stylistically significant word repetition across transitions in...

Python Developer – Transition Repetition Analysis Module

Milestone 1 – Linguistic QA Validator (French Transition Rules) 🎯 Objective Develop a Python module to validate batches of AI-generated French transition phrases. This module ensures: 1. No stylistically significant word repetition across transitions in...

More Jobs like this

Show more
Uniswap v3 & v4 Sniper Bot Developer

Looking for an experienced DeFi developer to build a crypto sniper bot on Base blockchain (OP Stack L2) that supports both Uniswap v3 and v4.Strong Solidity + Node.js backgroundFamiliar with Uniswap v3 and Uniswap v4...

Crypto Bot Setup for Token Launch (Confidential)

The tokens were acquired during a presale, and I plan to sell a portion after the launch, expected between September 1–15, 2025. I have some locked tokens I won’t sell. I need assistance configuring the...

Social media marketing

i need instagram and tiktok marketing for my website

Selling 40B $Seconds Time Farm

You can contact me in telegram @Rareape for faster transact

Pro YouTube Editor for Snowball Channel – $20 per Video

🚨 Only for Real Video Editors (Snowball YouTube Channel) We are looking for professional editors only. If you are not experienced → please don’t apply. Channel: Snowball @snowball1 (950+ subs). What we need: 7 YouTube...

Generate a realistic video with Veo3

Generate a video with Veo3, send me a link for verification and if I like it I'll send the payment and you'll send the file.

Technical writer

Job Title: Copywriter – DeFi Location: Remote Type: Contract / Part-time / Full-time About the Role We are seeking a talented Copywriter with strong technical knowledge of DeFi to craft high-impact, engaging, and conversion-driven content...

OnlyFans Promo Campaign – Acquire 250 Subscribers

We are looking for an experienced marketer to run a promo campaign for an OnlyFans profile. The goal is to acquire 250 paying subscribers for a budget of $270. You will manage audience targeting, outreach,...

SEO Website optimization

I have a recently created website that is a blog and provides information about a specfic person. I'm looking for someone to fine tune the site but more importantly, give it into the search engines...

Solana Copy trading Bot/ Auto Detection & Fast Execution

DEVELOPMENT OF AN ULTRA-FAST COPY TRADING BOT ON SOLANA (BLOCK 0 / BLOCK 1) DESCRIPTION: I am looking for an experienced Solana developer with a strong mastery of the Web3 ecosystem, low-latency RPCs, and network optimization, to...

Uniswap v3 & v4 Sniper Bot Developer

Looking for an experienced DeFi developer to build a crypto sniper bot on Base blockchain (OP Stack L2) that supports both Uniswap v3 and v4.Strong Solidity + Node.js backgroundFamiliar with Uniswap v3 and Uniswap v4...

Crypto Bot Setup for Token Launch (Confidential)

The tokens were acquired during a presale, and I plan to sell a portion after the launch, expected between September 1–15, 2025. I have some locked tokens I won’t sell. I need assistance configuring the...

Social media marketing

i need instagram and tiktok marketing for my website

Selling 40B $Seconds Time Farm

You can contact me in telegram @Rareape for faster transact

Pro YouTube Editor for Snowball Channel – $20 per Video

🚨 Only for Real Video Editors (Snowball YouTube Channel) We are looking for professional editors only. If you are not experienced → please don’t apply. Channel: Snowball @snowball1 (950+ subs). What we need: 7 YouTube...

Generate a realistic video with Veo3

Generate a video with Veo3, send me a link for verification and if I like it I'll send the payment and you'll send the file.

Technical writer

Job Title: Copywriter – DeFi Location: Remote Type: Contract / Part-time / Full-time About the Role We are seeking a talented Copywriter with strong technical knowledge of DeFi to craft high-impact, engaging, and conversion-driven content...

OnlyFans Promo Campaign – Acquire 250 Subscribers

We are looking for an experienced marketer to run a promo campaign for an OnlyFans profile. The goal is to acquire 250 paying subscribers for a budget of $270. You will manage audience targeting, outreach,...

SEO Website optimization

I have a recently created website that is a blog and provides information about a specfic person. I'm looking for someone to fine tune the site but more importantly, give it into the search engines...

Solana Copy trading Bot/ Auto Detection & Fast Execution

DEVELOPMENT OF AN ULTRA-FAST COPY TRADING BOT ON SOLANA (BLOCK 0 / BLOCK 1) DESCRIPTION: I am looking for an experienced Solana developer with a strong mastery of the Web3 ecosystem, low-latency RPCs, and network optimization, to...